If Oracle tomorrow closed the code base or stop developing MySQL, we can continue to develop MariaDB as if nothing had happened. This is because almost all of the original core engineers, including all MySQL architects and the original MySQL optimiser experts are now working on MariaDB.
Another great paper from Google. if it worksfor them, its probably worth copying. Particularly appreciated the Brighthouse reference.
Druid is an in-memory high-speed analytic database that is distributed in a cloud platform. MetaMarkets says they plan to release the code in the coming year. Data is fed in a semi-aggregated form from a Hadoop backend that stores the raw data. Their cluster of 40 modest machines churns through 1 billion rows of arbitrary dimensionality in 1 second. In another post they mention that they can use 6TB of memory spread among multiple machines before they incur degradation in speed due to cloud communication overhead.
The ability to push data from QlikView back out to a database is beneficial for what-if analysis, financial reporting, CRM dashboards and more. Unfortunately our ability to achieve this, even using custom code, is limited. On the frontend, there are only a few objects that can support user input and very little control over how these objects display. On the backend, connecting QlikView back to a database is very difficult to do well.
Part of the problem is that QlikView server is not always in control of this communication. If you’re using Plugin, it’s done by the client. If you’re using AJAX or Mobile, it’s done by the server. If, however, you code your solution as an Extension, things shift back to the client side again.
The common way to implement writeback is through VBScript macros. Examples of code to achieve this are plentiful on QlikCommunity. Although a simple version can be mocked up in a few lines of code, deploying this solution to many users in a modestly secure environment has serious disadvantages.
- Each client machine needs to communicate to the database. Therefore, database drivers need to be installed on each client. Credentials need to be included in the code of each QlikView document, leaving them exposed to users. Special ports need to be open in firewalls for driver communication. These are poor security practices and should be reason enough for any enterprise to abandon this approach.
- Client-side code is difficult to monitor. Error handling is poor. A separate system would need to be in place to capture errors for analysis and resolution.
- Managing conflicts in a distributed environment requires careful design and development.
- Communication between the VBS execution environment and the database can be slowed for any number of reasons. This leaves the client in an uncertain state, without feedback on progress or problems. Meanwhile, the application state in QlikView can continue to change. This easily can cause inconsistencies.
- Database driver communication uses proprietary protocols that are difficult to monitor for debugging and by network security software.
But there is a much better way to implement writeback to a database from QlikView: build a lightweight web service. What this means is to have QlikView send a structured request to a web server that can interpret the request, make the appropriate database changes, and send a useful response back to QlikView. Overall, this approach is far more flexible, reliable, compatible, configurable and maintainable.
- The response to a web service command (HTTP POST) can itself be an extensive report on the success or failure of any updates. This data can be made visible to the user as a clear confirmation that changes were successful.
- Server-side code is more reliable. It’s far easier to manage many users updating data at the same time. It’s easier to record and react to errors. Implementing your web service in PHP gives you a community with examples of good design.
- Server-side code can handle any level of complexity such as triggering other systems. Client crashes need not leave complex processes in unresolved states.
- This approach only requires a web port to be open in the firewall and therefore is more likely to work regardless of where the user is located.
- With this approach, it is easier to handle database rollbacks, atomic transactions and other features that support the completion of a transaction.
- Changes do not need to be sent to a database one value at a time. Instead, changes can be aggregated into a single update on the QlikView side. Aggregating changes is done faster than database communication. There is less chance of stalling the user session or allowing QlikView’s state to change in the middle of an update process.
- Multiple tables can be updated. For example, not only can a value be updated, but a separate audit log can be updated with who made the change, when, and to what value.
- Communication using XML over HTTP is readily captured by network security software.
- Web services can leverage existing network infrastructure. For example, IIS & Active Directory will authenticate the user making the web request. The web service code can be passed this information reliably.
- The database is read and written by a single set of credentials, written once in the web services code, and running on a secured server, without any access from other machines on the network. This is far more secure and a much easier sell to the IT department.
I’ve had plenty of success with this approach, combining IIS, ActiveDirectory, PHP and QlikView VBScript macro code. I don’t think we need writeback as a QlikView feature. I would, however, like to see a few changes to QlikView to better support features like this.
- Support the editing of Input Fields in more objects, such as when used as dimensions in a Pivot Table, or in a bar chart.
- Support multi-line text in Input Fields.
- Add functions to VBScript and Extensions to identify the rows of a table with Input Fields that have been changed since reload. For Extensions, something like a “next” iterator that moves to the next changed value.
- Make it possible to share Input Fields across users–without using collaboration objects as a kludge.
- Add Extension/AJAX functions for managing the data behind Extension objects with millions of rows.
- Support the updating of Input Fields from AJAX.
While I have my head in the clouds, I should mention that Vertica has a cloud solution that they manage for you. Not new, but gives some perspective.
With competitive offerings in the $10-20k per terabyte, this is an attractive offer and a great way to try before you invest when you have that much data.
I hear Vertica is a screamer, but I can’t imagine getting sub-second results for 3 TB of data on 3 virtualized servers, for the same reasons I gave in my previous post.
I wondered if InfoBright would do this. Before going open-source their website described the product as a kind of bulk-storage and not a data warehouse. A place to put data that you need to remain accessible but which you don’t need to query fast or frequently. That was the enterprise story. As an open-source project, I think they have a much more compelling value proposition. It’s the democratization of analysis. Try before you buy (the Enterprise Edition). Rapid prototype / rapid failure. Connects to any SQL tool, platform or language. As easy as working with MySQL.
My test data set is 37 million rows of point-of-sale transactions. Total data size as CSV is 7GB. My test system stinks. I need to make that clear so that my numbers are not seen as representative of what’s possible with InfoBright. After seeing the product in action, I’m sure that server hardware will do much better.
How fast to bulk load?
InfoBright loads are multi-threaded, but my test server is a single-processor desktop and the loads are still fast! With my single processor, about 1.8 million rows/minute (336 MB/min) are being inserted and the load rate slowed down about 10% over 37 million rows. Disk access was minimal as records were inserted. Overall, my little desktop moved an average of 30,000 rows/sec or 5.6 megabytes/sec. That’s 20GB/hour! My processor was fully loaded every second. With faster cores and multi-threading, the load should be much faster. When I get the chance to load Linux on a bigger box I’ll be eager to see how it performs.
How big on disk?
I have 7GB of data. Using MySQL’s default MyISAM storage engine with an 8-bit ASCII representation requires… 7GB. No surprise there. InfoBright took 591.2MB, as reported from my MySQL management console. That’s a 92% reduction in size or a 12:1 compression ratio.
The status data coming from the InfoBright engine includes the storage size of each column and total size of the table. If I could remove the lowest-level detail, InfoBright reports exactly how much space that would save. Helpful.
How much memory?
I don’t have much guidance because I don’t have enough data to stress the cache. My largest data set can fit comfortably inside the compressed cache. That means every company I’ve ever dealt with would be able to avoid disk reads and improve performance. Unfortunately, this does not put InfoBright’s performance on par with other in-memory databases. More on this later.
Here are some guidelines from InfoBright on the memory (in megabytes) that you should allocate given a certain amount of system memory. These figures have no relationship to the size of your data set. I also don’t know if 32 GB represents an upper limit for the InfoBright software. I suspect the point to this table is that the loader heap does not need to increase and that the compressed heap should increase the fastest but will not exceed the main heap.
|# System Memory||Server Main Heap Size||Server Compressed Heap Size||Loader Main Heap|
ServerMainHeapSize – Size of the main memory heap in the server process, in MB
ServerCompressedHeapSize – Size of the compressed memory heap in the server, in MB.
LoaderMainHeapSize – Size of the memory heap in the loader process, in MB.
Is it fast? Slow? My hardware is too restrictive to see what InfoBright can do. All signs are promising. What I can say is that the cache grew over time until MySQL was barely touching the disk. My processor is completely peaked, with 99.8% allocated to the MySQL process. According to this article published by MySQL yesterday, InfoBright queries are (for now) restricted to one CPU core. Performance is dependent on the size of my cache and the speed of each core, two things I have direct control over.
Even with my little desktop testbed, this much is clear: the QlikView in-memory database is much faster. On this dataset I’d see results in a split-second instead of 30, 60 or 120 seconds. You might think that comparing these two products isn’t fair, but if your goal is to deliver analysis in SMEs or enterprise departments, these two will definitely compete and complement one another.
One of the advantages of column-stores for data warehousing is that simply replicating the original transactional schema can yield adequate performance. Also, there is no performance hit for bringing in the lowest level of granularity. With column-stores, you may not need to build snowflake schemas or do much transformation. Column-stores are therefore less effort to get started in smaller companies with resource-starved IT departments. This means a faster failure rate which is what interests me most: implement quickly, measure early impact and choose investment (InfoBright Enterprise), deferral or elimination.
There is one other free column-store database of significance, MonetDB. It’s an academic project and as such it lacks the toolset and polish that InfoBright inherited from MySQL. I was up and running faster with InfoBright than I was with MonetDB because the installers and administration utilities for InfoBright are already familiar. My Windows tools for MySQL connected right in without a problem. My front-ends with simplified MySQL connectors were oblivious to the InfoBright backend, which is absolutely how it should be.
InfoBright is not without its issues. Documentation is thin or non-existant. I spent hours and hours until I determined (and confirmed on the forums) that the InfoBright loader does not support all of the MySQL syntax for bulk loads. This would not have been such a problem if the error message had provided some warning about my syntax that was perfectly legal in standard MySQL.
All in all, I’m thrilled to have a no-cost column-store database available for prototyping, quick and dirty applications, and bulk data storage.
Over the weekend I have revisited Tableau, enjoyed some success with MonetDB, tried to turn MySQL into a hundred million row data warehouse, been underwhelmed with Firebird, installed Greenplum and spent many frustrated hours with Talend Open Studio, Pentaho Kettle and Jitterbit.
Of course, I could just buy QlikView, but what can be done for less $money? Unfortunately data warehouses and BI front-ends are not sexy problems in the opensource community. Graphs and charts get a little more attention, but you’ll need to write your own code to glue them to your application.
In summary, what can I say about our options?
First, write your own ETL. Why do opensource ETL tools like Talend and Kettle work so hard to rebuild Informatica? It reminds me of Linux in the 1990s when the community wanted to beat Windows and kept working to look like Windows and wondering when victory would arrive. Informatica, like OLAP and mainframes, is from an era when memory was scarce; languages were low-level, slow to compile & run, abstracted little and were not at all portable. On top of that, ODBC drivers were tightly controlled and costly.
But now we can pick from many great scripting languages. Today’s languages abstract the hard parts, are easy to read, can be edited while executing and talk to any system, database, web service or application. I think the next direction for ETL will be a simple (but extensible) transformation language using an ORM wrapper… Rails on ETL. Until that arrives, you can achieve everything you need with PHP, Perl, Ruby and others.
Best option for low-cost data warehouse?
One of the most useful tricks shared at the QlikView conference was from Nik Boman on improving the data extraction from databases.
ODBC is a slow protocol, running orders of magnitude slower than the database or a typical Ethernet connection. Very pricey ETL tools for data warehousing get around this by extracting through multiple connections to the database, and there’s no reason that a QlikView infrastructure can’t take advantage of it.
For example, run two copies of QlikView at the same time and extract approximately half of the data set with each. First, make a copy of the QV.exe file and give it a unique name. You can open QV.exe and your unique copy at the same time. You can run three or more copies of QlikView with this method.
Next, decide how to divide your data set; it could be based on date, country, state, or half the alphabet, for example. What you want is to divide the data set into roughly equal segments, one for each copy of QlikView.
How does each copy of QlikView know which segment to load? One way to do this dynamically is to use the command-line to set a variable in the script. Reference this variable in the SQL SELECT statement in the script: WHERE YearField=$(vYearVariable). See the reference manual for command-line options.
Your mileage will vary. Some databases don’t do much better with simultaneous ODBC reads. Oracle does quite well.
In my world, which is corporate software systems, I have a transactional database that is usually in second normal form and has very few aggregates. Building reports directly means joining at least 4 tables, often 8, and sometimes as many as 12. Unfortunately, the new crop of data warehouse vendors have made it very difficult to grasp how well they handle this. Some of these products handle your datamodel as-is, and some expect star/snowflake schemas, which adds a layer of design, coding, testing, validation and additional maintenance.
Netezza, Greenplum and Vertica all use off-the-shelf interconnects, meaning 1 gigabit ethernet in most cases. Transferring large amounts of data from a distributed system over ethernet can easily unravel any gains. In a simplistic design, an evenly distributed dataset would require that every node talks to every other node. With multiple joins, this would create a series of bottlenecks. It would also rely heavily on synchronization across the distributed system.
Vertica is a star/snowflake product. The Vertica distributed system replicates the dimension tables on each node and partitions the fact table. Vertica says that they have customers that use more transactional models, but what does that mean for overall performance? Greenplum’s website says: “Utilizes pipelining techniques and redistributes data among nodes for high performance execution of complex joins.” Encouraging, but what is considered “complex” and what will this do to my network in real-world conditions?
If you have any thoughts to share, please add them to the comments.
Gartner released the updated quadrant for DW DBMS software and appliances. DATAllegro seems too far below Netezza in ability to execute. DATAllegro has large, proven installations. Their recent releases run on Dell blades with EMC storage instead of the customized FPGAs of Netezza. And how is Greenplum rated higher than DATAllego? (via DBMS2)