I wondered if InfoBright would do this. Before going open-source their website described the product as a kind of bulk-storage and not a data warehouse. A place to put data that you need to remain accessible but which you don’t need to query fast or frequently. That was the enterprise story. As an open-source project, I think they have a much more compelling value proposition. It’s the democratization of analysis. Try before you buy (the Enterprise Edition). Rapid prototype / rapid failure. Connects to any SQL tool, platform or language. As easy as working with MySQL.
My test data set is 37 million rows of point-of-sale transactions. Total data size as CSV is 7GB. My test system stinks. I need to make that clear so that my numbers are not seen as representative of what’s possible with InfoBright. After seeing the product in action, I’m sure that server hardware will do much better.
How fast to bulk load?
InfoBright loads are multi-threaded, but my test server is a single-processor desktop and the loads are still fast! With my single processor, about 1.8 million rows/minute (336 MB/min) are being inserted and the load rate slowed down about 10% over 37 million rows. Disk access was minimal as records were inserted. Overall, my little desktop moved an average of 30,000 rows/sec or 5.6 megabytes/sec. That’s 20GB/hour! My processor was fully loaded every second. With faster cores and multi-threading, the load should be much faster. When I get the chance to load Linux on a bigger box I’ll be eager to see how it performs.
How big on disk?
I have 7GB of data. Using MySQL’s default MyISAM storage engine with an 8-bit ASCII representation requires… 7GB. No surprise there. InfoBright took 591.2MB, as reported from my MySQL management console. That’s a 92% reduction in size or a 12:1 compression ratio.
The status data coming from the InfoBright engine includes the storage size of each column and total size of the table. If I could remove the lowest-level detail, InfoBright reports exactly how much space that would save. Helpful.
How much memory?
I don’t have much guidance because I don’t have enough data to stress the cache. My largest data set can fit comfortably inside the compressed cache. That means every company I’ve ever dealt with would be able to avoid disk reads and improve performance. Unfortunately, this does not put InfoBright’s performance on par with other in-memory databases. More on this later.
Here are some guidelines from InfoBright on the memory (in megabytes) that you should allocate given a certain amount of system memory. These figures have no relationship to the size of your data set. I also don’t know if 32 GB represents an upper limit for the InfoBright software. I suspect the point to this table is that the loader heap does not need to increase and that the compressed heap should increase the fastest but will not exceed the main heap.
|# System Memory
||Server Main Heap Size
||Server Compressed Heap Size
||Loader Main Heap
ServerMainHeapSize – Size of the main memory heap in the server process, in MB
ServerCompressedHeapSize – Size of the compressed memory heap in the server, in MB.
LoaderMainHeapSize – Size of the memory heap in the loader process, in MB.
Is it fast? Slow? My hardware is too restrictive to see what InfoBright can do. All signs are promising. What I can say is that the cache grew over time until MySQL was barely touching the disk. My processor is completely peaked, with 99.8% allocated to the MySQL process. According to this article published by MySQL yesterday, InfoBright queries are (for now) restricted to one CPU core. Performance is dependent on the size of my cache and the speed of each core, two things I have direct control over.
Even with my little desktop testbed, this much is clear: the QlikView in-memory database is much faster. On this dataset I’d see results in a split-second instead of 30, 60 or 120 seconds. You might think that comparing these two products isn’t fair, but if your goal is to deliver analysis in SMEs or enterprise departments, these two will definitely compete and complement one another.
One of the advantages of column-stores for data warehousing is that simply replicating the original transactional schema can yield adequate performance. Also, there is no performance hit for bringing in the lowest level of granularity. With column-stores, you may not need to build snowflake schemas or do much transformation. Column-stores are therefore less effort to get started in smaller companies with resource-starved IT departments. This means a faster failure rate which is what interests me most: implement quickly, measure early impact and choose investment (InfoBright Enterprise), deferral or elimination.
There is one other free column-store database of significance, MonetDB. It’s an academic project and as such it lacks the toolset and polish that InfoBright inherited from MySQL. I was up and running faster with InfoBright than I was with MonetDB because the installers and administration utilities for InfoBright are already familiar. My Windows tools for MySQL connected right in without a problem. My front-ends with simplified MySQL connectors were oblivious to the InfoBright backend, which is absolutely how it should be.
InfoBright is not without its issues. Documentation is thin or non-existant. I spent hours and hours until I determined (and confirmed on the forums) that the InfoBright loader does not support all of the MySQL syntax for bulk loads. This would not have been such a problem if the error message had provided some warning about my syntax that was perfectly legal in standard MySQL.
All in all, I’m thrilled to have a no-cost column-store database available for prototyping, quick and dirty applications, and bulk data storage.