InfoBright Open-Source Column-Store DBMS

I wondered if InfoBright would do this. Before going open-source their website described the product as a kind of bulk-storage and not a data warehouse. A place to put data that you need to remain accessible but which you don’t need to query fast or frequently. That was the enterprise story. As an open-source project, I think they have a much more compelling value proposition. It’s the democratization of analysis. Try before you buy (the Enterprise Edition). Rapid prototype / rapid failure. Connects to any SQL tool, platform or language. As easy as working with MySQL.

My test data set is 37 million rows of point-of-sale transactions. Total data size as CSV is 7GB. My test system stinks. I need to make that clear so that my numbers are not seen as representative of what’s possible with InfoBright. After seeing the product in action, I’m sure that server hardware will do much better.

How fast to bulk load?

InfoBright loads are multi-threaded, but my test server is a single-processor desktop and the loads are still fast! With my single processor, about 1.8 million rows/minute (336 MB/min) are being inserted and the load rate slowed down about 10% over 37 million rows. Disk access was minimal as records were inserted. Overall, my little desktop moved an average of 30,000 rows/sec or 5.6 megabytes/sec. That’s 20GB/hour! My processor was fully loaded every second. With faster cores and multi-threading, the load should be much faster. When I get the chance to load Linux on a bigger box I’ll be eager to see how it performs.

How big on disk?

I have 7GB of data. Using MySQL’s default MyISAM storage engine with an 8-bit ASCII representation requires… 7GB. No surprise there. InfoBright took 591.2MB, as reported from my MySQL management console. That’s a 92% reduction in size or a 12:1 compression ratio.

The status data coming from the InfoBright engine includes the storage size of each column and total size of the table. If I could remove the lowest-level detail, InfoBright reports exactly how much space that would save. Helpful.

How much memory?

I don’t have much guidance because I don’t have enough data to stress the cache. My largest data set can fit comfortably inside the compressed cache. That means every company I’ve ever dealt with would be able to avoid disk reads and improve performance. Unfortunately, this does not put InfoBright’s performance on par with other in-memory databases. More on this later.

Here are some guidelines from InfoBright on the memory (in megabytes) that you should allocate given a certain amount of system memory. These figures have no relationship to the size of your data set. I also don’t know if 32 GB represents an upper limit for the InfoBright software. I suspect the point to this table is that the loader heap does not need to increase and that the compressed heap should increase the fastest but will not exceed the main heap.

# System Memory Server Main Heap Size Server Compressed Heap Size Loader Main Heap
32GB 24000 4000 800
16GB 10000 1000 800
8GB 4000 500 800

ServerMainHeapSize – Size of the main memory heap in the server process, in MB
ServerCompressedHeapSize – Size of the compressed memory heap in the server, in MB.
LoaderMainHeapSize – Size of the memory heap in the loader process, in MB.

Performance?

Is it fast? Slow? My hardware is too restrictive to see what InfoBright can do. All signs are promising. What I can say is that the cache grew over time until MySQL was barely touching the disk. My processor is completely peaked, with 99.8% allocated to the MySQL process. According to this article published by MySQL yesterday, InfoBright queries are (for now) restricted to one CPU core. Performance is dependent on the size of my cache and the speed of each core, two things I have direct control over.

Even with my little desktop testbed, this much is clear: the QlikView in-memory database is much faster. On this dataset I’d see results in a split-second instead of 30, 60 or 120 seconds. You might think that comparing these two products isn’t fair, but if your goal is to deliver analysis in SMEs or enterprise departments, these two will definitely compete and complement one another.

Summary?

One of the advantages of column-stores for data warehousing is that simply replicating the original transactional schema can yield adequate performance. Also, there is no performance hit for bringing in the lowest level of granularity. With column-stores, you may not need to build snowflake schemas or do much transformation. Column-stores are therefore less effort to get started in smaller companies with resource-starved IT departments. This means a faster failure rate which is what interests me most: implement quickly, measure early impact and choose investment (InfoBright Enterprise), deferral or elimination.

There is one other free column-store database of significance, MonetDB. It’s an academic project and as such it lacks the toolset and polish that InfoBright inherited from MySQL. I was up and running faster with InfoBright than I was with MonetDB because the installers and administration utilities for InfoBright are already familiar. My Windows tools for MySQL connected right in without a problem. My front-ends with simplified MySQL connectors were oblivious to the InfoBright backend, which is absolutely how it should be.

InfoBright is not without its issues. Documentation is thin or non-existant. I spent hours and hours until I determined (and confirmed on the forums) that the InfoBright loader does not support all of the MySQL syntax for bulk loads. This would not have been such a problem if the error message had provided some warning about my syntax that was perfectly legal in standard MySQL.

All in all, I’m thrilled to have a no-cost column-store database available for prototyping, quick and dirty applications, and bulk data storage.

Related posts:

  1. Open-Source QlikView Engine?
  2. Infobright 3.0.2 Released
  3. Low-Cost Data Analysis & Visualization: It’s Getting Better All The Time
  4. Pentaho Open-Source BI Suite Adds Weka Data-Mining Tool
  5. A Brief Look at QlikView Storage

7 thoughts on “InfoBright Open-Source Column-Store DBMS

  1. Jay,

    Thanks for taking the time to try out Infobright, and for blogging your thoughts on Infobright.

    We take community feedback seriously – we are working on improving the documentation (have a look at the community Wiki; in particular http://www.infobright.org/wiki/Data_Loading/), and will incorporate better error messaging in future releases.

    The Open Source community is vital to us – we’ll continue to enhance and improve the ICE offering.

    John.

  2. I’ll be interested when you do a concurrent user test. I suspect that those customers you reference woouldn’t fare so well if they had more than 20 users. I believe the statement from Infobright on their site is that the next release will bump concurrent users to something like 30, still very low if you want this to be even a departmental reporting platform.

  3. Pingback: Websites tagged "dataset" on Postsaver

  4. I do truly like your superior text! Could you accomplish the good term paper as example? Because I do know that a great custom writing service would compose well researched term papers of brilliant quality.

  5. I had got a desire to make my own company, but I did not earn enough of cash to do this. Thank heaven my friend suggested to take the mortgage loans. So I took the sba loan and realized my desire.

  6. Pingback: QlikCommunity | A Brief Look at QlikView Storage - The QlikView Blogroll

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>