The last missing enabling technology for my vision of business software was recently discussed on High Scalability as a future open source project. Soon all the pieces will exist in the open source ecosystem.
Big Metadata is my phrase for the application of Big Data tools and techniques to the larger ecosystem of data surrounding business software. While Big Data is focused on unwieldy data sets at large firms or new-technology companies, Big Metadata is about every company or organization taking advantage of the massive amounts of relevant metadata that is completely ignored.
Business software only capture the bits of each transaction that will calculate changes in such things as inventory, bank accounts, and production. This reductionist view of each transaction is disconnected with the reality of daily business: full of exceptions and rewrites and human conversations. Just as consumer software is making strides in conforming to our habits and needs, business software should capture the reality of business and not enforce an ideal.
How we got here is simple. The constraints on storage, processing, software development and error checking left us focused only on critical features. We now design business databases to be as narrow as possible. We write code to run the system for today and struggle to reconcile original decisions as processes, people and the whole business landscape continues to change.
When we talk of business software being more human, more predictive, and more like the consumer software we experience from Apple and others, I think we are looking for the same kind of Big Data solutions that come from Google, Apple, Amazon, and Facebook. These are solutions based on massive amounts of data, good choice in analysis factors, and a sprinkling of algorithm and statistics.
There are clear individual business uses for social networks, voice and gesture recognition, cloud infrastructure, schema-less databases, in-memory storage, distributed algorithms, graph processing, domain-specific languages, development frameworks and other technologies. I believe the combination of these technologies to solve the common challenges in business will fuel the next great wave. My hope is to sketch out the future of business software.
I wondered if InfoBright would do this. Before going open-source their website described the product as a kind of bulk-storage and not a data warehouse. A place to put data that you need to remain accessible but which you don’t need to query fast or frequently. That was the enterprise story. As an open-source project, I think they have a much more compelling value proposition. It’s the democratization of analysis. Try before you buy (the Enterprise Edition). Rapid prototype / rapid failure. Connects to any SQL tool, platform or language. As easy as working with MySQL.
My test data set is 37 million rows of point-of-sale transactions. Total data size as CSV is 7GB. My test system stinks. I need to make that clear so that my numbers are not seen as representative of what’s possible with InfoBright. After seeing the product in action, I’m sure that server hardware will do much better.
How fast to bulk load?
InfoBright loads are multi-threaded, but my test server is a single-processor desktop and the loads are still fast! With my single processor, about 1.8 million rows/minute (336 MB/min) are being inserted and the load rate slowed down about 10% over 37 million rows. Disk access was minimal as records were inserted. Overall, my little desktop moved an average of 30,000 rows/sec or 5.6 megabytes/sec. That’s 20GB/hour! My processor was fully loaded every second. With faster cores and multi-threading, the load should be much faster. When I get the chance to load Linux on a bigger box I’ll be eager to see how it performs.
How big on disk?
I have 7GB of data. Using MySQL’s default MyISAM storage engine with an 8-bit ASCII representation requires… 7GB. No surprise there. InfoBright took 591.2MB, as reported from my MySQL management console. That’s a 92% reduction in size or a 12:1 compression ratio.
The status data coming from the InfoBright engine includes the storage size of each column and total size of the table. If I could remove the lowest-level detail, InfoBright reports exactly how much space that would save. Helpful.
How much memory?
I don’t have much guidance because I don’t have enough data to stress the cache. My largest data set can fit comfortably inside the compressed cache. That means every company I’ve ever dealt with would be able to avoid disk reads and improve performance. Unfortunately, this does not put InfoBright’s performance on par with other in-memory databases. More on this later.
Here are some guidelines from InfoBright on the memory (in megabytes) that you should allocate given a certain amount of system memory. These figures have no relationship to the size of your data set. I also don’t know if 32 GB represents an upper limit for the InfoBright software. I suspect the point to this table is that the loader heap does not need to increase and that the compressed heap should increase the fastest but will not exceed the main heap.
|# System Memory||Server Main Heap Size||Server Compressed Heap Size||Loader Main Heap|
ServerMainHeapSize – Size of the main memory heap in the server process, in MB
ServerCompressedHeapSize – Size of the compressed memory heap in the server, in MB.
LoaderMainHeapSize – Size of the memory heap in the loader process, in MB.
Is it fast? Slow? My hardware is too restrictive to see what InfoBright can do. All signs are promising. What I can say is that the cache grew over time until MySQL was barely touching the disk. My processor is completely peaked, with 99.8% allocated to the MySQL process. According to this article published by MySQL yesterday, InfoBright queries are (for now) restricted to one CPU core. Performance is dependent on the size of my cache and the speed of each core, two things I have direct control over.
Even with my little desktop testbed, this much is clear: the QlikView in-memory database is much faster. On this dataset I’d see results in a split-second instead of 30, 60 or 120 seconds. You might think that comparing these two products isn’t fair, but if your goal is to deliver analysis in SMEs or enterprise departments, these two will definitely compete and complement one another.
One of the advantages of column-stores for data warehousing is that simply replicating the original transactional schema can yield adequate performance. Also, there is no performance hit for bringing in the lowest level of granularity. With column-stores, you may not need to build snowflake schemas or do much transformation. Column-stores are therefore less effort to get started in smaller companies with resource-starved IT departments. This means a faster failure rate which is what interests me most: implement quickly, measure early impact and choose investment (InfoBright Enterprise), deferral or elimination.
There is one other free column-store database of significance, MonetDB. It’s an academic project and as such it lacks the toolset and polish that InfoBright inherited from MySQL. I was up and running faster with InfoBright than I was with MonetDB because the installers and administration utilities for InfoBright are already familiar. My Windows tools for MySQL connected right in without a problem. My front-ends with simplified MySQL connectors were oblivious to the InfoBright backend, which is absolutely how it should be.
InfoBright is not without its issues. Documentation is thin or non-existant. I spent hours and hours until I determined (and confirmed on the forums) that the InfoBright loader does not support all of the MySQL syntax for bulk loads. This would not have been such a problem if the error message had provided some warning about my syntax that was perfectly legal in standard MySQL.
All in all, I’m thrilled to have a no-cost column-store database available for prototyping, quick and dirty applications, and bulk data storage.
I certainly agree with Mr. Debes that we have been here before. Technology analysts grab hold of the next great thing and see how far it will bend before it breaks. SaaS is a collection of ideas exemplified by Salesforce. Each idea could be implemented by traditional software vendors. We want to understand what makes Salesforce successful, but we must recognize the difference between Salesforce and an ERP system. Sales pipelines can be abstracted from the daily operations of a business. Salesforce does not need to understand backflushing with 3-level bills of material with multiple currencies and the impact on materials planning in a cellular manufactring environment. Thankfully.
Mr. Mirchandani says the problem is that enterprise software vendors are not listening to their customers who are asking for Salesforce features and a Salesforce pricing model. Perhaps for some that makes sense, but what makes Salesforce and other SaaS companies successful is the tight scope of their product. Enterprise software has a huge scope and it must be completely customized in certain areas. Areas that are different for every type of business.
Two announcements this week, all centered around small and mid-market applications built on flexible platforms. Force.com from Salesforce.com is a rebranding of their technology with a focus beyond CRM applications. This article mentions some promising case studies including 45,000 users running on the Salesforce.com platform already. That’s the Japanese post office, so obviously internationalization isn’t an issue. Not everyone is convinced of Salesforce.com’s potential. Joshua Greenbaum thinks that SAP’s A1S, to be announced on Sept 19th, will be the one to beat.
Either way, I’m pretty excited. On-demand, software-as-a-service business software platforms with straightforward development tools have incredible potential. Highly-tailored applications are the way to go, but you don’t want to be caught in the low-level details of databases and web service implementation.