What if you could turn on a massively parallel business intelligence database cluster with a few lines of code? What if you could leverage in-house and outsourced resources for computation and storage as needed? What if you could expand your analysis, data mining and text-search effort one node at a time, transparently, instantly?
There’s been a flurry of discussion around Hadoop and the Hbase project to bring Google’s BigTable feature to Hadoop.
Now Amazon wants to talk about how to use Hadoop with EC2 and S3, their computing and storage clusters.
Can I search large volumes of data on the cheap? Yes, but my algorithms must fit within the MapReduce framework.
Does someone have a MapReduce-enabled data query language? Well, there’s Pig from Yahoo. Sawzall from Google. Here is a discussion comparing those two from Greg Linden. Abacus from the Hadoop project. Apparently Microsoft has DryadLINQ.
We are on the exponential curve as it swoops upward dramatically. From the power and flexibility of opensource, anyone can use Google secret sauce on Amazon’s computers for 18 cents per gigabyte and 10 cents per computing hour.
I finally got around to watching the Tableau 3.0 webinar. I agree with their very excited presenter that Tableau 3.0 is a leap forward. The support of ad-hoc grouping of dimension elements is excellent as is the enhanced support of ad-hoc sets. The annotations look good and act sensibly. Generally, the new features are focused on ease of use, better statistical analysis, and report clarity. All good things. Here are 3.0 examples.
Annotations should be required in every BI tool. The ability to mark reference lines and data points on graphs and tables is critical to clear communication. Placing an annotation on a point in space does not require a data point to exist there, another nice feature. The smart BI vendors are focusing on collaboration and communication among users.
“Groups” stole their name from the “groups” of 2.x which are now the “sets” of 3.0 and can be used like so: similar dimensions such as coffee and tea, which may need to be represented in the database as separate product lines, can now be combined on the fly within Tableau by an end user under the simple heading “drinks”. This would make it easy to answer a question about food vs drink sales without the need to export to Excel and spend more time adding up the drink categories. In short, “groups” bring dimension values together and “sets” allow for separating special values from the rest of a dimensions values–and both can be done by the end user. Pretty nice.
I think the strongest competitor for visualization is Spotfire. However, Tableau’s use of live database interaction will become an advantage as data warehouse implementations shift to high-performance in-memory read-optimized databases. Was that over-hyphenated? Spotfire’s initial data loads are inflexible and I wouldn’t recommend it if you need to update a large dataset frequently.
Unlike QlikView, all of Tableau’s data needs to be in a single database. With good design, this is not a performance issue. The problem is that the extra expense of hardware and software to store a separate data warehouse and run ETL processing may push Tableau’s final price tag far above QlikView, which can easy pull from multiple sources and uses its own high-speed database.
I really hope Pentaho makes a compelling product. But they have a long road ahead. They’ve incorporated projects that were developed independently, use different methods for accessing data and do not share the same quality in the code base. The programs will change and users that build to one version will find their tools broken upon the next version release. That’s part of the practical downside of open-source. Until the product has matured, it will be a moving target.