The Myth and Mystery of Big Data

“With enough data, you can discover patterns and facts using simple counting that you can’t discover in small data using sophisticated statistical and machine learning approaches.” Link

I used to assume that big data and data mining and statistics were inseparable. But the reality–companies making a killing transforming data into value–is far from complex.

Big data is not hard. Statistics are not required. Neither are complex algorithms. Google’s Marissa Mayer attributed the company’s intelligence to the volume of data available for cross-referencing and not to clever algorithms. Google translate leveraged massive volumes of cross-referenced text in multiple languages rather than a finely tuned understanding of grammar. Voice translation uses much the same technique based on huge volumes of recorded, transcribed text.

Right now our two best tools are visualization and data exploration (business discovery). Both are simple, easy to demonstrate and easy to grasp. The big data revolution’s message to the masses is that simple correlation will outstrip them both as long as enough data can be crunched. And much of this can be automated, pre-calculated, and even anticipated. Imagine the analysis system analyzing itself: these people tend to ask these questions at these times!

Data can be correlated post-hoc. Correlation does not equal causation, but simple correlation is ample evidence on which to take action. Correlation is immediately perceived visually. Correlation is relative and easy to compare. Correlation can look at 2, 3, 4 or more factors at once. Correlation is business friendly. It is easily understood. Correlation is gut-instinct compatible. Kids understand it: mom gets upset when I put peanut butter on the cat. If I do it right now, she’ll probably be mad.

The business opportunity is really that so much big data is simply thrown away. The opportunity to store all this data didn’t exist, so we have an old habit of simply letting it vaporize. Every server message, every website click, every customer contact and interaction, every manufacturing activity, temperature, timeclock action, phone call received, phone call placed, security video, email sent. Every bit of data can be analyzed, and from multiple perspectives: employee, employer, customer, vendor, shipper, receiver, and on and on.

We don’t know what we’ll find. As more and more stories of big data at little(er) companies emerge, the snowball will become an avalanche.

TeraData Performance at KiloData Prices

What if you could turn on a massively parallel business intelligence database cluster with a few lines of code? What if you could leverage in-house and outsourced resources for computation and storage as needed? What if you could expand your analysis, data mining and text-search effort one node at a time, transparently, instantly?

There’s been a flurry of discussion around Hadoop and the Hbase project to bring Google’s BigTable feature to Hadoop.

Now Amazon wants to talk about how to use Hadoop with EC2 and S3, their computing and storage clusters.

Can I search large volumes of data on the cheap? Yes, but my algorithms must fit within the MapReduce framework.

Does someone have a MapReduce-enabled data query language? Well, there’s Pig from Yahoo. Sawzall from Google. Here is a discussion comparing those two from Greg Linden. Abacus from the Hadoop project. Apparently Microsoft has DryadLINQ.

We are on the exponential curve as it swoops upward dramatically. From the power and flexibility of opensource, anyone can use Google secret sauce on Amazon’s computers for 18 cents per gigabyte and 10 cents per computing hour.

The Lean Enterprise And Business Intelligence

Here’s an article by Scott Wanless on the B-Eye Network Healthcare Business Intelligence Triples the Value of Lean Initiatives.


Business intelligence helps to answer these strategic questions by combining internal data (e.g., patient data, sales data, staffing data) and external data (e.g., market statistics, partner data, payor data, etc.) into a historical repository for analysis of key trends, patterns, exceptions and opportunities.

The results of this analysis are used to make informed decisions regarding the direction of the organization, its performance and the value provided to the various stakeholder groups.

A Look At The Future Of Business Intelligence


The need to take action, not just be informed, is more urgent than ever due to the relentless externalization of business, the rapid emergence of loosely coupled computing environments based on standards, and the Web-as-the-platform paradigm.

This article from Neil Raden at Intelligent Enterprise covers some of the changes in the non-traditional and the leading-edge business intelligence space.