More specifically, I hope to discuss on this site the pros and cons of overengineered data solutions: simple algorithms that overwhelm predictive modeling problems with increasingly inexpensive cloud computing power. I think the trend in data analysis is to use a few analysts with a large team of servers to create better (or at least cheaper) predictive modeling solutions than a large team of analysts with one server. The key is simple but exhaustive ensemble solutions that apply a wide variety of predictive techniques that, in combination, match or outperform more targeted apporaches.
A comment by Thomas Dinsmore at DBMS2.com
Re “beer and diapers” — I first heard this at a 1996 Data Mining conference in San Francisco, where a Teradata presenter used the story to tout the capabilities of Teradata.
A year or so later, Forbes ran a piece quoting the head of merchandising for Wal-Mart saying that even if the finding were true he wouldn’t know what to do with it.
That’s an important point. Suppose that it’s true that shoppers tend to purchase beer and diapers on Friday nights. Does this mean retailers should:
(1) Place beer and diapers next to one another in the store, for shopper convenience;
(2) Place beer and diapers far apart in the store, to maximize time in store;
(3) Issue a coupon for beer when purchased with diapers;
(4) Don’t issue a coupon, since shoppers buy beer and diapers anyway.
There is no way to answer the question without a follow-up test and learn. In the absence of an experimental design, observed associations have little or no value for decision-making.
Here’s another angle on big data applied to the world surrounding the common business: graph databases. This is a more natural way to model the complete context of a transaction. It’s a big data problem and every business can benefit from analyzing it.
For example, each business transaction is a node. And each person that touches it, or is responsible for it. Each customer and vendor and the communication with them. And each component of the transaction, such as the product, or serial number. And the billions of relationships between all these nodes.
Validation from IBM. Combine BPM, social communication and mobile interfaces. Ok, now just add 2-way business intelligence and some BigMetaData concepts for a revolutionary product.
Netflix opted not to implement the entire algorithm that won the Netflix Prize because a) it would have cost too much to engineer, b) it no longer fit the direction of the company, and c) just two algorithms from the prize submission yielded the most significant gains.
This is such a great example of the reality of Big Data! Before embarking on a mathematical journey, companies need to assess the reward of developing models and balance against the risk of process changes, and the cost of implementation. Also, simpler models that are cheaper to implement and better understood in the general case are more likely to get business traction than deep models, highly tuned through iteration.
Netflix’s factors changed, meaning the pieces of data fed into the system. Factor selection is repeatedly identified as the most significant step in model building. With streaming, a partial viewing of a video means something: so do clicks to go back, forward, or browse for something else while you’re watching. All of this reflects your behavior.
Wonderful examples of diagnostic heat maps!
The last missing enabling technology for my vision of business software was recently discussed on High Scalability as a future open source project. Soon all the pieces will exist in the open source ecosystem.
Big Metadata is my phrase for the application of Big Data tools and techniques to the larger ecosystem of data surrounding business software. While Big Data is focused on unwieldy data sets at large firms or new-technology companies, Big Metadata is about every company or organization taking advantage of the massive amounts of relevant metadata that is completely ignored.
Business software only capture the bits of each transaction that will calculate changes in such things as inventory, bank accounts, and production. This reductionist view of each transaction is disconnected with the reality of daily business: full of exceptions and rewrites and human conversations. Just as consumer software is making strides in conforming to our habits and needs, business software should capture the reality of business and not enforce an ideal.
How we got here is simple. The constraints on storage, processing, software development and error checking left us focused only on critical features. We now design business databases to be as narrow as possible. We write code to run the system for today and struggle to reconcile original decisions as processes, people and the whole business landscape continues to change.
When we talk of business software being more human, more predictive, and more like the consumer software we experience from Apple and others, I think we are looking for the same kind of Big Data solutions that come from Google, Apple, Amazon, and Facebook. These are solutions based on massive amounts of data, good choice in analysis factors, and a sprinkling of algorithm and statistics.
There are clear individual business uses for social networks, voice and gesture recognition, cloud infrastructure, schema-less databases, in-memory storage, distributed algorithms, graph processing, domain-specific languages, development frameworks and other technologies. I believe the combination of these technologies to solve the common challenges in business will fuel the next great wave. My hope is to sketch out the future of business software.