… and points beyond

mostly about data

Browsing Posts in business intelligence

A subtle and powerful shift happened last year.

I was building a QlikView application for financials. My client and I had discussed the idea of this application a year earlier but it was impossible then. Version 8.5 had not been released. The ability to look at dozens of simultaneous selection sets is the key to making this great idea work.

Zoom forward to 2009. Versions 8.5 and then 9.0 are released with features including Set Analysis, unlimited rows, chaining of document selections, data export from the script, and Dynamic Tables. These innovations remove the architectural limitations of QlikView that had tied my hands. A year ago I could not deliver the solution that was in my head and that my client needed. Now these limits are gone and I can build exactly what my client needs.

Build exactly what my client needs? This is the first time that this thought crossed my mind. It’s true! With the release of version 9, QlikView has entered a new phase. One that no other product can match.

QlikView is the first and ONLY tool on the market in which every business analysis that I have been asked to build can be built with confidence and an expectation of success. Dream big!

QlikView is not SPSS or JMP, and it never will be, but since I have never been asked to do anything more complex than a regression, QlikView works perfectly.

QlikView is the tool to turn to. It delivers results. Real value, right now. And you can be confident that it can achieve any business analysis you can think of. To get an idea of what QlikView can do, follow this thread on LinkedIn with over 100 unique uses for QlikView.

There’s a point where query response time is low enough that it changes the analysis game completely. This is the amount of time that a decision maker is willing to wait to get the next answer. Not the first answer, but the next one, and the next one. Eventually the frustration of waiting is worse than not knowing.

Salesperson: “What shipped yesterday? Ok, what’s the breakdown? Woah, what happened in that department? That markdown is too steep, who wrote that order? Which customer? What’s that rep’s extension?”

With one-second results, that analysis would have happened in the time it took you to read it. This is a competition against human nature. One-second results makes the difference between wishing you had the answer and getting it, multiplied over and over throughout the day.

The impact on a business is not from faster queries alone. Behavior changes when decision makers trust that the data is immediately at hand. The relationship to data changes when you can find the answer while you think about it and not lose your train of thought.

Because the query engine can respond to any query in one second, we can make every path of exploration available at the beginning. One application can take the place of many reports. Users can begin to query immediately and along any drill path. The benefit of one-second results is diminished if users have to first identify the report that has the data and filtering options they need.

Can OLAP deliver this? No. We must combine speed of execution with rapid application development, full transaction details, and eliminate predefined drill paths. OLAP/MOLAP/ROLAP/SCHMOLAP can’t take us into this new era. In-memory associative and column-store databases can.

With one-second results, you don’t build a query and then start the execution. Instead, the results update as soon as you pick the first filtering option, whether it’s the day, order number or country of origin. You get immediate feedback before you make your next selection. Also, the filter options can change based on the results. Maybe you remove options that are incompatible with the selections made so far. By shrinking the feedback loop with one-second results, the filtering options can show intelligent behavior to help guide users or add context to the results. This level of dynamism lets users roll back and forth through their ideas. They can cross-reference without losing a train of thought, or discover and follow tangents that are more important.

It’s not just one decision maker getting an answer quickly. Interactions and processes benefit. Workers get feedback in near-real-time. We can do tricks like running the same query once per second. Ridiculous? This isn’t paradise, I live in the land of low budgets and “getting it done”. Vendor and customer data is available right when they’re on the phone. Less “I’ll get back to you” and more “I have that info right in front of me.” I’ve also noticed that it’s harder to bullshit when anyone in the meeting can easily explore the data on their laptop and get the real answer.

In companies where I can deliver one-second results, I spend a lot of time reconditioning people to ask for anything they desire, because now I can put any information at their fingertips, no matter how many tables, how much detail and with little knowledge of how they want to look at the data.

For nearly all companies, the entire transactional database can be copied as-is into a one-second query engine. Add a BI tool on top, rename some fields and identify the table relationships. Time is spent developing the frontend to deliver the best reports and analysis. One person can build the entire solution. Since the transactional model is already validated, there is no data modeling, no formal architecture and little documentation. This might be frightening to enterprises but the benefits are huge for strapped IT budgets.

A one-second query engine needs an interactive frontend to take advantage of it. We also need simpler ETL tools. With the engine in place first, developers will connect the dots and the tools will be built to take advantage of the new abilities.

None of this is theoretical. I’ve been doing this for the past 7 years with an in-memory associative database, ETL tool and interactive frontend called QlikView. When information flows at the speed of thought, it changes decision-maker behavior and the business process. When we can prototype and deploy one-second query engines quickly, then ideas can be built and tested quickly. Most ideas won’t be new or unexpected, but they were impossible or impractical without one-second results.

Over the weekend I have revisited Tableau, enjoyed some success with MonetDB, tried to turn MySQL into a hundred million row data warehouse, been underwhelmed with Firebird, installed Greenplum and spent many frustrated hours with Talend Open Studio, Pentaho Kettle and Jitterbit.

Of course, I could just buy QlikView, but what can be done for less $money? Unfortunately data warehouses and BI front-ends are not sexy problems in the opensource community. Graphs and charts get a little more attention, but you’ll need to write your own code to glue them to your application.

In summary, what can I say about our options?

First, write your own ETL. Why do opensource ETL tools like Talend and Kettle work so hard to rebuild Informatica? It reminds me of Linux in the 1990s when the community wanted to beat Windows and kept working to look like Windows and wondering when victory would arrive. Informatica, like OLAP and mainframes, is from an era when memory was scarce; languages were low-level, slow to compile & run, abstracted little and were not at all portable. On top of that, ODBC drivers were tightly controlled and costly.

But now we can pick from many great scripting languages. Today’s languages abstract the hard parts, are easy to read, can be edited while executing and talk to any system, database, web service or application. I think the next direction for ETL will be a simple (but extensible) transformation language using an ORM wrapper… Rails on ETL. Until that arrives, you can achieve everything you need with PHP, Perl, Ruby and others.

Best option for low-cost data warehouse?

continue reading…

I consult for QlikView and I have to agree it’s awesome. But hearing me rant about its greatness would sound like another fanboy foaming at the mouth. So I’ll let someone else, David Raab, explain why QlikView is so good. David has also put together a concrete example using a cross-sell table that answers the question “What other products do customers tend to buy if/when they purchase product X?” This is a powerful question that every sales person should be asking, but it’s hard to get an answer when you need to go to the IT department each time and wait for them to build the model in a traditional SQL query or OLAP tool. QlikView makes it easy to get immediate answers and explore your data “at the speed of thought”.

What if you could turn on a massively parallel business intelligence database cluster with a few lines of code? What if you could leverage in-house and outsourced resources for computation and storage as needed? What if you could expand your analysis, data mining and text-search effort one node at a time, transparently, instantly?

There’s been a flurry of discussion around Hadoop and the Hbase project to bring Google’s BigTable feature to Hadoop.

Now Amazon wants to talk about how to use Hadoop with EC2 and S3, their computing and storage clusters.

Can I search large volumes of data on the cheap? Yes, but my algorithms must fit within the MapReduce framework.

Does someone have a MapReduce-enabled data query language? Well, there’s Pig from Yahoo. Sawzall from Google. Here is a discussion comparing those two from Greg Linden. Abacus from the Hadoop project. Apparently Microsoft has DryadLINQ.

We are on the exponential curve as it swoops upward dramatically. From the power and flexibility of opensource, anyone can use Google secret sauce on Amazon’s computers for 18 cents per gigabyte and 10 cents per computing hour.

Enrico Bertini at Visuale asks how important is interactivity in information visualization? As a proponent of QlikView, Spotfire, Tableau and others, I think it’s extremely important. Interactivity is the future, it’s “make or break.”

I’ve been implementing speed-of-thought interactive BI tools for 6 years and I don’t want to do it any other way. When I watched my first seasoned executive lose restraint and laugh uncontrollably as he got instant answers to his hardest questions, I knew this was the only way to go. When my end-user training sessions end late because everyone is so excited about what they can do, it’s clear that people NEED interactivity.

I finally got around to watching the Tableau 3.0 webinar. I agree with their very excited presenter that Tableau 3.0 is a leap forward. The support of ad-hoc grouping of dimension elements is excellent as is the enhanced support of ad-hoc sets. The annotations look good and act sensibly. Generally, the new features are focused on ease of use, better statistical analysis, and report clarity. All good things. Here are 3.0 examples.

Annotations should be required in every BI tool. The ability to mark reference lines and data points on graphs and tables is critical to clear communication. Placing an annotation on a point in space does not require a data point to exist there, another nice feature. The smart BI vendors are focusing on collaboration and communication among users.

“Groups” stole their name from the “groups” of 2.x which are now the “sets” of 3.0 and can be used like so: similar dimensions such as coffee and tea, which may need to be represented in the database as separate product lines, can now be combined on the fly within Tableau by an end user under the simple heading “drinks”. This would make it easy to answer a question about food vs drink sales without the need to export to Excel and spend more time adding up the drink categories. In short, “groups” bring dimension values together and “sets” allow for separating special values from the rest of a dimensions values–and both can be done by the end user. Pretty nice.

I think the strongest competitor for visualization is Spotfire. However, Tableau’s use of live database interaction will become an advantage as data warehouse implementations shift to high-performance in-memory read-optimized databases. Was that over-hyphenated? Spotfire’s initial data loads are inflexible and I wouldn’t recommend it if you need to update a large dataset frequently.

Unlike QlikView, all of Tableau’s data needs to be in a single database. With good design, this is not a performance issue. The problem is that the extra expense of hardware and software to store a separate data warehouse and run ETL processing may push Tableau’s final price tag far above QlikView, which can easy pull from multiple sources and uses its own high-speed database.

My favorite part of the Juice Analytics presentation (PDF) is the rundown of the essential BI toolset and the examples they chose such as Yahoo Pipes, Baby Name Voyager and We Feel Fine. Great job!