In my world, which is corporate software systems, I have a transactional database that is usually in second normal form and has very few aggregates. Building reports directly means joining at least 4 tables, often 8, and sometimes as many as 12. Unfortunately, the new crop of data warehouse vendors have made it very difficult to grasp how well they handle this. Some of these products handle your datamodel as-is, and some expect star/snowflake schemas, which adds a layer of design, coding, testing, validation and additional maintenance.
Netezza, Greenplum and Vertica all use off-the-shelf interconnects, meaning 1 gigabit ethernet in most cases. Transferring large amounts of data from a distributed system over ethernet can easily unravel any gains. In a simplistic design, an evenly distributed dataset would require that every node talks to every other node. With multiple joins, this would create a series of bottlenecks. It would also rely heavily on synchronization across the distributed system.
Vertica is a star/snowflake product. The Vertica distributed system replicates the dimension tables on each node and partitions the fact table. Vertica says that they have customers that use more transactional models, but what does that mean for overall performance? Greenplum’s website says: “Utilizes pipelining techniques and redistributes data among nodes for high performance execution of complex joins.” Encouraging, but what is considered “complex” and what will this do to my network in real-world conditions?
If you have any thoughts to share, please add them to the comments.