In my world, which is corporate software systems, I have a transactional database that is usually in second normal form and has very few aggregates. Building reports directly means joining at least 4 tables, often 8, and sometimes as many as 12. Unfortunately, the new crop of data warehouse vendors have made it very difficult to grasp how well they handle this. Some of these products handle your datamodel as-is, and some expect star/snowflake schemas, which adds a layer of design, coding, testing, validation and additional maintenance.
Netezza, Greenplum and Vertica all use off-the-shelf interconnects, meaning 1 gigabit ethernet in most cases. Transferring large amounts of data from a distributed system over ethernet can easily unravel any gains. In a simplistic design, an evenly distributed dataset would require that every node talks to every other node. With multiple joins, this would create a series of bottlenecks. It would also rely heavily on synchronization across the distributed system.
Vertica is a star/snowflake product. The Vertica distributed system replicates the dimension tables on each node and partitions the fact table. Vertica says that they have customers that use more transactional models, but what does that mean for overall performance? Greenplum’s website says: “Utilizes pipelining techniques and redistributes data among nodes for high performance execution of complex joins.” Encouraging, but what is considered “complex” and what will this do to my network in real-world conditions?
If you have any thoughts to share, please add them to the comments.

This is a complex topic – especially when considering a shared nothing database system with, typically, two layers of query optimization.
The first layer sorts out how to run the query across the various nodes in the system, taking into account how tables are distributed and/or replicated and how much network traffic will be generated.
The second layer plans the query at the node level.
In building query plans for joins with large numbers of tables, the keys are to:
- minimize movement between the nodes
- plan the query quickly while still coming up with a good plan (this gets exponentially more complex as the number of tables and/or indexes increases)
- ensure the join order minimizes I/O and maximizes RAM usage
We’ve done a lot of work in all of these areas to ensure that complex joins can be handled very efficiently.
One big advantage of a system such as DATAllegro is that indexes aren’t generally used, especially in joins. Also, we generally force hash joins. As a result, the search space for a query plan is much smaller than a traditional DBMS.
In addition, DATAllegro uses a very high speed network for inter-node communication (10Gbps Infiniband) and has a lot of I/O bandwidth on each node. This greatly reduces the cost of any data movements that are required.
Overall, we feel that our ability to handle complex joins is actually much better than traditional DBMS platforms – and our appliance competitors.
Stuart Frost
CEO, DATAllegro
Hi Jay,
With Greenplum it just works. We do not require a denormalized or specialized schema.
WRT interconnect bandwidth, on the Sun appliance we’ve seen many situations where a random distribution on the relations is equal or very close in performance to a “tuned” distribution, which also is true for intermediate redistributions. That’s a result of keeping the duplex interconnect bandwidth in a good ratio with the on-host simplex disk bandwidth.
Our cost based optimizer takes interconnect traffic into account and will choose a plan to suit.
More interesting question: how about 12-way joins between internal tables and external tables for ELT (Extract Load and Transform)? We have an external table capability that allows you to use external data as tables in full parallel mode. That means that you can do one step ELT for maintaining data in the warehouse. We’ve actually done 12-way joins for ELT queries and they work great.
Last point: why not try it yourself – you can download our software at gpn.greenplum.com.
BTW – I was shocked when I first heard that Vertica only allows one fact table – that’s far too limiting to be useful. No wonder we beat them in every POC
Since there was a recent tweet about this column, I thought I point out that the Vertica information is quite dated. Vertica has supported any type of schema for years and is no longer restricted in the ways mentioned above. We have customers in production with thousands of tables in their schema. We also have over 130 customers, so we clearly can’t be losing every POC.
Pingback: Tweets that mention How well do Netezza, Greenplum, Vertica and others handle 12-way joins? -- Topsy.com