Simplified example
We’ve just seen 3 types of join operations.
Now let’s say we need to join 5 tables to have a full view of a person. A PERSON can have:
- multiple MOBILES
- multiple MAILS
- multiple ADRESSES
- multiple BANK_ACCOUNTS
In other words we need a quick answer for the following query:
SELECT*fromPERSON, MOBILES, MAILS,ADRESSES, BANK_ACCOUNTSWHEREPERSON.PERSON_ID = MOBILES.PERSON_IDANDPERSON.PERSON_ID = MAILS.PERSON_IDANDPERSON.PERSON_ID = ADRESSES.PERSON_IDANDPERSON.PERSON_ID = BANK_ACCOUNTS.PERSON_ID |
---|
As a query optimizer, I have to find the best way to process the data. But there are 2 problems:
- What kind of join should I use for each join?
I have 3 possible joins (Hash Join, Merge Join, Nested Join) with the possibility to use 0,1 or 2 indexes (not to mention that there are different types of indexes).
- What order should I choose to compute the join?
For example, the following figure shows different possible plans for only 3 joins on 4 tables
So here are my possibilities:
- 1) I use a brute force approach
Using the database statistics, I compute the cost for every possible plan and I keep the best one. But there are many possibilities. For a given order of joins, each join has 3 possibilities: HashJoin, MergeJoin, NestedJoin. So, for a given order of joins there are 34 possibilities. The join ordering is a permutation problem on a binary tree and there are (2*4)!/(4+1)! possible orders. For this very simplified problem, I end up with 34*(2*4)!/(4+1)! possibilities.
In non-geek terms, it means 27 216 possible plans. If I now add the possibility for the merge join to take 0,1 or 2 B+Tree indexes, the number of possible plans becomes 210 000. Did I forget to mention that this query is VERY SIMPLE?
- 2) I cry and quit this job
It’s very tempting but you wouldn’t get your result and I need money to pay the bills.
- 3) I only try a few plans and take the one with the lowest cost.
Since I’m not superman, I can’t compute the cost of every plan. Instead, I can arbitrary choose a subset of all the possible plans, compute their costs and give you the best plan of this subset.
- 4) I apply smart rules to reduce the number of possible plans.
There are 2 types of rules:
I can use “logical” rules that will remove useless possibilities but they won’t filter a lot of possible plans. For example: “the inner relation of the nested loop join must be the smallest data set”
I accept not finding the best solution and apply more aggressive rules to reduce a lot the number of possibilities. For example “If a relation is small, use a nested loop join and never use a merge join or a hash join”
In this simple example, I end up with many possibilities. But a real query can have other relational operators like OUTER JOIN, CROSS JOIN, GROUP BY, ORDER BY, PROJECTION, UNION, INTERSECT, DISTINCT … which means even more possibilities.
So, how a database does it?