Join operators

So, we know how to get our data, let’s join them!

I’ll present the 3 common join operators: Merge Join, Hash Join and Nested Loop Join. But before that, I need to introduce new vocabulary: inner relation and outer relation. A relation can be:

  • a table
  • an index
  • an intermediate result from a previous operation (for example the result of a previous join)

When you’re joining two relations, the join algorithms manage the two relations differently. In the rest of the article, I’ll assume that:

  • the outer relation is the left data set
  • the inner relation is the right data set

For example, A JOIN B is the join between A and B where A is the outer relation and B the inner relation.

Most of the time, the cost of A JOIN B is not the same as the cost of B JOIN A.

In this part, I’ll also assume that the outer relation has N elementsand the inner relation M elements. Keep in mind that a real optimizer knows the values of N and M with the statistics.

Note: N and M are the cardinalities of the relations.

Nested loop join

The nested loop join is the easiest one.

Here is the idea:

  • for each row in the outer relation
  • you look at all the rows in the inner relation to see if there are rows that match

Here is a pseudo code:

nested_loop_join(array outer, array inner)foreach row a in outerforeach row b in innerif(match_join_condition(a,b))write_result_in_output(a,b)endifendforendfor

Since it’s a double iteration, the time complexity is O(N*M)

In term of disk I/O, for each of the N rows in the outer relation, the inner loop needs to read M rows from the inner relation. This algorithm needs to read N + N*M rows from disk. But, if the inner relation is small enough, you can put the relation in memory and just have M +N reads. With this modification, the inner relation must be the smallest one since it has more chance to fit in memory.

In terms of time complexity it makes no difference but in terms of disk I/O it’s way better to read only once both relations.

Of course, the inner relation can be replaced by an index, it will be better for the disk I/O.

Since this algorithm is very simple, here is another version that is more disk I/O friendly if the inner relation is too big to fit in memory. Here is the idea:

  • instead of reading both relation row by row,
  • you read them bunch by bunch and keep 2 bunches of rows (from each relation) in memory,
  • you compare the rows inside the two bunches and keep the rows that match,
  • then you load new bunches from disk and compare them
  • and so on until there are no bunches to load.

Here is a possible algorithm:

// improved version to reduce the disk I/O.nested_loop_join_v2(file outer, file inner)foreach bunch ba in outer// ba is now in memoryforeach bunch bb in inner// bb is now in memoryforeach row a in baforeach row b in bbif(match_join_condition(a,b))write_result_in_output(a,b)endifendforendforendforendfor

With this version, the time complexity remains the same, but the number of disk access decreases:

  • With the previous version, the algorithm needs N + N*M accesses (each access gets one row).
  • With this new version, the number of disk accesses becomes number_of_bunches_for(outer)+ number_of_ bunches_for(outer)* number_of_ bunches_for(inner).
  • If you increase the size of the bunch you reduce the number of disk accesses.

Note: Each disk access gathers more data than the previous algorithm but it doesn’t matter since they’re sequential accesses (the real issue with mechanical disks is the time to get the first data).

Hash join

The hash join is more complicated but gives a better cost than a nested loop join in many situations.

The idea of the hash join is to:

  • 1) Get all elements from the inner relation
  • 2) Build an in-memory hash table
  • 3) Get all elements of the outer relation one by one
  • 4) Compute the hash of each element (with the hash function of the hash table) to find the associated bucket of the inner relation
  • 5) find if there is a match between the elements in the bucket and the element of the outer table

In terms of time complexity I need to make some assumptions to simplify the problem:

  • The inner relation is divided into X buckets
  • The hash function distributes hash values almost uniformly for both relations. In other words the buckets are equally sized.
  • The matching between an element of the outer relation and all elements inside a bucket costs the number of elements inside the buckets.

The time complexity is (M/X) * N + cost_to_create_hash_table(M) + cost_of_hash_function*N

If the Hash function creates enough small-sized buckets then the time complexity is O(M+N)

Here is another version of the hash join which is more memory friendly but less disk I/O friendly. This time:

  • 1) you compute the hash tables for both the inner and outer relations
  • 2) then you put them on disk
  • 3) then you compare the 2 relations bucket by bucket (with one loaded in-memory and the other read row by row)

Merge join

The merge join is the only join that produces a sorted result.

Note: In this simplified merge join, there are no inner or outer tables; they both play the same role. But real implementations make a difference, for example, when dealing with duplicates.

The merge join can be divided into of two steps:

  1. (Optional) Sort join operations: Both the inputs are sorted on the join key(s).
  2. Merge join operation: The sorted inputs are merged together.

Sort

We already spoke about the merge sort, in this case a merge sort in a good algorithm (but not the best if memory is not an issue).

But sometimes the data sets are already sorted, for example:

  • If the table is natively ordered, for example an index-organized table on the join condition
  • If the relation is an index on the join condition
  • If this join is applied on an intermediate result already sorted during the process of the query

Merge join

This part is very similar to the merge operation of the merge sort we saw. But this time, instead of picking every element from both relations, we only pick the elements from both relations that are equals. Here is the idea:

  • 1) you compare both current elements in the 2 relations (current=first for the first time)
  • 2) if they’re equal, then you put both elements in the result and you go to the next element for both relations
  • 3) if not, you go to the next element for the relation with the lowest element (because the next element might match)
  • 4) and repeat 1,2,3 until you reach the last element of one of the relation.

This works because both relations are sorted and therefore you don’t need to “go back” in these relations.

This algorithm is a simplified version because it doesn’t handle the case where the same data appears multiple times in both arrays (in other words a multiple matches). The real version is more complicated “just” for this case; this is why I chose a simplified version.

If both relations are already sorted then the time complexity is O(N+M)

If both relations need to be sorted then the time complexity is the cost to sort both relations: O(N*Log(N) + M*Log(M))

For the CS geeks, here is a possible algorithm that handles the multiple matches (note: I’m not 100% sure about my algorithm):

mergeJoin(relation a, relation b)relation outputinteger a_key:=0;integer b_key:=0;while(a[a_key]!=nullor b[b_key]!=null)if(a[a_key] < b[b_key])a_key++;elseif(a[a_key] > b[b_key])b_key++;else//Join predicate satisfied//i.e. a[a_key] == b[b_key]//count the number of duplicates in relation ainteger nb_dup_in_a =1:while(a[a_key]==a[a_key+nb_dup_in_a])nb_dup_in_a++;//count the number of duplicates in relation binteger dup_in_b =1:while(b[b_key]==b[b_key+nb_dup_in_b])nb_dup_in_b++;//write the duplicates in outputfor(inti =0; i< nb_dup_in_a ; i++)for(intj =0; i< nb_dup_in_b ; i++) write_result_in_output(a[a_key+i],b[b_key+j])a_key=a_key + nb_dup_in_a-1;b_key=b_key + nb_dup_in_b-1;endifendwhile

Which one is the best?

If there was a best type of joins, there wouldn’t be multiple types. This question is very difficult because many factors come into play like:

  • The amount of free memory : without enough memory you can say goodbye to the powerful hash join (at least the full in-memory hash join)
  • The size of the 2 data sets . For example if you have a big table with a very small one, a nested loop join will be faster than a hash join because the hash join has an expensive creation of hashes. If you have 2 very large tables the nested loop join will be very CPU expensive.
  • The presence of indexes . With 2 B+Tree indexes the smart choice seems to be the merge join
  • If the result need to be sorted : Even if you’re working with unsorted data sets, you might want to use a costly merge join (with the sorts) because at the end the result will be sorted and you’ll be able to chain the result with another merge join (or maybe because the query asks implicitly/explicitly for a sorted result with an ORDER BY/GROUP BY/DISTINCT operation)
  • If the relations are already sorted : In this case the merge join is the best candidate
  • The type of joins you’re doing: is it an equijoin (i.e.: tableA.col1 = tableB.col2)? Is it an inner join , an outer join, a cartesian product or a self-join ? Some joins can’t work in certain situations.
  • The distribution of data . If the data on the join condition are skewed (For example you’re joining people on their last name but many people have the same), using a hash join will be a disaster because the hash function will create ill-distributed buckets.
  • If you want the join to be executed by multiple threads/process

For more information, you can read the DB2, ORACLE or SQL Server documentations.

results matching ""

    No results matching ""