Joins and the Oracle Optimizer I (Nested loop joins and Hash Joins)

Posted: April 21, 2014 in Database Administration
Tags: Joins

Greetings!!

Last week, I was asked a question in an interview. What is the difference between a Hash Join and a Nested Loop Join? I can tell you honestly that I have never set out to create either join and maybe you haven’t either. But pull up an explain plan and you are bound to see both from time to time.

So, to answer this question properly, we must needs take a look at the Oracle Query Optimizer.

The Oracle Query Optimizer

To choose an execution plan for a join, the optimizer must make these interrelated decisions:

Access paths

The optimizer must choose an access path to retrieve data from each table in the join statement.

Join method

To join each pair of row sources, Oracle must perform a join operation. Join methods include nested loop, hash, sort merge and cartesian joins.

Join order

To execute a statement that joins more than two tables, Oracle joins two of the tables and then joins the resulting row source to the next table. This process is continued until all tables are joined into the result.

How the Query Optimizer Chooses an Execution Plan for Joins

Before an execution plan is chosen, the query optimizer first considers the following…

Will the join between two or more tables result in a row source containing at most one row? The optimizer determines this based on UNIQUE and PRIMARY KEY constraints on the tables. If the answer is yes, the optimizer places these tables first in the join order. Then the optimizer will optimize the join on the remaining tables.
If an outer join condition exists, the table with the outer join must come after the other table in the condition in the join order. The optimizer will not consider join orders that violate this rule.

The optimizer generates a set of execution plans, according to possible join orders, join methods and available access paths. The optimizer then estimates the cost of each plan and chooses the one with the lowest cost. How is this estimation done?

The cost of a nested loopoperation is based on the cost of reading each selected row of the outer table and each of its matching rows of the inner table in memory. The optimizer estimates these costs using the statistics in the data dictionary. It behooves you, as the DBA, to ensure your stats are up to date!!
The cost of a hash joinis based largely on the cost of building a hash table on one of the input sides to the join and using the rows from the other side of the join to probe it. (more on this later)
The cost of a sort merge join is based largely on the cost of reading all the sources into memory and sorting them.

There are other factors the optimizer considers such as sort area size and multi block read count size but this thread will stick to the joins.

**The optimizer’s choice of join order can be overridden with the ORDERED hint. However, if the ORDERED hint specifies a join order that violates the rul for an outer join, the optimizer will ignore the hint and choose the order. You can also override the optimizer’s choice of join method with hints.**

Nested Loop Joins

A nested loop join involves the following steps:

The optimizer determines the driving table and designates it as the outer table.
The other table is designated as the inner table.
For every row in the outer table, Oracle accesses all the rows in the inner table. The outer loop is for every row in the outer table and the inner loop is for every row in the inner table. The outer loop appears before the inner loop in the execution plan, as follows:
```
NESTED LOOPS 
  outer_loop 
  inner_loop
```

Nested loop joins are useful when small subsets of data are being joined and if the join condition is an efficient way of accessing the second table.

It is very important to ensure that the inner table is driven from (dependent on) the outer table. If the inner table’s access path is independent of the outer table, then the same rows are retrieved for every iteration of the outer loop, degrading performance considerably. In such cases, hash joins joining the two independent row sources perform better

Consider the following query:

SELECT e.first_name, e.last_name, e.salary, d.department_name
    FROM hr.employees e, hr.departments d
    WHERE d.department_name IN ('Marketing', 'Sales')
      AND e.department_id = d.department_id;

New Implementation for Nested Loop Joins

Oracle Database 11g Release 1 (11.1) introduces a new implementation for nested loop joins to reduce overall latency for physical I/O. When an index or a table block is not in the buffer cache and is needed to process the join, a physical I/O is required. In Oracle Database 11g Release 1 (11.1), Oracle Database can batch multiple physical I/O requests and process them using a vector I/O instead of processing them one at a time. As part of the new implementation for nested loop joins, two NESTED LOOPS join row sources might appear in the execution plan where only one would have appeared in prior releases. In such cases, Oracle Database allocates one NESTED LOOPS join row source to join the values from the table on the outer side of the join with the index on the inner side. A second row source is allocated to join the result of the first join, which includes the rowids stored in the index, with the table on the inner side of the join.

Consider the query in “Original Implementation for Nested Loop Joins”. In Oracle Database 11g Release 1 (11.1), with the new implementation for nested loop joins, the execution plan for this query might appear similar to the following execution plan:

------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name              | Rows  | Bytes | Cost(%CPU)| Time      |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                   |    19 |   722 |     3   (0)| 00:00:01 |
|   1 |  NESTED LOOPS                |                   |       |       |            |          |
|   2 |   NESTED LOOPS               |                   |    19 |   722 |     3   (0)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL         | DEPARTMENTS       |     2 |    32 |     2   (0)| 00:00:01 |
|*  4 |    INDEX RANGE SCAN          | EMP_DEPARTMENT_IX |    10 |       |     0   (0)| 00:00:01 |
|   5 |   TABLE ACCESS BY INDEX ROWID| EMPLOYEES         |    10 |   220 |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("D"."DEPARTMENT_NAME"='Marketing' OR "D"."DEPARTMENT_NAME"='Sales')
   4 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")

In this case, the rows from the hr.departments table constitute the outer side of the first join. The inner side of the first join is the index emp_department_ix. The results of the first join constitute the outer side of the second join, which has the hr.employees table as its inner side.

There are cases where a second join row source is not allocated, and the execution plan looks the same as it did in prior releases. The following list describes such cases:

All of the columns needed from the inner side of the join are present in the index, and there is no table access required. In this case, Oracle Database allocates only one join row source.
The order of the rows returned might be different than it was in previous releases. Hence, when Oracle Database tries to preserve a specific ordering of the rows, for example to eliminate the need for an ORDER BY sort, Oracle Database might use the original implementation for nested loop joins.
The OPTIMIZER_FEATURES_ENABLE initialization parameter is set to a release before Oracle Database 11g Release 1 (11.1). In this case, Oracle Database uses the original implementation for nested loop joins.

When the Optimizer Uses Nested Loop Joins

The optimizer uses nested loop joins when joining a small number of rows, with a good driving condition between the two tables. You drive from the outer loop to the inner loop, so the order of tables in the execution plan is important.

The outer loop is the driving row source. the row source can be a table accessed sing an index scan or a full table scan. Also, the rows can be produced from any other operation. For example, the output from a nested loop join can be used as a row source for another nested loop join.

The inner loop is iterated for every row returned from the outer loop, ideally by an index scan. If the access path for the inner loop is not dependent on the outer loop, then you can end up with a Cartesian product; for every iteration of the outer loop, the inner loop produces the same set of rows. Therefore, you should use other join methods when two independent row sources are joined together.

Nested Loop Join Hints

If the optimizer is choosing to use some other join method, you can use the USE_NL(table1 table2) hint, where table1 and table2 are the aliases of the tables being joined. For some SQL examples, the data is small enough for the optimizer to prefer full table scans and use hash joins.

Hash Joins

Hash joins are used for joining large data sets. The optimizer uses the smaller of two tables or data sources to build a hash table on the join key in memory. It then scans the larger table, probing the hash table to find the joined rows.

This method is best used when the smaller table fits in available memory. The cost is then limited to a single read pass over the data for the two tables.

When the Optimizer Uses Hash Joins

The optimizer uses a hash join to join two tables if they are joined using an equijoin and if either of the following conditions are true:

A large amount of data needs to be joined.
A large fraction of a small table needs to be joined.

The following example was taken (borrowed) from the Oracle Documentation on the Query Optimizer. In this example, the table ORDERS is used to build the hash table, and ORDER_ITEMS is the larger table, which is scanned later.

http://docs.oracle.com/cd/B28359_01/server.111/b28274/optimops.htm#i76073

SELECT o.customer_id, l.unit_price * l.quantity
  FROM orders o ,order_items l
 WHERE l.order_id = o.order_id;

--------------------------------------------------------------------------
| Id  | Operation            |  Name        | Rows  | Bytes | Cost (%CPU)|
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |              |   665 | 13300 |     8  (25)|
|*  1 |  HASH JOIN           |              |   665 | 13300 |     8  (25)|
|   2 |   TABLE ACCESS FULL  | ORDERS       |   105 |   840 |     4  (25)|
|   3 |   TABLE ACCESS FULL  | ORDER_ITEMS  |   665 |  7980 |     4  (25)|
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("L"."ORDER_ID"="O"."ORDER_ID")

**Apply the USE_HASH hint to instruct the optimizer to use a hash join when joining two tables together.** Use this link for hints for join operations.

http://docs.oracle.com/cd/B28359_01/server.111/b28274/hintsref.htm#CHDBAFID

This should be enough to chew on for one post. In part two, I plan to dive into Sort Merge Joins, Cartesian Joins and Outer Joins. Learning how and when to use joins is an extremely important task for tuning and maintaining your SQL as a DBA.

Questions?

Comments?

@dba_jay

Blogroll

Recent Posts

Archives

Blog Stats