数据库教程二 - 查询处理之连接算法 - 技术笔记_jsks

查询处理中用到的算法

参考《database system concepts》

Selection
A1 (linear search
A2 (binary search).
A3 (primary index on candidate key, equality).
A4 (primary index on nonkey, equality)
A5 (equality on search-key of secondary index)
A6 (primary index, comparison)
A7 (secondary index, comparison)
A8 (conjunctive selection using one index)
A9 (conjunctive selection using multiple-key index)
A10 (conjunctive selection by intersection of identifiers)
A11 (disjunctive selection by union of identifiers)

Sort
External Sort Merge

Join
Nested-loop join
Block nested-loop join
Indexed nested-loop join
Merge-join
Hash-join

Nested loops, Hash join and Sort Merge joins – difference?
转载 http://oracle-online-help.blogspot.com/2007/03/nested-loops-hash-join-and-sort-merge.html

1. 嵌套循环连接Nested loop Join
例子:
Select tab1.*, tab2.* from tabl, tab2 where tabl.col1=tab2.col2;

查询处理的步骤大致如下
For i in (select * from tab1) loop
　　For j in (select * from tab2 where col2=i.col1) loop
　　　　Display results;
　　End loop;
End loop;

执行包含一下几个步骤
a) 决定那个作为外层表，哪个作为内层表，一般能一次放入内存的表作为内存表，这样就只需读取磁盘一
b) 一次循环，外层表的一行与内层表每一行比较
c) 重复步骤b

执行计划大致可以表示成如下
NESTED LOOPS
outer_loop
inner_loop

什么时候使用嵌套循环连接
（1）连接的表行数较少
（2）内层循环的表的连接属性上有索引，内层表尽可能的放入内存，避免多次磁盘读写

优化器什么情况下不使用Nested Loop
   1. No of rows of both the table is quite high
   2. Inner query always results in same set of records
   3. The access path of inner table is independent of data coming from outer table.

Note: You will see more use of nested loop when using FIRST_ROWS optimizer mode as it works on model of showing instantaneous results to user as they are fetched. There is no need for selecting caching any data before it is returned to user. In case of hash join it is needed and is explained below.

散列连接 Hash join
Hash joins are used when the joining large tables. The optimizer uses smaller of the 2 tables to build a hash table in memory and the scans the large tables and compares the hash value (of rows from large table) with this hash table to find the joined rows.

The algorithm of hash join is divided in two parts
   1. Build a in-memory hash table on smaller of the two tables.
   2. Probe this hash table with hash value for each row second table

In simpler terms it works like
Build phase
For each row RW1 in small (left/build) table loop
　　Calculate hash value on RW1 join key
　　Insert RW1 in appropriate hash bucket.
End loop;

Probe Phase
For each row RW2 in big (right/probe) table loop
　　Calculate the hash value on RW2 join key
　　For each row RW1 in hash table loop
　　　　If RW1 joins with RW2
　　　　　　Return RW1, RW2
　　End loop;
End loop;

When optimizer uses hash join?
Optimizer uses has join while joining big tables or big fraction of small tables.
Unlike nested loop, the output of hash join result is not instantaneous as hash joining is blocked on building up hash table.

Note: You may see more hash joins used with ALL_ROWS optimizer mode, because it works on model of showing results after all the rows of at least one of the tables are hashed in hash table.

排序归并连接 Sort merge join
Sort merge join is used to join two independent data sources. They perform better than nested loop when the volume of data is big in tables but not as good as hash joins in general.

They perform better than hash join when the join condition columns are already sorted or there is no sorting required.
The full operation is done in two parts:

    * Sort join operation
get first row RW1 from input1
get first row RW2 from input2.

    * Merge join operation
while not at the end of either input loop
　　if RW1 joins with RW2
　　　　get next row R2 from input 2
　　　　return (RW1, RW2)
　　else if RW1 < style="">
　　　　get next row RW1 from input 1
　　else
　　　　get next row RW2 from input 2
end loop

Note: If the data is already sorted, first step is avoided.
Important point to understand is, unlike nested loop where driven (inner) table is read as many number of times as the input from outer table, in sort merge join each of the tables involved are accessed at most once. So they prove to be better than nested loop when the data set is large.

When optimizer uses Sort merge join?
a) When the join condition is an inequality condition (like <, <=, >=). This is because hash join cannot be used for inequality conditions and if the data set is large, nested loop is definitely not an option.
b) If sorting is anyways required due to some other attribute (other than join) like “order by”, optimizer prefers sort merge join over hash join as it is cheaper.

Note: Sort merge join can be seen with both ALL_ROWS and FIRST_ROWS optimizer hint because it works on a model of first sorting both the data sources and then start returning the results. So if the data set is large and you have FIRST_ROWS as optimizer goal, optimizer may prefer sort merge join over nested loop because of large data. And if you have ALL_ROWS as optimizer goal and if any inequality condition is used the SQL, optimizer may use sort-merge join over hash join

发表于 2012-05-07 14:06 技术笔记_jsks 阅读(532) 评论(0) 收藏举报

专注于自然预言处理，机器学习，大数据，文本处理，搜索引擎，股票金融

数据库教程二 - 查询处理之连接算法

公告