笔记：YSmart: Yet Another SQL-to-MapReduce Translator - 过雁

Introduce
样例sql语句：“what is the average number of pages a user visits between a page in category X and a page in category Y?”

背景。很简明扼要的介绍了MR、hive的运行机制。
关联识别的MR概述。III. CORRELATION-AWARE MAPREDUCE: AN OVERVIEW
说明为什么要做关联识别：MR对中间结果的处理要比DBMS要代价高很多，所以将多个操作集中在一个MR中效率高。 the way of executing multiple operations in a single job (many-to-one), if possible, could be a much more effective choice than the one-to-one translation
内部关联以及优化原理。IV. INTRA-QUERY CORRELATIONS AND THEIR OPTIMIZATION PRINCIPLES

输入关联：Multiple nodes have input correlation (IC) if their input relation sets are not disjoint
两个操作可以共享一个表扫描
转换关联Transit Correlation: Multiple nodes have transit correlation (TC) if they have not only input correlation, but also the same partition key;
存在数据交叠，存在冗余的IO操作
流程关联。Job Flow Correlation: A node has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node
后面的MR可以在前一个MR的reduce里面直接执行

带group的聚合。An aggregation node with grouping can be directly executed in the reduce function of its only child node;
A join node J1 has job flow correlation with only one of its child nodes C1. Thus as long as the job of another child node of this join node C2 has been completed, a single job is sufficient to execute both C1 and J1
A join node J1 has job flow correlation with two child nodes C1 and C2. Then, according to the correlation definitions, C1 and C2 must have both input correlation and transit correlation. Thus a single job is sufficient to execute both C1 and C2. Besides, J1 can also be directly executed in the reduce phase of the job

An Example of Correlation Query and Its Optimization
sql以及原始的执行计划（3个MR）

Ysmart后：

选择和投影。A SELECTION-PROJECTION (SP) Job is used to execute a simple query with only selection and projection operations on a base relation
聚合。An AGGREGATION (AGG) job is used to execute aggregation and grouping on an input relation
关联合并。A JOIN job is used to execute an equi-join (inner or left/right/full outer) of two input relations;
排序。A SORT job is used to execute a sorting operation.

rule 1：如果两个job有输入关联和转换关联，将被合并。If two jobs have input correlation and transit correlation, they will be merged into a common job.
Rule 2: 一个聚合job如果仅与它前面的一个job有流程关联，那该聚合job可以合并到前面job的reduce中。An AGGREGATION job that has job flow correlation with its only preceding job will be merged into this preceding job.
Rule 3: 如果一个join job与它前面的两个job有输入关联，这个join job可以合并。For a JOIN job with job flow correlation with its two preceding jobs, the join operation will be merged into the reduce phase of the common job。 In this case, there must be transit correlation between the two preceding jobs, and the two jobs have been merged into a common job in the first step. Based on this, the join operation can be put into the reduce phase of the common job
Rule 4: For a JOIN job that has job flow correlation with only one of its two preceding jobs, merge the JOIN job with the preceding job with job flow correlation – which has to be executed later than the other one.

An Example of Job Merging
We assume that 1) JOIN1 and AGG2 have input correlation and transit correlation, 2) JOIN2 has job flow correlation with JOIN1 but not AGG1, and 3) JOIN3 has job flow correlation with both JOIN2 and AGG2. In the figure, we show the job number for each node.

后续遍历执行计划，得到job序列：{J1, J2, J3, J4, J5}. 执行rule 1 得到{J1+4, J2, J3, J5}. 执行其他规则得到{J1+4, J2, J3+5}. -》s {J2, J1+4+3+5}

The first requirement is to provide a flexible framework to allow different types of MapReduce jobs
The second requirement is to execute multiple merged jobs in a common job with minimal overhead

CMF提供合并两个关联job的通用模板。CMF provides a general template based approach to generate a common job that can merge a collection of correlated jobs。The template has the following structures. The common mapper executes operations (selection and/or projection operations) involved in the map functions of merged jobs. The common reducer executes all the operations (e.g. join or aggregation) involved in the reduce functions of merged jobs. The post-job computation is a subcomponent in the common reducer to execute further computations on the outputs of merged jobs.

Common Mapper
读取一行数据，然后产生key-value给所有的被合并的job。由于不同的被合并的job有不同的选择条件，所有common mapper需要记录job对应哪些数据。

posted on 2015-04-27 14:45 过雁阅读(667) 评论(0) 编辑收藏举报

刷新页面返回顶部