Hive执行计划详解

什么是Hive SQL执行计划

Hive SQL执行计划描绘了SQL实际执行的整体轮廓,即SQL转化为对应计算引擎的执行逻辑;毫无疑问,这一块对于Hive SQL的优化是非常重要的。

Hive SQL早期是基于规则的方式生成执行计划,在Hive 0.14及之后,集成了Apache Calcite,使得Hive能够基于成本代价来生成执行计划。

Hive目前提供的是预估的执行计划,而非真实的执行计划(SQL实际执行完后才能获得的计划)。

查看Hive SQl执行计划

  • explain:查看基本信息
  • explain extended:查看扩展信息
  • explain dependency:查看SQL数据输入依赖的信息
  • explain authorization:查看SQL操作相关权限的信息
  • explain vectorization:查看向量化的描述信息

# 看这个前请先理解sql语句
STAGE DEPENDENCIES: # 作业依赖关系
  Stage-1 is a root stage # 根stage
  Stage-0 depends on stages: Stage-1 # stage-0依赖stage-1

STAGE PLANS:
  Stage: Stage-1  # Stage-1详细任务
    Map Reduce  	# 表示当前引擎使用的是 MapReduce
      Map Operator Tree: # Map阶段操作信息
          TableScan  		# 对下面alias声明的结果集进行 表扫描操作
            alias: student  	# alias声明student表
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE # 对当前阶段的统计信息,如当前处理的行和数据量(都是预估值)
            Filter Operator	# 表示在Tablescan的结果集上进行过滤
              predicate: ((s_age < 30) and (s_name like '%red%')) (type: boolean) # 过滤条件
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator # 表示在过滤后的结果集上进行投影
                expressions: s_age (type: int) # 需要投影的列
                outputColumnNames: s_age	# 输出的列名
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Group By Operator # 在投影后结果集上进行分组聚合
                  aggregations: count() # 分组聚合使用的算法
                  keys: s_age (type: int) # 分组的列
                  mode: hash # 采用hash
                  outputColumnNames: _col0, _col1 # 输出的列名
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  Reduce Output Operator # Map端聚合
                    key expressions: _col0 (type: int) # Map端输出的key 
                    sort order: +  # +表示正序,-表示逆序
                    Map-reduce partition columns: _col0 (type: int) # 分区字段
                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                    value expressions: _col1 (type: bigint) # Map端输出的value
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0) # 对Map value的第一个值进行聚合算法
          keys: KEY._col0 (type: int) # key是Map Key的第一个值
          mode: mergepartial
          outputColumnNames: _col0, _col1 # 输出列名
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          File Output Operator # 对上面的结果集进行文件输出
            compressed: false # 不压缩
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            table: 
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat # 输入文件类型
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat # 输出文件类型
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe # 序列化、反序列化方式

  Stage: Stage-0
    Fetch Operator 
      limit: -1
      Processor Tree:
        ListSink

可以看到有两部分内容:STAGE DEPENDENCIES(作业的依赖关系图)、STAGE PLANS(每个作业的详细信息)。以及标注的详细解释。

执行计划归类

只有Map阶段的类型

  • select-from-where型:简单的SQL执行计划,不包含列操作、条件过滤、UDF、聚合、连接等操作的SQL。

    由于不需要经过聚合,所以只有Map阶段操作,如果文件大小控制合适的话,可以完全发挥任务本地化执行的优点,也就是不需要跨节点,非常高效!

    hive> explain
        > select s_age,s_name
        > from student
        > where s_age=10;
    OK
    STAGE DEPENDENCIES:
      Stage-0 is a root stage
    
    STAGE PLANS:
      Stage: Stage-0
        Fetch Operator
          limit: -1
          Processor Tree:
            TableScan
              alias: student
              Statistics: ...
              Filter Operator
                predicate: (s_age = 10) (type: boolean)
                Statistics: ...
                Select Operator
                  expressions: 10 (type: int), s_name (type: int)
                  outputColumnNames: _col0, _col1
                  Statistics: ...
                  ListSink
    
  • select-func(col)-from-where-func(col) / select-operation-from-where-operation型:只带普通函数(除UDTF、UDAF、窗口函数)

    同样只有Map阶段,非常高效!

    hive> explain
        > select case when s_age>20 then 'over 20' when s_age=20 then 'equal 20' else 'other' end age
        > from student
        > where s_name is not null;
    OK
    STAGE DEPENDENCIES:
      Stage-0 is a root stage
    
    STAGE PLANS:
      Stage: Stage-0
        Fetch Operator
          limit: -1
          Processor Tree:
            TableScan
              alias: student
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Filter Operator
                predicate: s_name is not null (type: boolean)
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Select Operator
                  expressions: CASE WHEN ((s_age > 20)) THEN ('over 20') WHEN ((s_age = 20)) THEN ('equal 20') ELSE ('other') END (type: string)
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  ListSink
    

Map+Reduce类型

  • select-aggr_func-from-where-groupby类型:带聚合函数的SQL。这类SQL可以分为如下几类:在Reduce阶段聚合的SQL执行计划、在Map和Reduce都有聚合的SQL计划、高级分组聚合的执行计划。

    hive中可以通过配置hive.map.aggr来设定是否开启Combine。

    高级分组聚合指的是聚合时使用:grouping sets、cube、rollup(使用高级分组聚合需要确保Map端reduce开启)。使用高级分组聚合可以将union需要多次的作业塞到一个作业中,可以减少多作业在磁盘和网络IO中的消耗,是一种优化。但是需要注意,这种聚合会造成数据极速膨胀;如果基表的数据量很大,容易导致Map或者Reduce任务因为硬件资源不足而崩溃,Hive中可以使用hive.new.job.grouping.set.cardinality配置这个问题。

  • 带窗口函数的SQL执行计划

    hive> explain
        > select s_name,row_number() over(partition by s_age order by s_score) rk
        > from student;
    OK
    STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-0 depends on stages: Stage-1
    
    STAGE PLANS:
      Stage: Stage-1
        Map Reduce
          Map Operator Tree:
              TableScan
                alias: student
                Statistics: ...
                Reduce Output Operator
                  key expressions: s_age (type: int), s_score (type: int)
                  sort order: ++
                  Map-reduce partition columns: s_age (type: int)
                  Statistics: ...
                  value expressions: s_name (type: string)
          Execution mode: vectorized
          Reduce Operator Tree:
            Select Operator
              expressions: VALUE._col0 (type: string), KEY.reducesinkkey0 (type: int), KEY.reducesinkkey1 (type: int)
              outputColumnNames: _col0, _col1, _col2
              Statistics: ...
              PTF Operator # 窗口函数分析操作
                Function definitions: 
                    Input definition
                      input alias: ptf_0
                      output shape: _col0: string, _col1: int, _col2: int
                      type: WINDOWING
                    Windowing table definition
                      input alias: ptf_1
                      name: windowingtablefunction
                      order by: _col2 ASC NULLS FIRST # 窗口函数排序列
                      partition by: _col1	# 窗口函数分区列
                      raw input shape: 
                      window functions:
                          window function definition
                            alias: row_number_window_0
                            name: row_number # 窗口函数的方法
                            window function: GenericUDAFRowNumberEvaluator
                            window frame: ROWS PRECEDING(MAX)~FOLLOWING(MAX) # 表示当前窗口函数上下边界
                            isPivotResult: true
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Select Operator
                  expressions: _col0 (type: string), row_number_window_0 (type: int)
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
    
      Stage: Stage-0
        Fetch Operator
          limit: -1
          Processor Tree:
            ListSink
    

表连接的SQL计划

Hive表连接类型可以分为以下六种:

  • inner join
  • full outer join
  • left outer join
  • right outer join
  • left semi join:返回左表中与右表的匹配记录
  • cross join:返回两表连接字段的笛卡尔积

以下举例inner join

hive> explain
    > select a.s_no
    > from a a inner join b b on a.s_no=b.s_no;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan # 扫描表a
            alias: a
            Statistics: ...
            Filter Operator
              predicate: s_no is not null (type: boolean) # inner join 内置过滤
              Statistics: ...
              Select Operator
                expressions: s_no (type: int)
                outputColumnNames: _col0
                Statistics: ...
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: ...
          TableScan # 扫描表b
            alias: b
            Statistics: ...
            Filter Operator
              predicate: s_no is not null (type: boolean)
              Statistics: ...
              Select Operator
                expressions: s_no (type: int)
                outputColumnNames: _col0
                Statistics: ...
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: ...
      Reduce Operator Tree:
        Join Operator # 表连接操作
          condition map:
               Inner Join 0 to 1  # inner join 0 1 ;0 1指代的信息在下面
          keys:
            0 _col0 (type: int) # 0表示map阶段一个表的数据集 _col0表示连接字段
            1 _col0 (type: int) # 1表示map阶段另一个表的数据集 _col0表示连接字段
          outputColumnNames: _col0
          Statistics: ...
          File Output Operator
            compressed: false
            Statistics: ...
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
posted @ 2023-08-22 20:59  nangk  阅读(353)  评论(0编辑  收藏  举报