【源码解析】postgresql having clause 是如何实现的 (1)

该文章将分为上下两篇，

从源码层面了解 aggregate function 怎么实现
从源码层面了解 having clause 怎么实现

第一篇是阅读第二遍的前提。本文讲述的是第一篇的内容。每一篇文章内容会分为 5 个部分，对应一条 sql 的执行流程。这 5 个部分是：

parser
analyser
rewriter
planner/optimizer
executor

这 5 个阶段的关系如下：

flowchart TD main["`**postgres** backend PostgresMain()`"] parse["`**PARSE**: pg_parse_query()`"] analyze["`**ANALYZE:** parse_analyze_fixedparams()`"] rewrite["`**REWRITE**: pg_rewrite_query()`"] utilities["`**DDL PROCESSOR**: ProcessUtility()`"] optimize["`**PLAN/OPTIMIZE**: pg_plan_queries()`"] execute["`**DML EXECUTOR**: ExecutePlan()`"] main --> parse parse --> analyze analyze --> rewrite analyze --> utilities rewrite --> optimize optimize --> execute

setup

在正式分析前，先建立一张表：

create table foo(a int, b int);
insert into foo select i/100, random() * i from generate_series(1000, 2000) g(i);

后续内容是针对以下 sql 的源码解析:

select a, count(b) from foo group by a;

为了突出重点内容，有些显示结果为简化后的内容；同时，为了和源码中的术语保持一致，后续将采用英文描述。

parser

(gdb) call pprint(raw_parsetree_list)

The output is written to the postgres log file.

the raw parse tree:

(
   {RAWSTMT 
   :stmt 
      {SELECTSTMT
      :targetList (
         {RESTARGET 
         :val 
            {COLUMNREF
            :fields ("a")
            }
         }
         {RESTARGET
         :val 
            {FUNCCALL 
            :funcname ("count")
            :args (
               {COLUMNREF 
               :fields ("b")
               }
            )
            :funcformat 0 
            }
         }
      )
      :fromClause (
         {RANGEVAR
         :relname foo 
         :inh true 
         :relpersistence p
         }
      )
      :groupClause (
         {COLUMNREF 
         :fields ("a")
         }
      )
    }
   }
)

analyser

{QUERY 
:commandType 1 
:hasAggs true 
:rtable (
  {RANGETBLENTRY 
  :eref 
     {ALIAS 
     :aliasname foo 
     :colnames ("a" "b")
     }
  :rtekind 0 
  :relid 16401 
  :relkind r 
  }
)
:targetList (
  {TARGETENTRY 
  :expr 
     {VAR 
     :varno 1 
     :varattno 1 
     :vartype 23 
     }
  :resno 1 
  :resname a 
  :ressortgroupref 1 
  :resorigtbl 16401 
  :resorigcol 1 
  }
  {TARGETENTRY 
  :expr 
     {AGGREF 
     :aggfnoid 2147 
     :aggtype 20 
     :args (
        {TARGETENTRY 
        :expr 
           {VAR 
           :varno 1 
           :varattno 2 
           :vartype 23 
           }
        :resno 1 
        }
     )
     :aggkind n 
     }
  :resno 2 
  :resname count 
  }
)
:groupClause (
  {SORTGROUPCLAUSE 
  :tleSortGroupRef 1 
  :eqop 96 
  :sortop 97 
  :hashable true
  }
)
}

rewriter

the query tree:

(
   {QUERY 
   :commandType 1 
   :canSetTag true 
   :hasAggs true 
   :rtable (
      {RANGETBLENTRY 
      :eref 
         {ALIAS 
         :aliasname foo 
         :colnames ("a" "b")
         }
      :rtekind 0 
      :relid 16385 
      :relkind r 
      :rellockmode 1 
      :perminfoindex 1 
      :inh true 
      :inFromCl true 
      }
   )
   :rteperminfos (
      {RTEPERMISSIONINFO 
      :relid 16385 
      :inh true 
      :requiredPerms 2 
      :selectedCols (b 8 9)
      :insertedCols (b)
      :updatedCols (b)
      }
   )
   :jointree 
      {FROMEXPR 
      :fromlist (
         {RANGETBLREF 
         :rtindex 1
         }
      )
      }
   :targetList (
      {TARGETENTRY 
      :expr 
         {VAR 
         :varno 1 
         :varattno 1 
         :vartype 23 
         :varnullingrels (b)
         :varlevelsup 0 
         :varnosyn 1 
         :varattnosyn 1 
         }
      :resno 1 
      :resname a 
      :ressortgroupref 1 
      :resorigtbl 16385 
      :resorigcol 1 
      }
      {TARGETENTRY 
      :expr 
         {AGGREF 
         :aggfnoid 2147 
         :aggtype 20 
         :aggargtypes (o 23)
         :args (
            {TARGETENTRY 
            :expr 
               {VAR 
               :varno 1 
               :varattno 2 
               :vartype 23 
               :varnullingrels (b)
               :varnosyn 1 
               :varattnosyn 2 
               }
            :resno 1 
            }
         )
         :aggkind n 
         }
      :resno 2 
      :resname count 
      }
   )
   :groupClause (
      {SORTGROUPCLAUSE 
      :tleSortGroupRef 1 
      :eqop 96 
      :sortop 97 
      :nulls_first false 
      :hashable true
      }
   )
   }
)

planner/optimizer

the query plan:

(
   {PLANNEDSTMT 
   :commandType 1 
   :canSetTag true 
   :planTree 
      {AGG 
      :plan.targetlist (
         {TARGETENTRY 
         :expr 
            {VAR 
            :varno -2 
            :varattno 1 
            :vartype 23 
            }
         :resno 1 
         :resname a 
         :ressortgroupref 1 
         :resorigtbl 16385 
         }
         {TARGETENTRY 
         :expr 
            {AGGREF 
            :aggfnoid 2147 
            :aggtype 20 
            :aggtranstype 20 
            :aggargtypes (o 23)
            :args (
               {TARGETENTRY 
               :expr 
                  {VAR 
                  :varno -2 
                  :varattno 2 
                  :vartype 23 
                  :varnosyn 1 
                  :varattnosyn 2 
                  }
               :resno 1 
               }
            )
            :aggkind n 
            }
         :resno 2 
         :resname count 
         }
      )
      :plan.lefttree 
         {SEQSCAN 
         :scan.plan.targetlist (
            {TARGETENTRY 
            :expr 
               {VAR 
               :varno 1 
               :varattno 1 
               :vartype 23 
               :varnosyn 1 
               :varattnosyn 1 
               }
            :resno 1 
            :ressortgroupref 1 
            }
            {TARGETENTRY 
            :expr 
               {VAR 
               :varno 1 
               :varattno 2 
               :vartype 23 
               :varnosyn 1 
               :varattnosyn 2 
               }
            :resno 2 
            }
         )
         :scan.scanrelid 1
         }
      :aggstrategy 2 
      :numCols 1 
      :grpColIdx ( 1)
      :grpOperators ( 96)
      :grpCollations ( 0)
      :numGroups 11 
      :aggParams (b)
      }
   :rtable (
      {RANGETBLENTRY 
      :eref 
         {ALIAS 
         :aliasname foo 
         :colnames ("a" "b")
         }
      :rtekind 0 
      :relid 16385 
      :relkind r 
      :rellockmode 1 
      :perminfoindex 1 
      :inFromCl true 
      }
   )
   :permInfos (
      {RTEPERMISSIONINFO 
      :relid 16385 
      :inh true 
      :requiredPerms 2 
      :selectedCols (b 8 9)
      :insertedCols (b)
      :updatedCols (b)
      }
   )
   :relationOids (o 16385)
   }
)

evaluation steps

Evaluation steps are generated by function ExecInitExpr.

Each step is labelled with opcode. The first step is EEOP_SCAN_FETCHSOME for sequential scan and the last step is EEOP_DONE. The opcode is then replaced by code location inside function ExecReadyExpr. You can use function ExecEvalStepOp to get back the human readable opcode.

(gdb) p ExecEvalStepOp(state, op)

executor

overview

The executor uses the function execAgg to handle Agg node. The simplified version of ExecAgg is:

The most work is done with function agg_fill_hash_table inside which tuples are fetched and the aggregate function is applied in each group.

jargons

project
In the context of databases, the term "project" refers to the operation of extracting a subset of columns from a table or producing a new set of columns by applying expressions/transformations.
outer plan
```
Limit 
  ->  Sort 
        ->  SeqScan
```
In PostgreSQL, "outer plan" refers to the input plan node that feeds data into the current executing plan node. Here, the Sort node's outer plan is the SeqScan node. The SeqScan generates rows that become input to Sort.
qualification
In SQL queries, we can specify a WHERE or HAVING clause with conditions that filter which rows are included in the result. For example:
```
SELECT avg(salary) 
FROM employees
GROUP BY department
HAVING avg(salary) > 50000
```
Here the HAVING avg(salary) > 50000 is a qual condition (qualification) that checks if the average salary for a group is greater than 50,000.

the internals

aggregate strategies

AGG_HASHED
This builds a hash table from the input tuples, using the group
key columns as the hash key. It then scans the hash table to compute the
aggregates for each distinct group. This allows fast aggregation for
arbitrary groupings, but requires additional memory for the hash table.
AGG_MIXED
This uses both hashed and sorted aggregation. It builds the hash
table first, then switches to sorted aggregation once the table fills up.
This prevents excessive memory use while still utilizing the faster hash
aggregation.
AGG_PLAIN
This performs a naive aggregation by scanning the input tuples
repeatedly, updating the aggregates for each group as it encounters tuples
in that group. Simple but slow for large data sets.
AGG_SORTED
This first sorts the input data on the group key columns. It
then scans the sorted data and computes aggregates within each consecutive
group. This method is faster than plain aggregation and has low memory
overhead, but requires the ability to sort the data.

data structures

AggStatePerAggData
This contains the state for each aggregate function like sum, count etc.
AggStatePerTransData
This contains the transition state used when running aggregates for distributive functions.
AggStatePerGroupData
This contains the intermediate data used per grouping set during aggregation.
AggStatePerPhaseData
This contains state that can change between phases of aggregation like hashing vs sorting.
AggStatePerHashData
This contains state related to the hash table handling like table, batches etc.

agg_fill_hash_table

fetch_input_tuple

SeqNext is an access method and it uses table_scan_getnextslot to fetch data.

Use TableHashSlot to hold the tuple data.

what is a minimum tuple in postgresql?

A minimal tuple in PostgreSQL is a compact representation of a tuple that contains
only the essential data necessary to reconstruct the full tuple.

Key characteristics of minimal tuples:

Only stores the data for user columns along with some metadata like null flags.

Does not store control information like xmin, cmin, cmax etc.

The data is not aligned on byte boundaries.

Occupies minimal space as extra padding is removed.

Has a smaller header than a standard heap tuple header.

Pointer to the TupleDesc is needed to interpret the raw data.

Converting to/from a full tuple format requires metadata like TupleDesc.

Minimal tuples are used in PostgreSQL for:

Hash table tuples in hashed aggregation to save space.

Intermediate storage for some operations like grouping, windowing.

Representing tuples compactly in memory between executions.

Storing tuples during sorting to reduce data volume.

Transmitting tuples over the network without wasted space.

So in summary, a minimal tuple is an optimized storage format that removes
unnecessary control data and padding to represent tuples very compactly for
in-memory operations and network transfer.

system catalogs

pg_proc

select * from pg_proc where proname = 'count';

pg_aggregate

select * from pg_aggregate where aggfnoid = 2803;

name	value
aggfnoid	pg_catalog.count
aggkind	n
aggtransfn	int8inc
aggfinalfn	-
aggcombinefn	int8pl
aggfinalextra	f
aggfinalmodify	r
aggtranstype	20
aggtransspace	0
agginitval	0

how does `count(b)` get executed?

The following steps get executed:

EEOP_OUTER_FETCHSOME
EEOP_OUTER_VAR
EEOP_AGG_STRICT_INPUT_CHECK_ARGS
EEOP_AGG_PLAIN_TRANS_STRICT_BYVAL
EEOP_DONE

In the EEOP_AGG_PLAIN_TRANS_STRICT_BYVAL step, the function ExecAggPlainTransByVal executes logics of transition function. The transition value is stored in pergroup->transValue.

agg_retrieve_hash_table

Evaluation steps for function project_aggregate:

EEOP_OUTER_FETCHSOME
EEOP_ASSIGN_OUTER_VAR
EEOP_AGGREF
EEOP_ASSIGN_TMP
EEOP_DONE

posted on 2023-08-10 21:12 winter-loo 阅读(22) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

ldd.cool