【源码解析】postgresql having clause 是如何实现的 (1)

该文章将分为上下两篇,

  1. 从源码层面了解 aggregate function 怎么实现
  2. 从源码层面了解 having clause 怎么实现

第一篇是阅读第二遍的前提。本文讲述的是第一篇的内容。每一篇文章内容会分为 5 个部分,对应一条 sql 的执行流程。这 5 个部分是:

  • parser
  • analyser
  • rewriter
  • planner/optimizer
  • executor

这 5 个阶段的关系如下:

flowchart TD main["`**postgres** backend PostgresMain()`"] parse["`**PARSE**: pg_parse_query()`"] analyze["`**ANALYZE:** parse_analyze_fixedparams()`"] rewrite["`**REWRITE**: pg_rewrite_query()`"] utilities["`**DDL PROCESSOR**: ProcessUtility()`"] optimize["`**PLAN/OPTIMIZE**: pg_plan_queries()`"] execute["`**DML EXECUTOR**: ExecutePlan()`"] main --> parse parse --> analyze analyze --> rewrite analyze --> utilities rewrite --> optimize optimize --> execute

setup

在正式分析前,先建立一张表:

create table foo(a int, b int);
insert into foo select i/100, random() * i from generate_series(1000, 2000) g(i);

后续内容是针对以下 sql 的源码解析:

select a, count(b) from foo group by a;

为了突出重点内容,有些显示结果为简化后的内容;同时,为了和源码中的术语保持一致,后续将采用英文描述。

parser

(gdb) call pprint(raw_parsetree_list)

The output is written to the postgres log file.

the raw parse tree:

(
   {RAWSTMT 
   :stmt 
      {SELECTSTMT
      :targetList (
         {RESTARGET 
         :val 
            {COLUMNREF
            :fields ("a")
            }
         }
         {RESTARGET
         :val 
            {FUNCCALL 
            :funcname ("count")
            :args (
               {COLUMNREF 
               :fields ("b")
               }
            )
            :funcformat 0 
            }
         }
      )
      :fromClause (
         {RANGEVAR
         :relname foo 
         :inh true 
         :relpersistence p
         }
      )
      :groupClause (
         {COLUMNREF 
         :fields ("a")
         }
      )
    }
   }
)

analyser

{QUERY 
:commandType 1 
:hasAggs true 
:rtable (
  {RANGETBLENTRY 
  :eref 
     {ALIAS 
     :aliasname foo 
     :colnames ("a" "b")
     }
  :rtekind 0 
  :relid 16401 
  :relkind r 
  }
)
:targetList (
  {TARGETENTRY 
  :expr 
     {VAR 
     :varno 1 
     :varattno 1 
     :vartype 23 
     }
  :resno 1 
  :resname a 
  :ressortgroupref 1 
  :resorigtbl 16401 
  :resorigcol 1 
  }
  {TARGETENTRY 
  :expr 
     {AGGREF 
     :aggfnoid 2147 
     :aggtype 20 
     :args (
        {TARGETENTRY 
        :expr 
           {VAR 
           :varno 1 
           :varattno 2 
           :vartype 23 
           }
        :resno 1 
        }
     )
     :aggkind n 
     }
  :resno 2 
  :resname count 
  }
)
:groupClause (
  {SORTGROUPCLAUSE 
  :tleSortGroupRef 1 
  :eqop 96 
  :sortop 97 
  :hashable true
  }
)
}

rewriter

the query tree:

(
   {QUERY 
   :commandType 1 
   :canSetTag true 
   :hasAggs true 
   :rtable (
      {RANGETBLENTRY 
      :eref 
         {ALIAS 
         :aliasname foo 
         :colnames ("a" "b")
         }
      :rtekind 0 
      :relid 16385 
      :relkind r 
      :rellockmode 1 
      :perminfoindex 1 
      :inh true 
      :inFromCl true 
      }
   )
   :rteperminfos (
      {RTEPERMISSIONINFO 
      :relid 16385 
      :inh true 
      :requiredPerms 2 
      :selectedCols (b 8 9)
      :insertedCols (b)
      :updatedCols (b)
      }
   )
   :jointree 
      {FROMEXPR 
      :fromlist (
         {RANGETBLREF 
         :rtindex 1
         }
      )
      }
   :targetList (
      {TARGETENTRY 
      :expr 
         {VAR 
         :varno 1 
         :varattno 1 
         :vartype 23 
         :varnullingrels (b)
         :varlevelsup 0 
         :varnosyn 1 
         :varattnosyn 1 
         }
      :resno 1 
      :resname a 
      :ressortgroupref 1 
      :resorigtbl 16385 
      :resorigcol 1 
      }
      {TARGETENTRY 
      :expr 
         {AGGREF 
         :aggfnoid 2147 
         :aggtype 20 
         :aggargtypes (o 23)
         :args (
            {TARGETENTRY 
            :expr 
               {VAR 
               :varno 1 
               :varattno 2 
               :vartype 23 
               :varnullingrels (b)
               :varnosyn 1 
               :varattnosyn 2 
               }
            :resno 1 
            }
         )
         :aggkind n 
         }
      :resno 2 
      :resname count 
      }
   )
   :groupClause (
      {SORTGROUPCLAUSE 
      :tleSortGroupRef 1 
      :eqop 96 
      :sortop 97 
      :nulls_first false 
      :hashable true
      }
   )
   }
)

planner/optimizer

the query plan:

(
   {PLANNEDSTMT 
   :commandType 1 
   :canSetTag true 
   :planTree 
      {AGG 
      :plan.targetlist (
         {TARGETENTRY 
         :expr 
            {VAR 
            :varno -2 
            :varattno 1 
            :vartype 23 
            }
         :resno 1 
         :resname a 
         :ressortgroupref 1 
         :resorigtbl 16385 
         }
         {TARGETENTRY 
         :expr 
            {AGGREF 
            :aggfnoid 2147 
            :aggtype 20 
            :aggtranstype 20 
            :aggargtypes (o 23)
            :args (
               {TARGETENTRY 
               :expr 
                  {VAR 
                  :varno -2 
                  :varattno 2 
                  :vartype 23 
                  :varnosyn 1 
                  :varattnosyn 2 
                  }
               :resno 1 
               }
            )
            :aggkind n 
            }
         :resno 2 
         :resname count 
         }
      )
      :plan.lefttree 
         {SEQSCAN 
         :scan.plan.targetlist (
            {TARGETENTRY 
            :expr 
               {VAR 
               :varno 1 
               :varattno 1 
               :vartype 23 
               :varnosyn 1 
               :varattnosyn 1 
               }
            :resno 1 
            :ressortgroupref 1 
            }
            {TARGETENTRY 
            :expr 
               {VAR 
               :varno 1 
               :varattno 2 
               :vartype 23 
               :varnosyn 1 
               :varattnosyn 2 
               }
            :resno 2 
            }
         )
         :scan.scanrelid 1
         }
      :aggstrategy 2 
      :numCols 1 
      :grpColIdx ( 1)
      :grpOperators ( 96)
      :grpCollations ( 0)
      :numGroups 11 
      :aggParams (b)
      }
   :rtable (
      {RANGETBLENTRY 
      :eref 
         {ALIAS 
         :aliasname foo 
         :colnames ("a" "b")
         }
      :rtekind 0 
      :relid 16385 
      :relkind r 
      :rellockmode 1 
      :perminfoindex 1 
      :inFromCl true 
      }
   )
   :permInfos (
      {RTEPERMISSIONINFO 
      :relid 16385 
      :inh true 
      :requiredPerms 2 
      :selectedCols (b 8 9)
      :insertedCols (b)
      :updatedCols (b)
      }
   )
   :relationOids (o 16385)
   }
)

evaluation steps

Evaluation steps are generated by function ExecInitExpr.

image

Each step is labelled with opcode. The first step is EEOP_SCAN_FETCHSOME for sequential scan and the last step is EEOP_DONE. The opcode is then replaced by code location inside function ExecReadyExpr. You can use function ExecEvalStepOp to get back the human readable opcode.

(gdb) p ExecEvalStepOp(state, op)

executor

overview

The executor uses the function execAgg to handle Agg node. The simplified version of ExecAgg is:

image

The most work is done with function agg_fill_hash_table inside which tuples are fetched and the aggregate function is applied in each group.

jargons

  • project
    In the context of databases, the term "project" refers to the operation of extracting a subset of columns from a table or producing a new set of columns by applying expressions/transformations.

  • outer plan

    Limit 
      ->  Sort 
            ->  SeqScan
    

    In PostgreSQL, "outer plan" refers to the input plan node that feeds data into the current executing plan node. Here, the Sort node's outer plan is the SeqScan node. The SeqScan generates rows that become input to Sort.

  • qualification
    In SQL queries, we can specify a WHERE or HAVING clause with conditions that filter which rows are included in the result. For example:

    SELECT avg(salary) 
    FROM employees
    GROUP BY department
    HAVING avg(salary) > 50000
    

    Here the HAVING avg(salary) > 50000 is a qual condition (qualification) that checks if the average salary for a group is greater than 50,000.

the internals

aggregate strategies

  • AGG_HASHED
    This builds a hash table from the input tuples, using the group
    key columns as the hash key. It then scans the hash table to compute the
    aggregates for each distinct group. This allows fast aggregation for
    arbitrary groupings, but requires additional memory for the hash table.

  • AGG_MIXED
    This uses both hashed and sorted aggregation. It builds the hash
    table first, then switches to sorted aggregation once the table fills up.
    This prevents excessive memory use while still utilizing the faster hash
    aggregation.

  • AGG_PLAIN
    This performs a naive aggregation by scanning the input tuples
    repeatedly, updating the aggregates for each group as it encounters tuples
    in that group. Simple but slow for large data sets.

  • AGG_SORTED
    This first sorts the input data on the group key columns. It
    then scans the sorted data and computes aggregates within each consecutive
    group. This method is faster than plain aggregation and has low memory
    overhead, but requires the ability to sort the data.

data structures

  • AggStatePerAggData
    This contains the state for each aggregate function like sum, count etc.
  • AggStatePerTransData
    This contains the transition state used when running aggregates for distributive functions.
  • AggStatePerGroupData
    This contains the intermediate data used per grouping set during aggregation.
  • AggStatePerPhaseData
    This contains state that can change between phases of aggregation like hashing vs sorting.
  • AggStatePerHashData
    This contains state related to the hash table handling like table, batches etc.

agg_fill_hash_table

image

fetch_input_tuple

image

SeqNext is an access method and it uses table_scan_getnextslot to fetch data.

Use TableHashSlot to hold the tuple data.

what is a minimum tuple in postgresql?

A minimal tuple in PostgreSQL is a compact representation of a tuple that contains
only the essential data necessary to reconstruct the full tuple.

Key characteristics of minimal tuples:

  • Only stores the data for user columns along with some metadata like null flags.
  • Does not store control information like xmin, cmin, cmax etc.
  • The data is not aligned on byte boundaries.
  • Occupies minimal space as extra padding is removed.
  • Has a smaller header than a standard heap tuple header.
  • Pointer to the TupleDesc is needed to interpret the raw data.
  • Converting to/from a full tuple format requires metadata like TupleDesc.

Minimal tuples are used in PostgreSQL for:

  • Hash table tuples in hashed aggregation to save space.
  • Intermediate storage for some operations like grouping, windowing.
  • Representing tuples compactly in memory between executions.
  • Storing tuples during sorting to reduce data volume.
  • Transmitting tuples over the network without wasted space.

So in summary, a minimal tuple is an optimized storage format that removes
unnecessary control data and padding to represent tuples very compactly for
in-memory operations and network transfer.

system catalogs
  • pg_proc
select * from pg_proc where proname = 'count';
  • pg_aggregate
select * from pg_aggregate where aggfnoid = 2803;
name value
aggfnoid pg_catalog.count
aggkind n
aggtransfn int8inc
aggfinalfn -
aggcombinefn int8pl
aggfinalextra f
aggfinalmodify r
aggtranstype 20
aggtransspace 0
agginitval 0
how does count(b) get executed?

image

The following steps get executed:

EEOP_OUTER_FETCHSOME
EEOP_OUTER_VAR
EEOP_AGG_STRICT_INPUT_CHECK_ARGS
EEOP_AGG_PLAIN_TRANS_STRICT_BYVAL
EEOP_DONE

In the EEOP_AGG_PLAIN_TRANS_STRICT_BYVAL step, the function ExecAggPlainTransByVal executes logics of transition function. The transition value is stored in pergroup->transValue.

agg_retrieve_hash_table

image

Evaluation steps for function project_aggregate:

EEOP_OUTER_FETCHSOME
EEOP_ASSIGN_OUTER_VAR
EEOP_AGGREF
EEOP_ASSIGN_TMP
EEOP_DONE

posted on 2023-08-10 21:12  winter-loo  阅读(22)  评论(0编辑  收藏  举报

导航