postgre-分区和表继承

postgresql分区和表继承

继承

SQL99之后定义了继承

CREATE TABLE cities (
    name            text,
    population      float,
    altitude        int     -- 英尺
); 
CREATE TABLE capitals (
    state           char(2)
) INHERITS (cities);

capitals表继承自cities表
找出500英尺以上的城市名,包括首府(cities表和capitals表都查找)

SELECT name, altitude
    FROM cities
    WHERE altitude > 500;

不包括首府(只查找cities表,不查找capitals表)

SELECT name, altitude
    FROM ONLY cities
    WHERE altitude > 500;

SELECT, UPDATE,DELETE都支持only关键字
使用*显式指定后代

SELECT name, altitude
    FROM cities*
    WHERE altitude > 500;

每个表里都有一个tableoid属性告诉你源表是谁,获取的是一个OID

SELECT c.tableoid, c.name, c.altitude
FROM cities c
WHERE c.altitude > 500;

通过和pg_class做一个连接,就可以看到实际的表名字:

SELECT p.relname, c.name, c.altitude
FROM cities c, pg_class p
WHERE c.altitude > 500 AND c.tableoid = p.oid;

INSERT或COPY,继承并不自动影响其后代表;

更多参考:
http://www.postgres.cn/docs/9.4/ddl-inherit.html

分区

分区好处:

  • 减少索引体积,使用率部分的索引存放在内存中
  • 访问某一分区,不需要扫描全表
  • 删除分区速度快
  • 很少用的数据可以移动到便宜一些的慢速存储介质上

PostgreSQL通过表继承实现,父表常是空的,它的存在只是为了代表整个数据集;
分区形式

  • 范围分区:不同范围没有重叠
  • 列表分区:明确地列出每个分区里应该出现哪些关键字值实现

实现分区

  1. 创建"主表",所有分区都从它继承(不要定义任何约束,不要定义索引)
  2. 创建几个"子表",每个都从主表上继承,子表不会增加字段(这些是子表,尽管是普通的pg表)
  3. 给分区表增加约束,定义每个分区允许的健值
  4. 对于每个分区单独建立索引
  5. 定义一个规则或者触发器,来重定向数据插入主表到适当的分区。
  6. postgresql.conf中的constraint_exclusion参数

定义约束,保证键值在不同的分区之间不重叠

CHECK ( x = 1 )
CHECK ( county IN ( 'Oxfordshire', 'Buckinghamshire', 'Warwickshire' ))
CHECK ( outletID >= 100 AND outletID < 200 )

为一个巨大的冰激凌公司构造数据库,每天都需要测量温度,概念上有个表:

CREATE TABLE measurement (
    city_id         int not null,
    logdate         date not null,
    peaktemp        int,
    unitsales       int
);

主表是measurement表,就像上面那样声明
然后我们为每个月创建一个分区

CREATE TABLE measurement_y2006m02 ( ) INHERITS (measurement);
CREATE TABLE measurement_y2006m03 ( ) INHERITS (measurement);
...
CREATE TABLE measurement_y2007m11 ( ) INHERITS (measurement);
CREATE TABLE measurement_y2007m12 ( ) INHERITS (measurement);
CREATE TABLE measurement_y2008m01 ( ) INHERITS (measurement);

这样就解决了我们的一个问题:删除旧数据。
每个月, 我们需要做的只是在最旧的子表上执行一个DROP TABLE, 然后为新月份创建一个新的子表。
我们必须提供非重叠的表约束。而不是只像上面那样创建分区表,所以我们的建表脚本就变成:

CREATE TABLE measurement_y2006m02 (
    CHECK ( logdate >= DATE '2006-02-01' AND logdate < DATE '2006-03-01' )
) INHERITS (measurement);

CREATE TABLE measurement_y2006m03 (
    CHECK ( logdate >= DATE '2006-03-01' AND logdate < DATE '2006-04-01' )
) INHERITS (measurement);
...

CREATE TABLE measurement_y2007m11 (
    CHECK ( logdate >= DATE '2007-11-01' AND logdate < DATE '2007-12-01' )
) INHERITS (measurement);

CREATE TABLE measurement_y2007m12 (
    CHECK ( logdate >= DATE '2007-12-01' AND logdate < DATE '2008-01-01' )
) INHERITS (measurement);

CREATE TABLE measurement_y2008m01 (
    CHECK ( logdate >= DATE '2008-01-01' AND logdate < DATE '2008-02-01' )
) INHERITS (measurement);

我们可能还需要在关键字字段上有索引:

CREATE INDEX measurement_y2006m02_logdate ON measurement_y2006m02 (logdate);
CREATE INDEX measurement_y2006m03_logdate ON measurement_y2006m03 (logdate);
...
CREATE INDEX measurement_y2007m11_logdate ON measurement_y2007m11 (logdate);
CREATE INDEX measurement_y2007m12_logdate ON measurement_y2007m12 (logdate);
CREATE INDEX measurement_y2008m01_logdate ON measurement_y2008m01 (logdate);

我们想让我们的应用可以说INSERT INTO measurement ... 并且数据被重定向到相应的分区表。
我们可以安排给主表附上一个合适的触发器。 如果数据只进入最新的分区,我们可以使用一个非常简单的触发器:

CREATE OR REPLACE FUNCTION measurement_insert_trigger()
RETURNS TRIGGER AS $$
BEGIN
    INSERT INTO measurement_y2008m01 VALUES (NEW.*);
    RETURN NULL;
END;
$$
LANGUAGE plpgsql;

创建完函数后,我们将创建一个调用触发器函数的触发器:

CREATE TRIGGER insert_measurement_trigger
    BEFORE INSERT ON measurement
    FOR EACH ROW EXECUTE PROCEDURE measurement_insert_trigger();

我们必须每月重新定义触发器,以便它总是指向当前分区。然而,触发定义不需要更新。
我们可能想插入数据并且想让服务器自动定位应该向哪个分区插入数据。 我们可以用下面这个复杂的触发器来实现这个目标,比如:

CREATE OR REPLACE FUNCTION measurement_insert_trigger()
RETURNS TRIGGER AS $$
BEGIN
    IF ( NEW.logdate >= DATE '2006-02-01' AND
         NEW.logdate < DATE '2006-03-01' ) THEN
        INSERT INTO measurement_y2006m02 VALUES (NEW.*);
    ELSIF ( NEW.logdate >= DATE '2006-03-01' AND
            NEW.logdate < DATE '2006-04-01' ) THEN
        INSERT INTO measurement_y2006m03 VALUES (NEW.*);
    ...
    ELSIF ( NEW.logdate >= DATE '2008-01-01' AND
            NEW.logdate < DATE '2008-02-01' ) THEN
        INSERT INTO measurement_y2008m01 VALUES (NEW.*);
    ELSE
        RAISE EXCEPTION 'Date out of range.  Fix the measurement_insert_trigger() function!';
    END IF;
    RETURN NULL;
END;
$$
LANGUAGE plpgsql;

每一个触发器跟以前一样。注意,每一个IF测试必须匹配其分区的 CHECK约束。
当这个函数比单月的情况更复杂时,它不需要经常的更新,因为分支可以在需要之前被添加。
我们可以看出,一个复杂的分区方案可能要求相当多的 DDL 。 在上面的例子里我们需要每个月创建一次新分区, 因此写一个脚本自动生成需要的 DDL 是明智的。

管理分区
删除旧数据最简单的方法是删除不再需要的分区:

DROP TABLE measurement_y2006m02;

在删除分区同时,把它转化为一个独立的表,可以使用COPY, pg_dump或其他工具对它单独处理及其他操作;

ALTER TABLE measurement_y2006m02 NO INHERIT measurement;

同样,我们可以像前面创建最初的分区一样,创建一个新的空分区来处理新数据。

CREATE TABLE measurement_y2008m02 (
    CHECK ( logdate >= DATE '2008-02-01' AND logdate < DATE '2008-03-01' )
) INHERITS (measurement);

有时在分区结构之外创建新表并在一段时间之后将其变为分区更为方便。 因为这将允许在该表变为分区之前对其中的数据进行加载、检查、转换之类的操作

CREATE TABLE measurement_y2008m02
  (LIKE measurement INCLUDING DEFAULTS INCLUDING CONSTRAINTS);
ALTER TABLE measurement_y2008m02 ADD CONSTRAINT y2008m02
   CHECK ( logdate >= DATE '2008-02-01' AND logdate < DATE '2008-03-01' );
\copy measurement_y2008m02 from 'measurement_y2008m02'
--其它可能的数据准备工作
ALTER TABLE measurement_y2008m02 INHERIT measurement;

分区和约束排除
约束排除是一种查询优化技巧,它改进了用上述方法定义的表分区的性能

SET constraint_exclusion = on;
SELECT count(*) FROM measurement WHERE logdate >= DATE '2008-01-01';

如果没有约束排除,上述查询会扫描每个分区,打开之后先筛选分区;
可以使用EXPLAIN显示一个规划在打开和关闭constraint_exclusion情况下的不同
典型的非最佳规划如下:

SET constraint_exclusion = off;
EXPLAIN SELECT count(*) FROM measurement WHERE logdate >= DATE '2008-01-01';

                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Aggregate  (cost=158.66..158.68 rows=1 width=0)
   ->  Append  (cost=0.00..151.88 rows=2715 width=0)
         ->  Seq Scan on measurement  (cost=0.00..30.38 rows=543 width=0)
               Filter: (logdate >= '2008-01-01'::date)
         ->  Seq Scan on measurement_y2006m02 measurement  (cost=0.00..30.38 rows=543 width=0)
               Filter: (logdate >= '2008-01-01'::date)
         ->  Seq Scan on measurement_y2006m03 measurement  (cost=0.00..30.38 rows=543 width=0)
               Filter: (logdate >= '2008-01-01'::date)
...
         ->  Seq Scan on measurement_y2007m12 measurement  (cost=0.00..30.38 rows=543 width=0)
               Filter: (logdate >= '2008-01-01'::date)
         ->  Seq Scan on measurement_y2008m01 measurement  (cost=0.00..30.38 rows=543 width=0)
               Filter: (logdate >= '2008-01-01'::date)

在打开约束排除之后

SET constraint_exclusion = on;
EXPLAIN SELECT count(*) FROM measurement WHERE logdate >= DATE '2008-01-01';
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Aggregate  (cost=63.47..63.48 rows=1 width=0)
   ->  Append  (cost=0.00..60.75 rows=1086 width=0)
         ->  Seq Scan on measurement  (cost=0.00..30.38 rows=543 width=0)
               Filter: (logdate >= '2008-01-01'::date)
         ->  Seq Scan on measurement_y2008m01 measurement  (cost=0.00..30.38 rows=543 width=0)
               Filter: (logdate >= '2008-01-01'::date)

约束排除只由CHECK约束驱动,在关键字字段上定义索引没有必要;
如果分区很多,只访问很少的分区,索引会有帮助,但是一般不会有帮助;

对于constraint_exclusion建议设置为partition,这样会在分区和不分区之间做一个权衡,取最优;

其他分区方法
可以再master表上设置规则,而不是触发器,比如

CREATE RULE measurement_insert_y2006m02 AS
ON INSERT TO measurement WHERE
( logdate >= DATE ’2006-02-01’ AND logdate < DATE ’2006-03-01’ )
DO INSTEAD
INSERT INTO measurement_y2006m02 VALUES (NEW.*);
...

CREATE RULE measurement_insert_y2008m01 AS
ON INSERT TO measurement WHERE
( logdate >= DATE ’2008-01-01’ AND logdate < DATE ’2008-02-01’ )
DO INSTEAD
INSERT INTO measurement_y2008m01 VALUES (NEW.*);

rule比触发器有显著的开销,每次查询一次的开销;COPY会忽略规则,但是会触发触发器,用触发器会正常使用;

也可以使用union all视图,增加和删除各个分区的数据集,需要重新创建视图,增加一个额外的步骤。
和继承相比基本没有可取之处

CREATE VIEW measurement AS
          SELECT * FROM measurement_y2006m02
UNION ALL SELECT * FROM measurement_y2006m03
...
UNION ALL SELECT * FROM measurement_y2007m11
UNION ALL SELECT * FROM measurement_y2007m12
UNION ALL SELECT * FROM measurement_y2008m01;

警告
已分区表的注意事项

  • 自动创建分区更安全
  • 对分区字段进行UPDATE会失败,如果需要修改可以创建update触发器,但是这样会极大的增加复杂度
  • VACUUM或者ANALYZE命令,需要单独运行在各个分区上,ANALYZE measurement;只会在master分区起作用
  • 分区只有在partition key在where中才生效

参考:
http://www.postgres.cn/docs/9.4/ddl-partitioning.html

建立分区尝试

建立分区

    $DB_LOAD \
        -d pg \
        -c $MAINTENANCE_CONNECT_STR \
        -x "
        drop table if exists international_hotel_physical_room_price_beat_lose_$DATE;

        CREATE TABLE international_hotel_physical_room_price_beat_lose_$DATE (
            CHECK ( date='$FDATE_SUB0DAY' )
        ) INHERITS (rpt_international_hotel_physical_room_price_beat_lose_base);

        CREATE OR REPLACE FUNCTION international_hotel_beat_lose_base_insert_trigger()
        RETURNS TRIGGER AS \$\$
        BEGIN
            INSERT INTO international_hotel_physical_room_price_beat_lose_$DATE VALUES (NEW.*);
            RETURN NULL;
        END;
        \$\$
        LANGUAGE plpgsql;

        drop trigger if exists
        insert_international_hotel_beat_lose_base_trigger
        on rpt_international_hotel_physical_room_price_beat_lose_base;

        CREATE TRIGGER insert_international_hotel_beat_lose_base_trigger
        BEFORE INSERT ON rpt_international_hotel_physical_room_price_beat_lose_base
        FOR EACH ROW EXECUTE PROCEDURE international_hotel_beat_lose_base_insert_trigger();
        "

向分区中导入数据

  $DB_LOAD \
        -d pg \
        -c $MAINTENANCE_CONNECT_STR \
        -t rpt_international_hotel_physical_room_price_beat_lose_base \
        -b 100000 \
        -i 3 \
        -s "
            select
            date,
            if(hotel_seq is not null,hotel_seq,''),
            if(hotel_name is not null,hotel_name,''),
            if(hotel_name_e is not null,hotel_name_e,''),
            if(physical_room_name is not null,physical_room_name,''),
            if(product_full_name is not null,product_full_name,''),
            if(has_breakfast is not null,has_breakfast,0),
            if(supplier_code is not null,supplier_code,''),
            if(supplier_name is not null,supplier_name,''),
            if(room_fee is not null,room_fee,0),
            if(city_code is not null,city_code,''),
            if(city_name is not null,city_name,''),
            if(province_name is not null,province_name,''),
            if(country_name is not null,country_name,''),
            if(checkin_date is not null,checkin_date,'1970-01-01'),
            if(checkout_date is not null,checkout_date,'1970-01-01'),
            if(opponent_name is not null,opponent_name,''),
            if(opponent_wrapper is not null,opponent_wrapper,''),
            if(opponent_product_full_name is not null,opponent_product_full_name,''),
            if(opponent_room_fee is not null,opponent_room_fee,0),
            if(result is not null,result,'')
            from rpt_international_hotel_physical_room_price_beat_lose_base where dt='$DATE';
        "
posted @ 2017-01-04 21:35  zhangshihai1232  阅读(499)  评论(0)    收藏  举报