Iceberg Spark存储过程-表治理工具

一、简介

存储过程（Procedure）是数据库领域的概念，类似于编程语言中的方法或函数，是对实现特定操作的封装，原生的 Spark SQL 中是不支持存储过程的，Iceberg 0.11.0版本之后对其进行了扩展，并提供了部分存储过程的实现。Iceberg 中提供的所有存储过程都在system namespace 下，分快照管理、元数据管理、表迁移三种。如果需要使用则spark新增如下配置项

spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

二、存储过程使用

1、语法简介

-- 基于参数位置
call catalog_name.system.procedure_name(arg_1, arg_2, ... arg_n);
-- 基于参数名称
call catalog_name.system.procedure_name(arg_name_2 => arg_2, arg_name_1 => arg_1);

三、Iceberg快照管理

1、回滚表到特定的快照ID（rollback_to_snapshot）

（1）参数

table（必须）：string，表名

snapshot_id（必须）：long，快照 ID

（2）示例

call catalog_name.system.rollback_to_snapshot('db.sample', 1);

2、回滚表到特定时间（rollback_to_timestamp）

（1）参数

table（必须）：string，表名

timestamp（必须）：long，回滚的时间戳

（2）示例

call catalog_name.system.rollback_to_timestamp('db.sample', TIMESTAMP '2023-12-08 00:00:00.000');

3、设置表的当前快照ID（set_current_snapshot）

与回滚不同，该参数可以在各个快照之间任意穿梭

（1）参数

table（必须）：string，表名

snapshot_id（必须）：long，快照 ID

（2）示例

call spark_catalog.system.set_current_snapshot('db.sample', 1);

4、从现有快照创建新快照（cherrypick_snapshot）

（1）参数

table（必须）：string，表名

snapshot_id（必须）：long，快照 ID

四、元数据管理

1、删除过期快照和相关数据文件（expire_snapshots）

（1）参数

table（必须）：string，表名
older_than：timestamp，该时间戳之前的快照将被删除，默认为 5 天前
retain_last：int，和 older_than 同时存在的时候，要保留的快照数（默认为 1）
max_concurrent_deletes：int，用于删除文件操作的线程池大小（默认不使用线程池）

（2）示例

CALL spark_catalog.system.expire_snapshots(table => 'db.sample', older_than => TIMESTAMP '2023-12-07 10:40:00.000');

2、删除Iceberg元数据中未被引用的文件（remove_orphan_files）

（1）参数

table（必须）：string，表名。
older_than：timestamp，删除在此时间戳之前创建的孤立文件（默认为 3 天前）。
location：string，查找文件的目录（默认为表的位置）。
dry_run：boolean，当为 true 时，实际上不删除文件（默认为 false）。
max_concurrent_deletes：int，用于删除文件操作的线程池大小（默认不使用线程池）。

（2）示例

call spark_catalog.system.remove_orphan_files(table => 'db.sample');

3、合并小文件，加速文件扫描速度（rewrite_data_files）

（1）参数

table（必须）：string，表名。
strategy：string，合并策略binpack或sort。默认为binpack。
sort_order：string，用以描述排序方式，多个字段之间用逗号分隔。如：name asc nulls last, age desc nulls first。
options：map<string, string>，用以重写文件时的其它参数。
where：string，指定过滤条件。

（2）例子

CALL spark_catalog.system.rewrite_data_files(
  table => 'db.sample',
  options => map(
    'max-concurrent-file-group-rewrites','1',
    'max-file-group-size-bytes','1',
    'rewrite-all','true'
    )
  );

4、重写manifest文件，优化扫描计划（rewrite_manifests）

（1）参数

table（必须）：string，表名。
use_caching：boolean，是否使用 Spark 缓存（默认为 true）。

（2）例子

call spark_catalog.system.rewrite_manifests('db.sample', false);

5、获取指定快照血缘关系（ancestors_of）

（1）参数

table（必须）：string，表名。
snapshot_id（非必须）：long，指定的快照 ID。

（2）例子

call spark_catalog.system.ancestors_of('db.sample');

五、表迁移

1、snapshots

（1）介绍

做iceberg表快照，在不影响原始表的情况下创建一个新的轻量级快照表用以测试，测试结束之后可以通过drop table删除掉。如果在新的快照表中没有插入新数据，则依然使用的是原始表的数据文件，如果插入操作，则新的数据文件放在快照表的数据目录下，不会影响原始表。

在新表中任何只影响元数据的操作都是允许的，如 inset、delete。但是会影响到物理数据文件的操作是禁止的，如删除过期快照（expire_snapshots），因为会影响到原始表。

（2）参数

source_table（必须）：string，原始表表名。
table：string（必须）：新表（快照表）表名。
location：string，新表的存储目录，默认有 catalog 管理。
properties：map<string, string>，添加到新表中的属性。

（3）例子

call catalog_name.system.snapshot('db.sample', 'db.snap');

2、migrate

（1）介绍

将一个非 Iceberg 表转为 Iceberg 表，原始表中的文件会加载到新的 Iceberg 表中，原始表中的 schema 信息、partition 信息、属性信息以及位置都会拷贝至新表。

（2）参数

table（必须）：string，需要迁移的原始表。

properties：map<string, string>，新 Iceberg 表的属性。

（3）例子

call catalog_name.system.migrate('db.sample');

3、add_files

（1）介绍

将 Hive 或其它基于文件的表中的数据文件添加到指定的 Iceberg 表中，可以从一个或多个分区导入文件。add_files只会为需要导入的数据文件增加元数据信息，并不会物理地移动数据文件，而且不会考虑导入文件的 Scheme 信息是否和 Iceberg 表匹配。

（2）参数

table（必须）：string，目标表，数据被导入其中。

source_table（必须）：string，源表，提供需要被导入的数据。如果是 Hive 或 Spark 中的表，可以为db.tbname；如果文件，可以为file_format.path。

partition_filter：map<string, string>，要从中导入的源表中的分区集合。

（3）例子

-- 将 db.src_tbl 表（非Iceberg表）中分区 year = 2022 的数据添加到 db.tbl 中
call spark_catalog.system.add_files(
    table => 'db.tbl',
    source_table => 'db.src_tbl',
    partition_filter => map('year', '2022')
)

-- 将 path/to/table 目录中 格式为 parquet 的文件添加到 db.tbl 表中，
call spark_catalog.system.add_files(
    table => 'db.tbl',
    source_table => '`parquet`.`path/to/table`'
)

posted @ 2023-12-08 16:09 Robots2 阅读(813) 评论(0) 收藏举报

刷新页面返回顶部

Robots2

Iceberg Spark存储过程-表治理工具

一、简介

二、存储过程使用

三、Iceberg快照管理

四、元数据管理

五、表迁移

公告