hash对比方案落地
PgSQL vs GaussDB 行级 Hash 数据比对工具
技术栈:Java 8 · Spring Boot 2.7.x · PostgreSQL JDBC 42.x · GaussDB JDBC · HikariCP
目录
1. 整体设计
用户请求(表名 + 日期范围)
│
▼
┌───────────────────────┐
│ HashSqlBuilder │ 查询 PgSQL information_schema 获取列信息
│ 按数据类型生成标准化 │ 生成 MD5(concat_ws(...)) 的 SQL
└───────────────────────┘
│
▼
┌───────────────────────┐
│ ShardSplitter │ 按 id 范围(或行数估算)切分 N 个分片
└───────────────────────┘
│ 提交 N 个任务
▼
┌───────────────────────────────────────┐
│ compareExecutor(线程池) │
│ 每个 Worker: │
│ ├─ PgSQL 查询(ioExecutor 并发) │
│ └─ Gauss 查询(ioExecutor 并发) │
│ └─ diffMaps → ShardDiff │
└───────────────────────────────────────┘
│ CompletableFuture.allOf
▼
┌───────────────────────┐
│ ResultAggregator │ 合并各分片差异 → CompareReport
└───────────────────────┘
核心优化点:
| 优化点 | 说明 |
|---|---|
| 分片并发 | 每片 20w 行,多片并行,线程池控制并发度 |
| 同片双库并发 | 每个 Worker 内 PgSQL / Gauss 查询同时发出,取两者中慢的等待时间 |
| 流式游标读取 | fetchSize=2000,启用游标防 OOM |
| 两阶段对比 | 先 XOR 粗筛有差异的分片,再精确行级 hash 对比 |
| 背压保护 | CallerRunsPolicy,队列满时主线程自行执行,不丢任务不 OOM |
2. Maven 依赖
<properties>
<java.version>1.8</java.version>
<spring-boot.version>2.7.18</spring-boot.version>
</properties>
<dependencies>
<!-- Spring Boot Web(REST 触发接口,可选) -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!-- Spring JDBC(核心) -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-jdbc</artifactId>
</dependency>
<!-- PostgreSQL 驱动(42.7.x 为 PgSQL 最新稳定版) -->
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>42.7.3</version>
</dependency>
<!-- GaussDB JDBC(使用华为官方驱动,替换为实际 jar 或私有仓库坐标) -->
<!-- 如无 Maven 仓库,使用 system scope:
<dependency>
<groupId>com.huawei.gauss</groupId>
<artifactId>gaussdb-jdbc</artifactId>
<version>503.2.0</version>
<scope>system</scope>
<systemPath>${project.basedir}/libs/gsjdbc4.jar</systemPath>
</dependency>
-->
<!-- HikariCP(Spring Boot 2.x 默认已内置,无需显式引入) -->
<!-- Lombok(简化 getter/builder/log) -->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<!-- Jackson(序列化报告 JSON) -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</dependency>
</dependencies>
3. 配置文件
# application.yml
spring:
datasource:
# ──────────────── PgSQL 数据源 ────────────────
pgsql:
jdbc-url: jdbc:postgresql://10.0.0.1:5432/your_db
username: your_user
password: your_pass
driver-class-name: org.postgresql.Driver
hikari:
pool-name: HikariPgSQL
maximum-pool-size: 20 # 根据 DB 承载能力调整
minimum-idle: 5
connection-timeout: 30000
idle-timeout: 600000
max-lifetime: 1800000
# ──────────────── GaussDB 数据源 ────────────────
gauss:
jdbc-url: jdbc:gaussdb://10.0.0.2:5432/your_db
username: your_user
password: your_pass
driver-class-name: com.huawei.gaussdb.jdbc.Driver
hikari:
pool-name: HikariGauss
maximum-pool-size: 20
minimum-idle: 5
connection-timeout: 30000
idle-timeout: 600000
max-lifetime: 1800000
# 比对工具参数
compare:
shard-size: 200000 # 每片行数,建议 10w~50w
compare-threads: 8 # 比对线程池核心线程数
compare-threads-max: 16 # 比对线程池最大线程数
io-threads: 32 # IO 线程池(同时跑 PgSQL+Gauss 查询)
fetch-size: 2000 # JDBC 游标每批获取行数
queue-capacity: 200 # 比对任务队列容量
4. 数据源配置(双数据源)
// DataSourceConfig.java
package com.example.dbcompare.config;
import com.zaxxer.hikari.HikariDataSource;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.boot.jdbc.DataSourceBuilder;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Primary;
import org.springframework.jdbc.core.JdbcTemplate;
import javax.sql.DataSource;
@Configuration
public class DataSourceConfig {
/**
* PgSQL 数据源(Primary,用于读取 information_schema)
*/
@Primary
@Bean("pgsqlDataSource")
@ConfigurationProperties(prefix = "spring.datasource.pgsql")
public DataSource pgsqlDataSource() {
return DataSourceBuilder.create().type(HikariDataSource.class).build();
}
/**
* GaussDB 数据源
*/
@Bean("gaussDataSource")
@ConfigurationProperties(prefix = "spring.datasource.gauss")
public DataSource gaussDataSource() {
return DataSourceBuilder.create().type(HikariDataSource.class).build();
}
@Primary
@Bean("pgsqlJdbcTemplate")
public JdbcTemplate pgsqlJdbcTemplate(@Qualifier("pgsqlDataSource") DataSource ds) {
JdbcTemplate tpl = new JdbcTemplate(ds);
tpl.setFetchSize(2000); // 需配合事务只读+游标使用
return tpl;
}
@Bean("gaussJdbcTemplate")
public JdbcTemplate gaussJdbcTemplate(@Qualifier("gaussDataSource") DataSource ds) {
JdbcTemplate tpl = new JdbcTemplate(ds);
tpl.setFetchSize(2000);
return tpl;
}
}
注意:
@Qualifier需要引入org.springframework.beans.factory.annotation.Qualifier。
5. 数据类型标准化与 Hash SQL 生成器
5.1 列元数据模型
// ColumnMeta.java
package com.example.dbcompare.model;
import lombok.Data;
@Data
public class ColumnMeta {
private String columnName;
private String dataType; // information_schema 中的 data_type
private String udtName; // pg_catalog.pg_type 中的 udt_name(更精确)
private Integer numericPrecision;
private Integer numericScale;
private Integer datetimePrecision;
private String isNullable; // YES / NO
}
5.2 Hash SQL 生成器
// HashSqlBuilder.java
package com.example.dbcompare.builder;
import com.example.dbcompare.model.ColumnMeta;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.jdbc.core.BeanPropertyRowMapper;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Component;
import java.util.List;
import java.util.stream.Collectors;
/**
* 根据表结构动态生成 MD5(concat_ws(...)) 的 Hash 查询 SQL。
*
* <p>核心原则:两库各列显式 CAST 为 text,加分隔符拼接后 MD5,
* 保证跨库序列化结果一致。</p>
*/
@Slf4j
@Component
@RequiredArgsConstructor
public class HashSqlBuilder {
@Qualifier("pgsqlJdbcTemplate")
private final JdbcTemplate pgsqlJdbcTemplate;
// ----------------------------------------------------------------
// 公共入口
// ----------------------------------------------------------------
/**
* 构建分片 Hash 查询 SQL(含 OFFSET/LIMIT 分页)。
*
* @param schema schema 名称,如 "public"
* @param table 表名
* @param dateCol 日期过滤字段名
* @param dateFrom 开始日期(含),格式 "yyyy-MM-dd"
* @param dateTo 结束日期(含),格式 "yyyy-MM-dd"
* @param offset 分片偏移
* @param limit 分片大小
* @param isPgsql true=PgSQL语法,false=GaussDB语法(目前两者几乎相同)
*/
public String buildShardHashSql(String schema, String table, String dateCol,
String dateFrom, String dateTo,
long offset, int limit, boolean isPgsql) {
List<ColumnMeta> columns = getColumns(schema, table);
String hashExpr = buildHashExpr(columns, isPgsql);
return String.format(
"SELECT id, %s AS row_hash" +
" FROM %s.%s" +
" WHERE %s >= '%s' AND %s <= '%s'" +
" ORDER BY id" +
" OFFSET %d LIMIT %d",
hashExpr,
schema, table,
dateCol, dateFrom,
dateCol, dateTo,
offset, limit
);
}
/**
* 构建用于「粗筛」阶段的 XOR 聚合 SQL(按分片范围聚合单个 bigint)。
* XOR 相等则该分片数据极大概率一致,可跳过精细比对。
*/
public String buildXorAggregateSql(String schema, String table, String dateCol,
String dateFrom, String dateTo,
long offset, int limit, boolean isPgsql) {
List<ColumnMeta> columns = getColumns(schema, table);
String hashExpr = buildHashExpr(columns, isPgsql);
// hashtext() 是 PgSQL/GaussDB 内置函数,将 text 映射为 int4
// 用 BIT_XOR 聚合;若驱动不支持 BIT_XOR 可改用 SUM(hashtext(...)::bigint)
return String.format(
"SELECT COUNT(*) AS cnt," +
" BIT_XOR(hashtext(%s)::bigint) AS xor_val" +
" FROM (" +
" SELECT %s AS row_hash FROM %s.%s" +
" WHERE %s >= '%s' AND %s <= '%s'" +
" ORDER BY id OFFSET %d LIMIT %d" +
" ) t",
hashExpr, hashExpr,
schema, table,
dateCol, dateFrom,
dateCol, dateTo,
offset, limit
);
}
// ----------------------------------------------------------------
// 查询列元数据(仅从 PgSQL 查,GaussDB 结构相同)
// ----------------------------------------------------------------
public List<ColumnMeta> getColumns(String schema, String table) {
String sql =
"SELECT c.column_name," +
" c.data_type," +
" c.udt_name," +
" c.numeric_precision," +
" c.numeric_scale," +
" c.datetime_precision," +
" c.is_nullable" +
" FROM information_schema.columns c" +
" WHERE c.table_schema = ?" +
" AND c.table_name = ?" +
" ORDER BY c.ordinal_position";
return pgsqlJdbcTemplate.query(sql,
new BeanPropertyRowMapper<>(ColumnMeta.class),
schema, table);
}
// ----------------------------------------------------------------
// 核心:按数据类型生成标准化表达式
// ----------------------------------------------------------------
private String buildHashExpr(List<ColumnMeta> columns, boolean isPgsql) {
String colExprs = columns.stream()
.map(col -> toNormalizedText(col, isPgsql))
.collect(Collectors.joining(", "));
// concat_ws 使用 '|' 作分隔符;NULL 列由 COALESCE 转为字面 'NULL'
return "MD5(concat_ws('|', " + colExprs + "))";
}
/**
* 将单列转换为跨库一致的 text 表达式。
*
* <p>各类型处理策略:</p>
* <ul>
* <li>integer/bigint/smallint → 直接 ::text,两库一致</li>
* <li>numeric/decimal → to_char 固定精度,消除尾零差异</li>
* <li>real/double → to_char 固定精度,消除科学计数差异</li>
* <li>boolean → CASE 统一为 '1'/'0'</li>
* <li>timestamp(无时区) → to_char 固定格式</li>
* <li>timestamptz(带时区) → AT TIME ZONE 'UTC' 后 to_char</li>
* <li>date → to_char 固定格式</li>
* <li>time → to_char 固定格式</li>
* <li>jsonb/json → 两库都转 text,键序可能不同;
* 如需严格对比需应用层排序,此处简化处理</li>
* <li>array(_xxx) → array_to_string 展开</li>
* <li>其余(text/varchar/uuid等)→ 直接 ::text</li>
* </ul>
*/
private String toNormalizedText(ColumnMeta col, boolean isPgsql) {
String name = "\"" + col.getColumnName() + "\"";
String udtName = col.getUdtName() == null ? "" : col.getUdtName().toLowerCase();
String dataType = col.getDataType() == null ? "" : col.getDataType().toLowerCase();
String expr;
if (isIntegerType(dataType, udtName)) {
// 整数:直接转 text
expr = name + "::text";
} else if (isNumericType(dataType, udtName)) {
// numeric/decimal:to_char 保留最多 20 位小数,消除尾零
// FM 前缀去掉前导空格
expr = "to_char(" + name + ", 'FM99999999999999999990.####################')";
} else if (isFloatType(dataType, udtName)) {
// float/double:固定 15 位有效数字
expr = "to_char(" + name + "::double precision, 'FM0.####################')";
} else if (isBooleanType(dataType, udtName)) {
// boolean:统一为 1/0
expr = "CASE WHEN " + name + " THEN '1' ELSE '0' END";
} else if (isTimestampTzType(dataType, udtName)) {
// timestamptz:先转 UTC,再格式化(消除时区标记差异)
expr = "to_char(" + name + " AT TIME ZONE 'UTC', 'YYYY-MM-DD HH24:MI:SS.US')";
} else if (isTimestampType(dataType, udtName)) {
// timestamp(无时区):直接格式化,精度到微秒
expr = "to_char(" + name + ", 'YYYY-MM-DD HH24:MI:SS.US')";
} else if (isDateType(dataType, udtName)) {
expr = "to_char(" + name + ", 'YYYY-MM-DD')";
} else if (isTimeType(dataType, udtName)) {
expr = "to_char(" + name + ", 'HH24:MI:SS.US')";
} else if (isArrayType(udtName)) {
// 数组:展开为逗号分隔字符串
expr = "array_to_string(" + name + "::text[], ',')";
} else if (isJsonType(dataType, udtName)) {
// json/jsonb:转 text(注意键序不保证一致,若需严格对比需应用层处理)
expr = name + "::text";
} else {
// text / varchar / char / uuid / bytea(bytea 转 hex)等
if ("bytea".equals(udtName)) {
expr = "encode(" + name + ", 'hex')";
} else {
expr = name + "::text";
}
}
// 统一用 COALESCE 将 NULL 替换为字面 'NULL',避免 concat_ws 跳过 NULL 列
return "COALESCE(" + expr + ", 'NULL')";
}
// ----------------------------------------------------------------
// 类型判断辅助方法
// ----------------------------------------------------------------
private boolean isIntegerType(String dt, String udt) {
return dt.contains("integer") || dt.contains("int") || dt.contains("serial")
|| "int2".equals(udt) || "int4".equals(udt) || "int8".equals(udt);
}
private boolean isNumericType(String dt, String udt) {
return dt.contains("numeric") || dt.contains("decimal")
|| "numeric".equals(udt);
}
private boolean isFloatType(String dt, String udt) {
return dt.contains("real") || dt.contains("double") || dt.contains("float")
|| "float4".equals(udt) || "float8".equals(udt);
}
private boolean isBooleanType(String dt, String udt) {
return dt.contains("boolean") || "bool".equals(udt);
}
private boolean isTimestampTzType(String dt, String udt) {
return dt.contains("timestamp with time zone") || "timestamptz".equals(udt);
}
private boolean isTimestampType(String dt, String udt) {
return (dt.contains("timestamp") && !dt.contains("with time zone"))
|| "timestamp".equals(udt);
}
private boolean isDateType(String dt, String udt) {
return "date".equals(dt) || "date".equals(udt);
}
private boolean isTimeType(String dt, String udt) {
return (dt.contains("time") && !dt.contains("timestamp"))
|| "time".equals(udt) || "timetz".equals(udt);
}
private boolean isArrayType(String udt) {
return udt.startsWith("_"); // PgSQL 中数组类型 udt_name 以 _ 开头
}
private boolean isJsonType(String dt, String udt) {
return dt.contains("json") || "json".equals(udt) || "jsonb".equals(udt);
}
}
6. 核心比对服务
6.1 比对任务请求
// CompareRequest.java
package com.example.dbcompare.model;
import lombok.Data;
import javax.validation.constraints.NotBlank;
@Data
public class CompareRequest {
@NotBlank
private String schema = "public";
@NotBlank
private String tableName;
/** 日期过滤字段名 */
@NotBlank
private String dateColumn;
/** 开始日期,格式 yyyy-MM-dd(含) */
@NotBlank
private String dateFrom;
/** 结束日期,格式 yyyy-MM-dd(含) */
@NotBlank
private String dateTo;
/** 每片行数,默认 20w;可在请求中覆盖 */
private int shardSize = 200_000;
}
6.2 结果模型
// ShardXorResult.java
package com.example.dbcompare.model;
import lombok.Data;
@Data
public class ShardXorResult {
private long shardIndex;
private long offset;
private long limit;
private long pgCount;
private long gaussCount;
private Long pgXor;
private Long gaussXor;
/** 是否一致(XOR相等且行数相等) */
private boolean consistent;
}
// ShardDiff.java
package com.example.dbcompare.model;
import lombok.Data;
import java.util.List;
@Data
public class ShardDiff {
private long shardIndex;
/** PgSQL 有,Gauss 无(id 列表) */
private List<Object> onlyInPgsql;
/** Gauss 有,PgSQL 无(id 列表) */
private List<Object> onlyInGauss;
/** 两边都有但 hash 不同(id 列表) */
private List<Object> hashMismatch;
}
// CompareReport.java
package com.example.dbcompare.model;
import lombok.Data;
import java.time.LocalDateTime;
import java.util.List;
@Data
public class CompareReport {
private String tableName;
private String dateFrom;
private String dateTo;
private LocalDateTime startTime;
private LocalDateTime endTime;
private long costMillis;
private long totalShards;
/** XOR 粗筛一致的分片数 */
private long consistentShards;
/** 需要精细对比的分片数 */
private long inconsistentShards;
private long totalPgsqlRows;
private long totalGaussRows;
/** 仅在 PgSQL 的 id 列表 */
private List<Object> onlyInPgsql;
/** 仅在 GaussDB 的 id 列表 */
private List<Object> onlyInGauss;
/** Hash 不一致的 id 列表 */
private List<Object> hashMismatch;
public boolean isFullyConsistent() {
return onlyInPgsql.isEmpty() && onlyInGauss.isEmpty() && hashMismatch.isEmpty();
}
}
6.3 比对线程池配置
// CompareExecutorConfig.java
package com.example.dbcompare.config;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicInteger;
@Configuration
public class CompareExecutorConfig {
@Value("${compare.compare-threads:8}")
private int compareThreads;
@Value("${compare.compare-threads-max:16}")
private int compareThreadsMax;
@Value("${compare.io-threads:32}")
private int ioThreads;
@Value("${compare.queue-capacity:200}")
private int queueCapacity;
/**
* 比对分片任务线程池(CPU 密集型:对比两个 Map)
*/
@Bean("compareExecutor")
public ExecutorService compareExecutor() {
return new ThreadPoolExecutor(
compareThreads, compareThreadsMax,
60L, TimeUnit.SECONDS,
new LinkedBlockingQueue<>(queueCapacity),
namedThreadFactory("compare-worker"),
new ThreadPoolExecutor.CallerRunsPolicy() // 背压:队满主线程自行执行
);
}
/**
* IO 线程池(用于 PgSQL / Gauss 并发查询,IO 密集型,线程数可以较多)
*/
@Bean("ioExecutor")
public ExecutorService ioExecutor() {
return new ThreadPoolExecutor(
ioThreads, ioThreads,
60L, TimeUnit.SECONDS,
new LinkedBlockingQueue<>(500),
namedThreadFactory("compare-io"),
new ThreadPoolExecutor.CallerRunsPolicy()
);
}
private ThreadFactory namedThreadFactory(String prefix) {
AtomicInteger counter = new AtomicInteger(0);
return r -> {
Thread t = new Thread(r, prefix + "-" + counter.incrementAndGet());
t.setDaemon(true);
return t;
};
}
}
6.4 核心比对服务
// DataCompareService.java
package com.example.dbcompare.service;
import com.example.dbcompare.builder.HashSqlBuilder;
import com.example.dbcompare.model.*;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;
import javax.sql.DataSource;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.time.LocalDateTime;
import java.util.*;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutorService;
import java.util.stream.Collectors;
@Slf4j
@Service
@RequiredArgsConstructor
public class DataCompareService {
@Qualifier("pgsqlDataSource")
private final DataSource pgsqlDataSource;
@Qualifier("gaussDataSource")
private final DataSource gaussDataSource;
private final HashSqlBuilder hashSqlBuilder;
@Qualifier("compareExecutor")
private final ExecutorService compareExecutor;
@Qualifier("ioExecutor")
private final ExecutorService ioExecutor;
@Value("${compare.fetch-size:2000}")
private int fetchSize;
// ================================================================
// 公共入口
// ================================================================
public CompareReport compare(CompareRequest req) {
LocalDateTime startTime = LocalDateTime.now();
long t0 = System.currentTimeMillis();
log.info("[Compare] Start: table={}.{} dateRange=[{}, {}] shardSize={}",
req.getSchema(), req.getTableName(),
req.getDateFrom(), req.getDateTo(), req.getShardSize());
// 1. 估算总行数(从 PgSQL 查,决定分片数)
long total = countRows(req);
int shardCount = (int) Math.ceil((double) total / req.getShardSize());
log.info("[Compare] Estimated rows={}, shards={}", total, shardCount);
// 2. 第一阶段:XOR 粗筛(并发)
List<ShardXorResult> xorResults = phaseOneXorCheck(req, shardCount);
long consistent = xorResults.stream().filter(ShardXorResult::isConsistent).count();
long inconsistent = shardCount - consistent;
log.info("[Compare] Phase1 done: consistent={} inconsistent={}", consistent, inconsistent);
// 3. 第二阶段:对 XOR 不一致的分片做行级精细比对(并发)
List<ShardXorResult> dirtyShards = xorResults.stream()
.filter(r -> !r.isConsistent())
.collect(Collectors.toList());
List<ShardDiff> diffs = phaseTwoDetailCompare(req, dirtyShards);
// 4. 聚合结果
CompareReport report = aggregate(req, xorResults, diffs, startTime, t0,
shardCount, consistent, inconsistent);
log.info("[Compare] Done: totalPg={} totalGauss={} onlyPg={} onlyGauss={} mismatch={} costMs={}",
report.getTotalPgsqlRows(), report.getTotalGaussRows(),
report.getOnlyInPgsql().size(), report.getOnlyInGauss().size(),
report.getHashMismatch().size(), report.getCostMillis());
return report;
}
// ================================================================
// 第一阶段:XOR 粗筛
// ================================================================
private List<ShardXorResult> phaseOneXorCheck(CompareRequest req, int shardCount) {
List<CompletableFuture<ShardXorResult>> futures = new ArrayList<>(shardCount);
for (int i = 0; i < shardCount; i++) {
final long offset = (long) i * req.getShardSize();
final int limit = req.getShardSize();
final int shardIndex = i;
futures.add(CompletableFuture.supplyAsync(
() -> xorOneShard(req, shardIndex, offset, limit),
compareExecutor
));
}
return futures.stream()
.map(CompletableFuture::join)
.collect(Collectors.toList());
}
private ShardXorResult xorOneShard(CompareRequest req, int shardIndex,
long offset, long limit) {
String pgSql = hashSqlBuilder.buildXorAggregateSql(
req.getSchema(), req.getTableName(), req.getDateColumn(),
req.getDateFrom(), req.getDateTo(), offset, (int) limit, true);
String gaussSql = hashSqlBuilder.buildXorAggregateSql(
req.getSchema(), req.getTableName(), req.getDateColumn(),
req.getDateFrom(), req.getDateTo(), offset, (int) limit, false);
// 两库并发查询
CompletableFuture<long[]> pgFut = CompletableFuture.supplyAsync(
() -> queryXor(pgsqlDataSource, pgSql), ioExecutor);
CompletableFuture<long[]> gaussFut = CompletableFuture.supplyAsync(
() -> queryXor(gaussDataSource, gaussSql), ioExecutor);
long[] pgResult = pgFut.join(); // [cnt, xor]
long[] gaussResult = gaussFut.join();
ShardXorResult r = new ShardXorResult();
r.setShardIndex(shardIndex);
r.setOffset(offset);
r.setLimit(limit);
r.setPgCount(pgResult[0]);
r.setGaussCount(gaussResult[0]);
r.setPgXor(pgResult[1]);
r.setGaussXor(gaussResult[1]);
r.setConsistent(pgResult[0] == gaussResult[0]
&& Objects.equals(pgResult[1], gaussResult[1]));
return r;
}
/** 执行 XOR 聚合查询,返回 [cnt, xor_val] */
private long[] queryXor(DataSource ds, String sql) {
try (Connection conn = ds.getConnection();
PreparedStatement ps = conn.prepareStatement(sql);
ResultSet rs = ps.executeQuery()) {
if (rs.next()) {
return new long[]{rs.getLong("cnt"), rs.getLong("xor_val")};
}
} catch (Exception e) {
log.error("[XOR] Query failed: {}", e.getMessage(), e);
throw new RuntimeException("XOR query failed", e);
}
return new long[]{0L, 0L};
}
// ================================================================
// 第二阶段:行级精细比对
// ================================================================
private List<ShardDiff> phaseTwoDetailCompare(CompareRequest req,
List<ShardXorResult> dirtyShards) {
if (dirtyShards.isEmpty()) {
return Collections.emptyList();
}
List<CompletableFuture<ShardDiff>> futures = new ArrayList<>(dirtyShards.size());
for (ShardXorResult shard : dirtyShards) {
futures.add(CompletableFuture.supplyAsync(
() -> detailOneShard(req, shard),
compareExecutor
));
}
return futures.stream()
.map(CompletableFuture::join)
.collect(Collectors.toList());
}
private ShardDiff detailOneShard(CompareRequest req, ShardXorResult shard) {
String pgSql = hashSqlBuilder.buildShardHashSql(
req.getSchema(), req.getTableName(), req.getDateColumn(),
req.getDateFrom(), req.getDateTo(),
shard.getOffset(), (int) shard.getLimit(), true);
String gaussSql = hashSqlBuilder.buildShardHashSql(
req.getSchema(), req.getTableName(), req.getDateColumn(),
req.getDateFrom(), req.getDateTo(),
shard.getOffset(), (int) shard.getLimit(), false);
// 两库并发流式查询,结果存入 LinkedHashMap(保序)
CompletableFuture<Map<Object, String>> pgFut = CompletableFuture.supplyAsync(
() -> streamHashQuery(pgsqlDataSource, pgSql), ioExecutor);
CompletableFuture<Map<Object, String>> gaussFut = CompletableFuture.supplyAsync(
() -> streamHashQuery(gaussDataSource, gaussSql), ioExecutor);
Map<Object, String> pgMap = pgFut.join();
Map<Object, String> gaussMap = gaussFut.join();
return diffMaps(shard.getShardIndex(), pgMap, gaussMap);
}
/**
* 流式游标查询 id -> row_hash 映射。
* 使用原生 Connection + PreparedStatement 设置 fetchSize,
* 避免一次性将千万行加载进内存。
*/
private Map<Object, String> streamHashQuery(DataSource ds, String sql) {
Map<Object, String> result = new LinkedHashMap<>(50_000);
try (Connection conn = ds.getConnection()) {
// PgSQL 要求 autoCommit=false 才能启用服务端游标
conn.setAutoCommit(false);
try (PreparedStatement ps = conn.prepareStatement(sql,
ResultSet.TYPE_FORWARD_ONLY,
ResultSet.CONCUR_READ_ONLY)) {
ps.setFetchSize(fetchSize);
try (ResultSet rs = ps.executeQuery()) {
while (rs.next()) {
result.put(rs.getObject("id"), rs.getString("row_hash"));
}
}
} finally {
conn.setAutoCommit(true);
}
} catch (Exception e) {
log.error("[Detail] Stream query failed: {}", e.getMessage(), e);
throw new RuntimeException("Detail hash query failed", e);
}
return result;
}
/**
* 比较两个 Map,找出三类差异:仅PgSQL有 / 仅Gauss有 / hash不同
*/
private ShardDiff diffMaps(long shardIndex,
Map<Object, String> pgMap,
Map<Object, String> gaussMap) {
List<Object> onlyInPg = new ArrayList<>();
List<Object> onlyInGauss = new ArrayList<>();
List<Object> mismatch = new ArrayList<>();
// PgSQL 侧遍历
pgMap.forEach((id, pgHash) -> {
String gaussHash = gaussMap.get(id);
if (gaussHash == null) {
onlyInPg.add(id);
} else if (!pgHash.equals(gaussHash)) {
mismatch.add(id);
}
});
// Gauss 独有
gaussMap.forEach((id, v) -> {
if (!pgMap.containsKey(id)) {
onlyInGauss.add(id);
}
});
ShardDiff diff = new ShardDiff();
diff.setShardIndex(shardIndex);
diff.setOnlyInPgsql(onlyInPg);
diff.setOnlyInGauss(onlyInGauss);
diff.setHashMismatch(mismatch);
return diff;
}
// ================================================================
// 辅助:统计行数
// ================================================================
private long countRows(CompareRequest req) {
String sql = String.format(
"SELECT COUNT(*) FROM %s.%s WHERE %s >= '%s' AND %s <= '%s'",
req.getSchema(), req.getTableName(),
req.getDateColumn(), req.getDateFrom(),
req.getDateColumn(), req.getDateTo()
);
try (Connection conn = pgsqlDataSource.getConnection();
PreparedStatement ps = conn.prepareStatement(sql);
ResultSet rs = ps.executeQuery()) {
return rs.next() ? rs.getLong(1) : 0L;
} catch (Exception e) {
log.warn("[Count] Failed, fallback to 0: {}", e.getMessage());
return 0L;
}
}
// ================================================================
// 聚合最终报告
// ================================================================
private CompareReport aggregate(CompareRequest req,
List<ShardXorResult> xorResults,
List<ShardDiff> diffs,
LocalDateTime startTime, long t0,
int shardCount, long consistent, long inconsistent) {
CompareReport report = new CompareReport();
report.setTableName(req.getSchema() + "." + req.getTableName());
report.setDateFrom(req.getDateFrom());
report.setDateTo(req.getDateTo());
report.setStartTime(startTime);
report.setEndTime(LocalDateTime.now());
report.setCostMillis(System.currentTimeMillis() - t0);
report.setTotalShards(shardCount);
report.setConsistentShards(consistent);
report.setInconsistentShards(inconsistent);
report.setTotalPgsqlRows(
xorResults.stream().mapToLong(ShardXorResult::getPgCount).sum());
report.setTotalGaussRows(
xorResults.stream().mapToLong(ShardXorResult::getGaussCount).sum());
List<Object> allOnlyPg = new ArrayList<>();
List<Object> allOnlyGauss = new ArrayList<>();
List<Object> allMismatch = new ArrayList<>();
for (ShardDiff d : diffs) {
allOnlyPg.addAll(d.getOnlyInPgsql());
allOnlyGauss.addAll(d.getOnlyInGauss());
allMismatch.addAll(d.getHashMismatch());
}
report.setOnlyInPgsql(allOnlyPg);
report.setOnlyInGauss(allOnlyGauss);
report.setHashMismatch(allMismatch);
return report;
}
}
7. REST 接口(可选触发入口)
// DataCompareController.java
package com.example.dbcompare.controller;
import com.example.dbcompare.model.CompareReport;
import com.example.dbcompare.model.CompareRequest;
import com.example.dbcompare.service.DataCompareService;
import lombok.RequiredArgsConstructor;
import org.springframework.http.ResponseEntity;
import org.springframework.validation.annotation.Validated;
import org.springframework.web.bind.annotation.*;
@RestController
@RequestMapping("/api/compare")
@RequiredArgsConstructor
public class DataCompareController {
private final DataCompareService compareService;
/**
* 触发比对任务(同步返回,大数据量建议改为异步 + 轮询)。
*
* POST /api/compare/run
* {
* "schema": "public",
* "tableName": "order",
* "dateColumn": "created_at",
* "dateFrom": "2026-01-01",
* "dateTo": "2026-01-31",
* "shardSize": 200000
* }
*/
@PostMapping("/run")
public ResponseEntity<CompareReport> run(@Validated @RequestBody CompareRequest req) {
CompareReport report = compareService.compare(req);
return ResponseEntity.ok(report);
}
}
8. 关键设计说明
8.1 为什么用 MD5(concat_ws) 而不是 MD5((t.*)::text)
MD5(concat_ws(...)) |
MD5((t.*)::text) |
|
|---|---|---|
| 跨库一致性 | 高(显式标准化每列) | 低(行类型序列化格式依赖实现) |
| NULL 安全 | 显式 COALESCE 处理 | 行类型中 NULL 可能被忽略或格式不同 |
| 数值精度 | 可控(to_char 固定格式) | 可能因精度不同产生误报 |
| 维护成本 | 中(需按类型处理) | 低(但错误率高) |
8.2 两阶段比对的收益
场景:1000 万行,其中仅 500 行有差异,分 50 片(每片 20w)
第一阶段:XOR 粗筛(50 次聚合查询)
→ 仅 3 个分片 XOR 不一致(含那 500 行)
→ 其余 47 片(9,400,000 行)直接跳过精细比对
第二阶段:仅对 3 片(600,000 行)做行级 hash 比对
→ 传输量从 1000w 行降低到 60w 行,减少约 94%
8.3 流式游标读取的关键配置
PgSQL 启用服务端游标的前提:
Connection.setAutoCommit(false)— 必须在事务内PreparedStatement.setFetchSize(N)— N > 0 启用分批拉取ResultSet.TYPE_FORWARD_ONLY + CONCUR_READ_ONLY— 单向只读游标
GaussDB 与 PgSQL 在此行为一致。
8.4 线程模型
主线程
└─ compareExecutor(8~16线程)— 负责分片任务调度和 Map diff
└─ ioExecutor(32线程)— 负责 PgSQL/Gauss 并发查询(IO密集)
每个分片的时序(单 Worker):
t=0: 提交 pgSql 到 ioExecutor → CompletableFuture A
t=0: 提交 gaussSql 到 ioExecutor → CompletableFuture B
t=T: join(A, B)(T = max(pgQueryTime, gaussQueryTime))
t=T+δ: diffMaps(内存操作,通常毫秒级)
8.5 内存消耗估算
| 场景 | 内存用量 |
|---|---|
| 每片 20w 行,每行 id(8B)+hash(32B) = 40B | 20w × 40B × 2(双库)≈ 16 MB/片 |
| 16 线程并发 | 峰值 ≈ 16 × 16MB = 256 MB |
| JVM 堆建议 | -Xmx2g(含其他开销余量) |
9. 性能调优参数参考
| 参数 | 建议值 | 说明 |
|---|---|---|
shardSize |
100,000 ~ 500,000 | 取决于单行数据大小,行宽越大片越小 |
compareThreads |
CPU核心数 × 1 | Map diff 为 CPU 密集 |
ioThreads |
DB连接池上限 × 0.6 | IO 密集,可多于 CPU 核数 |
fetchSize |
1,000 ~ 5,000 | 太小网络往返多;太大内存峰值高 |
maximum-pool-size |
20(每个库) | 与 ioThreads 匹配,避免连接等待 |
| JVM | -Xmx2g -XX:+UseG1GC |
G1 对大量短生命周期对象友好 |
10. 快速启动示例
# 启动服务
java -Xmx2g -XX:+UseG1GC -jar db-compare-tool.jar
# 触发比对
curl -X POST http://localhost:8080/api/compare/run \
-H "Content-Type: application/json" \
-d '{
"schema": "public",
"tableName": "orders",
"dateColumn": "created_at",
"dateFrom": "2026-01-01",
"dateTo": "2026-01-31",
"shardSize": 200000
}'
响应示例:
{
"tableName": "public.orders",
"dateFrom": "2026-01-01",
"dateTo": "2026-01-31",
"startTime": "2026-05-14T21:30:00",
"endTime": "2026-05-14T21:31:42",
"costMillis": 102000,
"totalShards": 50,
"consistentShards": 47,
"inconsistentShards": 3,
"totalPgsqlRows": 10000000,
"totalGaussRows": 10000000,
"onlyInPgsql": [],
"onlyInGauss": [],
"hashMismatch": [10001, 28345, 99827],
"fullyConsistent": false
}
浙公网安备 33010602011771号