hash对比方案落地

PgSQL vs GaussDB 行级 Hash 数据比对工具

技术栈:Java 8 · Spring Boot 2.7.x · PostgreSQL JDBC 42.x · GaussDB JDBC · HikariCP


目录

  1. 整体设计
  2. Maven 依赖
  3. 配置文件
  4. 数据源配置(双数据源)
  5. 数据类型标准化与 Hash SQL 生成器
  6. 核心比对服务
  7. 结果模型
  8. REST 接口(可选触发入口)
  9. 关键设计说明
  10. 性能调优参数参考

1. 整体设计

用户请求(表名 + 日期范围)
        │
        ▼
┌───────────────────────┐
│  HashSqlBuilder       │  查询 PgSQL information_schema 获取列信息
│  按数据类型生成标准化   │  生成 MD5(concat_ws(...)) 的 SQL
└───────────────────────┘
        │
        ▼
┌───────────────────────┐
│  ShardSplitter        │  按 id 范围(或行数估算)切分 N 个分片
└───────────────────────┘
        │ 提交 N 个任务
        ▼
┌───────────────────────────────────────┐
│  compareExecutor(线程池)             │
│  每个 Worker:                        │
│    ├─ PgSQL 查询(ioExecutor 并发)    │
│    └─ Gauss 查询(ioExecutor 并发)    │
│    └─ diffMaps → ShardDiff            │
└───────────────────────────────────────┘
        │ CompletableFuture.allOf
        ▼
┌───────────────────────┐
│  ResultAggregator     │  合并各分片差异 → CompareReport
└───────────────────────┘

核心优化点:

优化点 说明
分片并发 每片 20w 行,多片并行,线程池控制并发度
同片双库并发 每个 Worker 内 PgSQL / Gauss 查询同时发出,取两者中慢的等待时间
流式游标读取 fetchSize=2000,启用游标防 OOM
两阶段对比 先 XOR 粗筛有差异的分片,再精确行级 hash 对比
背压保护 CallerRunsPolicy,队列满时主线程自行执行,不丢任务不 OOM

2. Maven 依赖

<properties>
    <java.version>1.8</java.version>
    <spring-boot.version>2.7.18</spring-boot.version>
</properties>

<dependencies>
    <!-- Spring Boot Web(REST 触发接口,可选) -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>

    <!-- Spring JDBC(核心) -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-jdbc</artifactId>
    </dependency>

    <!-- PostgreSQL 驱动(42.7.x 为 PgSQL 最新稳定版) -->
    <dependency>
        <groupId>org.postgresql</groupId>
        <artifactId>postgresql</artifactId>
        <version>42.7.3</version>
    </dependency>

    <!-- GaussDB JDBC(使用华为官方驱动,替换为实际 jar 或私有仓库坐标) -->
    <!-- 如无 Maven 仓库,使用 system scope:
    <dependency>
        <groupId>com.huawei.gauss</groupId>
        <artifactId>gaussdb-jdbc</artifactId>
        <version>503.2.0</version>
        <scope>system</scope>
        <systemPath>${project.basedir}/libs/gsjdbc4.jar</systemPath>
    </dependency>
    -->

    <!-- HikariCP(Spring Boot 2.x 默认已内置,无需显式引入) -->

    <!-- Lombok(简化 getter/builder/log) -->
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <optional>true</optional>
    </dependency>

    <!-- Jackson(序列化报告 JSON) -->
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-databind</artifactId>
    </dependency>
</dependencies>

3. 配置文件

# application.yml

spring:
  datasource:
    # ──────────────── PgSQL 数据源 ────────────────
    pgsql:
      jdbc-url: jdbc:postgresql://10.0.0.1:5432/your_db
      username: your_user
      password: your_pass
      driver-class-name: org.postgresql.Driver
      hikari:
        pool-name: HikariPgSQL
        maximum-pool-size: 20        # 根据 DB 承载能力调整
        minimum-idle: 5
        connection-timeout: 30000
        idle-timeout: 600000
        max-lifetime: 1800000

    # ──────────────── GaussDB 数据源 ────────────────
    gauss:
      jdbc-url: jdbc:gaussdb://10.0.0.2:5432/your_db
      username: your_user
      password: your_pass
      driver-class-name: com.huawei.gaussdb.jdbc.Driver
      hikari:
        pool-name: HikariGauss
        maximum-pool-size: 20
        minimum-idle: 5
        connection-timeout: 30000
        idle-timeout: 600000
        max-lifetime: 1800000

# 比对工具参数
compare:
  shard-size: 200000          # 每片行数,建议 10w~50w
  compare-threads: 8          # 比对线程池核心线程数
  compare-threads-max: 16     # 比对线程池最大线程数
  io-threads: 32              # IO 线程池(同时跑 PgSQL+Gauss 查询)
  fetch-size: 2000            # JDBC 游标每批获取行数
  queue-capacity: 200         # 比对任务队列容量

4. 数据源配置(双数据源)

// DataSourceConfig.java
package com.example.dbcompare.config;

import com.zaxxer.hikari.HikariDataSource;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.boot.jdbc.DataSourceBuilder;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Primary;
import org.springframework.jdbc.core.JdbcTemplate;

import javax.sql.DataSource;

@Configuration
public class DataSourceConfig {

    /**
     * PgSQL 数据源(Primary,用于读取 information_schema)
     */
    @Primary
    @Bean("pgsqlDataSource")
    @ConfigurationProperties(prefix = "spring.datasource.pgsql")
    public DataSource pgsqlDataSource() {
        return DataSourceBuilder.create().type(HikariDataSource.class).build();
    }

    /**
     * GaussDB 数据源
     */
    @Bean("gaussDataSource")
    @ConfigurationProperties(prefix = "spring.datasource.gauss")
    public DataSource gaussDataSource() {
        return DataSourceBuilder.create().type(HikariDataSource.class).build();
    }

    @Primary
    @Bean("pgsqlJdbcTemplate")
    public JdbcTemplate pgsqlJdbcTemplate(@Qualifier("pgsqlDataSource") DataSource ds) {
        JdbcTemplate tpl = new JdbcTemplate(ds);
        tpl.setFetchSize(2000); // 需配合事务只读+游标使用
        return tpl;
    }

    @Bean("gaussJdbcTemplate")
    public JdbcTemplate gaussJdbcTemplate(@Qualifier("gaussDataSource") DataSource ds) {
        JdbcTemplate tpl = new JdbcTemplate(ds);
        tpl.setFetchSize(2000);
        return tpl;
    }
}

注意@Qualifier 需要引入 org.springframework.beans.factory.annotation.Qualifier


5. 数据类型标准化与 Hash SQL 生成器

5.1 列元数据模型

// ColumnMeta.java
package com.example.dbcompare.model;

import lombok.Data;

@Data
public class ColumnMeta {
    private String columnName;
    private String dataType;       // information_schema 中的 data_type
    private String udtName;        // pg_catalog.pg_type 中的 udt_name(更精确)
    private Integer numericPrecision;
    private Integer numericScale;
    private Integer datetimePrecision;
    private String isNullable;     // YES / NO
}

5.2 Hash SQL 生成器

// HashSqlBuilder.java
package com.example.dbcompare.builder;

import com.example.dbcompare.model.ColumnMeta;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.jdbc.core.BeanPropertyRowMapper;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Component;

import java.util.List;
import java.util.stream.Collectors;

/**
 * 根据表结构动态生成 MD5(concat_ws(...)) 的 Hash 查询 SQL。
 *
 * <p>核心原则:两库各列显式 CAST 为 text,加分隔符拼接后 MD5,
 * 保证跨库序列化结果一致。</p>
 */
@Slf4j
@Component
@RequiredArgsConstructor
public class HashSqlBuilder {

    @Qualifier("pgsqlJdbcTemplate")
    private final JdbcTemplate pgsqlJdbcTemplate;

    // ----------------------------------------------------------------
    //  公共入口
    // ----------------------------------------------------------------

    /**
     * 构建分片 Hash 查询 SQL(含 OFFSET/LIMIT 分页)。
     *
     * @param schema    schema 名称,如 "public"
     * @param table     表名
     * @param dateCol   日期过滤字段名
     * @param dateFrom  开始日期(含),格式 "yyyy-MM-dd"
     * @param dateTo    结束日期(含),格式 "yyyy-MM-dd"
     * @param offset    分片偏移
     * @param limit     分片大小
     * @param isPgsql   true=PgSQL语法,false=GaussDB语法(目前两者几乎相同)
     */
    public String buildShardHashSql(String schema, String table, String dateCol,
                                    String dateFrom, String dateTo,
                                    long offset, int limit, boolean isPgsql) {
        List<ColumnMeta> columns = getColumns(schema, table);
        String hashExpr = buildHashExpr(columns, isPgsql);

        return String.format(
            "SELECT id, %s AS row_hash" +
            " FROM %s.%s" +
            " WHERE %s >= '%s' AND %s <= '%s'" +
            " ORDER BY id" +
            " OFFSET %d LIMIT %d",
            hashExpr,
            schema, table,
            dateCol, dateFrom,
            dateCol, dateTo,
            offset, limit
        );
    }

    /**
     * 构建用于「粗筛」阶段的 XOR 聚合 SQL(按分片范围聚合单个 bigint)。
     * XOR 相等则该分片数据极大概率一致,可跳过精细比对。
     */
    public String buildXorAggregateSql(String schema, String table, String dateCol,
                                       String dateFrom, String dateTo,
                                       long offset, int limit, boolean isPgsql) {
        List<ColumnMeta> columns = getColumns(schema, table);
        String hashExpr = buildHashExpr(columns, isPgsql);

        // hashtext() 是 PgSQL/GaussDB 内置函数,将 text 映射为 int4
        // 用 BIT_XOR 聚合;若驱动不支持 BIT_XOR 可改用 SUM(hashtext(...)::bigint)
        return String.format(
            "SELECT COUNT(*) AS cnt," +
            "       BIT_XOR(hashtext(%s)::bigint) AS xor_val" +
            " FROM (" +
            "   SELECT %s AS row_hash FROM %s.%s" +
            "   WHERE %s >= '%s' AND %s <= '%s'" +
            "   ORDER BY id OFFSET %d LIMIT %d" +
            " ) t",
            hashExpr, hashExpr,
            schema, table,
            dateCol, dateFrom,
            dateCol, dateTo,
            offset, limit
        );
    }

    // ----------------------------------------------------------------
    //  查询列元数据(仅从 PgSQL 查,GaussDB 结构相同)
    // ----------------------------------------------------------------

    public List<ColumnMeta> getColumns(String schema, String table) {
        String sql =
            "SELECT c.column_name," +
            "       c.data_type," +
            "       c.udt_name," +
            "       c.numeric_precision," +
            "       c.numeric_scale," +
            "       c.datetime_precision," +
            "       c.is_nullable" +
            " FROM information_schema.columns c" +
            " WHERE c.table_schema = ?" +
            "   AND c.table_name   = ?" +
            " ORDER BY c.ordinal_position";

        return pgsqlJdbcTemplate.query(sql,
            new BeanPropertyRowMapper<>(ColumnMeta.class),
            schema, table);
    }

    // ----------------------------------------------------------------
    //  核心:按数据类型生成标准化表达式
    // ----------------------------------------------------------------

    private String buildHashExpr(List<ColumnMeta> columns, boolean isPgsql) {
        String colExprs = columns.stream()
            .map(col -> toNormalizedText(col, isPgsql))
            .collect(Collectors.joining(", "));

        // concat_ws 使用 '|' 作分隔符;NULL 列由 COALESCE 转为字面 'NULL'
        return "MD5(concat_ws('|', " + colExprs + "))";
    }

    /**
     * 将单列转换为跨库一致的 text 表达式。
     *
     * <p>各类型处理策略:</p>
     * <ul>
     *   <li>integer/bigint/smallint   → 直接 ::text,两库一致</li>
     *   <li>numeric/decimal           → to_char 固定精度,消除尾零差异</li>
     *   <li>real/double               → to_char 固定精度,消除科学计数差异</li>
     *   <li>boolean                   → CASE 统一为 '1'/'0'</li>
     *   <li>timestamp(无时区)        → to_char 固定格式</li>
     *   <li>timestamptz(带时区)      → AT TIME ZONE 'UTC' 后 to_char</li>
     *   <li>date                      → to_char 固定格式</li>
     *   <li>time                      → to_char 固定格式</li>
     *   <li>jsonb/json                → 两库都转 text,键序可能不同;
     *                                   如需严格对比需应用层排序,此处简化处理</li>
     *   <li>array(_xxx)             → array_to_string 展开</li>
     *   <li>其余(text/varchar/uuid等)→ 直接 ::text</li>
     * </ul>
     */
    private String toNormalizedText(ColumnMeta col, boolean isPgsql) {
        String name = "\"" + col.getColumnName() + "\"";
        String udtName = col.getUdtName() == null ? "" : col.getUdtName().toLowerCase();
        String dataType = col.getDataType() == null ? "" : col.getDataType().toLowerCase();

        String expr;

        if (isIntegerType(dataType, udtName)) {
            // 整数:直接转 text
            expr = name + "::text";

        } else if (isNumericType(dataType, udtName)) {
            // numeric/decimal:to_char 保留最多 20 位小数,消除尾零
            // FM 前缀去掉前导空格
            expr = "to_char(" + name + ", 'FM99999999999999999990.####################')";

        } else if (isFloatType(dataType, udtName)) {
            // float/double:固定 15 位有效数字
            expr = "to_char(" + name + "::double precision, 'FM0.####################')";

        } else if (isBooleanType(dataType, udtName)) {
            // boolean:统一为 1/0
            expr = "CASE WHEN " + name + " THEN '1' ELSE '0' END";

        } else if (isTimestampTzType(dataType, udtName)) {
            // timestamptz:先转 UTC,再格式化(消除时区标记差异)
            expr = "to_char(" + name + " AT TIME ZONE 'UTC', 'YYYY-MM-DD HH24:MI:SS.US')";

        } else if (isTimestampType(dataType, udtName)) {
            // timestamp(无时区):直接格式化,精度到微秒
            expr = "to_char(" + name + ", 'YYYY-MM-DD HH24:MI:SS.US')";

        } else if (isDateType(dataType, udtName)) {
            expr = "to_char(" + name + ", 'YYYY-MM-DD')";

        } else if (isTimeType(dataType, udtName)) {
            expr = "to_char(" + name + ", 'HH24:MI:SS.US')";

        } else if (isArrayType(udtName)) {
            // 数组:展开为逗号分隔字符串
            expr = "array_to_string(" + name + "::text[], ',')";

        } else if (isJsonType(dataType, udtName)) {
            // json/jsonb:转 text(注意键序不保证一致,若需严格对比需应用层处理)
            expr = name + "::text";

        } else {
            // text / varchar / char / uuid / bytea(bytea 转 hex)等
            if ("bytea".equals(udtName)) {
                expr = "encode(" + name + ", 'hex')";
            } else {
                expr = name + "::text";
            }
        }

        // 统一用 COALESCE 将 NULL 替换为字面 'NULL',避免 concat_ws 跳过 NULL 列
        return "COALESCE(" + expr + ", 'NULL')";
    }

    // ----------------------------------------------------------------
    //  类型判断辅助方法
    // ----------------------------------------------------------------

    private boolean isIntegerType(String dt, String udt) {
        return dt.contains("integer") || dt.contains("int") || dt.contains("serial")
            || "int2".equals(udt) || "int4".equals(udt) || "int8".equals(udt);
    }

    private boolean isNumericType(String dt, String udt) {
        return dt.contains("numeric") || dt.contains("decimal")
            || "numeric".equals(udt);
    }

    private boolean isFloatType(String dt, String udt) {
        return dt.contains("real") || dt.contains("double") || dt.contains("float")
            || "float4".equals(udt) || "float8".equals(udt);
    }

    private boolean isBooleanType(String dt, String udt) {
        return dt.contains("boolean") || "bool".equals(udt);
    }

    private boolean isTimestampTzType(String dt, String udt) {
        return dt.contains("timestamp with time zone") || "timestamptz".equals(udt);
    }

    private boolean isTimestampType(String dt, String udt) {
        return (dt.contains("timestamp") && !dt.contains("with time zone"))
            || "timestamp".equals(udt);
    }

    private boolean isDateType(String dt, String udt) {
        return "date".equals(dt) || "date".equals(udt);
    }

    private boolean isTimeType(String dt, String udt) {
        return (dt.contains("time") && !dt.contains("timestamp"))
            || "time".equals(udt) || "timetz".equals(udt);
    }

    private boolean isArrayType(String udt) {
        return udt.startsWith("_"); // PgSQL 中数组类型 udt_name 以 _ 开头
    }

    private boolean isJsonType(String dt, String udt) {
        return dt.contains("json") || "json".equals(udt) || "jsonb".equals(udt);
    }
}

6. 核心比对服务

6.1 比对任务请求

// CompareRequest.java
package com.example.dbcompare.model;

import lombok.Data;

import javax.validation.constraints.NotBlank;

@Data
public class CompareRequest {

    @NotBlank
    private String schema = "public";

    @NotBlank
    private String tableName;

    /** 日期过滤字段名 */
    @NotBlank
    private String dateColumn;

    /** 开始日期,格式 yyyy-MM-dd(含) */
    @NotBlank
    private String dateFrom;

    /** 结束日期,格式 yyyy-MM-dd(含) */
    @NotBlank
    private String dateTo;

    /** 每片行数,默认 20w;可在请求中覆盖 */
    private int shardSize = 200_000;
}

6.2 结果模型

// ShardXorResult.java
package com.example.dbcompare.model;

import lombok.Data;

@Data
public class ShardXorResult {
    private long shardIndex;
    private long offset;
    private long limit;
    private long pgCount;
    private long gaussCount;
    private Long pgXor;
    private Long gaussXor;
    /** 是否一致(XOR相等且行数相等) */
    private boolean consistent;
}
// ShardDiff.java
package com.example.dbcompare.model;

import lombok.Data;
import java.util.List;

@Data
public class ShardDiff {
    private long shardIndex;
    /** PgSQL 有,Gauss 无(id 列表) */
    private List<Object> onlyInPgsql;
    /** Gauss 有,PgSQL 无(id 列表) */
    private List<Object> onlyInGauss;
    /** 两边都有但 hash 不同(id 列表) */
    private List<Object> hashMismatch;
}
// CompareReport.java
package com.example.dbcompare.model;

import lombok.Data;
import java.time.LocalDateTime;
import java.util.List;

@Data
public class CompareReport {
    private String tableName;
    private String dateFrom;
    private String dateTo;
    private LocalDateTime startTime;
    private LocalDateTime endTime;
    private long costMillis;

    private long totalShards;
    /** XOR 粗筛一致的分片数 */
    private long consistentShards;
    /** 需要精细对比的分片数 */
    private long inconsistentShards;

    private long totalPgsqlRows;
    private long totalGaussRows;

    /** 仅在 PgSQL 的 id 列表 */
    private List<Object> onlyInPgsql;
    /** 仅在 GaussDB 的 id 列表 */
    private List<Object> onlyInGauss;
    /** Hash 不一致的 id 列表 */
    private List<Object> hashMismatch;

    public boolean isFullyConsistent() {
        return onlyInPgsql.isEmpty() && onlyInGauss.isEmpty() && hashMismatch.isEmpty();
    }
}

6.3 比对线程池配置

// CompareExecutorConfig.java
package com.example.dbcompare.config;

import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicInteger;

@Configuration
public class CompareExecutorConfig {

    @Value("${compare.compare-threads:8}")
    private int compareThreads;

    @Value("${compare.compare-threads-max:16}")
    private int compareThreadsMax;

    @Value("${compare.io-threads:32}")
    private int ioThreads;

    @Value("${compare.queue-capacity:200}")
    private int queueCapacity;

    /**
     * 比对分片任务线程池(CPU 密集型:对比两个 Map)
     */
    @Bean("compareExecutor")
    public ExecutorService compareExecutor() {
        return new ThreadPoolExecutor(
            compareThreads, compareThreadsMax,
            60L, TimeUnit.SECONDS,
            new LinkedBlockingQueue<>(queueCapacity),
            namedThreadFactory("compare-worker"),
            new ThreadPoolExecutor.CallerRunsPolicy()  // 背压:队满主线程自行执行
        );
    }

    /**
     * IO 线程池(用于 PgSQL / Gauss 并发查询,IO 密集型,线程数可以较多)
     */
    @Bean("ioExecutor")
    public ExecutorService ioExecutor() {
        return new ThreadPoolExecutor(
            ioThreads, ioThreads,
            60L, TimeUnit.SECONDS,
            new LinkedBlockingQueue<>(500),
            namedThreadFactory("compare-io"),
            new ThreadPoolExecutor.CallerRunsPolicy()
        );
    }

    private ThreadFactory namedThreadFactory(String prefix) {
        AtomicInteger counter = new AtomicInteger(0);
        return r -> {
            Thread t = new Thread(r, prefix + "-" + counter.incrementAndGet());
            t.setDaemon(true);
            return t;
        };
    }
}

6.4 核心比对服务

// DataCompareService.java
package com.example.dbcompare.service;

import com.example.dbcompare.builder.HashSqlBuilder;
import com.example.dbcompare.model.*;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;

import javax.sql.DataSource;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.time.LocalDateTime;
import java.util.*;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutorService;
import java.util.stream.Collectors;

@Slf4j
@Service
@RequiredArgsConstructor
public class DataCompareService {

    @Qualifier("pgsqlDataSource")
    private final DataSource pgsqlDataSource;

    @Qualifier("gaussDataSource")
    private final DataSource gaussDataSource;

    private final HashSqlBuilder hashSqlBuilder;

    @Qualifier("compareExecutor")
    private final ExecutorService compareExecutor;

    @Qualifier("ioExecutor")
    private final ExecutorService ioExecutor;

    @Value("${compare.fetch-size:2000}")
    private int fetchSize;

    // ================================================================
    //  公共入口
    // ================================================================

    public CompareReport compare(CompareRequest req) {
        LocalDateTime startTime = LocalDateTime.now();
        long t0 = System.currentTimeMillis();

        log.info("[Compare] Start: table={}.{} dateRange=[{}, {}] shardSize={}",
            req.getSchema(), req.getTableName(),
            req.getDateFrom(), req.getDateTo(), req.getShardSize());

        // 1. 估算总行数(从 PgSQL 查,决定分片数)
        long total = countRows(req);
        int shardCount = (int) Math.ceil((double) total / req.getShardSize());
        log.info("[Compare] Estimated rows={}, shards={}", total, shardCount);

        // 2. 第一阶段:XOR 粗筛(并发)
        List<ShardXorResult> xorResults = phaseOneXorCheck(req, shardCount);

        long consistent = xorResults.stream().filter(ShardXorResult::isConsistent).count();
        long inconsistent = shardCount - consistent;
        log.info("[Compare] Phase1 done: consistent={} inconsistent={}", consistent, inconsistent);

        // 3. 第二阶段:对 XOR 不一致的分片做行级精细比对(并发)
        List<ShardXorResult> dirtyShards = xorResults.stream()
            .filter(r -> !r.isConsistent())
            .collect(Collectors.toList());

        List<ShardDiff> diffs = phaseTwoDetailCompare(req, dirtyShards);

        // 4. 聚合结果
        CompareReport report = aggregate(req, xorResults, diffs, startTime, t0,
            shardCount, consistent, inconsistent);

        log.info("[Compare] Done: totalPg={} totalGauss={} onlyPg={} onlyGauss={} mismatch={} costMs={}",
            report.getTotalPgsqlRows(), report.getTotalGaussRows(),
            report.getOnlyInPgsql().size(), report.getOnlyInGauss().size(),
            report.getHashMismatch().size(), report.getCostMillis());

        return report;
    }

    // ================================================================
    //  第一阶段:XOR 粗筛
    // ================================================================

    private List<ShardXorResult> phaseOneXorCheck(CompareRequest req, int shardCount) {
        List<CompletableFuture<ShardXorResult>> futures = new ArrayList<>(shardCount);

        for (int i = 0; i < shardCount; i++) {
            final long offset = (long) i * req.getShardSize();
            final int limit = req.getShardSize();
            final int shardIndex = i;

            futures.add(CompletableFuture.supplyAsync(
                () -> xorOneShard(req, shardIndex, offset, limit),
                compareExecutor
            ));
        }

        return futures.stream()
            .map(CompletableFuture::join)
            .collect(Collectors.toList());
    }

    private ShardXorResult xorOneShard(CompareRequest req, int shardIndex,
                                       long offset, long limit) {
        String pgSql   = hashSqlBuilder.buildXorAggregateSql(
            req.getSchema(), req.getTableName(), req.getDateColumn(),
            req.getDateFrom(), req.getDateTo(), offset, (int) limit, true);
        String gaussSql = hashSqlBuilder.buildXorAggregateSql(
            req.getSchema(), req.getTableName(), req.getDateColumn(),
            req.getDateFrom(), req.getDateTo(), offset, (int) limit, false);

        // 两库并发查询
        CompletableFuture<long[]> pgFut = CompletableFuture.supplyAsync(
            () -> queryXor(pgsqlDataSource, pgSql), ioExecutor);
        CompletableFuture<long[]> gaussFut = CompletableFuture.supplyAsync(
            () -> queryXor(gaussDataSource, gaussSql), ioExecutor);

        long[] pgResult    = pgFut.join();    // [cnt, xor]
        long[] gaussResult = gaussFut.join();

        ShardXorResult r = new ShardXorResult();
        r.setShardIndex(shardIndex);
        r.setOffset(offset);
        r.setLimit(limit);
        r.setPgCount(pgResult[0]);
        r.setGaussCount(gaussResult[0]);
        r.setPgXor(pgResult[1]);
        r.setGaussXor(gaussResult[1]);
        r.setConsistent(pgResult[0] == gaussResult[0]
            && Objects.equals(pgResult[1], gaussResult[1]));

        return r;
    }

    /** 执行 XOR 聚合查询,返回 [cnt, xor_val] */
    private long[] queryXor(DataSource ds, String sql) {
        try (Connection conn = ds.getConnection();
             PreparedStatement ps = conn.prepareStatement(sql);
             ResultSet rs = ps.executeQuery()) {
            if (rs.next()) {
                return new long[]{rs.getLong("cnt"), rs.getLong("xor_val")};
            }
        } catch (Exception e) {
            log.error("[XOR] Query failed: {}", e.getMessage(), e);
            throw new RuntimeException("XOR query failed", e);
        }
        return new long[]{0L, 0L};
    }

    // ================================================================
    //  第二阶段:行级精细比对
    // ================================================================

    private List<ShardDiff> phaseTwoDetailCompare(CompareRequest req,
                                                   List<ShardXorResult> dirtyShards) {
        if (dirtyShards.isEmpty()) {
            return Collections.emptyList();
        }

        List<CompletableFuture<ShardDiff>> futures = new ArrayList<>(dirtyShards.size());

        for (ShardXorResult shard : dirtyShards) {
            futures.add(CompletableFuture.supplyAsync(
                () -> detailOneShard(req, shard),
                compareExecutor
            ));
        }

        return futures.stream()
            .map(CompletableFuture::join)
            .collect(Collectors.toList());
    }

    private ShardDiff detailOneShard(CompareRequest req, ShardXorResult shard) {
        String pgSql = hashSqlBuilder.buildShardHashSql(
            req.getSchema(), req.getTableName(), req.getDateColumn(),
            req.getDateFrom(), req.getDateTo(),
            shard.getOffset(), (int) shard.getLimit(), true);
        String gaussSql = hashSqlBuilder.buildShardHashSql(
            req.getSchema(), req.getTableName(), req.getDateColumn(),
            req.getDateFrom(), req.getDateTo(),
            shard.getOffset(), (int) shard.getLimit(), false);

        // 两库并发流式查询,结果存入 LinkedHashMap(保序)
        CompletableFuture<Map<Object, String>> pgFut = CompletableFuture.supplyAsync(
            () -> streamHashQuery(pgsqlDataSource, pgSql), ioExecutor);
        CompletableFuture<Map<Object, String>> gaussFut = CompletableFuture.supplyAsync(
            () -> streamHashQuery(gaussDataSource, gaussSql), ioExecutor);

        Map<Object, String> pgMap    = pgFut.join();
        Map<Object, String> gaussMap = gaussFut.join();

        return diffMaps(shard.getShardIndex(), pgMap, gaussMap);
    }

    /**
     * 流式游标查询 id -> row_hash 映射。
     * 使用原生 Connection + PreparedStatement 设置 fetchSize,
     * 避免一次性将千万行加载进内存。
     */
    private Map<Object, String> streamHashQuery(DataSource ds, String sql) {
        Map<Object, String> result = new LinkedHashMap<>(50_000);
        try (Connection conn = ds.getConnection()) {
            // PgSQL 要求 autoCommit=false 才能启用服务端游标
            conn.setAutoCommit(false);
            try (PreparedStatement ps = conn.prepareStatement(sql,
                    ResultSet.TYPE_FORWARD_ONLY,
                    ResultSet.CONCUR_READ_ONLY)) {
                ps.setFetchSize(fetchSize);
                try (ResultSet rs = ps.executeQuery()) {
                    while (rs.next()) {
                        result.put(rs.getObject("id"), rs.getString("row_hash"));
                    }
                }
            } finally {
                conn.setAutoCommit(true);
            }
        } catch (Exception e) {
            log.error("[Detail] Stream query failed: {}", e.getMessage(), e);
            throw new RuntimeException("Detail hash query failed", e);
        }
        return result;
    }

    /**
     * 比较两个 Map,找出三类差异:仅PgSQL有 / 仅Gauss有 / hash不同
     */
    private ShardDiff diffMaps(long shardIndex,
                                Map<Object, String> pgMap,
                                Map<Object, String> gaussMap) {
        List<Object> onlyInPg    = new ArrayList<>();
        List<Object> onlyInGauss = new ArrayList<>();
        List<Object> mismatch    = new ArrayList<>();

        // PgSQL 侧遍历
        pgMap.forEach((id, pgHash) -> {
            String gaussHash = gaussMap.get(id);
            if (gaussHash == null) {
                onlyInPg.add(id);
            } else if (!pgHash.equals(gaussHash)) {
                mismatch.add(id);
            }
        });

        // Gauss 独有
        gaussMap.forEach((id, v) -> {
            if (!pgMap.containsKey(id)) {
                onlyInGauss.add(id);
            }
        });

        ShardDiff diff = new ShardDiff();
        diff.setShardIndex(shardIndex);
        diff.setOnlyInPgsql(onlyInPg);
        diff.setOnlyInGauss(onlyInGauss);
        diff.setHashMismatch(mismatch);
        return diff;
    }

    // ================================================================
    //  辅助:统计行数
    // ================================================================

    private long countRows(CompareRequest req) {
        String sql = String.format(
            "SELECT COUNT(*) FROM %s.%s WHERE %s >= '%s' AND %s <= '%s'",
            req.getSchema(), req.getTableName(),
            req.getDateColumn(), req.getDateFrom(),
            req.getDateColumn(), req.getDateTo()
        );
        try (Connection conn = pgsqlDataSource.getConnection();
             PreparedStatement ps = conn.prepareStatement(sql);
             ResultSet rs = ps.executeQuery()) {
            return rs.next() ? rs.getLong(1) : 0L;
        } catch (Exception e) {
            log.warn("[Count] Failed, fallback to 0: {}", e.getMessage());
            return 0L;
        }
    }

    // ================================================================
    //  聚合最终报告
    // ================================================================

    private CompareReport aggregate(CompareRequest req,
                                    List<ShardXorResult> xorResults,
                                    List<ShardDiff> diffs,
                                    LocalDateTime startTime, long t0,
                                    int shardCount, long consistent, long inconsistent) {
        CompareReport report = new CompareReport();
        report.setTableName(req.getSchema() + "." + req.getTableName());
        report.setDateFrom(req.getDateFrom());
        report.setDateTo(req.getDateTo());
        report.setStartTime(startTime);
        report.setEndTime(LocalDateTime.now());
        report.setCostMillis(System.currentTimeMillis() - t0);
        report.setTotalShards(shardCount);
        report.setConsistentShards(consistent);
        report.setInconsistentShards(inconsistent);

        report.setTotalPgsqlRows(
            xorResults.stream().mapToLong(ShardXorResult::getPgCount).sum());
        report.setTotalGaussRows(
            xorResults.stream().mapToLong(ShardXorResult::getGaussCount).sum());

        List<Object> allOnlyPg    = new ArrayList<>();
        List<Object> allOnlyGauss = new ArrayList<>();
        List<Object> allMismatch  = new ArrayList<>();

        for (ShardDiff d : diffs) {
            allOnlyPg.addAll(d.getOnlyInPgsql());
            allOnlyGauss.addAll(d.getOnlyInGauss());
            allMismatch.addAll(d.getHashMismatch());
        }

        report.setOnlyInPgsql(allOnlyPg);
        report.setOnlyInGauss(allOnlyGauss);
        report.setHashMismatch(allMismatch);
        return report;
    }
}

7. REST 接口(可选触发入口)

// DataCompareController.java
package com.example.dbcompare.controller;

import com.example.dbcompare.model.CompareReport;
import com.example.dbcompare.model.CompareRequest;
import com.example.dbcompare.service.DataCompareService;
import lombok.RequiredArgsConstructor;
import org.springframework.http.ResponseEntity;
import org.springframework.validation.annotation.Validated;
import org.springframework.web.bind.annotation.*;

@RestController
@RequestMapping("/api/compare")
@RequiredArgsConstructor
public class DataCompareController {

    private final DataCompareService compareService;

    /**
     * 触发比对任务(同步返回,大数据量建议改为异步 + 轮询)。
     *
     * POST /api/compare/run
     * {
     *   "schema": "public",
     *   "tableName": "order",
     *   "dateColumn": "created_at",
     *   "dateFrom": "2026-01-01",
     *   "dateTo": "2026-01-31",
     *   "shardSize": 200000
     * }
     */
    @PostMapping("/run")
    public ResponseEntity<CompareReport> run(@Validated @RequestBody CompareRequest req) {
        CompareReport report = compareService.compare(req);
        return ResponseEntity.ok(report);
    }
}

8. 关键设计说明

8.1 为什么用 MD5(concat_ws) 而不是 MD5((t.*)::text)

MD5(concat_ws(...)) MD5((t.*)::text)
跨库一致性 高(显式标准化每列) 低(行类型序列化格式依赖实现)
NULL 安全 显式 COALESCE 处理 行类型中 NULL 可能被忽略或格式不同
数值精度 可控(to_char 固定格式) 可能因精度不同产生误报
维护成本 中(需按类型处理) 低(但错误率高)

8.2 两阶段比对的收益

场景:1000 万行,其中仅 500 行有差异,分 50 片(每片 20w)

第一阶段:XOR 粗筛(50 次聚合查询)
  → 仅 3 个分片 XOR 不一致(含那 500 行)
  → 其余 47 片(9,400,000 行)直接跳过精细比对

第二阶段:仅对 3 片(600,000 行)做行级 hash 比对
  → 传输量从 1000w 行降低到 60w 行,减少约 94%

8.3 流式游标读取的关键配置

PgSQL 启用服务端游标的前提:

  1. Connection.setAutoCommit(false) — 必须在事务内
  2. PreparedStatement.setFetchSize(N) — N > 0 启用分批拉取
  3. ResultSet.TYPE_FORWARD_ONLY + CONCUR_READ_ONLY — 单向只读游标

GaussDB 与 PgSQL 在此行为一致。

8.4 线程模型

主线程
  └─ compareExecutor(8~16线程)— 负责分片任务调度和 Map diff
       └─ ioExecutor(32线程)— 负责 PgSQL/Gauss 并发查询(IO密集)

每个分片的时序(单 Worker):
  t=0: 提交 pgSql 到 ioExecutor → CompletableFuture A
  t=0: 提交 gaussSql 到 ioExecutor → CompletableFuture B
  t=T: join(A, B)(T = max(pgQueryTime, gaussQueryTime))
  t=T+δ: diffMaps(内存操作,通常毫秒级)

8.5 内存消耗估算

场景 内存用量
每片 20w 行,每行 id(8B)+hash(32B) = 40B 20w × 40B × 2(双库)≈ 16 MB/片
16 线程并发 峰值 ≈ 16 × 16MB = 256 MB
JVM 堆建议 -Xmx2g(含其他开销余量)

9. 性能调优参数参考

参数 建议值 说明
shardSize 100,000 ~ 500,000 取决于单行数据大小,行宽越大片越小
compareThreads CPU核心数 × 1 Map diff 为 CPU 密集
ioThreads DB连接池上限 × 0.6 IO 密集,可多于 CPU 核数
fetchSize 1,000 ~ 5,000 太小网络往返多;太大内存峰值高
maximum-pool-size 20(每个库) 与 ioThreads 匹配,避免连接等待
JVM -Xmx2g -XX:+UseG1GC G1 对大量短生命周期对象友好

10. 快速启动示例

# 启动服务
java -Xmx2g -XX:+UseG1GC -jar db-compare-tool.jar

# 触发比对
curl -X POST http://localhost:8080/api/compare/run \
  -H "Content-Type: application/json" \
  -d '{
    "schema": "public",
    "tableName": "orders",
    "dateColumn": "created_at",
    "dateFrom": "2026-01-01",
    "dateTo": "2026-01-31",
    "shardSize": 200000
  }'

响应示例:

{
  "tableName": "public.orders",
  "dateFrom": "2026-01-01",
  "dateTo": "2026-01-31",
  "startTime": "2026-05-14T21:30:00",
  "endTime": "2026-05-14T21:31:42",
  "costMillis": 102000,
  "totalShards": 50,
  "consistentShards": 47,
  "inconsistentShards": 3,
  "totalPgsqlRows": 10000000,
  "totalGaussRows": 10000000,
  "onlyInPgsql": [],
  "onlyInGauss": [],
  "hashMismatch": [10001, 28345, 99827],
  "fullyConsistent": false
}

posted @ 2026-05-14 21:51  景之1231  阅读(8)  评论(0)    收藏  举报