Architure upgrade_Database Sharding_English

1. Functional Correctness Testing

Ensure the sharding logic is correct and all business scenarios function properly.

Shard Key Routing Accuracy:
- Core Test Point! Verify that data is precisely routed to the correct physical table based on the sharding rule (e.g., user ID modulo, range sharding, hash sharding).
- Design test cases covering boundary values, typical values, and edge cases for the shard key.
- Example: Data where user_id % 4 = 0 must be written to user_table_0.
CRUD Operation Correctness:
- Create (INSERT): Verify new data is correctly written to the target shard table based on the shard key.
- Read (SELECT):
  - Single-Shard Query: Query using the shard key; verify it correctly returns data from the target shard table.
  - Non-Shard-Key Query: Verify queries without the shard key (e.g., by name, phone number) can be executed correctly via broadcast query or Global Secondary Index (GSI) and return a complete result set.
- Update (UPDATE):
  - Verify the update operation locates the correct shard table and updates the data.
  - Special Attention: If the update operation modifies the shard key itself (should be avoided if possible), verify the system behavior meets expectations (often disallowed or requires special handling).
- Delete (DELETE): Verify the delete operation correctly removes data from the target shard table.
Join Query Testing:
- Same-Database, Same-Shard JOIN: Verify JOIN operations within the same shard (e.g., order table and order detail table sharded by order ID) execute correctly.
- Cross-Shard JOIN: Verify JOIN operations across tables in different shards (e.g., user table and order table with different shard keys) can be executed correctly via middleware (e.g., ShardingSphere) or application-layer logic. Performance is typically poor; assess if the business can tolerate it.
Aggregation Query Testing:
- Single-Shard Aggregation: Verify results for operations like SUM, COUNT, MAX, MIN on a single shard table are correct.
- Cross-Shard Aggregation: Verify operations like COUNT(*) for total row count or SUM(amount) for total amount. Verify the middleware can push the aggregation down to each shard, execute it, and merge (Gather) the results correctly.
Pagination Query Testing:
- Pagination by Shard Key: Relatively simple; verify correctness.
- Pagination by Non-Shard Key (e.g., ORDER BY create_time): Complex and High-Risk! Verify LIMIT offset, size returns globally ordered results correctly. Middleware typically needs to fetch data from all shards, merge and sort in memory, incurring high performance overhead. Test performance and correctness under large offset values.
Transaction Testing:
- Single-Shard Transaction: Verify local transactions (ACID) across multiple tables within the same shard function normally.
- Cross-Shard Transaction: The Biggest Challenge! Verify distributed transactions involving multiple shards (usually based on XA or flexible transactions like TCC, Saga) can guarantee eventual consistency. Test behavior under commit, rollback, network timeout, and node failure scenarios. It is strongly recommended to design business logic to avoid cross-shard transactions whenever possible.

2. Data Consistency & Migration Validation

Sharding often involves data migration, which is the highest-risk phase.

Full Data Migration Consistency Verification:
- After migration, use verification tools (e.g., open-source datax, gh-ost's verification feature, or custom scripts) to compare the total data volume, checksums of key fields, and primary key ranges between the source table and all target shard tables.
- Perform random sampling checks on a large number of records to ensure content is identical.
Incremental Data Sync Verification:
- During the migration cutover, incremental sync (e.g., based on Binlog) is often used.
- Verify incremental data can be synchronized to the corresponding shard table in real-time and accurately.
- Simulate writes at the moment of cutover to verify no data loss or corruption occurs.
Rollback Plan Verification:
- Develop and test a data rollback plan. If issues arise after sharding goes live, can data and traffic be quickly and safely switched back to the original single table?

3. Performance Testing

Verify that sharding has genuinely resolved performance bottlenecks.

Baseline Comparison:
- Conduct performance tests using the same scripts and environment before (single table) and after (sharded), comparing key metrics.
Key Metrics:
- Write Performance (Write Throughput): Has the single-table write bottleneck (e.g., IOPS, lock contention) been alleviated by data distribution? Has TPS improved?
- Read Performance (Read Performance): Have P95/P99 latencies for single-shard queries decreased? Is the performance degradation for non-shard-key queries acceptable?
- Database Load: Have CPU, IOPS, connection count, and slow query count for a single MySQL instance decreased significantly?
- Sharding Middleware Overhead: Analyze the CPU, memory, network overhead, and added latency of the sharding middleware (e.g., ShardingSphere-Proxy).
High-Concurrency Stress Testing:
- Simulate concurrency far exceeding normal levels to test the sharded system's performance under extreme load.
- Verify the system remains stable and identify any new bottlenecks (e.g., middleware, network bandwidth).
Hot-Shard Testing:
- Simulate one shard (e.g., user_table_0) becoming a hotspot (access volume far exceeding others).
- Verify if the MySQL instance hosting that shard can withstand the pressure and if performance degrades sharply.

4. Stability & Fault Tolerance Testing

Simulate various failures to test the resilience of the sharded architecture.

Single-Shard Database Failure:
- Simulate the MySQL instance corresponding to one shard going down.
- Verify:
  - Requests accessing that shard are handled correctly (e.g., return errors, degrade)?
  - Other parts of the system (accessing other shards) remain unaffected and available? (Verify fault isolation capability)
  - After database recovery, does master-slave replication and service discovery work normally, allowing the system to recover automatically?
Sharding Middleware Failure:
- Simulate a node failure or network issue in the sharding middleware cluster (e.g., ShardingSphere-Proxy).
- Verify the cluster's high availability and whether requests can be taken over by other nodes.
Network Partition:
- Simulate a network issue causing a shard database to lose connectivity with other components (application, middleware).
- Verify the system's degradation and fault tolerance strategies.
Scaling Drill (Expansion/Contraction):
- Core Operational Capability! Simulate future data growth requiring new shards (e.g., expanding from 4 to 8 shards).
- Verify the reliability of the Resharding process automation scripts or tools, and whether data migration and traffic switching are smooth with controllable impact on online services (typically requiring off-peak hours and minute-level cutover).

5. Monitoring & Observability Validation

Ensure the state of the complex sharded system is "transparent."

Monitoring Metrics:
- Independent Monitoring per Shard: Configure independent monitoring (CPU, IOPS, connections, slow queries, replication lag) for the database instance corresponding to each shard table.
- Middleware Monitoring: Monitor the sharding middleware's performance (QPS, latency, error rate, JVM metrics).
- Routing Monitoring: Monitor which shard SQL requests from key business operations are routed to, and track load balancing across shards.
- Global Aggregation Monitoring: Metrics like total table row count or total data size require periodic aggregation via scripts.
Alerting Setup:
- Set independent, reasonable alert thresholds for each shard database and middleware cluster.
- Set "Hot-Shard" alerts (e.g., QPS/latency for one shard significantly higher than others).
- Set "Data Inconsistency" alerts (based on verification tasks).
Logging & Tracing:
- Ensure logs include shard information (e.g., shard_id=user_3) for easier troubleshooting.
- Distributed tracing (Tracing) traces should clearly show which physical database and table the SQL request was routed to.

6. Regression Testing

Full Regression: Sharding is a fundamental storage change; full regression testing must be performed on all core business functions dependent on this table to ensure no unintended side effects are introduced.
Automation: Automate core sharding functional test cases (especially CRUD, key queries) and integrate them into the CI/CD pipeline.

Summary

Testing after sharding a table at the tens of millions scale is a systems engineering task that must be meticulously planned, implemented in steps, and thoroughly verified:

Functionality is the Prerequisite: Ensure it "works."
Data Consistency is the Baseline: Ensure "data is correct."
Performance is the Goal: Ensure it's "faster."
Stability is the Guarantee: Ensure it "holds up."
Observability is the Eyes: Ensure it's "visible."
Maintainability is the Future: Ensure it can "scale."

The best practice is to use a shadow table/database or dual-write strategy for canary validation, comparing results from the new and old logic under real traffic to minimize risk.

posted @ 2025-08-29 18:05 bestsarah 阅读(7) 评论(0) 收藏举报

刷新页面返回顶部

bestsarah