Architure upgrade_cache optimization_English

cache optimization is implemented for hotspot data, data that is frequently accessed, costly to retrieve, and infrequently updated. The examples we did are as follows:

Lots of methods under multiple system services need to check user's cancellation status, via calling the Feign interface of member service. By caching the member cancellation status, inter-service calls can be avoided.
Cache mechanisms must be added for querying flash sale events, products, and shipping addresses. Furthermore, cache warming should be performed for event and product queries before the flash sale begins.
Lucky draw activities: Add data caching, including prize-winning rules,积分 lottery conditions, and prize inventory.
Password-based activities: Cache the association between passwords and activity information.

1. Functional Testing

This is the most fundamental and critical test, ensuring the cache logic itself is correct.

Cache Hit and Miss:
- Verify that on the first request (cache miss), the system correctly queries the database and writes the result to the cache.
- Verify that on subsequent identical requests (cache hit), the system directly returns data from the cache and does not access the database (can be confirmed via database slow query logs or monitoring).
Cache Data Accuracy:
- Verify that the data read from the cache is exactly consistent with the data in the database.
- Test complex data structures (e.g., nested objects, lists) to ensure serialization/deserialization is error-free.
Cache Key Strategy:
- Verify that the cache key generation rules are correct, unique, and predictable (e.g., user:123:profile, product:456:price).
- Verify that different parameterized requests generate different cache keys to prevent data confusion.
Cache Expiration (TTL - Time To Live):
- Verify that the set expiration time takes effect. After expiration, the next request should trigger a cache miss and reload the data.
- Test scenarios where different data has different TTLs.
Cache Update/Invalidation:
- Core Test Point! Verify that when data in the database is modified (INSERT, UPDATE, DELETE), the corresponding cache entry is timely and correctly updated or deleted.
- Test Scenarios:
  - Delete Cache after Update (Cache-Aside Pattern): After updating the database, immediately delete (or mark as invalid) the corresponding cache entry. Verify that the next read triggers a reload.
  - Update Cache after Update: After updating the database, synchronously update the value in the cache. Verify the new value is correct.
  - Batch Operations: When data is updated or deleted in bulk, verify that related cache entries are correctly handled (e.g., deleted via a list of primary keys, or the entire category's cache is deleted).
Degradation Strategy:
- Simulate the cache service being completely unavailable (e.g., Redis downtime).
- Verify the system can degrade gracefully, directly querying the database to return results, ensuring core functionality remains available (albeit with degraded performance).
- Verify the degradation logic does not cause a sudden, overwhelming load on the database (mitigating "cache avalanche").

2. Data Consistency Testing

The biggest risk of caching is data inconsistency. This is a key and challenging area for testing.

Read-Write Consistency:
- Under high concurrency, simulate a "read-write-read" operation sequence.
- Verify that after a write operation updates the database and deletes the cache, the subsequent read operation correctly loads the new data from the database and rebuilds the cache, rather than reading stale cached data.
Concurrent Update Conflicts:
- Simulate multiple threads/requests simultaneously updating the same data.
- Verify that the final state of the cache and database meets expectations and prevents data corruption due to race conditions.
- Verify if appropriate concurrency control mechanisms (e.g., distributed locks) are used to handle high-concurrency cache update scenarios.
Eventual Consistency Verification:
- In scenarios with asynchronous cache updates, verify that the system reaches data consistency within an acceptable timeframe.

3. Performance Testing

Verify whether the cache optimization has achieved the expected performance improvements.

Baseline Comparison:
- Conduct performance tests using the same scripts and environment before and after optimization, and compare key metrics.
Key Metrics:
- Response Time: Have P95 and P99 latencies significantly decreased?
- Throughput: Has TPS (Transactions Per Second) significantly increased?
- Database Load: Have the database's QPS, CPU, and IOPS significantly decreased?
- Cache Hit Ratio: This is the core metric for measuring cache efficiency. Has the hit ratio reached the expected level (e.g., > 90%) after optimization? Analyze the causes of low hit rates.
- Resource Consumption: Have the CPU and memory usage of the application servers decreased due to reduced database access?
Stress Testing:
- Under high load, verify the performance and stability of the cache system itself (e.g., Redis), ensuring it does not become a bottleneck.

4. Stability and Fault Tolerance Testing

Verify the system's robustness under abnormal conditions.

Cache Service Failure:
- Redis Downtime/Restart: Simulate the Redis service stopping or crashing.
- Verify the application can continue serving via the degradation strategy (accessing the database).
- Verify that after Redis recovers, the cache can be rebuilt normally and the system returns to normal operation.
Network Issues:
- Simulate network latency, packet loss, or disconnection between the application server and Redis.
- Verify that the application's timeout settings, retry mechanisms, and degradation strategies are effective.
Cache Penetration:
- Simulate a large number of requests querying Keys that do not exist in the database at all (e.g., malicious attacks).
- Verify if there are countermeasures (e.g., Bloom Filter) to intercept these invalid requests, preventing them from directly hitting the database.
Cache Breakdown:
- Simulate a highly concurrent access to a hot Key at the moment it expires, causing a large number of requests to flood the database simultaneously.
- Verify if there are countermeasures (e.g., Mutex Lock) to ensure only one request loads from the database while others wait.
Cache Avalanche:
- Simulate a large number of cache Keys expiring at the same time, causing a sudden surge of requests to the database.
- Verify if there are countermeasures (e.g., randomized expiration times, never expire + asynchronous background update, multi-level caching) to mitigate the impact.
Memory Overflow (OOM):
- Simulate continuous growth of cached data, approaching Redis's memory limit.
- Verify that Redis's memory eviction policies (e.g., LRU, LFU) work as expected.
- Verify that the application can handle cache misses due to eviction.

5. Security Testing

Access Control: Is the Redis instance configured with strong password authentication? Is it only accessible from trusted IPs? Are dangerous commands (e.g., FLUSHALL, CONFIG) disabled?
Data Security: Does the cache store sensitive information (e.g., user passwords, ID numbers)? If so, is it encrypted or masked?
Injection Risks: Are cache keys constructed directly from user input? Is there a risk of injection?

6. Monitoring and Observability Validation

Ensure the cache's state and performance are "visible" for operations and troubleshooting.

Monitoring Metrics:
- Verify that key Redis metrics are monitored: memory usage, CPU, number of connections, hit ratio, latency, number of rejected connections, keyspace hits/misses, number of evicted keys, etc.
- Verify that application-level cache-related metrics are monitored: QPS for cache read/write/delete, hit ratio, failure rate.
Logging:
- Verify that critical cache operations (e.g., cache miss, cache update, cache deletion, degradation) have clear log records for problem tracing.
Alerting Setup:
- Verify that reasonable alert thresholds are set, for example:
  - Cache hit ratio falls below a certain value (e.g., < 80%).
  - Redis memory usage is too high (e.g., > 80%).
  - Too many Redis connections.
  - Cache latency is too high.
- Simulate triggering alerts to verify they are timely delivered to the relevant personnel.

Summary

Testing after cache optimization is a multi-layered, multi-dimensional process:

Functionality is the Foundation: Ensure the cache logic is correct.
Consistency is the Lifeline: Rigorously prevent data corruption.
Performance is the Goal: Verify the optimization effect.
Stability is the Guarantee: Simulate various failures to test system resilience.
Security is the Baseline: Protect data and the system.
Observability is the Eyes: Make problems impossible to hide.

The best practice is to automate these tests (especially functional, core performance, and key fault-tolerance scenarios) and integrate them into the CI/CD pipeline, ensuring that every code change does not break the correctness and stability of the cache.

posted @ 2025-08-29 17:35 bestsarah 阅读(7) 评论(0) 收藏举报

刷新页面返回顶部

bestsarah