【负载均衡】负载不均衡问题：配置了 4 个 pod，并发压测的时候并发的 100 个请求都打在同一个 pod 上，问题排查 & 解决方案

1. 问题根因分析

1.1 Kubernetes Service 负载均衡机制

核心问题：Kubernetes Service 默认的负载均衡算法可能不适合你的压测场景。

2. 详细排查步骤

2.1 检查当前 Service 配置

# 查看 Service 详细配置
kubectl describe service your-service-name

# 查看 Endpoints 分布
kubectl get endpoints your-service-name

# 查看 Pod 标签和状态
kubectl get pods -l app=your-app-label -o wide

2.2 Java 代码：模拟负载均衡排查

/**
 * 负载均衡诊断工具
 */
@Component
public class LoadBalancerDiagnoser {
    
    @Autowired
    private RestTemplate restTemplate;
    
    @Value("${service.url}")
    private String serviceUrl;
    
    /**
     * 诊断请求分布
     */
    public void diagnoseLoadDistribution(int requestCount) {
        Map<String, Integer> podRequestCount = new ConcurrentHashMap<>();
        CountDownLatch latch = new CountDownLatch(requestCount);
        
        for (int i = 0; i < requestCount; i++) {
            CompletableFuture.runAsync(() -> {
                try {
                    // 发送请求并记录响应的 Pod 信息
                    ResponseEntity<String> response = restTemplate.getForEntity(
                        serviceUrl + "/debug/pod-info", String.class);
                    
                    // 从响应头或响应体中提取 Pod 信息
                    String podName = extractPodName(response);
                    podRequestCount.merge(podName, 1, Integer::sum);
                    
                } catch (Exception e) {
                    log.error("Request failed", e);
                } finally {
                    latch.countDown();
                }
            });
        }
        
        latch.await();
        
        // 输出分布情况
        log.info("请求分布统计:");
        podRequestCount.forEach((pod, count) -> {
            double percentage = (double) count / requestCount * 100;
            log.info("Pod {}: {} 请求 ({:.2f}%)", pod, count, percentage);
        });
    }
    
    /**
     * 在应用中添加调试端点，返回当前 Pod 信息
     */
    @RestController
    public static class DebugController {
        
        @Value("${HOSTNAME:unknown}")
        private String podName;
        
        @GetMapping("/debug/pod-info")
        public Map<String, String> getPodInfo() {
            Map<String, String> info = new HashMap<>();
            info.put("podName", podName);
            info.put("timestamp", Instant.now().toString());
            return info;
        }
    }
}

3. 常见原因及解决方案

3.1 原因1：Session Affinity（会话保持）启用

检查方法：

kubectl get service your-service -o yaml
# 查看 spec.sessionAffinity 配置

解决方案：

apiVersion: v1
kind: Service
metadata:
  name: your-service
spec:
  sessionAffinity: None  # 确保为 None，而不是 ClientIP
  selector:
    app: your-app
  ports:
  - port: 80
    targetPort: 8080

3.2 原因2：iptables 模式的负载均衡缺陷

Kubernetes 默认使用 iptables 做负载均衡，它使用随机算法，但在高并发下可能不均匀。

解决方案：切换到 ipvs 模式

# 检查当前代理模式
kubectl get configmap -n kube-system kube-proxy -o yaml | grep mode

# 如果使用 ipvs，需要修改配置

创建 ConfigMap 配置：

apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"  # 使用 ipvs 替代 iptables
ipvs:
  scheduler: "lc"  # 最少连接算法，更均衡

3.3 原因3：客户端连接池复用

问题分析：HTTP 客户端连接池导致请求复用同一个连接。

Java 客户端解决方案：

@Configuration
public class LoadBalancerConfig {
    
    /**
     * 配置支持负载均衡的 RestTemplate
     */
    @Bean
    @LoadBalanced
    public RestTemplate loadBalancedRestTemplate() {
        return new RestTemplate(createLoadBalancingClient());
    }
    
    private ClientHttpRequestFactory createLoadBalancingClient() {
        // 使用连接池，但设置合理的参数
        HttpClient httpClient = HttpClient.create()
            .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 5000)
            .doOnConnected(conn -> 
                conn.addHandlerLast(new ReadTimeoutHandler(5000, TimeUnit.MILLISECONDS)))
            .compress(true)
            .followRedirect(true);
        
        // 关键：配置连接池，避免长连接导致粘滞
        ConnectionProvider provider = ConnectionProvider.builder("lb-connection-pool")
            .maxConnections(100)                    // 最大连接数
            .maxIdleTime(Duration.ofSeconds(20))     // 最大空闲时间，避免长连接
            .maxLifeTime(Duration.ofMinutes(5))      // 最大生存时间
            .pendingAcquireTimeout(Duration.ofSeconds(10))
            .evictInBackground(Duration.ofSeconds(30))
            .build();
        
        return new ReactorClientHttpConnector(HttpClient.create(provider));
    }
    
    /**
     * 对于 HTTP 客户端，添加随机化策略
     */
    @Bean
    public WebClient loadBalancedWebClient() {
        // 使用 Reactor LoadBalancer 的随机策略
        return WebClient.builder()
            .clientConnector(new ReactorClientHttpConnector(
                HttpClient.create(ConnectionProvider.newConnection())))
            .filter((request, next) -> {
                // 为每个请求添加时间戳，避免缓存
                return next.exchange(ClientRequest.from(request)
                    .header("X-Request-Timestamp", String.valueOf(System.currentTimeMillis()))
                    .build());
            })
            .build();
    }
}

3.4 原因4：DNS 缓存问题

解决方案：配置 DNS 缓存策略

@Configuration
public class DnsConfig {
    
    @Bean
    public HttpClient httpClientWithDns() {
        return HttpClient.create()
            .resolver(spec -> spec.roundRobinSelection(true))  // DNS轮询
            .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 5000)
            .doOnConnected(conn -> 
                conn.addHandlerLast(new ReadTimeoutHandler(5000, TimeUnit.MILLISECONDS)));
    }
    
    // 配置 JVM DNS 缓存时间
    @PostConstruct
    public void setDnsCacheSettings() {
        // 设置 DNS 缓存时间为 10 秒
        java.security.Security.setProperty("networkaddress.cache.ttl", "10");
        java.security.Security.setProperty("networkaddress.cache.negative.ttl", "5");
    }
}

4. 高级解决方案

4.1 使用 Service Mesh（Istio）进行智能负载均衡

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: your-service
spec:
  host: your-service.default.svc.cluster.local
  trafficPolicy:
    loadBalancer:
      simple: LEAST_CONN  # 使用最少连接算法
    connectionPool:
      tcp:
        maxConnections: 100
        connectTimeout: 30ms
      http:
        http1MaxPendingRequests: 1024
        maxRequestsPerConnection: 1024
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: your-service
spec:
  hosts:
  - your-service.default.svc.cluster.local
  http:
  - route:
    - destination:
        host: your-service.default.svc.cluster.local
        subset: v1
      weight: 100

4.2 客户端负载均衡策略

/**
 * 自定义负载均衡策略
 */
@Component
public class CustomLoadBalancer {
    
    private final AtomicInteger counter = new AtomicInteger(0);
    private final List<String> availablePods = Collections.synchronizedList(new ArrayList<>());
    
    @Scheduled(fixedRate = 30000) // 每30秒刷新Pod列表
    public void refreshPodList() {
        // 从Kubernetes API获取当前可用的Pod列表
        List<String> currentPods = kubernetesClient.pods()
            .inNamespace("default")
            .withLabels(Collections.singletonMap("app", "your-app"))
            .list()
            .getItems()
            .stream()
            .map(pod -> pod.getStatus().getPodIP())
            .collect(Collectors.toList());
            
        availablePods.clear();
        availablePods.addAll(currentPods);
    }
    
    /**
     * 加权随机负载均衡
     */
    public String choosePod() {
        if (availablePods.isEmpty()) {
            throw new IllegalStateException("No available pods");
        }
        
        // 简单的轮询
        int index = counter.getAndIncrement() % availablePods.size();
        return availablePods.get(index);
    }
    
    /**
     * 使用自定义负载均衡的HTTP客户端
     */
    public <T> T executeWithLoadBalance(Function<String, T> requestFunction) {
        String targetPod = choosePod();
        String url = "http://" + targetPod + ":8080";
        
        return requestFunction.apply(url);
    }
}

5. 压测脚本优化

5.1 确保压测工具正确配置

# 使用 wrk 进行压测，确保使用多个连接
wrk -t12 -c100 -d30s http://your-service --connections 100

# 使用 Apache Bench，禁用 keep-alive
ab -n 1000 -c 100 -H "Connection: close" http://your-service/

# 使用 hey（更现代的替代品）
hey -n 1000 -c 100 -disable-keepalive http://your-service

5.2 Java 压测客户端优化

/**
 * 优化的压测客户端，确保请求分布均匀
 */
public class BalancedLoadTest {
    
    public void performLoadTest() throws InterruptedException {
        int concurrentUsers = 100;
        int requestsPerUser = 100;
        
        CountDownLatch latch = new CountDownLatch(concurrentUsers);
        Map<String, AtomicInteger> requestDistribution = new ConcurrentHashMap<>();
        
        // 为每个并发用户创建独立的HTTP客户端
        List<CompletableFuture<Void>> futures = new ArrayList<>();
        
        for (int i = 0; i < concurrentUsers; i++) {
            final int userIndex = i;
            CompletableFuture<Void> future = CompletableFuture.runAsync(() -> {
                // 每个用户使用独立的HTTP客户端实例
                CloseableHttpClient httpClient = HttpClients.custom()
                    .setMaxConnTotal(1)  // 每个客户端一个连接
                    .setMaxConnPerRoute(1)
                    .disableConnectionState() // 禁用连接状态跟踪
                    .build();
                
                try {
                    for (int j = 0; j < requestsPerUser; j++) {
                        HttpGet request = new HttpGet("http://your-service/api");
                        
                        // 添加随机参数避免缓存
                        request.setHeader("Cache-Control", "no-cache");
                        request.setHeader("User-Agent", "LoadTest-User-" + userIndex);
                        
                        try (CloseableHttpResponse response = httpClient.execute(request)) {
                            String podInfo = EntityUtils.toString(response.getEntity());
                            String podName = extractPodName(podInfo);
                            
                            requestDistribution
                                .computeIfAbsent(podName, k -> new AtomicInteger())
                                .incrementAndGet();
                        }
                        
                        // 添加微小延迟，避免请求完全同步
                        Thread.sleep(ThreadLocalRandom.current().nextInt(10));
                    }
                } catch (Exception e) {
                    e.printStackTrace();
                } finally {
                    latch.countDown();
                }
            });
            
            futures.add(future);
        }
        
        latch.await(2, TimeUnit.MINUTES);
        
        // 输出分布报告
        System.out.println("=== 负载分布报告 ===");
        requestDistribution.forEach((pod, count) -> {
            double percentage = (double) count.get() / (concurrentUsers * requestsPerUser) * 100;
            System.out.printf("Pod %s: %d 请求 (%.2f%%)%n", pod, count.get(), percentage);
        });
    }
}

6. 监控和验证

6.1 实时监控请求分布

# 实时查看每个Pod的请求计数
kubectl get pods -l app=your-app -o wide | awk '{print $1}' | xargs -I {} kubectl logs {} --tail=10 | grep "REQUEST_COUNT"

# 使用 Prometheus 查询
sum(rate(http_requests_total[1m])) by (pod)

6.2 验证解决方案是否生效

实施上述任一解决方案后，重新运行压测，观察请求分布：

# 期望看到的结果示例
Pod your-app-7cbbf5d56f-abcde: 1250 请求 (25.0%)
Pod your-app-7cbbf5d56f-fghij: 1248 请求 (24.96%)  
Pod your-app-7cbbf5d56f-klmno: 1252 请求 (25.04%)
Pod your-app-7cbbf5d56f-pqrst: 1250 请求 (25.0%)

总结

这个问题通常由以下原因导致，按优先级排查：

Session Affinity 配置（最常见）
客户端连接池复用
iptables 负载均衡算法缺陷
DNS 缓存问题
压测工具配置不当

建议的解决顺序：

检查并禁用 Session Affinity
优化客户端连接池配置
考虑切换到 ipvs 模式
使用 Service Mesh 进行高级负载均衡

通过系统性的排查和优化，可以确保流量均匀分布到所有 Pod。

posted @ 2025-09-29 16:17 NeoLshu 阅读(10) 评论(0) 收藏举报来源

刷新页面返回顶部

neolshu