在K8S环境里部署大模型
部署说明:
本示例使用SGlang作为大模型部署运行的框架,并且使用SGLang Router作为网关负载后端部署的大模型服务。示例使用模型为deepseek-v32,每台节点为8张H200GPU卡,一共三台实例。下面为具体内容
1、部署deepseek-v32的sts服务
apiVersion: apps/v1 kind: StatefulSet metadata: name: deepseek-v32-worker namespace: deepseek spec: serviceName: deepseek-v32-worker replicas: 3 selector: matchLabels: app: deepseek-v32-worker model: deepseek-v32 template: metadata: labels: app: deepseek-v32-worker model: deepseek-v32 spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: deepseek-v32-worker topologyKey: kubernetes.io/hostname containers: - name: sglang image: lmsysorg/sglang:v0.4.1-cu124 command: ["python3", "-m", "sglang.launch_server"] args: - "--model-path=/models/DeepSeek-V3.2" - "--tp=8" - "--dp=1" - "--quantization=fp8" - "--context-length=131072" - "--mem-fraction-static=0.88" - "--trust-remote-code" - "--port=8000" - "--enable-metrics" ports: - containerPort: 8000 name: http resources: limits: nvidia.com/gpu: "8" cpu: "128" memory: "1Ti" volumeMounts: - name: models mountPath: /models - name: dshm mountPath: /dev/shm volumes: - name: models persistentVolumeClaim: claimName: deepseek-v32-pvc - name: dshm emptyDir: medium: Memory sizeLimit: "200Gi" tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule --- apiVersion: v1 kind: Service metadata: name: deepseek-v32-worker namespace: deepseek spec: selector: app: deepseek-v32-worker ports: - port: 8000 name: http
2、部署Router Deploy(并通过K8S机制实现worker自动发下)
apiVersion: v1 kind: ServiceAccount metadata: name: sglang-router namespace: deepseek --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: sglang-router namespace: deepseek rules: - apiGroups: [""] resources: ["pods"] verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: sglang-router namespace: deepseek subjects: - kind: ServiceAccount name: sglang-router namespace: deepseek roleRef: kind: Role name: sglang-router apiGroup: rbac.authorization.k8s.io --- apiVersion: apps/v1 kind: Deployment metadata: name: deepseek-v32-router namespace: deepseek spec: replicas: 2 selector: matchLabels: app: deepseek-v32-router template: metadata: labels: app: deepseek-v32-router spec: serviceAccountName: sglang-router containers: - name: router image: lmsysorg/sglang:v0.4.1-cu124 command: ["python3", "-m", "sglang_router.launch_router"] args: - "--service-discovery" # 启用 K8s 服务发现 - "--selector=app=deepseek-v32-worker" # 匹配 Worker 标签 - "--service-discovery-namespace=deepseek" # Namespace - "--service-discovery-port=8000" # Worker 端口 - "--policy=cache_aware" # 缓存感知路由 - "--cache-threshold=0.5" - "--port=8080" - "--host=0.0.0.0" ports: - containerPort: 8080 name: router resources: limits: cpu: "4" memory: "8Gi" livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 5 --- apiVersion: v1 kind: Service metadata: name: deepseek-v32-router namespace: deepseek spec: selector: app: deepseek-v32-router ports: - port: 8080 targetPort: 8080 type: ClusterIP --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: deepseek-v32 namespace: deepseek annotations: nginx.ingress.kubernetes.io/proxy-read-timeout: "1200" nginx.ingress.kubernetes.io/proxy-send-timeout: "1200" spec: rules: - host: deepseek-v32.yourdomain.com http: paths: - path: / pathType: Prefix backend: service: name: deepseek-v32-router port: number: 8080
总结:
通过sglang-router并配合K8S的自动发现机制,实现扩容模型实例时,能够自动感知,无需人工干预。

浙公网安备 33010602011771号