在K8S环境里部署大模型

部署说明:

本示例使用SGlang作为大模型部署运行的框架,并且使用SGLang Router作为网关负载后端部署的大模型服务。示例使用模型为deepseek-v32,每台节点为8张H200GPU卡,一共三台实例。下面为具体内容

1、部署deepseek-v32的sts服务

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: deepseek-v32-worker
  namespace: deepseek
spec:
  serviceName: deepseek-v32-worker
  replicas: 3
  selector:
    matchLabels:
      app: deepseek-v32-worker
      model: deepseek-v32
  template:
    metadata:
      labels:
        app: deepseek-v32-worker
        model: deepseek-v32
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: deepseek-v32-worker
            topologyKey: kubernetes.io/hostname
      containers:
      - name: sglang
        image: lmsysorg/sglang:v0.4.1-cu124
        command: ["python3", "-m", "sglang.launch_server"]
        args:
        - "--model-path=/models/DeepSeek-V3.2"
        - "--tp=8"
        - "--dp=1"
        - "--quantization=fp8"
        - "--context-length=131072"
        - "--mem-fraction-static=0.88"
        - "--trust-remote-code"
        - "--port=8000"
        - "--enable-metrics"
        ports:
        - containerPort: 8000
          name: http
        resources:
          limits:
            nvidia.com/gpu: "8"
            cpu: "128"
            memory: "1Ti"
        volumeMounts:
        - name: models
          mountPath: /models
        - name: dshm
          mountPath: /dev/shm
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: deepseek-v32-pvc
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: "200Gi"
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: deepseek-v32-worker
  namespace: deepseek
spec:
  selector:
    app: deepseek-v32-worker
  ports:
  - port: 8000
    name: http

2、部署Router Deploy(并通过K8S机制实现worker自动发下)

apiVersion: v1
kind: ServiceAccount
metadata:
  name: sglang-router
  namespace: deepseek
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: sglang-router
  namespace: deepseek
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: sglang-router
  namespace: deepseek
subjects:
- kind: ServiceAccount
  name: sglang-router
  namespace: deepseek
roleRef:
  kind: Role
  name: sglang-router
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-v32-router
  namespace: deepseek
spec:
  replicas: 2
  selector:
    matchLabels:
      app: deepseek-v32-router
  template:
    metadata:
      labels:
        app: deepseek-v32-router
    spec:
      serviceAccountName: sglang-router
      containers:
      - name: router
        image: lmsysorg/sglang:v0.4.1-cu124
        command: ["python3", "-m", "sglang_router.launch_router"]
        args:
        - "--service-discovery"                           # 启用 K8s 服务发现
        - "--selector=app=deepseek-v32-worker"            # 匹配 Worker 标签
        - "--service-discovery-namespace=deepseek"        # Namespace
        - "--service-discovery-port=8000"                 # Worker 端口
        - "--policy=cache_aware"                          # 缓存感知路由
        - "--cache-threshold=0.5"
        - "--port=8080"
        - "--host=0.0.0.0"
        ports:
        - containerPort: 8080
          name: router
        resources:
          limits:
            cpu: "4"
            memory: "8Gi"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: deepseek-v32-router
  namespace: deepseek
spec:
  selector:
    app: deepseek-v32-router
  ports:
  - port: 8080
    targetPort: 8080
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: deepseek-v32
  namespace: deepseek
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "1200"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "1200"
spec:
  rules:
  - host: deepseek-v32.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: deepseek-v32-router
            port:
              number: 8080

总结:
通过sglang-router并配合K8S的自动发现机制,实现扩容模型实例时,能够自动感知,无需人工干预。

posted @ 2026-05-20 18:21  ZANAN  阅读(11)  评论(0)    收藏  举报