koordinator to hami
生产环境升级步骤
调度器升级
生产环境中已经有了 hami,但没有 koordinator,为了调整简单些,通过 helm chart 将 koordinator 部署在 hami 所在的 namesapce 中
kubectl annotate namespace hami meta.helm.sh/release-name=koordinator
kubectl annotate namespace hami meta.helm.sh/release-namespace=hami
kubectl label namespace hami app.kubernetes.io/managed-by=Helm
cd ./koordinator
helm install koordinator -n hami .
修改 koord-scheduler-config configmap,直接将下面的 configmap 保存后,apply 下,整体更新
apiVersion: v1
data:
koord-scheduler-config: |
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
resourceLock: leases
resourceName: koord-scheduler
resourceNamespace: hami
extenders:
- urlPrefix: "https://127.0.0.1:443"
filterVerb: filter
bindVerb: bind
nodeCacheCapable: true
weight: 1
httpTimeout: 30s
enableHTTPS: true
tlsConfig:
insecure: true
managedResources:
- name: nvidia.com/gpu
ignoredByScheduler: true
- name: nvidia.com/gpumem
ignoredByScheduler: true
- name: nvidia.com/gpucores
ignoredByScheduler: true
- name: nvidia.com/gpumem-percentage
ignoredByScheduler: true
- name: nvidia.com/priority
ignoredByScheduler: true
- name: cambricon.com/vmlu
ignoredByScheduler: true
- name: hygon.com/dcunum
ignoredByScheduler: true
- name: hygon.com/dcumem
ignoredByScheduler: true
- name: hygon.com/dcucores
ignoredByScheduler: true
- name: iluvatar.ai/vgpu
ignoredByScheduler: true
- name: "metax-tech.com/gpu"
ignoredByScheduler: true
- name: metax-tech.com/sgpu
ignoredByScheduler: true
- name: metax-tech.com/vcore
ignoredByScheduler: true
- name: metax-tech.com/vmemory
ignoredByScheduler: true
- name: huawei.com/Ascend910A
ignoredByScheduler: true
- name: huawei.com/Ascend910A-memory
ignoredByScheduler: true
- name: huawei.com/Ascend910B2
ignoredByScheduler: true
- name: huawei.com/Ascend910B2-memory
ignoredByScheduler: true
- name: huawei.com/Ascend910B
ignoredByScheduler: true
- name: huawei.com/Ascend910B-memory
ignoredByScheduler: true
- name: huawei.com/Ascend910B4
ignoredByScheduler: true
- name: huawei.com/Ascend910B4-memory
ignoredByScheduler: true
- name: huawei.com/Ascend310P
ignoredByScheduler: true
- name: huawei.com/Ascend310P-memory
ignoredByScheduler: true
profiles:
- pluginConfig:
- name: NodeResourcesFit
args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: NodeResourcesFitArgs
scoringStrategy:
type: LeastAllocated
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
- name: "kubernetes.io/batch-cpu"
weight: 1
- name: "kubernetes.io/batch-memory"
weight: 1
- name: LoadAwareScheduling
args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: LoadAwareSchedulingArgs
filterExpiredNodeMetrics: false
nodeMetricExpirationSeconds: 300
resourceWeights:
cpu: 1
memory: 1
usageThresholds:
cpu: 65
memory: 95
# disable by default
# prodUsageThresholds indicates the resource utilization threshold of Prod Pods compared to the whole machine.
# prodUsageThresholds:
# cpu: 55
# memory: 75
# scoreAccordingProdUsage controls whether to score according to the utilization of Prod Pod
# scoreAccordingProdUsage: true
# aggregated supports resource utilization filtering and scoring based on percentile statistics
aggregated:
usageThresholds:
cpu: 65
memory: 95
usageAggregationType: "p95"
scoreAggregationType: "p95"
estimatedScalingFactors:
cpu: 85
memory: 70
- name: ElasticQuota
args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: ElasticQuotaArgs
quotaGroupNamespace: hami
plugins:
queueSort:
disabled:
- name: "*"
enabled:
- name: Coscheduling
preFilter:
enabled:
- name: Coscheduling
- name: Reservation
- name: NodeNUMAResource
- name: DeviceShare
- name: ElasticQuota
filter:
enabled:
- name: Reservation
- name: LoadAwareScheduling
- name: NodeNUMAResource
- name: DeviceShare
postFilter:
disabled:
- name: "*"
enabled:
- name: Reservation
- name: Coscheduling
- name: ElasticQuota
- name: DefaultPreemption
preScore:
enabled:
- name: Reservation # The Reservation plugin must come first
score:
enabled:
- name: LoadAwareScheduling
weight: 1
- name: NodeNUMAResource
weight: 1
- name: DeviceShare
weight: 1
- name: Reservation
weight: 5000
reserve:
enabled:
- name: Reservation # The Reservation plugin must come first
- name: LoadAwareScheduling
- name: NodeNUMAResource
- name: DeviceShare
- name: Coscheduling
- name: ElasticQuota
permit:
enabled:
- name: Coscheduling
preBind:
enabled:
- name: NodeNUMAResource
- name: DeviceShare
- name: Reservation
- name: DefaultPreBind
bind:
disabled:
- name: "*"
enabled:
- name: Reservation
- name: DefaultBinder
postBind:
enabled:
- name: Coscheduling
schedulerName: hami-scheduler
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: koordinator
meta.helm.sh/release-namespace: hami
labels:
app.kubernetes.io/managed-by: Helm
name: koord-scheduler-config
namespace: hami
通过 helm 安装后,修改 koordinator 的 deployment 增加 hami 的 container 和 volume 信息
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
meta.helm.sh/release-name: koordinator
meta.helm.sh/release-namespace: default
labels:
app.kubernetes.io/managed-by: Helm
koord-app: koord-scheduler
name: koord-scheduler
namespace: hami
resourceVersion: "350146"
uid: 89510e69-de4c-46c5-b9ac-aada7abbc123
spec:
minReadySeconds: 3
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
koord-app: koord-scheduler
strategy:
rollingUpdate:
maxSurge: 100%
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
koord-app: koord-scheduler
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: koord-app
operator: In
values:
- koord-scheduler
topologyKey: kubernetes.io/hostname
weight: 100
containers:
- args:
- --port=10251
- --authentication-skip-lookup=true
- --v=4
- --feature-gates=
- --config=/config/koord-scheduler.config
command:
- /koord-scheduler
image: registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-scheduler:v1.6.0
imagePullPolicy: Always
name: scheduler
readinessProbe:
failureThreshold: 3
httpGet:
path: healthz
port: 10251
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "1"
memory: 1Gi
requests:
cpu: 500m
memory: 256Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /config
name: koord-scheduler-config-volume
# 基于原 koordinator deployment 增加的 hami container
- command:
- scheduler
- --http_bind=0.0.0.0:443
- --cert_file=/tls/tls.crt
- --key_file=/tls/tls.key
- --scheduler-name=hami-scheduler
- --metrics-bind-address=:9395
- --node-scheduler-policy=binpack
- --gpu-scheduler-policy=spread
- --device-config-file=/device-config.yaml
- --enable-ascend=true
- --debug
- -v=4
env:
- name: HAMI_NODELOCK_EXPIRE
value: 5m
image: 10.62.48.94:30085/hami/hami:v2.6.1 # 以最终发版为准
imagePullPolicy: IfNotPresent
name: vgpu-scheduler-extender
ports:
- containerPort: 443
name: http
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /tls
name: tls-config
- mountPath: /device-config.yaml
name: device-config
subPath: device-config.yaml
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: koord-scheduler
serviceAccountName: koord-scheduler
terminationGracePeriodSeconds: 10
volumes:
- configMap:
defaultMode: 420
items:
- key: koord-scheduler-config
path: koord-scheduler.config
name: koord-scheduler-config
name: koord-scheduler-config-volume
# 增加的 hami
- name: tls-config
secret:
defaultMode: 420
secretName: hami-scheduler-tls
- configMap:
defaultMode: 420
name: hami-scheduler-newversion
name: scheduler-config
- configMap:
defaultMode: 420
name: hami-scheduler-device
name: device-config
- 修改 hami-scheduler service 的 selector,指向 koord-scheduler pod
目前产品中创建 Pod 均为指定 schedulerName, 是通过 hami 的 mutatingwebhook 实现, 将 hami-scheduler
services 指向上面部署的 koordinator+hami 结合的 Pod - 修改 hami-scheduler clusterrolebinding,绑定 koord-scheduler serviceAccount
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
meta.helm.sh/release-name: hami
meta.helm.sh/release-namespace: koordinator-system
labels:
app.kubernetes.io/component: hami-scheduler
app.kubernetes.io/instance: hami
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: hami
app.kubernetes.io/version: 2.6.1
helm.sh/chart: hami-2.6.1
name: hami-scheduler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: hami-scheduler
subjects:
- kind: ServiceAccount
name: hami-scheduler
namespace: koordinator-system
# 增加的 koordiantor 的 serviceAccount
- kind: ServiceAccount
name: koord-scheduler
namespace: koordinator-system
- 修改 koordinator-scheduler 的 clusterRole,增加 nodes 的 patch 操作
kubectl edit clusterrole koord-scheduler-role
- apiGroups:
- ""
resources:
- pods
- nodes # 增加字段
verbs:
- patch
- update
拓扑监控
- seriov-exporter daemonset 部署
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
app.kubernetes.io/name: sriov-metrics-exporter
app.kubernetes.io/version: v0.0.1
name: sriov-metrics-exporter
namespace: nvidia-network-operator
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/name: sriov-metrics-exporter
template:
metadata:
labels:
app.kubernetes.io/name: sriov-metrics-exporter
app.kubernetes.io/version: v0.0.1
spec:
hostNetwork: true
containers:
- args:
- --path.kubecgroup=/host/kubecgroup
- --path.sysbuspci=/host/sys/bus/pci/devices/
- --path.sysclassnet=/host/sys/class/net/
- --path.cpucheckpoint=/host/cpu_manager_state
- --path.kubeletsocket=/host/kubelet.sock
- --collector.kubepoddevice=true
- --collector.vfstatspriority=sysfs,netlink
image: ghcr.io/k8snetworkplumbingwg/sriov-network-metrics-exporter:latest
imagePullPolicy: Always
name: sriov-metrics-exporter
resources:
requests:
memory: 100Mi
cpu: 100m
limits:
memory: 100Mi
cpu: 100m
securityContext:
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
volumeMounts:
- mountPath: /host/kubelet.sock
name: kubeletsocket
- mountPath: /host/sys/bus/pci/devices
name: sysbuspcidevices
readOnly: true
- mountPath: /host/sys/devices
name: sysdevices
readOnly: true
- mountPath: /host/sys/class/net
name: sysclassnet
readOnly: true
- mountPath: /host/kubecgroup
name: kubecgroup
readOnly: true
- mountPath: /host/cpu_manager_state
name: cpucheckpoint
readOnly: true
nodeSelector:
kubernetes.io/os: linux
feature.node.kubernetes.io/network-sriov.capable: "true"
restartPolicy: Always
tolerations:
- operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet/pod-resources/kubelet.sock
type: "Socket"
name: kubeletsocket
- hostPath:
path: /sys/fs/cgroup/kubepods.slice/
type: "Directory"
name: kubecgroup
- hostPath:
path: /var/lib/kubelet/cpu_manager_state
type: "File"
name: cpucheckpoint
- hostPath:
path: /sys/class/net
type: "Directory"
name: sysclassnet
- hostPath:
path: /sys/bus/pci/devices
type: "Directory"
name: sysbuspcidevices
- hostPath:
path: /sys/devices
type: "Directory"
name: sysdevices
---
apiVersion: v1
kind: Service
metadata:
name: sriov-metrics-exporter
namespace: nvidia-network-operator
annotations:
prometheus.io/target: "true"
spec:
selector:
app.kubernetes.io/name: sriov-metrics-exporter
ports:
- protocol: TCP
port: 9808
targetPort: 9808
- 新增 seriov-exporter 的 vmrules
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
annotations:
meta.helm.sh/release-name: sriov-exporter
meta.helm.sh/release-namespace: monitoring-system
generation: 1
labels:
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: sriov
group: kubegien
kubegien.org/rule-level: builtin
name: sriov-exporter
namespace: monitoring-system
spec:
groups:
- name: sriov.rules
rules:
- expr: |
sum(rate(sriov_vf_tx_bytes{}[5m])) by (pf,node)
record: sriov_vf_tx_bytes:sum
- expr: |
sum(rate(sriov_vf_rx_bytes{}[5m])) by (pf,node)
record: sriov_vf_rx_bytes:sum
- 新增 seriov-exporter 的 vmPodScrape
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMPodScrape
metadata:
annotations:
meta.helm.sh/release-name: seriov-exporter
meta.helm.sh/release-namespace: monitoring-system
labels:
app.kubernetes.io/name: seriov-exporter
name: seriov-exporter
namespace: monitoring-system
spec:
namespaceSelector:
any: true
podMetricsEndpoints:
- interval: 5m
path: /metrics
port: prometheus
relabelConfigs:
- action: replace
regex: (.*)
replacement: $1
sourceLabels:
- __meta_kubernetes_pod_node_name
targetLabel: node
selector:
matchLabels:
app.kubernetes.io/name: sriov-metrics-exporter
- 节点 numa 内存指标 修改 node-exporter daemonset,增加参数
- --collector.meminfo_numa
# https://rigitlab.gientech.com/P6085080/components-helmchart/-/commit/3772a900e65f370fd34e2f1322b33a988518c65e
- expr: sum by (node, numa) (label_replace(label_replace(node_memory_numa_MemUsed,"numa",
"$1", "node", "(.*)"),"node", "$1", "instance", "(.*)"))
record: node:node_numa_memory_bytes_used:sum
- expr: sum by (node, numa) (label_replace(label_replace(node_memory_numa_MemTotal,"numa",
"$1", "node", "(.*)"),"node", "$1", "instance", "(.*)"))
record: node:node_numa_memory_bytes_total:sum
- expr: node:node_numa_memory_bytes_used:sum / node:node_numa_memory_bytes_total:sum
record: node:node_numa_memory_utilization
- 增加 dcgm-exporter recording rule 规则
#https://rigitlab.gientech.com/P6085080/components-helmchart/-/commit/f5c627eed5a806804d46f8a60025e6a4a3910b3b
- expr: |
sum by (uuid, driver, deviceType, node) ( label_replace (label_replace(label_replace(floor (DCGM_FI_DEV_FB_USED * 1024) * on (namespace, pod) group_left(node, host_ip, role) node_namespace_pod:kube_pod_info:, "driver", "$1", "DCGM_FI_DRIVER_VERSION", "(.+)" ), "uuid", "$1", "UUID", "(.+)" ), "deviceType", "$1", "modelName", "(.+)"))
record: gpu:nv_memory_usage_in_byte:sum
kueue 的更新
1. 配置:
# 更新 intergrations
integrations:
frameworks:
- batch/job
- kubeflow.org/mpijob
- ray.io/rayjob
- ray.io/raycluster
- jobset.x-k8s.io/jobset
- trainer.kubeflow.org/trainjob
- kubeflow.org/paddlejob
- kubeflow.org/pytorchjob
- kubeflow.org/tfjob
- kubeflow.org/xgboostjob
- kubeflow.org/jaxjob
- workload.codeflare.dev/appwrapper
- pod
- deployment
- statefulset
# kueue 参数调整,feature-gates 改为
--feature-gates=TopologyAwareScheduling=true,VisibilityOnDemand=false,ElasticJobsViaWorkloadSlices=true
2. CRD
1. 删除 就的 kueue topology kubectl delete crd topologies.kueue.x-k8s.io
2. kubectl apply -f ./topologies.yaml
3. Kueue 最新的镜像
10.62.48.94:30085/cap-system/kueue:v0.14.0-devel-329-g19ebce029-1
创建默认的 topology
apiVersion: kueue.x-k8s.io/v1beta1
kind: Topology
metadata:
name: "default"
spec:
levels:
- nodeLabel: "topology.kubegien.org/switch"
- nodeLabel: "topology.kubegien.org/rack"
- nodeLabel: "kubernetes.io/hostname"
---
# 以上 yaml 保存成文件,通过 kubectl create -f 创建
#注意: 已有 resourceFlavor 的环境,当没有配置 `topologyName` 时,可以新增,若已配置,则不可以修改。
# 更新环境时,需要修改已有的 resourceFlavor,增加 topologyName: default
创建 infraNode 资源
kubectl create -f https://rigitlab.gientech.com/cprd/ctl-icm/components-helmchart/-/commit/5564572016e6f4253c8fd24ef4d10863c4a8e872
限制
- 当分区开启了拓扑感知调度时,及在 resourcflavor 上增加了 topologyName 字段,那么使用这个分区的资源池的服务就需要设置节点拓扑亲和,这对现有的服务会存在一定的影响,需要更新正在运行的业务服务,如默认增加如下 label
kueue.x-k8s.io/podset-preferred-topology:topology.kubegien.org/switch
部署时,在节点上打上标签
node.koordinator.sh/numa-topology-policy: Restricted
我来为您展示一个完整的示例:创建 Deployment -> Service -> Ingress,并使用 IngressClass nginx 暴露服务。
完整的 YAML 配置
1. 创建 Deployment
deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-doc-deployment
namespace: default
labels:
app: api-doc
version: v1
spec:
replicas: 3
selector:
matchLabels:
app: api-doc
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: api-doc
version: v1
spec:
containers:
- name: api-doc
image: swaggerapi/swagger-ui:v5.9.0 # 使用 Swagger UI 作为示例
# 如果您有自己的 API 文档镜像,替换这里
# image: your-api-doc-image:tag
ports:
- containerPort: 8080
name: http
env:
- name: SWAGGER_JSON
value: "/api-docs/openapi.json"
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
readinessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
2. 创建 Service
service.yaml:
apiVersion: v1
kind: Service
metadata:
name: api-doc-service
namespace: default
labels:
app: api-doc
spec:
type: ClusterIP
selector:
app: api-doc
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
3. 创建 Ingress(使用 IngressClass nginx)
ingress.yaml:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-doc-ingress
namespace: default
annotations:
# Nginx Ingress 控制器相关注解
nginx.ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/ssl-redirect: "false" # 如果是 HTTP
# 如果是 HTTPS,使用以下配置
# nginx.ingress.kubernetes.io/ssl-redirect: "true"
# nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
# 其他有用的注解
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
spec:
ingressClassName: nginx # 指定使用 nginx IngressClass
rules:
- host: api-doc.example.com # 替换为您的域名
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-doc-service
port:
number: 80
4. 批量创建所有资源
# 创建所有资源
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f ingress.yaml
# 或者使用一个文件包含所有配置
cat <<EOF > api-doc-all.yaml
# 这里包含上面的三个YAML配置,用 --- 分隔
EOF
kubectl apply -f api-doc-all.yaml
验证配置
# 查看所有资源状态
kubectl get deployment,service,ingress -l app=api-doc
# 查看详细状态
kubectl describe deployment api-doc-deployment
kubectl describe service api-doc-service
kubectl describe ingress api-doc-ingress
# 查看 Pod 状态
kubectl get pods -l app=api-doc
# 查看日志
kubectl logs -l app=api-doc --tail=50
测试访问
# 如果是本地测试,可以使用端口转发
kubectl port-forward service/api-doc-service 8080:80
# 然后访问 http://localhost:8080
# 或者通过 Ingress Controller 的 LoadBalancer IP
kubectl get svc -n ingress-nginx # 查看 nginx ingress controller 的外部IP
# 使用 curl 测试(修改 hosts 文件或使用 Host 头)
curl -H "Host: api-doc.example.com" http://<INGRESS-EXTERNAL-IP>/
重要说明
-
IngressClass 前提条件:
- 需要先安装 Nginx Ingress Controller
- 确保 IngressClass
nginx存在:kubectl get ingressclass
-
安装 Nginx Ingress Controller(如果没有):
# 使用 Helm(推荐) helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm install ingress-nginx ingress-nginx/ingress-nginx \ --namespace ingress-nginx \ --create-namespace # 或者使用 kubectl kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.8.2/deploy/static/provider/cloud/deploy.yaml -
DNS/域名配置:
- 需要在 DNS 中将
api-doc.example.com
- 需要在 DNS 中将

浙公网安备 33010602011771号