K8s && K3s根据pod名称修改应用启动配置文件

K8s && K3s根据pod名称修改应用启动配置文件

TrusNas修改安装的应用K3s配置文件

环境:

linux: 
	Linux version 6.1.74-production+truenas (root@tnsbuilds01.tn.ixsystems.net) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #2 SMP PREEMPT_DYNAMIC Wed Feb 21 20:30:38 UTC 2024

TrueNas应用:	
	TrueNAS-SCALE-23.10.2

K3s版本: 	
	Client Version: v1.26.6+k3s-e18037a7-dirty
Kustomize Version: v4.5.7
Server Version: v1.26.6+k3s-e18037a7-dirty

时间:
	2024年6月29日

问题概述:

​ 在使用TrueNas系统时发现通过应用市场安装的Immich照片管理系统的搜索功能和图像识别无法正常使用,经排查发现原因为immich的机器学习服务异常,通过分析日志得知,机器学习服务中使用的模型因网络问题无法正常自动下载,导致无法加载模型;故需要手动下载模型文件导入至机器学习服务中。

​ 经过踩坑发现,原本通过k3s kubectl cp将模型文件传入到pod容器中的方式会出现一旦Truenas重启或者K3s重启,那么因为POD重启,导致POD内的复制的模型就没了。故想通过POD内部挂载NAS中的NFS卷的方式,让POD使用NAS中持久化存储的模型文件,这样就不会因POD重启而导致模型文件消失;

实施方案:

方案概述:

​ 通过POD名称获取到应用的控制器配置文件,然后将控制器配置文件导出到本地进行修改,可以自行修改内容包括但不限于数据卷,本文教程为修改数据卷,其他修改内容请自行查询修改参数;修改成功后应用配置文件;

实施过程:

注:这里的实施过程以问题描述中的案例进行实施,也就是修改immich机器学习POD的K3s启动配置参数
  1. 获取需要修改的POD的控制器名称

    1. 切换到root用户

      root@truenas[~]# sudo -i
      
    2. 查询控制器名称

    root@truenas[~]# k3s kubectl get deployments -n ix-immich
    NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
    immich-postgres          1/1     1            1           3d23h
    immich-redis             1/1     1            1           3d23h
    immich                   1/1     1            1           3d23h
    immich-machinelearning   1/1     1            1           3d23h  #此项为当前案例要找的控制器名称
    
  2. 导出控制器配置文件到本地

root@truenas[~]# k3s kubectl get deployment immich-machinelearning  -n ix-immich -o yaml > deployment.yaml
root@truenas[~]# ls
deployment.yaml  my-deployment.yaml  samba  tdb
  1. 修改配置文件中的volumes
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
    meta.helm.sh/release-name: immich
    meta.helm.sh/release-namespace: ix-immich
  creationTimestamp: "2024-06-25T12:27:28Z"
  generation: 4
  labels:
    app: immich-4.0.3
    app.kubernetes.io/instance: immich
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: immich
    app.kubernetes.io/version: 1.106.4
    helm-revision: "3"
    helm.sh/chart: immich-4.0.3
    release: immich
  name: immich-machinelearning
  namespace: ix-immich
  resourceVersion: "1968579"
  uid: 78bb9425-18a8-4be9-b033-f0252ad46737
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app.kubernetes.io/instance: immich
      app.kubernetes.io/name: immich
      pod.name: machinelearning
  strategy:
    type: Recreate
  template:
    metadata:
      annotations:
        rollme: eKH7i
      creationTimestamp: null
      labels:
        app: immich-4.0.3
        app.kubernetes.io/instance: immich
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: immich
        app.kubernetes.io/version: 1.106.4
        helm-revision: "3"
        helm.sh/chart: immich-4.0.3
        pod.name: machinelearning
        release: immich
    spec:
      automountServiceAccountToken: false
      containers:
      - env:
        - name: TZ
          value: Asia/Shanghai
        - name: UMASK
          value: "002"
        - name: UMASK_SET
          value: "002"
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: all
        - name: PUID
          value: "568"
        - name: USER_ID
          value: "568"
        - name: UID
          value: "568"
        - name: PGID
          value: "568"
        - name: GROUP_ID
          value: "568"
        - name: GID
          value: "568"
        envFrom:
        - configMapRef:
            name: immich-ml-config
        image: altran1502/immich-machine-learning:v1.106.4
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /ping
            port: 32002
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: immich
        ports:
        - containerPort: 32002
          name: machinelearning
          protocol: TCP
        readinessProbe:
          failureThreshold: 5
          httpGet:
            path: /ping
            port: 32002
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 2
          timeoutSeconds: 5
        resources:
          limits:
            cpu: "2"
            memory: 4Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: 10m
            memory: 50Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          privileged: false
          readOnlyRootFilesystem: false
          runAsGroup: 0
          runAsNonRoot: false
          runAsUser: 0
          seccompProfile:
            type: RuntimeDefault
        startupProbe:
          failureThreshold: 60
          httpGet:
            path: /ping
            port: 32002
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 2
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /mlcache
          name: mlcache
      dnsConfig:
        options:
        - name: ndots
          value: "2"
      dnsPolicy: ClusterFirst
      enableServiceLinks: false
      initContainers:
      - command:
        - /bin/ash
        - -c
        - |-
          echo "Pinging [http://immich:30041/api/server-info/ping] until it is ready..."
          until wget --spider --quiet --timeout=3 --tries=1 "http://immich:30041/api/server-info/ping"; do
            echo "Waiting for [http://immich:30041/api/server-info/ping] to be ready..."
            sleep 2
          done
          echo "URL [http://immich:30041/api/server-info/ping] is ready!"
        env:
        - name: TZ
          value: Asia/Shanghai
        - name: UMASK
          value: "002"
        - name: UMASK_SET
          value: "002"
        - name: NVIDIA_VISIBLE_DEVICES
          value: void
        - name: S6_READ_ONLY_ROOT
          value: "1"
        image: bash:4.4.23
        imagePullPolicy: IfNotPresent
        name: immich-init-wait-url
        resources:
          limits:
            cpu: "2"
            memory: 4Gi
          requests:
            cpu: 10m
            memory: 50Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          privileged: false
          readOnlyRootFilesystem: true
          runAsGroup: 568
          runAsNonRoot: true
          runAsUser: 568
          seccompProfile:
            type: RuntimeDefault
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      restartPolicy: Always
      runtimeClassName: nvidia
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 568
        fsGroupChangePolicy: OnRootMismatch
        supplementalGroups:
        - 44
        - 107
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 30
      volumes:
#        - emptyDir: {}
#          name: mlcache
#将上面的这个mlcache卷配置修改为下面的配置
        - hostPath:
            path: /mnt/ssd/data/model-cache #NFS模型文件路径
            type: ""
          name: mlcache # 这里的name要和上面的volumeMounts配置里的name对应上
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2024-06-25T12:27:28Z"
    lastUpdateTime: "2024-06-29T09:11:02Z"
    message: ReplicaSet "immich-machinelearning-754d78cf48" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2024-06-29T09:49:17Z"
    lastUpdateTime: "2024-06-29T09:49:17Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 4
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

  1. 应用配置文件

    root@truenas[~]# k3s kubectl apply -f deployment.yaml
    
  2. 测试是否成功

    root@immich-machinelearning-86d4f98657-dtbr6:/mlcache# ls
    clip  facial-recognition  image-classification  models--M-CLIP--XLM-Roberta-Large-Vit-B-16Plus  models--google--vit-base-patch16-224  models--immich-app--XLM-Roberta-Large-Vit-B-16Plus  version.txt
    

​ 可以看到,NAS中的模型已经挂载到容器内部

命令合集:

# 切换为root用户
sudo -i   
# 获取控制器名称 -n 后面的是命名空间,这里我的是ix-immich,其他的根据实际情况调整
k3s kubectl get deployments -n ix-immich
# 应用修改后的配置文件
k3s kubectl apply -f deployment.yaml

鸣谢

​ 感谢您花时间浏览我的文章,如果对您有帮助,随手赞一下~~

posted @ 2024-06-29 20:26  codeHi  阅读(241)  评论(0)    收藏  举报