部署 Slurm
master
├── node01
└── node02
Munge
Munge(MUNGE Uid 'N' Gid Emporium)是一个认证服务,专门为 HPC 集群设计,用于在集群节点之间创建和验证凭据。它是 Slurm 和其他 HPC 软件的核心依赖。
-
所有节点安装 Munge:
sudo apt install munge libmunge-dev libmunge2munge:主程序和 munged 守护进程libmunge2:Munge 运行时libmunge-dev:开发库
-
Master 节点生成 Munge 密钥:
sudo -u munge /usr/sbin/mungekey --verbose # 一般会自动生成 -
将密钥上传到所有 Worker 节点。
sudo scp /etc/munge/munge.key node01:/etc/munge/ sudo scp /etc/munge/munge.key node02:/etc/munge/ -
在所有节点启动 Munge 服务:
sudo systemctl enable --now munge
参见:Installation Guide · dun/munge Wiki
Slurm
-
所有节点按照 Slurm 工具:
sudo apt install slurm-wlmslurm-wlm:基础包和工具
-
Master 安装 Slurm 工具:
sudo apt install slurmctldslurmctld:控制守护进程,负责调度和管理
-
所有节点同步配置文件:
主机名配置:
sudoedit /etc/hosts192.168.1.10 master 192.168.1.11 node01 192.168.1.12 node02Slurm 主配置文件:
sudoedit /etc/slurm/slurm.conf# 基本信息 ClusterName=mycluster SlurmctldHost=master # 认证 AuthType=auth/munge CryptoType=crypto/munge # 调度器 SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory # Cgroup 资源控制 ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup ConstrainDevices=yes ConstrainCores=yes ConstrainRAMSpace=yes # 资源类型 GresTypes=gpu # 节点定义 NodeName=node01 CPUs=4 RealMemory=3600 Gres=gpu:8 State=UNKNOWN NodeName=node02 CPUs=4 RealMemory=3600 Gres=gpu:8 State=UNKNOWN # 分区定义 PartitionName=gpu Nodes=node01,node02 Default=YES MaxTime=INFINITE State=UP # 日志 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd.log SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid # 状态保存 StateSaveLocation=/var/spool/slurm/ctld SlurmdSpoolDir=/var/spool/slurm/dCPUs的值可以通过命令nproc确定
资源配置文件:
sudoedit /etc/slurm/gres.conf# node01 的 8 张 GPU NodeName=node01 Name=gpu File=/dev/nvidia0 NodeName=node01 Name=gpu File=/dev/nvidia1 NodeName=node01 Name=gpu File=/dev/nvidia2 NodeName=node01 Name=gpu File=/dev/nvidia3 NodeName=node01 Name=gpu File=/dev/nvidia4 NodeName=node01 Name=gpu File=/dev/nvidia5 NodeName=node01 Name=gpu File=/dev/nvidia6 NodeName=node01 Name=gpu File=/dev/nvidia7 # node02 的 8 张 GPU NodeName=node02 Name=gpu File=/dev/nvidia0 NodeName=node02 Name=gpu File=/dev/nvidia1 NodeName=node02 Name=gpu File=/dev/nvidia2 NodeName=node02 Name=gpu File=/dev/nvidia3 NodeName=node02 Name=gpu File=/dev/nvidia4 NodeName=node02 Name=gpu File=/dev/nvidia5 NodeName=node02 Name=gpu File=/dev/nvidia6 NodeName=node02 Name=gpu File=/dev/nvidia7Cgroup 约束配置:
sudoedit /etc/slurm/cgroup.confCgroupAutomount=yes ConstrainCores=yes ConstrainDevices=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes AllowedDevicesFile=/etc/slurm/allowed_devices.confCgroup 白名单设备:
sudoedit /etc/slurm/allowed_devices.conf/dev/null /dev/urandom /dev/zero /dev/console /dev/tty* /dev/pts/*Master 启用 slurmctld:
sudo install -d -o slurm -g slurm -m 755 /var/spool/slurm/ctld sudo systemctl enable --now slurmctld -
Worker 安装 Slurm 工具
sudo apt install slurmdslurmd:计算节点守护进程,执行作业
Worker 启用 slurmd:
sudo install -d -o slurm -g slurm -m 755 /var/spool/slurm/d sudo systemctl enable --now slurmd -
Master 检查节点状态:
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu* up infinite 2 idle node[01-02]
MariaDB
-
安装依赖:
sudo apt install mariadb-client mariadb-server libmariadb-dev-compat libmariadb-dev libmariadb3 -
启动 MariaDB 并设置 root 密码:
sudo systemctl enable --now mariadb sudo mysql_secure_installation -
创建 Slurm 数据库和用户:
sudo mysql -u rootCREATE DATABASE slurm_acct_db; CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'your_password'; GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost'; FLUSH PRIVILEGES; EXIT;

浙公网安备 33010602011771号