虚拟化集群中PBS-Torque的部署

1. 概述

本篇博客主要介绍在centos7操作系统集群中部署,配置和使用pbs调度系统的过程

centos7版本:CentOS Linux release 7.9.2009 (Core)

pbs调度系统版本:torque-6.1.3.tar.gz

集群信息:

使用三个虚拟节点部署pbs系统

节点名称 节点IP 节点角色 节点服务
node16 192.168.80.16 管理节点,登陆节点 pbs_server pbs_schd
node17 192.168.80.17 节点节点 pbs_mom
node18 192.168.80.18 计算节点 pbs_mom

本篇只是pbs调度软件torque的基本部署,配置,使用。更加复杂的功能,比如MPI环境,图形界面显示,GPU调度,Munge认证,数据库信息获取,高可用配置,未做详细的探究,后期有时间再进行完善。

2. 部署

一般集群都需要时间统一,全局身份认证,这两个步骤在本篇博客不作介绍。

本篇博客的node16-18,已经通过ldap和sssd实现了全局身份认证。

同时约定使用node16的一个软件安装目录作为共享目录,共享给node17和node18

2.1 创建和挂载共享目录

node16上执行mkdir -p /hpc/torque/6.1.3,该目录用来安装torque软件,共享给其他节点

执行mkdir -p /hpc/packages/作为编译torque的工作目录

编辑/etc/exportfs文件,内容如下:

/hpc 192.168.80.0/24(rw,no_root_squash,no_all_squash)
/home 192.168.80.0/24(rw,no_root_squash,no_all_squash)

执行:systemctl start nfs && systemctl enable nfs设置nfs启动和开机启动

执行:exportfs -r,使共享目录即时生效

在node17,node18上分别执行:

mkdir -p /hpc
mount.nfs 192.168.80.16:/hpc /hpc
mount.nfs 192.168.80.16:/home /home

2.2 部署torque软件

下载

wget http://wpfilebase.s3.amazonaws.com/torque/torque-6.1.3.tar.gz

解压

tar -zxvf torque-6.1.3.tar.gz -C /hpc/packages,解压到编译安装torque的工作目录

配置安装信息

# 1. yum安装软件依赖
yum -y install libxml2-devel boost-devel openssl-devel libtool readline-devel pam-devel hwloc-devel numactl-devel tcl-devel tk-devel
# yum groupinstall "GNOME Desktop" "Graphical Administrator Tools" 如果编译选项有--enable-gui时,需要安装图像界面依赖
# 2. configure传参,配置编译安装信息
./configure \
	--prefix=/hpc/torque/6.1.3 \
	--mandir=/hpc/torque/6.1.3/man  \
	--enable-cgroups \
	--enable-syslog \
	--enable-drmaa \
	--enable-gui \
	--with-xauth \
	--with-hwloc \
	--with-pam \
	--with-tcl \
	--with-tk \
	# --enable-numa-support \	#这个参数加上,要编辑mom.layout,不清楚规则,暂时取消
# 3. 更新。后期可能添加对MPI,GPU,Munge认证,高可用配置等的支持,本篇后期补充

执行结束后:

Building components: server=yes mom=yes clients=yes
                     gui=yes drmaa=yes pam=yes
PBS Machine type    : linux
Remote copy         : /usr/bin/scp -rpB
PBS home            : /var/spool/torque
Default server      : node13

Unix Domain sockets : 
Linux cpusets       : no
Tcl                 : -L/usr/lib64 -ltcl8.5 -ldl -lpthread -lieee -lm
Tk                  : -L/usr/lib64 -ltk8.5 -lX11 -L/usr/lib64 -ltcl8.5 -ldl -lpthread -lieee -lm
Authentication      : trqauthd

Ready for 'make'.

编译和安装

# 1. 编译,安装
make -j4 && make install
# 2. 生成安装脚本
# make packages

执行make packages输出:

这一步可以不做。在本篇博客中,工作目录在共享文件系统,因此只需要在每个节点执行make install即可。

[root@node16 torque-6.1.3]# make packages
Building packages from /hpc/packages/torque-6.1.3/tpackages
rm -rf /hpc/packages/torque-6.1.3/tpackages
mkdir /hpc/packages/torque-6.1.3/tpackages
Building ./torque-package-server-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-mom-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-clients-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-gui-linux-x86_64.sh ...
Building ./torque-package-pam-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /lib64/security'
Building ./torque-package-drmaa-linux-x86_64.sh ...
libtool: install: warning: relinking `libdrmaa.la'
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-devel-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
Building ./torque-package-doc-linux-x86_64.sh ...
Done.

The package files are self-extracting packages that can be copied
and executed on your production machines.  Use --help for options.

执行 libtool --finish /hpc/torque/6.1.3/lib

这一步可以不做,make install 操作默认操作

[root@node16 torque-6.1.3]# libtool --finish /hpc/torque/6.1.3/lib
libtool: finish: PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/sbin" ldconfig -n /hpc/torque/6.1.3/lib
----------------------------------------------------------------------
Libraries have been installed in:
   /hpc/torque/6.1.3/lib

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH' environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------

接下来是配置环境变量和配置启动脚本

此时执行ls -lrt /usr/lib/systemd/system,发现目录下已经有了

-rw-r--r--  1 root root 1284 10月 12 11:17 pbs_server.service
-rw-r--r--  1 root root  704 10月 12 11:17 pbs_mom.service
-rw-r--r--  1 root root  335 10月 12 11:17 trqauthd.servic

少了一个pbs_sched.service启动脚本,从目录/hpc/packages/torque-6.1.3/contrib/systemd目录拷贝到系统中

cp /hpc/packages/torque-6.1.3/contrib/systemd/pbs_sched.service /usr/lib/systemd//system

此时执行ls -lrt /etc/profile.d能够看到目录下已经有了torque.sh,只需要执行source /etc/profile就可以了

3. 配置

3.1 配置管理节点

3.1.1添加pbs管理用户

这里设置为root用户

./torque.setup,这个脚本注释:create pbs_server database and default queue

[root@node16 torque-6.1.3]# ./torque.setup root
hostname: node13
Currently no servers active. Default server will be listed as active server. Error  15133
Active server name: node13  pbs_server port is: 15001
trqauthd daemonized - port /tmp/trqauthd-unix
trqauthd successfully started
initializing TORQUE (admin: root)

You have selected to start pbs_server in create mode.
If the server database exists it will be overwritten.
do you wish to continue y/(n)?
# 输入y

3.1.2 启动组件认证服务

3.1.1的步骤会启动认证服务trqauthd,执行ps axu|grep trqauthd可验证

后续通过systemctl start trqauthd时会报错,因此此时建议执行pkill -f trqauthd先处理掉该进程,再通过systemctl start trqauthd启动

pkill -f trqauthd
systemctl start trqauthd
systemctl enable trqauthd

3.1.3 启动主服务

配置计算节点,vim /var/spool/torque/server_priv/nodes

node17 np=4
node18 np=4

然后执行以下命令

systemctl status pbs_server 
systemctl start pbs_server #如果这一步执行失败,查看是否已经启动了pbs_server,如果启动执行pkill -f pbs_server,然后再执行此命令
systemctl enable pbs_server

执行qnodes查看信息

如果没有qnodes命令,执行source /etc/profile加载环境变量

node17
     state = down
     power_state = Running
     np = 4
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003
     total_sockets = 0
     total_numa_nodes = 0
     total_cores = 0
     total_threads = 0
     dedicated_sockets = 0
     dedicated_numa_nodes = 0
     dedicated_cores = 0
     dedicated_threads = 0

node18
     state = down
     power_state = Running
     np = 4
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003
     total_sockets = 0
     total_numa_nodes = 0
     total_cores = 0
     total_threads = 0
     dedicated_sockets = 0
     dedicated_numa_nodes = 0
     dedicated_cores = 0
     dedicated_threads = 0
# node17和node18因为没有启动pbs_mom,所以状态显示为down

3.1.4 启动调度服务

在node16上还要执行systemctl start pbs_sched,否则提交的作业全部为Q状态

设置开机启动systemctl enable pbs_sched

3.2 配置计算节点

3.1部分完成了管理节点node16的部署,包括:

  • yum安装依赖环境
  • 解压源码,配置编译信息,编译安装
  • 配置管理用户
  • 编辑配置文件
  • 启动trqauthed服务,启动pbs_server服务,启动pbs_sched服务

计算节点需要完成的内容:

  • yum安装依赖环境
  • 配置管理节点信息
  • 执行安装脚本,或者make install
  • 启动trqauthd服务,启动pbs_mom服务

因为所有的操作均在共享目录下进行,因此只需要在node17和node18节点上执行make install即可

4. 使用

4.1 查看和激活队列

在3.1.1过程中执行torque.setup执行后,会默认添加一个batch队列,并设置了队列的一些基本属性

此时需要执行qmgr active queue batch,才能够往这个队列提交作业

提交作业在管理节点上执行,在计算节点执行会报错

[liwl01@node18 ~]$ echo "sleep 120"|qsub
qsub: submit error (Bad UID for job execution MSG=ruserok failed validating liwl01/liwl01 from node18)

在node16上提交作业

[liwl01@node16 ~]$ echo "sleep 300"|qsub
1.node16

计算中执行,S表示的作业状态为R,运行状态

[liwl01@node16 ~]$ qstat -a -n

node16: 
                                                                                  Req'd       Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
1.node16                liwl01      batch    STDIN             20038     1      1       --   01:00:00 R  00:04:17
   node17/0

计算结束后,S表示的作业状态C,完成状态

[liwl01@node16 ~]$ qstat -a -n

node16: 
                                                                                  Req'd       Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
1.node16                liwl01      batch    STDIN             20038     1      1       --   01:00:00 C       -- 
   node17/0

而qnodes执行结果

[liwl01@node16 ~]$ qnodes 
node17
     state = free
     power_state = Running
     np = 4
     ntype = cluster
     jobs = 0/1.node16
     status = opsys=linux,uname=Linux node17 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64,sessions=20038,nsessions=1,nusers=1,idletime=2159,totmem=3879980kb,availmem=3117784kb,physmem=3879980kb,ncpus=4,loadave=0.00,gres=,netload=1039583908,state=free,varattr= ,cpuclock=Fixed,macaddr=00:00:00:80:00:17,version=6.1.3,rectime=1634101823,jobs=1.node16
     mom_service_port = 15002
     mom_manager_port = 15003
     total_sockets = 1
     total_numa_nodes = 1
     total_cores = 4
     total_threads = 4
     dedicated_sockets = 0
     dedicated_numa_nodes = 0
     dedicated_cores = 0
     dedicated_threads = 1

5. 维护

待后期更新

posted @ 2021-10-13 13:23  liwldev  阅读(1529)  评论(0编辑  收藏  举报