数据并行是指,多张 GPUs 使用相同的模型副本,但采用不同 batch 的数据进行训练.

模型并行是指,多张 GPUs 使用同一 batch 的数据,分别训练模型的不同部分.




  • 表示得到的模型是稳定的,可以复现的可能性高。如果发散,每次的结果都不同,实验结果有啥意义?!
  • 一个模型,应该是在输入变量不变的时候,输出也应该是确定的。在说模型收敛的时候,一般指的是训练、验证损失曲线没有大的波动,而且随着训练轮数不断增加,波动依然可以在一定容忍范围内。个人理解,收敛的意义是系统稳定,就是模型的某一个权重参数发生小的改变的时候,模型输出结果不会发生强烈变化,导致系统崩溃,也就是所谓的发散。反过来说,发散就是在模型参数发生微小改变的时候,模型输出变化,导致损失剧烈变化。此时的模型就不是训练充分的模型。






batch:一般翻译为“批次”,表示一次性输入模型一组样本。在神经网络的训练过程中,训练数据往往是很多的,比如几万条甚至几十万条——如果我们一次性将这上万条的数据全部放入模型,对计算机性能、神经网络模型学习能力等的要求太高了;那么就可以将训练数据划分为多个batch,并随后分批将每个batch的样本一起输入到模型中进行前向传播、损失计算、反向传播和参数更新。但要注意,一般batch这个词用的不多,多数情况大家都是只关注batch size的。

batch size:一般翻译为“批次大小”,表示训练过程中一次输入模型的一组样本的具体样本数量。前面提到了,我们在神经网络训练过程中,往往需要将训练数据划分为多个batch;而具体每一个batch有多少个样本,那么就是batch size指定的了。




假设我们现在有一个训练数据集(这个数据集不包括测试集),其中数据的样本数量为1500。那么,我们将这1500条数据全部训练1次,就是一个epoch。其中,由于数据量较大(其实1500个样本在神经网络研究中肯定不算大,但是我们这里只是一个例子,大家理解即可),因此我们希望将其分为多个batch,分批加以训练;我们决定每1批训练100条数据,那么为了将这些数据全部训练完,就需要训练15批——在这里,batch size就是100,而batch就是15。而前面我们提到,每次完成对一个batch数据的训练,就是完成了一个step,那么stepiteration就也都是15

以上是我们对这一数据集加以1次训练(1epoch)的情况,而一般情况下我们肯定是需要训练多次的,也就是多个epoch。我们假设我们需要训练3epoch,相当于需要将这1500个样本训练3次。那么,stepiteration都会随着epoch的改变而发生改变——二者都变为45,因为15 * 3。但是,batch依然是15,因为其是在每一个epoch的视角内来看待的,和epoch的具体大小没有关系。



https://zhuanlan.zhihu.com/p/428448728 【REUL激活函数】






★,win10 安装DeepSpeed环境: 2024年3月6日11:05:44

  • 使用system账号打开cmd
  • NameError: name '_C' is not defined: 安装最新版torch(2.2.1)
  • [ERROR]  Unable to pre-compile async_io






常见 GPU 任务运行流程图如下:


GPU中Memory-usage最直接的影响因素是模型的大小Batch size的大小。其中模型对GPU中Memory-usage因素包括网络的参数量(网络的深度,宽度等),而一般在训练时候模型结构都已经固定,很少再轻易的改动。因此,我们对Memory-usage的占用的影响主要调控在Batch size的大小,如batch size设置为12,Memory-usage为40%;与设置为24相比,Memory-usage内存占用率是80%,接近于2倍关系,偏差不大。所以在模型结构固定的情况下,尽量将batch size设置大,充分利用GPU的内存。(GPU会很快的算完你给进去的数据,而有关训练时间主要瓶颈在CPU的数据吞吐量上面)

有关Volatile GPU-Utile的利用率(GPU的利用率)


这个是Volatile GPU-Util表示,当没有设置好CPU的线程数时,这个参数是在反复的跳动的,0%,20%,70%,95%,0%。这样停息1-2 秒然后又重复起来。其实是GPU在等待数据从CPU传输过来,当从总线传输到GPU之后,GPU逐渐起计算来,利用率会突然升高,但是GPU的算力很强大,0.5秒就基本能处理完数据,所以利用率接下来又会降下去,等待下一个batch的传入。因此,这个GPU利用率瓶颈在内存带宽和内存介质上以及CPU的性能上面。最好当然就是换更好的四代或者更强大的内存条,配合更好的CPU。




ChatGlm2-6B int4需要6G显存,fp16需要13G显存

ChatGLM-6B源码解析 之 cli_demo.py

  • model = AutoModel.from_pretrained("../ChatGLM-Tuning-master/chatglm-6b", trust_remote_code=True).half().cuda(): 加载预训练模型,并将其转移到GPU上,同时使用半精度浮点数(half-precision floating point)来提高运算速度。

(89.51,centos7) RuntimeError: Library cudart is not initialized: 参考此文

  • 模型量化依赖cpm-kernelscpm-kernels调用了libcudart.so。可以通过以下代码检查libcudart.so是否存在:

    python -c "import ctypes.util; print(ctypes.util.find_library('cudart'))"


  • No Solve






infiniband是NVIDIA GPU节点之间互相通信的一种网络架构,高带宽。




  • 训练:DeepSpeed ZeRO 训练支持完整的 ZeRO stages 1, 2 and 3、以及 optimizer states, gradients and parameters 的 CPU/Disk offload 。

    • Stage 1:将 optimizer states 分片到数据并行 workers/GPUs 上。
    • Stage 2:将 optimizer states + gradients 分片到数据并行 workers/GPUs 上。
    • Stage 3:将 optimizer states + gradients + model parameters 分片到数据并行 workers/GPUs 上。
    • Optimizer Offload:将 optimizer states + gradients 卸载到 CPU/Disk ,建立在 ZERO Stage 2 之上。
    • Param Offload:将 model parameters 卸载到 CPU/Disk ,建立在 ZERO Stage 3 之上。

    注意:关于 Disk Offload ,磁盘应该是 NVME 的,以便有好的速度,但技术上可以在任何磁盘上工作。

  • 推理:DeepSpeed ZeRO Inference 支持 ZeRO Stage 3 与 ZeRO-Infinity 。它使用与训练相同的 ZeRO 协议,但它不使用优化器和 lr scheduler 。





import torch


you appear to be running an x server please exit x before installing

  /etc/init.d/lightdm stop

libcudart.so.11.0: cannot open shared object file: No such file or directory

  • conda install cudatoolkit  ❌
  • export LD_LIBRARY_PATH=

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType. (debug源码,缺少文件导致)



# 要导入上级目录下模块,可以使用sys.path:  
import sys 
import xxx

如何将一个模型转换成 torch.nn.Module类型??

<class 'transformers_modules.chatglm2-6b-int4.modeling_chatglm.ChatGLMForConditionalGeneration'>

transformer model类型如何转换为DeepSpeed model类型

cannot import name 'HfDeepSpeedConfig' from 'transformers.integrations'


理论:pytorch Transformer deepspeed






★,torchrun is a python console script to the main module torch.distributed.run declared in the entry_points configuration in setup.py. It is equivalent to invoking python -m torch.distributed.run.



--nnodes=1:4 (min 1; max 4)





torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacty of 11.76 GiB of which 130.00 MiB is free. Including non-PyTorch memory, this process has 11.63 GiB memory in use. Of the allocated memory 11.36 GiB is allocated by PyTorch, and 22.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.

server2: OSError: libcudart.so.11.0: cannot open shared object file: No such file or directory

server1: RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

server3: torch.distributed.DistNetworkError: Connection reset by peer

  • 检查两个机器上deepspeed、transformers、accelerate、torch、peft、bitsandbytes的版本是否完全一致

  • python版本从3.8变为3.10就好了(注:不一定python 3.8不行,三个节点中有一个节点的python版本就是3.8的)

nvidia-smi驱动报错:NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.  Make sure that the latest NVIDIA driver is installed and running.

  • 参考此文。亲测可用,解决方法如下:
sudo apt install dkms
sudo dkms install -m nvidia -v 418.87.00
其中,418.87.00 是之前安装 nvidia 驱动的版本号,可通过下面方法查到:
ls /usr/src | grep nvidia

Traceback (most recent call last):
  File "/data1/tong/GLM/ChatGLM2-6B/ptuning/deploy_demo_ds.py", line 55, in <module>
    model_hidden_size = config.d_model
  File "/root/miniconda3/lib/python3.10/site-packages/transformers/configuration_utils.py", line 261, in __getattribute__
    return super().__getattribute__(key)
AttributeError: 'ChatGLMConfig' object has no attribute 'd_model'

Loading THUDM/chatglm2-6b requires to execute some code in that repo, you can inspect the content of the repository at https://hf.co/THUDM/chatglm2-6b. You can dismiss this prompt by passing `trust_remote_code=True`.
Do you accept? [y/N] y
Traceback (most recent call last):
  File "/data1/tong/GLM/ChatGLM2-6B/ptuning/deploy_demo_ds.py", line 145, in <module>
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
  File "/root/miniconda3/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained
    return model_class.from_pretrained(
  File "/root/miniconda3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2670, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 797, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 769, in __init__
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 941, in _configure_train_batch_size
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 909, in _set_batch_related_parameters
    grad_acc = train_batch // micro_batch
TypeError: unsupported operand type(s) for //: 'int' and 'str'

  • model = AutoModel.from_pretrained(model_name)
  • grad_acc = train_batch // micro_batch,配置项一个配置的为1,一个配置的为“auto”,所以出现此错误。需统一类型。

[2024-03-05 15:09:11,736] [WARNING] [partition_parameters.py:921:_post_init_method] param `weight` in Embedding not on GPU so was not broadcasted from rank 0
[2024-03-05 15:09:12,121] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 0.27B parameters
Traceback (most recent call last):
  File "/data1/tong/GLM/ChatGLM2-6B/ptuning/deploy_demo_ds.py", line 145, in <module>
    model = AutoModel.from_pretrained(model_name,trust_remote_code=True)
  File "/root/miniconda3/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained
    return model_class.from_pretrained(
  File "/root/miniconda3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2675, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 434, in wrapper
    f(module, *args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 854, in __init__
    self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 434, in wrapper
    f(module, *args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 741, in __init__
    self.embedding = init_method(Embedding, config, **init_kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/utils/init.py", line 53, in skip_init
    return module_cls(*args, **kwargs).to_empty(device=final_device)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 434, in wrapper
    f(module, *args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 711, in __init__
    self.word_embeddings = nn.Embedding(
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 441, in wrapper
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 924, in _post_init_method
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1157, in partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1296, in _partition
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1381, in _partition_param
NotImplementedError: Cannot copy out of meta tensor; no data!

  • model = AutoModel.from_pretrained(xxx, empty_init=False) // from 此文

[2024-03-05 15:42:56,108] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented

Traceback (most recent call last):
  File "/data1/tong/GLM/ChatGLM2-6B/ptuning/deploy_demo_ds.py", line 161, in <module>
    tokenizer = AutoTokenizer.from_pretrained(model_name)
  File "/root/miniconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 688, in from_pretrained
    raise ValueError(
ValueError: Tokenizer class ChatGLMTokenizer does not exist or is not currently imported.

  • tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)


server3: master25:2726734:2726734 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation


server1: master33:3278978:3278978 [0] NCCL INFO Bootstrap : Using eno1:<0>
server1: master33:3278978:3278978 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
server1: master33:3278978:3278978 [0] NCCL INFO cudaDriverVersion 12020
server1: NCCL version 2.19.3+cuda12.3
server3: master25:2726734:2726734 [0] NCCL INFO cudaDriverVersion 12020
server3: master25:2726734:2726734 [0] NCCL INFO Bootstrap : Using eno1:<0>
★server3: master25:2726734:2726734 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
server1: master33:3278978:3279686 [0] NCCL INFO Failed to open libibverbs.so[.1]
server1: master33:3278978:3279686 [0] NCCL INFO NET/Socket : Using [0]eno1:<0> [1]br-efcde779becf:<0> [2]kube-ipvs0:<0> [3]flannel.1:<0> [4]cni0:<0>
server1: master33:3278978:3279686 [0] NCCL INFO Using non-device net plugin version 0
server1: master33:3278978:3279686 [0] NCCL INFO Using network Socket
server3: master25:2726734:2726907 [0] NCCL INFO Failed to open libibverbs.so[.1]
server3: master25:2726734:2726907 [0] NCCL INFO NET/Socket : Using [0]eno1:<0> [1]kube-ipvs0:<0> [2]flannel.1:<0> [3]cni0:<0>
server3: master25:2726734:2726907 [0] NCCL INFO Using non-device net plugin version 0
server3: master25:2726734:2726907 [0] NCCL INFO Using network Socket
server3: master25:2726734:2726907 [0] NCCL INFO comm 0x90982a0 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 3b000 commId 0x8b900e73d3d05138 - Init START
server1: master33:3278978:3279686 [0] NCCL INFO comm 0xa167a40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 3b000 commId 0x8b900e73d3d05138 - Init START
server1: master33:3278978:3279686 [0] NCCL INFO Setting affinity for GPU 0 to 55,55555555
server3: master25:2726734:2726907 [0] NCCL INFO Setting affinity for GPU 0 to 55,55555555
server1: master33:3278978:3279686 [0] NCCL INFO Channel 00/04 :    0   1
server1: master33:3278978:3279686 [0] NCCL INFO Channel 01/04 :    0   1
server1: master33:3278978:3279686 [0] NCCL INFO Channel 02/04 :    0   1
server1: master33:3278978:3279686 [0] NCCL INFO Channel 03/04 :    0   1
server1: master33:3278978:3279686 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1
server3: master25:2726734:2726907 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1
server3: master25:2726734:2726907 [0] NCCL INFO P2P Chunksize set to 131072
server1: master33:3278978:3279686 [0] NCCL INFO P2P Chunksize set to 131072
server1: master33:3278978:3279686 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/Socket/1
server1: master33:3278978:3279686 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/Socket/2
server1: master33:3278978:3279686 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[0] [receive] via NET/Socket/1
server1: master33:3278978:3279686 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[0] [receive] via NET/Socket/2
server1: master33:3278978:3279686 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/Socket/1
server1: master33:3278978:3279686 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/Socket/2
server3: master25:2726734:2726907 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/Socket/1
server1: master33:3278978:3279686 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[0] [send] via NET/Socket/1
server3: master25:2726734:2726907 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/Socket/3
server1: master33:3278978:3279686 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[0] [send] via NET/Socket/2
server3: master25:2726734:2726907 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[0] [receive] via NET/Socket/1
server3: master25:2726734:2726907 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[0] [receive] via NET/Socket/3
server3: master25:2726734:2726907 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/Socket/1
server3: master25:2726734:2726907 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/Socket/3
server3: master25:2726734:2726907 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[0] [send] via NET/Socket/1
server3: master25:2726734:2726907 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[0] [send] via NET/Socket/3
server1: master33:3278978:3279712 [0] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to<56199> failed : Software caused connection abort
server1: master33:3278978:3279712 [0] NCCL INFO misc/socket.cc:565 -> 2
server1: master33:3278978:3279712 [0] NCCL INFO misc/socket.cc:587 -> 2
server1: master33:3278978:3279712 [0] NCCL INFO transport/net_socket.cc:338 -> 2
server1: master33:3278978:3279712 [0] NCCL INFO transport/net.cc:677 -> 2
server1: master33:3278978:3279686 [0] NCCL INFO transport/net.cc:304 -> 2
server1: master33:3278978:3279686 [0] NCCL INFO transport.cc:148 -> 2
server1: master33:3278978:3279686 [0] NCCL INFO init.cc:1117 -> 2
server1: master33:3278978:3279686 [0] NCCL INFO init.cc:1396 -> 2
server1: master33:3278978:3279686 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
server1: master33:3278978:3278978 [0] NCCL INFO group.cc:418 -> 2
server1: master33:3278978:3278978 [0] NCCL INFO group.cc:95 -> 2
server3: master25:2726734:2726908 [0] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to<58433> failed : Software caused connection abort
server3: master25:2726734:2726908 [0] NCCL INFO misc/socket.cc:565 -> 2
server3: master25:2726734:2726908 [0] NCCL INFO misc/socket.cc:587 -> 2
server3: master25:2726734:2726908 [0] NCCL INFO transport/net_socket.cc:338 -> 2
server3: master25:2726734:2726908 [0] NCCL INFO transport/net.cc:677 -> 2
server3: master25:2726734:2726907 [0] NCCL INFO transport/net.cc:304 -> 2
server3: master25:2726734:2726907 [0] NCCL INFO transport.cc:148 -> 2
server3: master25:2726734:2726907 [0] NCCL INFO init.cc:1117 -> 2
server3: master25:2726734:2726907 [0] NCCL INFO init.cc:1396 -> 2
server3: master25:2726734:2726907 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
server3: master25:2726734:2726734 [0] NCCL INFO group.cc:418 -> 2
server3: master25:2726734:2726734 [0] NCCL INFO group.cc:95 -> 2
server3: master25:2726734:2726908 [0] proxy.cc:1523 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection
server3: master25:2726734:2726908 [0] NCCL INFO misc/socket.cc:806 -> 3
server3: master25:2726734:2726908 [0] proxy.cc:1533 NCCL WARN [Service thread] Could not receive type from localRank 0, res=3, closed=0
server3: master25:2726734:2726908 [0] proxy.cc:1557 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 3
server1: master33:3278978:3279712 [0] NCCL INFO misc/socket.cc:47 -> 3
server1: master33:3278978:3279712 [0] NCCL INFO misc/socket.cc:58 -> 3
server1: master33:3278978:3279712 [0] NCCL INFO misc/socket.cc:773 -> 3
server1: master33:3278978:3279712 [0] NCCL INFO proxy.cc:1374 -> 3
server1: master33:3278978:3279712 [0] NCCL INFO proxy.cc:1415 -> 3
server1: master33:3278978:3279712 [0] proxy.cc:1557 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
server3: Traceback (most recent call last):
server3:   File "/data1/tong/GLM/ChatGLM2-6B/ptuning/deploy_demo_ds.py", line 146, in <module>
server3:     model = AutoModel.from_pretrained(model_name,trust_remote_code=True, empty_init=False)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained
server3:     return model_class.from_pretrained(
server3:   File "/root/miniconda3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2675, in from_pretrained
server3:     model = cls(config, *model_args, **model_kwargs)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 434, in wrapper
server3:     f(module, *args, **kwargs)
server3:   File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 854, in __init__
server3:     self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 434, in wrapper
server3:     f(module, *args, **kwargs)
server3:   File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 741, in __init__
server3:     self.embedding = init_method(Embedding, config, **init_kwargs)
server3:   File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 48, in default_init
server3:     return cls(*args, **kwargs)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 434, in wrapper
server3:     f(module, *args, **kwargs)
server3:   File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 711, in __init__
server3:     self.word_embeddings = nn.Embedding(
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 441, in wrapper
server3:     self._post_init_method(module)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 915, in _post_init_method
server3:     dist.broadcast(param, 0, self.get_dp_process_group())
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 116, in log_wrapper
server3:     return func(*args, **kwargs)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 216, in broadcast
server3:     return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 188, in broadcast
server3:     return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
server3:     return func(*args, **kwargs)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast
server3:     work = default_pg.broadcast([tensor], opts)
server3: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
server3: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
server3: Last error:
server3: socketStartConnect: Connect to<58433> failed : Software caused connection abort
  • python -c "import torch;print(torch.cuda.nccl.version())" //(2, 19, 3)
  • NCCL是Nvidia Collective multi-GPU Communication Library的简称,它是一个实现多GPU的collective communication通信(all-gather, reduce, broadcast)库,Nvidia做了很多优化,以在PCIe、Nvlink、InfiniBand上实现较高的通信速度。目前NCCL 1.0版本只支持单机多卡,卡之间通过PCIe、NVlink、GPU Direct P2P来通信。NCCL 2.0会支持多机多卡,多机间通过Sockets (Ethernet)或者InfiniBand with GPU Direct RDMA通信。
  • ·NCCL_DEBUG=info  NCCL_SOCKET_IFNAME=eno1 NCCL_IB_DISABLE=1 deepspeed --hostfile=hostfile --num_nodes 2 --master_addr= --master_port=29500 deploy_demo_ds.py· // 参考此文
  • 防火墙问题

DeepSpeed + ChatGlm2-6B测试结论:

1. ChatGlm2-6B-int4 单机直接部署推理,成功,占用显存~4.5G
1. ChatGlm2-6B单机直接部署推理:OOM
1. ChatGlm2-6B单机DeepSpeed无Offload(下沉)部署推理:OOM
1. ChatGlm2-6B单机使用DeepSpeed Zero3 Offload CPU部署推理:成功,占用显存~4518MiB(/12288MiB),cpu占用明显升高,内存占用16G左右
1. ChatGlm2-6B多机(实验环境2台GPU 3080Ti)DeepSpeed Zero3 无Offload部署推理:成功,显存峰值时达到11502M(每张总共12288MiB),网络传输流量11.5MB/s左右(已有硬件的最大传输能力),用时3小时左右
1. ChatGlm2-6B多机(实验环境2台GPU 3080Ti)DeepSpeed Zero3 Offload CPU部署推理:成功,显存峰值时达到8080M(已看到的最大值,不确定是否是峰值),网络传输流量11.5MB/s左右(已有硬件的最大传输能力),用时1小时40分钟左右


(base) [root@master33 ptuning]$hostnamectl 
   Static hostname: master33
         Icon name: computer-server
           Chassis: server
        Machine ID: 61badc2da433498cb35d496f6d7b34a8
           Boot ID: 41aee70b71564c7fbca8e2e8e62a76e4
  Operating System: Ubuntu 20.04.6 LTS
            Kernel: Linux 5.15.0-92-generic
      Architecture: x86-64

(base) [root@master33 ptuning]$pip list|grep -E "torch|deepspeed|trans|accelerate|peft|bitsandbytes"
accelerate                                   0.27.2
bitsandbytes                                 0.42.0
deepspeed                                    0.10.0
peft                                         0.9.0
s3transfer                                   0.7.0
torch                                        2.2.1
transformers                                 4.30.2


(base) [root@master33 ptuning]$cat deploy_demo_ds.py 
#!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
# First you need to install deepspeed: pip install deepspeed
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
# To deploy on 1 gpu:
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
# To deploy on 2 gpus:
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py

from transformers import AutoTokenizer, AutoConfig, AutoModel, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # To avoid warnings about parallelism in tokenizers

# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))

#model_name = "bigscience/T0_3B"
model_name = "THUDM/chatglm2-6b"
#model_name = "THUDM/chatglm2-6b-int4"

config = AutoConfig.from_pretrained(model_name,trust_remote_code=True)
#model_hidden_size = config.d_model
model_hidden_size = 2048

# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size

# ds_config notes
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed

# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
ds_config = {
    "fp16": {
        "enabled": False
    "bf16": {
        "enabled": False
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
ds_config = {
    "fp16": {
        "enabled": False
    "bf16": {
        "enabled": True
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    "checkpoint": {
        "use_node_local_storage": True
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
# fmt: on

# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive

# now a model can be loaded.
#model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name,trust_remote_code=True, empty_init=False)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference

# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
rank = torch.distributed.get_rank()
if rank == 0:
    #text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
    text_in = "你叫什么名字"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")
(base) [root@master25 ptuning]$deepspeed --num_gpus 1 deploy_demo_ds.py
[2024-03-07 16:05:56,460] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-07 16:05:57,890] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-03-07 16:05:57,890] [INFO] [runner.py:555:main] cmd = /root/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr= --master_port=29500 --enable_each_rank_log=None deploy_demo_ds.py
[2024-03-07 16:05:59,380] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-07 16:06:00,776] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2024-03-07 16:06:00,776] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2024-03-07 16:06:00,776] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2024-03-07 16:06:00,776] [INFO] [launch.py:163:main] dist_world_size=1
[2024-03-07 16:06:00,776] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-03-07 16:06:03,385] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-07 16:06:03,631] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-07 16:06:03,632] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-03-07 16:06:03,632] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-03-07 16:06:13,327] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:13<00:00,  1.89s/it]
[2024-03-07 16:06:26,561] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
[2024-03-07 16:06:26,571] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-03-07 16:06:26,572] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
[2024-03-07 16:06:26,666] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-03-07 16:06:26,667] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.49 GB         Max_CA 1 GB 
[2024-03-07 16:06:26,667] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 18.56 GB, percent = 14.8%
Parameter Offload: Total persistent parameters: 362496 in 85 params
[2024-03-07 16:06:26,775] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-03-07 16:06:26,776] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.49 GB         Max_CA 1 GB 
[2024-03-07 16:06:26,776] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 18.56 GB, percent = 14.8%
[2024-03-07 16:06:26,777] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   amp_enabled .................. False
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   amp_params ................... False
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fac0e4130d0>
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   communication_data_type ...... None
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   disable_allgather ............ False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   dump_state ................... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   elasticity_enabled ........... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   fp16_enabled ................. False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   global_rank .................. 0
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   loss_scale ................... 1.0
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   memory_breakdown ............. False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   mics_shard_size .............. -1
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   optimizer_name ............... None
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   optimizer_params ............. None
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   pld_enabled .................. False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   pld_params ................... False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   prescale_gradients ........... False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   scheduler_name ............... None
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   scheduler_params ............. None
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   sparse_attention ............. None
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   steps_per_print .............. 2000
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   train_batch_size ............. 1
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   use_node_local_storage ....... True
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   world_size ................... 1
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   zero_enabled ................. True
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
[2024-03-07 16:06:26,779] [INFO] [config.py:950:print_user_config]   json = {
    "fp16": {
        "enabled": false
    "bf16": {
        "enabled": true
    "zero_optimization": {
        "stage": 3, 
        "offload_param": {
            "device": "cpu", 
            "pin_memory": true
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "reduce_bucket_size": 4.194304e+06, 
        "stage3_prefetch_bucket_size": 3.774874e+06, 
        "stage3_param_persistence_threshold": 2.048000e+04
    "checkpoint": {
        "use_node_local_storage": true
    "steps_per_print": 2.000000e+03, 
    "train_batch_size": 1, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": false
/root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  out=你好,我是人工智能助手。 根据问题,我们需要回答“人工智能助手”的定义
[2024-03-07 16:06:48,830] [INFO] [launch.py:347:main] Process 459826 exits successfully.

双机Zero3 No Offload运行成功,脚本和日志如下:比较慢,使用89.1, 89.3用时将近3个小时(2个小时53分钟)

 cat deploy_demo_ds.py
#!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
# First you need to install deepspeed: pip install deepspeed
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
# To deploy on 1 gpu:
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
# To deploy on 2 gpus:
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py

from transformers import AutoTokenizer, AutoConfig, AutoModel, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # To avoid warnings about parallelism in tokenizers

# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))

#model_name = "bigscience/T0_3B"
model_name = "THUDM/chatglm2-6b"
#model_name = "THUDM/chatglm2-6b-int4"

config = AutoConfig.from_pretrained(model_name,trust_remote_code=True)
#model_hidden_size = config.d_model
model_hidden_size = 2048

# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size

# ds_config notes
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed

# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
ds_config = {
    "fp16": {
        "enabled": False
    "bf16": {
        "enabled": False
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
ds_config = {
    "fp16": {
        "enabled": False
    "bf16": {
        "enabled": True
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    "checkpoint": {
        "use_node_local_storage": True
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
# fmt: on

# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive

# now a model can be loaded.
#model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name,trust_remote_code=True, empty_init=False)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference

# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
rank = torch.distributed.get_rank()
if rank == 0:
    #text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
    text_in = "你叫什么名字"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")
(base) [root@master33 ptuning]$NCCL_SOCKET_IFNAME=eno1 deepspeed --hostfile=hostfile --num_nodes 2 --master_addr= --master_port=29500 deploy_demo_ds.py
[2024-03-10 16:38:39,797] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-10 16:38:41,812] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: server1,server3
[2024-03-10 16:38:41,813] [INFO] [runner.py:555:main] cmd = pdsh -S -f 1024 -w server1,server3 export NCCL_SOCKET_IFNAME=eno1; export PYTHONPATH=/data1/tong/GLM/ChatGLM2-6B/ptuning;  cd /data1/tong/GLM/ChatGLM2-6B/ptuning; /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=%n --master_addr= --master_port=29500 deploy_demo_ds.py
server3: [2024-03-10 16:38:43,741] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-03-10 16:38:43,880] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-03-10 16:38:45,146] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
server3: [2024-03-10 16:38:45,146] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server3: [2024-03-10 16:38:45,146] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
server3: [2024-03-10 16:38:45,146] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server3: [2024-03-10 16:38:45,146] [INFO] [launch.py:163:main] dist_world_size=2
server3: [2024-03-10 16:38:45,146] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server1: [2024-03-10 16:38:45,407] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
server1: [2024-03-10 16:38:45,407] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server1: [2024-03-10 16:38:45,407] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
server1: [2024-03-10 16:38:45,407] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server1: [2024-03-10 16:38:45,407] [INFO] [launch.py:163:main] dist_world_size=2
server1: [2024-03-10 16:38:45,407] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server3: [2024-03-10 16:38:47,756] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-03-10 16:38:48,029] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server3: [2024-03-10 16:38:48,029] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-03-10 16:38:48,574] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-03-10 16:38:48,775] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server1: [2024-03-10 16:38:48,775] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-03-10 16:38:48,775] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
server1: [2024-03-10 16:39:34,659] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|██████████| 7/7 [42:07<00:00, 361.11s/it]   
Loading checkpoint shards: 100%|██████████| 7/7 [43:23<00:00, 371.91s/it]   
server1: [2024-03-10 17:22:58,082] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
server1: [2024-03-10 17:22:58,092] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
server1: [2024-03-10 17:22:58,093] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
server1: [2024-03-10 17:22:58,194] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
server1: [2024-03-10 17:22:58,195] [INFO] [utils.py:786:see_memory_usage] MA 5.82 GB         Max_MA 6.81 GB         CA 7.99 GB         Max_CA 8 GB 
server1: [2024-03-10 17:22:58,195] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 5.82 GB, percent = 4.6%
server1: Parameter Offload: Total persistent parameters: 362496 in 85 params
server1: [2024-03-10 17:22:58,310] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
server1: [2024-03-10 17:22:58,311] [INFO] [utils.py:786:see_memory_usage] MA 5.82 GB         Max_MA 5.82 GB         CA 7.99 GB         Max_CA 8 GB 
server1: [2024-03-10 17:22:58,311] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 5.82 GB, percent = 4.6%
server1: [2024-03-10 17:22:58,312] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
server1: [2024-03-10 17:22:58,312] [INFO] [config.py:964:print]   activation_checkpointing_config  {
server1:     "partition_activations": false, 
server1:     "contiguous_memory_optimization": false, 
server1:     "cpu_checkpointing": false, 
server1:     "number_checkpoints": null, 
server1:     "synchronize_checkpoint_boundary": false, 
server1:     "profile": false
server1: }
server1: [2024-03-10 17:22:58,312] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
server1: [2024-03-10 17:22:58,312] [INFO] [config.py:964:print]   amp_enabled .................. False
server1: [2024-03-10 17:22:58,312] [INFO] [config.py:964:print]   amp_params ................... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   autotuning_config ............ {
server1:     "enabled": false, 
server1:     "start_step": null, 
server1:     "end_step": null, 
server1:     "metric_path": null, 
server1:     "arg_mappings": null, 
server1:     "metric": "throughput", 
server1:     "model_info": null, 
server1:     "results_dir": "autotuning_results", 
server1:     "exps_dir": "autotuning_exps", 
server1:     "overwrite": true, 
server1:     "fast": true, 
server1:     "start_profile_step": 3, 
server1:     "end_profile_step": 5, 
server1:     "tuner_type": "gridsearch", 
server1:     "tuner_early_stopping": 5, 
server1:     "tuner_num_trials": 50, 
server1:     "model_info_path": null, 
server1:     "mp_size": 1, 
server1:     "max_train_batch_size": null, 
server1:     "min_train_batch_size": 1, 
server1:     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
server1:     "min_train_micro_batch_size_per_gpu": 1, 
server1:     "num_tuning_micro_batch_sizes": 3
server1: }
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fd8c7460eb0>
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   communication_data_type ...... None
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   disable_allgather ............ False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   dump_state ................... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   elasticity_enabled ........... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   flops_profiler_config ........ {
server1:     "enabled": false, 
server1:     "recompute_fwd_factor": 0.0, 
server1:     "profile_step": 1, 
server1:     "module_depth": -1, 
server1:     "top_modules": 1, 
server1:     "detailed": true, 
server1:     "output_file": null
server1: }
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   fp16_enabled ................. False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   global_rank .................. 0
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   loss_scale ................... 1.0
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   memory_breakdown ............. False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   mics_shard_size .............. -1
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   nebula_config ................ {
server1:     "enabled": false, 
server1:     "persistent_storage_path": null, 
server1:     "persistent_time_interval": 100, 
server1:     "num_of_version_in_retention": 2, 
server1:     "enable_nebula_load": true, 
server1:     "load_path": null
server1: }
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   optimizer_name ............... None
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   optimizer_params ............. None
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   pld_enabled .................. False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   pld_params ................... False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   prescale_gradients ........... False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   scheduler_name ............... None
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   scheduler_params ............. None
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   sparse_attention ............. None
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   steps_per_print .............. 2000
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   train_batch_size ............. 2
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   use_node_local_storage ....... True
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   world_size ................... 2
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   zero_enabled ................. True
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:950:print_user_config]   json = {
server1:     "fp16": {
server1:         "enabled": false
server1:     }, 
server1:     "bf16": {
server1:         "enabled": true
server1:     }, 
server1:     "zero_optimization": {
server1:         "stage": 3, 
server1:         "overlap_comm": true, 
server1:         "contiguous_gradients": true, 
server1:         "reduce_bucket_size": 4.194304e+06, 
server1:         "stage3_prefetch_bucket_size": 3.774874e+06, 
server1:         "stage3_param_persistence_threshold": 2.048000e+04
server1:     }, 
server1:     "checkpoint": {
server1:         "use_node_local_storage": true
server1:     }, 
server1:     "steps_per_print": 2.000000e+03, 
server1:     "train_batch_size": 2, 
server1:     "train_micro_batch_size_per_gpu": 1, 
server1:     "wall_clock_breakdown": false
server1: }
server1: /data1/common/python/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server1:   warnings.warn(
server3: /root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server3:   warnings.warn(
server1: rank0:
server1:    in=你叫什么名字
server1:   out=你叫什么名字? 
server1: 答: 我是一个名为 ChatGLM2-6
server3: rank1:
server3:    in=Is this review positive or negative? Review: this is the worst restaurant ever
server3:   out=Is this review positive or negative? Review: this is the worst restaurant ever I have ever
server3: [2024-03-10 19:31:37,252] [INFO] [launch.py:347:main] Process 496465 exits successfully.
server1: [2024-03-10 19:31:37,559] [INFO] [launch.py:347:main] Process 86940 exits successfully.

配置CPU Offload,用时1小时40分左右。日志如下:

(base) [root@master33 ptuning]$NCCL_SOCKET_IFNAME=eno1 deepspeed --hostfile=hostfile --num_nodes 2 --master_addr= --master_port=29500 deploy_demo_ds.py
[2024-03-10 09:37:12,374] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-10 09:37:14,564] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: server1,server3
[2024-03-10 09:37:14,565] [INFO] [runner.py:555:main] cmd = pdsh -S -f 1024 -w server1,server3 export NCCL_SOCKET_IFNAME=eno1; export PYTHONPATH=/data1/tong/GLM/ChatGLM2-6B/ptuning;  cd /data1/tong/GLM/ChatGLM2-6B/ptuning; /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=%n --master_addr= --master_port=29500 deploy_demo_ds.py
server3: [2024-03-10 09:37:16,677] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-03-10 09:37:16,713] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-03-10 09:37:18,094] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
server3: [2024-03-10 09:37:18,094] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server3: [2024-03-10 09:37:18,094] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
server3: [2024-03-10 09:37:18,094] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server3: [2024-03-10 09:37:18,094] [INFO] [launch.py:163:main] dist_world_size=2
server3: [2024-03-10 09:37:18,094] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server1: [2024-03-10 09:37:18,249] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
server1: [2024-03-10 09:37:18,249] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server1: [2024-03-10 09:37:18,249] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
server1: [2024-03-10 09:37:18,249] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server1: [2024-03-10 09:37:18,249] [INFO] [launch.py:163:main] dist_world_size=2
server1: [2024-03-10 09:37:18,249] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server3: [2024-03-10 09:37:20,699] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-03-10 09:37:20,947] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server3: [2024-03-10 09:37:20,947] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-03-10 09:37:21,360] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-03-10 09:37:21,553] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server1: [2024-03-10 09:37:21,553] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-03-10 09:37:21,553] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
server1: [2024-03-10 09:55:05,084] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|██████████| 7/7 [27:14<00:00, 233.48s/it]
server1: [2024-03-10 10:22:19,454] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
server1: [2024-03-10 10:22:19,464] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
server1: [2024-03-10 10:22:19,465] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
server1: [2024-03-10 10:22:19,564] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
server1: [2024-03-10 10:22:19,565] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-03-10 10:22:19,565] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 13.09 GB, percent = 10.4%
server1: Parameter Offload: Total persistent parameters: 362496 in 85 params
server1: [2024-03-10 10:22:19,680] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
server1: [2024-03-10 10:22:19,681] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-03-10 10:22:19,681] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 13.09 GB, percent = 10.4%
server1: [2024-03-10 10:22:19,682] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
server1: [2024-03-10 10:22:19,682] [INFO] [config.py:964:print]   activation_checkpointing_config  {
server1:     "partition_activations": false, 
server1:     "contiguous_memory_optimization": false, 
server1:     "cpu_checkpointing": false, 
server1:     "number_checkpoints": null, 
server1:     "synchronize_checkpoint_boundary": false, 
server1:     "profile": false
server1: }
server1: [2024-03-10 10:22:19,682] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
server1: [2024-03-10 10:22:19,682] [INFO] [config.py:964:print]   amp_enabled .................. False
server1: [2024-03-10 10:22:19,682] [INFO] [config.py:964:print]   amp_params ................... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   autotuning_config ............ {
server1:     "enabled": false, 
server1:     "start_step": null, 
server1:     "end_step": null, 
server1:     "metric_path": null, 
server1:     "arg_mappings": null, 
server1:     "metric": "throughput", 
server1:     "model_info": null, 
server1:     "results_dir": "autotuning_results", 
server1:     "exps_dir": "autotuning_exps", 
server1:     "overwrite": true, 
server1:     "fast": true, 
server1:     "start_profile_step": 3, 
server1:     "end_profile_step": 5, 
server1:     "tuner_type": "gridsearch", 
server1:     "tuner_early_stopping": 5, 
server1:     "tuner_num_trials": 50, 
server1:     "model_info_path": null, 
server1:     "mp_size": 1, 
server1:     "max_train_batch_size": null, 
server1:     "min_train_batch_size": 1, 
server1:     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
server1:     "min_train_micro_batch_size_per_gpu": 1, 
server1:     "num_tuning_micro_batch_sizes": 3
server1: }
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fdd809406d0>
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   communication_data_type ...... None
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   disable_allgather ............ False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   dump_state ................... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   elasticity_enabled ........... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   flops_profiler_config ........ {
server1:     "enabled": false, 
server1:     "recompute_fwd_factor": 0.0, 
server1:     "profile_step": 1, 
server1:     "module_depth": -1, 
server1:     "top_modules": 1, 
server1:     "detailed": true, 
server1:     "output_file": null
server1: }
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   fp16_enabled ................. False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   global_rank .................. 0
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   loss_scale ................... 1.0
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   memory_breakdown ............. False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   mics_shard_size .............. -1
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   nebula_config ................ {
server1:     "enabled": false, 
server1:     "persistent_storage_path": null, 
server1:     "persistent_time_interval": 100, 
server1:     "num_of_version_in_retention": 2, 
server1:     "enable_nebula_load": true, 
server1:     "load_path": null
server1: }
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   optimizer_name ............... None
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   optimizer_params ............. None
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   pld_enabled .................. False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   pld_params ................... False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   prescale_gradients ........... False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   scheduler_name ............... None
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   scheduler_params ............. None
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   sparse_attention ............. None
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   steps_per_print .............. 2000
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   train_batch_size ............. 2
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   use_node_local_storage ....... True
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   world_size ................... 2
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   zero_enabled ................. True
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:950:print_user_config]   json = {
server1:     "fp16": {
server1:         "enabled": false
server1:     }, 
server1:     "bf16": {
server1:         "enabled": true
server1:     }, 
server1:     "zero_optimization": {
server1:         "stage": 3, 
server1:         "offload_param": {
server1:             "device": "cpu", 
server1:             "pin_memory": true
server1:         }, 
server1:         "overlap_comm": true, 
server1:         "contiguous_gradients": true, 
server1:         "reduce_bucket_size": 4.194304e+06, 
server1:         "stage3_prefetch_bucket_size": 3.774874e+06, 
server1:         "stage3_param_persistence_threshold": 2.048000e+04
server1:     }, 
server1:     "checkpoint": {
server1:         "use_node_local_storage": true
server1:     }, 
server1:     "steps_per_print": 2.000000e+03, 
server1:     "train_batch_size": 2, 
server1:     "train_micro_batch_size_per_gpu": 1, 
server1:     "wall_clock_breakdown": false
server1: }
server1: /data1/common/python/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server1:   warnings.warn(
Loading checkpoint shards: 100%|██████████| 7/7 [27:14<00:00, 233.56s/it]
server3: /root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server3:   warnings.warn(
server1: rank0:
server1:    in=请介绍下戈尔迪之结的来历
server1:   out=请介绍下戈尔迪之结的来历。 戈尔迪之
server3: rank1:
server3:    in=Is this review positive or negative? Review: this is the worst restaurant ever
server3:   out=Is this review positive or negative? Review: this is the worst restaurant ever I have ever
server3: [2024-03-10 11:17:13,508] [INFO] [launch.py:347:main] Process 1625240 exits successfully.
server1: [2024-03-10 11:17:14,176] [INFO] [launch.py:347:main] Process 1894897 exits successfully.


(base) [root@master33 ptuning]$NCCL_SOCKET_IFNAME=eno1 deepspeed --hostfile=hostfile --num_nodes 2 --master_addr= --master_port=29500 deploy_demo_ds.py
[2024-03-26 15:34:20,559] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-26 15:34:22,553] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: server1,server3
[2024-03-26 15:34:22,554] [INFO] [runner.py:555:main] cmd = pdsh -S -f 1024 -w server1,server3 export NCCL_SOCKET_IFNAME=eno1; export PYTHONPATH=/data1/tong/GLM/ChatGLM2-6B/ptuning;  cd /data1/tong/GLM/ChatGLM2-6B/ptuning; /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=%n --master_addr= --master_port=29500 deploy_demo_ds.py
server3: [2024-03-26 15:34:24,513] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-03-26 15:34:24,706] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-03-26 15:34:25,922] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
server3: [2024-03-26 15:34:25,922] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server3: [2024-03-26 15:34:25,922] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
server3: [2024-03-26 15:34:25,922] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server3: [2024-03-26 15:34:25,923] [INFO] [launch.py:163:main] dist_world_size=2
server3: [2024-03-26 15:34:25,923] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server1: [2024-03-26 15:34:26,283] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
server1: [2024-03-26 15:34:26,283] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server1: [2024-03-26 15:34:26,283] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
server1: [2024-03-26 15:34:26,283] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server1: [2024-03-26 15:34:26,283] [INFO] [launch.py:163:main] dist_world_size=2
server1: [2024-03-26 15:34:26,283] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server3: [2024-03-26 15:34:28,563] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-03-26 15:34:28,808] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server3: [2024-03-26 15:34:28,808] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-03-26 15:34:29,578] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-03-26 15:34:29,800] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server1: [2024-03-26 15:34:29,801] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-03-26 15:34:29,801] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
server1: [2024-03-26 15:52:13,213] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|??????????| 7/7 [27:15<00:00, 233.68s/it]
server1: [2024-03-26 16:19:29,029] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
server1: [2024-03-26 16:19:29,040] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
server1: [2024-03-26 16:19:29,041] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
server1: [2024-03-26 16:19:29,146] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
server1: [2024-03-26 16:19:29,147] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-03-26 16:19:29,147] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.01 GB, percent = 12.8%
server1: Parameter Offload: Total persistent parameters: 362496 in 85 params
server1: [2024-03-26 16:19:29,265] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
server1: [2024-03-26 16:19:29,266] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-03-26 16:19:29,266] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.01 GB, percent = 12.8%
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   activation_checkpointing_config  {
server1:     "partition_activations": false, 
server1:     "contiguous_memory_optimization": false, 
server1:     "cpu_checkpointing": false, 
server1:     "number_checkpoints": null, 
server1:     "synchronize_checkpoint_boundary": false, 
server1:     "profile": false
server1: }
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   amp_enabled .................. False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   amp_params ................... False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   autotuning_config ............ {
server1:     "enabled": false, 
server1:     "start_step": null, 
server1:     "end_step": null, 
server1:     "metric_path": null, 
server1:     "arg_mappings": null, 
server1:     "metric": "throughput", 
server1:     "model_info": null, 
server1:     "results_dir": "autotuning_results", 
server1:     "exps_dir": "autotuning_exps", 
server1:     "overwrite": true, 
server1:     "fast": true, 
server1:     "start_profile_step": 3, 
server1:     "end_profile_step": 5, 
server1:     "tuner_type": "gridsearch", 
server1:     "tuner_early_stopping": 5, 
server1:     "tuner_num_trials": 50, 
server1:     "model_info_path": null, 
server1:     "mp_size": 1, 
server1:     "max_train_batch_size": null, 
server1:     "min_train_batch_size": 1, 
server1:     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
server1:     "min_train_micro_batch_size_per_gpu": 1, 
server1:     "num_tuning_micro_batch_sizes": 3
server1: }
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fc8ea1b26a0>
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   communication_data_type ...... None
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   disable_allgather ............ False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   dump_state ................... False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   elasticity_enabled ........... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   flops_profiler_config ........ {
server1:     "enabled": false, 
server1:     "recompute_fwd_factor": 0.0, 
server1:     "profile_step": 1, 
server1:     "module_depth": -1, 
server1:     "top_modules": 1, 
server1:     "detailed": true, 
server1:     "output_file": null
server1: }
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   fp16_enabled ................. False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   global_rank .................. 0
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   loss_scale ................... 1.0
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   memory_breakdown ............. False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   mics_shard_size .............. -1
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   nebula_config ................ {
server1:     "enabled": false, 
server1:     "persistent_storage_path": null, 
server1:     "persistent_time_interval": 100, 
server1:     "num_of_version_in_retention": 2, 
server1:     "enable_nebula_load": true, 
server1:     "load_path": null
server1: }
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   optimizer_name ............... None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   optimizer_params ............. None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   pld_enabled .................. False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   pld_params ................... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   prescale_gradients ........... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   scheduler_name ............... None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   scheduler_params ............. None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   sparse_attention ............. None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   steps_per_print .............. 2000
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   train_batch_size ............. 2
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   use_node_local_storage ....... True
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   world_size ................... 2
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   zero_enabled ................. True
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:950:print_user_config]   json = {
server1:     "fp16": {
server1:         "enabled": false
server1:     }, 
server1:     "bf16": {
server1:         "enabled": true
server1:     }, 
server1:     "zero_optimization": {
server1:         "stage": 3, 
server1:         "offload_param": {
server1:             "device": "cpu", 
server1:             "pin_memory": true
server1:         }, 
server1:         "overlap_comm": true, 
server1:         "contiguous_gradients": true, 
server1:         "reduce_bucket_size": 4.194304e+06, 
server1:         "stage3_prefetch_bucket_size": 3.774874e+06, 
server1:         "stage3_param_persistence_threshold": 2.048000e+04
server1:     }, 
server1:     "checkpoint": {
server1:         "use_node_local_storage": true
server1:     }, 
server1:     "steps_per_print": 2.000000e+03, 
server1:     "train_batch_size": 2, 
server1:     "train_micro_batch_size_per_gpu": 1, 
server1:     "wall_clock_breakdown": false
server1: }
Loading checkpoint shards: 100%|??????????| 7/7 [27:15<00:00, 233.68s/it]
server3: rank1:
server3:    in=Is this review positive or negative? Review: this is the worst restaurant ever.
server3:   out=Is this review positive or negative? Review: this is the worst restaurant ever. The food is disgusting and the service is terrible. I will never go back. negative
server1: rank0:
server1:    in=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy.
server1:   out=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy. It is made of high quality materials and is well made. The non-stick coating is perfect and makes the skillet easy to use. The price is a little high but it is worth it. I would definitely recommend this skillet to anyone looking
server3: rank1:
server3:    in=中国古代经典小说《红楼梦》的作者是谁?
server3:   out=中国古代经典小说《红楼梦》的作者是谁? 曹雪芹。
server3: before while
server3: start...
server1: rank0:
server1:    in=你好,请做下自我介绍!
server1:   out=你好,请做下自我介绍! 我是人工智能助手 ChatGLM2-6B,一个基于语言模型的人工智能助手。我的任务是针对用户的问题和要求提供适当的答复和支持。
server1: before while
server1: start...

No CPU Offload,进行两次推理,添加打印时间,结论是:每次推理用时大概2小时(第一次推理用时2小时08分钟,第二次推理用时一小时32分钟)!!!(百兆迈威交换机),代码和日志及截图如下


 #!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
# First you need to install deepspeed: pip install deepspeed
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
# To deploy on 1 gpu:
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
# To deploy on 2 gpus:
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py

from transformers import AutoTokenizer, AutoConfig, AutoModel, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os,datetime
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # To avoid warnings about parallelism in tokenizers

# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))

#model_name = "bigscience/T0_3B"
model_name = "THUDM/chatglm2-6b"
#model_name = "THUDM/chatglm2-6b-int4"

config = AutoConfig.from_pretrained(model_name,trust_remote_code=True)
#model_hidden_size = config.d_model
model_hidden_size = 2048

# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size

# ds_config notes
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed

# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
ds_config = {
    "fp16": {
        "enabled": False
    "bf16": {
        "enabled": False
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
ds_config = {
    "fp16": {
        "enabled": False
    "bf16": {
        "enabled": True
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    "checkpoint": {
        "use_node_local_storage": True
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
# fmt: on

# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive

# now a model can be loaded.
#model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name,trust_remote_code=True, empty_init=False)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference

# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
rank = torch.distributed.get_rank()
if rank == 0:
    #text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
    text_in = "你叫什么名字"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

rank = torch.distributed.get_rank()
if rank == 0:
    text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
    # text_in = "你叫什么名字"
elif rank == 1:
    # text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"
    text_in = "安徽的省会是哪里?"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")


 (base) [root@master25 ptuning]$NCCL_SOCKET_IFNAME=eno1 deepspeed --hostfile=hostfile --num_nodes 2 --master_addr= --master_port=29500 deploy_demo_ds.py
[2024-04-08 09:35:45,863] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-08 09:35:47,642] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: server1,server3
[2024-04-08 09:35:47,643] [INFO] [runner.py:555:main] cmd = pdsh -S -f 1024 -w server1,server3 export NCCL_SOCKET_IFNAME=eno1; export PYTHONPATH=/data1/tong/GLM/ChatGLM2-6B/ptuning;  cd /data1/tong/GLM/ChatGLM2-6B/ptuning; /root/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=%n --master_addr= --master_port=29500 deploy_demo_ds.py
server3: [2024-04-08 09:35:49,534] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-04-08 09:35:49,712] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-04-08 09:35:50,917] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
server3: [2024-04-08 09:35:50,917] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server3: [2024-04-08 09:35:50,917] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
server3: [2024-04-08 09:35:50,917] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server3: [2024-04-08 09:35:50,917] [INFO] [launch.py:163:main] dist_world_size=2
server3: [2024-04-08 09:35:50,917] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server1: [2024-04-08 09:35:51,248] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
server1: [2024-04-08 09:35:51,248] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server1: [2024-04-08 09:35:51,248] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
server1: [2024-04-08 09:35:51,248] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server1: [2024-04-08 09:35:51,248] [INFO] [launch.py:163:main] dist_world_size=2
server1: [2024-04-08 09:35:51,248] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server3: [2024-04-08 09:35:53,587] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-04-08 09:35:53,911] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server3: [2024-04-08 09:35:53,911] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-04-08 09:35:54,391] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-04-08 09:35:54,571] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server1: [2024-04-08 09:35:54,571] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-04-08 09:35:54,571] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
server1: [2024-04-08 09:36:40,589] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|██████████| 7/7 [42:10<00:00, 361.53s/it]   
server3: 第一次推理时间:2024-04-08 10:18:51.831791
Loading checkpoint shards: 100%|██████████| 7/7 [43:26<00:00, 372.32s/it]   
server1: [2024-04-08 10:20:06,846] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
server1: [2024-04-08 10:20:06,857] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
server1: [2024-04-08 10:20:06,858] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
server1: [2024-04-08 10:20:06,960] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
server1: [2024-04-08 10:20:06,960] [INFO] [utils.py:786:see_memory_usage] MA 5.82 GB         Max_MA 6.81 GB         CA 7.99 GB         Max_CA 8 GB 
server1: [2024-04-08 10:20:06,961] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 8.84 GB, percent = 7.0%
server1: Parameter Offload: Total persistent parameters: 362496 in 85 params
server1: [2024-04-08 10:20:07,077] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
server1: [2024-04-08 10:20:07,077] [INFO] [utils.py:786:see_memory_usage] MA 5.82 GB         Max_MA 5.82 GB         CA 7.99 GB         Max_CA 8 GB 
server1: [2024-04-08 10:20:07,078] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 8.84 GB, percent = 7.0%
server1: [2024-04-08 10:20:07,078] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   activation_checkpointing_config  {
server1:     "partition_activations": false, 
server1:     "contiguous_memory_optimization": false, 
server1:     "cpu_checkpointing": false, 
server1:     "number_checkpoints": null, 
server1:     "synchronize_checkpoint_boundary": false, 
server1:     "profile": false
server1: }
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   amp_enabled .................. False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   amp_params ................... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   autotuning_config ............ {
server1:     "enabled": false, 
server1:     "start_step": null, 
server1:     "end_step": null, 
server1:     "metric_path": null, 
server1:     "arg_mappings": null, 
server1:     "metric": "throughput", 
server1:     "model_info": null, 
server1:     "results_dir": "autotuning_results", 
server1:     "exps_dir": "autotuning_exps", 
server1:     "overwrite": true, 
server1:     "fast": true, 
server1:     "start_profile_step": 3, 
server1:     "end_profile_step": 5, 
server1:     "tuner_type": "gridsearch", 
server1:     "tuner_early_stopping": 5, 
server1:     "tuner_num_trials": 50, 
server1:     "model_info_path": null, 
server1:     "mp_size": 1, 
server1:     "max_train_batch_size": null, 
server1:     "min_train_batch_size": 1, 
server1:     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
server1:     "min_train_micro_batch_size_per_gpu": 1, 
server1:     "num_tuning_micro_batch_sizes": 3
server1: }
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f41eaef8e80>
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   communication_data_type ...... None
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   disable_allgather ............ False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   dump_state ................... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   elasticity_enabled ........... False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   flops_profiler_config ........ {
server1:     "enabled": false, 
server1:     "recompute_fwd_factor": 0.0, 
server1:     "profile_step": 1, 
server1:     "module_depth": -1, 
server1:     "top_modules": 1, 
server1:     "detailed": true, 
server1:     "output_file": null
server1: }
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   fp16_enabled ................. False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   global_rank .................. 0
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   loss_scale ................... 1.0
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   memory_breakdown ............. False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   mics_shard_size .............. -1
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   nebula_config ................ {
server1:     "enabled": false, 
server1:     "persistent_storage_path": null, 
server1:     "persistent_time_interval": 100, 
server1:     "num_of_version_in_retention": 2, 
server1:     "enable_nebula_load": true, 
server1:     "load_path": null
server1: }
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   optimizer_name ............... None
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   optimizer_params ............. None
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   pld_enabled .................. False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   pld_params ................... False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   prescale_gradients ........... False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   scheduler_name ............... None
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   scheduler_params ............. None
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   sparse_attention ............. None
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   steps_per_print .............. 2000
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   train_batch_size ............. 2
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   use_node_local_storage ....... True
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   world_size ................... 2
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   zero_enabled ................. True
server1: [2024-04-08 10:20:07,081] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
server1: [2024-04-08 10:20:07,081] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
server1: [2024-04-08 10:20:07,081] [INFO] [config.py:950:print_user_config]   json = {
server1:     "fp16": {
server1:         "enabled": false
server1:     }, 
server1:     "bf16": {
server1:         "enabled": true
server1:     }, 
server1:     "zero_optimization": {
server1:         "stage": 3, 
server1:         "overlap_comm": true, 
server1:         "contiguous_gradients": true, 
server1:         "reduce_bucket_size": 4.194304e+06, 
server1:         "stage3_prefetch_bucket_size": 3.774874e+06, 
server1:         "stage3_param_persistence_threshold": 2.048000e+04
server1:     }, 
server1:     "checkpoint": {
server1:         "use_node_local_storage": true
server1:     }, 
server1:     "steps_per_print": 2.000000e+03, 
server1:     "train_batch_size": 2, 
server1:     "train_micro_batch_size_per_gpu": 1, 
server1:     "wall_clock_breakdown": false
server1: }
server1: 第一次推理时间:2024-04-08 10:20:07.082254
server1: /data1/common/python/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server1:   warnings.warn(
server3: /root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server3:   warnings.warn(
server1: rank0:
server1:    in=你叫什么名字
server1:   out=你叫什么名字? 
server1: 答: 我是一个名为 ChatGLM2-6
server1: 第2次推理时间:2024-04-08 12:28:50.366853
server3: rank1:
server3:    in=Is this review positive or negative? Review: this is the worst restaurant ever
server3:   out=Is this review positive or negative? Review: this is the worst restaurant ever I have ever
server3: 第2次推理时间:2024-04-08 12:28:50.366859
server1: Input length of input_ids is 23, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
server3: rank1:
server3:    in=安徽的省会是哪里?
server3:   out=安徽的省会是哪里? 
server3: 安徽的省会是合肥。
server1: rank0:
server1:    in=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy
server1:   out=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy This
server1: [2024-04-08 14:00:15,574] [INFO] [launch.py:347:main] Process 1265565 exits successfully.
server3: [2024-04-08 14:00:15,790] [INFO] [launch.py:347:main] Process 3665211 exits successfully.
(base) [root@master25 ptuning]$



 NCCL_SOCKET_IFNAME=eno1 deepspeed --hostfile=hostfile --num_nodes 2 --master_addr= --master_port=29500 deploy_demo_ds.py
[2024-04-08 19:20:54,006] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-08 19:20:55,803] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: server1,server3
[2024-04-08 19:20:55,803] [INFO] [runner.py:555:main] cmd = pdsh -S -f 1024 -w server1,server3 export NCCL_SOCKET_IFNAME=eno1; export PYTHONPATH=/data1/tong/GLM/ChatGLM2-6B/ptuning;  cd /data1/tong/GLM/ChatGLM2-6B/ptuning; /root/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=%n --master_addr= --master_port=29500 deploy_demo_ds.py
server3: [2024-04-08 19:20:57,684] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-04-08 19:20:57,936] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-04-08 19:20:59,096] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
server3: [2024-04-08 19:20:59,096] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server3: [2024-04-08 19:20:59,096] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
server3: [2024-04-08 19:20:59,096] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server3: [2024-04-08 19:20:59,096] [INFO] [launch.py:163:main] dist_world_size=2
server3: [2024-04-08 19:20:59,096] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server1: [2024-04-08 19:20:59,566] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
server1: [2024-04-08 19:20:59,567] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server1: [2024-04-08 19:20:59,567] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
server1: [2024-04-08 19:20:59,567] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server1: [2024-04-08 19:20:59,567] [INFO] [launch.py:163:main] dist_world_size=2
server1: [2024-04-08 19:20:59,567] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server3: [2024-04-08 19:21:01,718] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-04-08 19:21:01,968] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server3: [2024-04-08 19:21:01,968] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-04-08 19:21:02,862] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-04-08 19:21:03,061] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server1: [2024-04-08 19:21:03,061] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-04-08 19:21:03,061] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
server1: [2024-04-08 19:38:47,403] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|██████████| 7/7 [27:14<00:00, 233.45s/it]
server1: [2024-04-08 20:06:01,577] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
server1: [2024-04-08 20:06:01,588] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
server1: [2024-04-08 20:06:01,589] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
server1: [2024-04-08 20:06:01,695] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
server1: [2024-04-08 20:06:01,696] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-04-08 20:06:01,696] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.18 GB, percent = 12.9%
server1: Parameter Offload: Total persistent parameters: 362496 in 85 params
server1: [2024-04-08 20:06:01,817] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
server1: [2024-04-08 20:06:01,818] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-04-08 20:06:01,818] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.17 GB, percent = 12.9%
server1: [2024-04-08 20:06:01,818] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   activation_checkpointing_config  {
server1:     "partition_activations": false, 
server1:     "contiguous_memory_optimization": false, 
server1:     "cpu_checkpointing": false, 
server1:     "number_checkpoints": null, 
server1:     "synchronize_checkpoint_boundary": false, 
server1:     "profile": false
server1: }
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   amp_enabled .................. False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   amp_params ................... False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   autotuning_config ............ {
server1:     "enabled": false, 
server1:     "start_step": null, 
server1:     "end_step": null, 
server1:     "metric_path": null, 
server1:     "arg_mappings": null, 
server1:     "metric": "throughput", 
server1:     "model_info": null, 
server1:     "results_dir": "autotuning_results", 
server1:     "exps_dir": "autotuning_exps", 
server1:     "overwrite": true, 
server1:     "fast": true, 
server1:     "start_profile_step": 3, 
server1:     "end_profile_step": 5, 
server1:     "tuner_type": "gridsearch", 
server1:     "tuner_early_stopping": 5, 
server1:     "tuner_num_trials": 50, 
server1:     "model_info_path": null, 
server1:     "mp_size": 1, 
server1:     "max_train_batch_size": null, 
server1:     "min_train_batch_size": 1, 
server1:     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
server1:     "min_train_micro_batch_size_per_gpu": 1, 
server1:     "num_tuning_micro_batch_sizes": 3
server1: }
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f092dda06a0>
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   communication_data_type ...... None
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   disable_allgather ............ False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   dump_state ................... False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   elasticity_enabled ........... False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   flops_profiler_config ........ {
server1:     "enabled": false, 
server1:     "recompute_fwd_factor": 0.0, 
server1:     "profile_step": 1, 
server1:     "module_depth": -1, 
server1:     "top_modules": 1, 
server1:     "detailed": true, 
server1:     "output_file": null
server1: }
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   fp16_enabled ................. False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   global_rank .................. 0
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   loss_scale ................... 1.0
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   memory_breakdown ............. False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   mics_shard_size .............. -1
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   nebula_config ................ {
server1:     "enabled": false, 
server1:     "persistent_storage_path": null, 
server1:     "persistent_time_interval": 100, 
server1:     "num_of_version_in_retention": 2, 
server1:     "enable_nebula_load": true, 
server1:     "load_path": null
server1: }
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   optimizer_name ............... None
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   optimizer_params ............. None
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   pld_enabled .................. False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   pld_params ................... False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   prescale_gradients ........... False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   scheduler_name ............... None
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   scheduler_params ............. None
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   sparse_attention ............. None
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   steps_per_print .............. 2000
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   train_batch_size ............. 2
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   use_node_local_storage ....... True
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   world_size ................... 2
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   zero_enabled ................. True
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:950:print_user_config]   json = {
server1:     "fp16": {
server1:         "enabled": false
server1:     }, 
server1:     "bf16": {
server1:         "enabled": true
server1:     }, 
server1:     "zero_optimization": {
server1:         "stage": 3, 
server1:         "offload_param": {
server1:             "device": "cpu", 
server1:             "pin_memory": true
server1:         }, 
server1:         "overlap_comm": true, 
server1:         "contiguous_gradients": true, 
server1:         "reduce_bucket_size": 4.194304e+06, 
server1:         "stage3_prefetch_bucket_size": 3.774874e+06, 
server1:         "stage3_param_persistence_threshold": 2.048000e+04
server1:     }, 
server1:     "checkpoint": {
server1:         "use_node_local_storage": true
server1:     }, 
server1:     "steps_per_print": 2.000000e+03, 
server1:     "train_batch_size": 2, 
server1:     "train_micro_batch_size_per_gpu": 1, 
server1:     "wall_clock_breakdown": false
server1: }
server1: 第一次推理时间:2024-04-08 20:06:01.822467
server1: /data1/common/python/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server1:   warnings.warn(
Loading checkpoint shards: 100%|██████████| 7/7 [27:14<00:00, 233.45s/it]
server3: 第一次推理时间:2024-04-08 20:06:02.258538
server3: /root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server3:   warnings.warn(
server1: rank0:
server1:    in=你叫什么名字
server1:   out=你叫什么名字? 
server1: 答: 我是一个名为 ChatGLM2-6
server1: 第2次推理时间:2024-04-08 22:14:02.114204
server3: rank1:
server3:    in=Is this review positive or negative? Review: this is the worst restaurant ever
server3:   out=Is this review positive or negative? Review: this is the worst restaurant ever I have ever
server3: 第2次推理时间:2024-04-08 22:14:02.118180
server1: Input length of input_ids is 23, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
server3: rank1:
server3:    in=安徽的省会是哪里?
server3:   out=安徽的省会是哪里? 
server3: 安徽的省会是合肥。
server1: rank0:
server1:    in=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy
server1:   out=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy This
server3: [2024-04-08 23:45:29,034] [INFO] [launch.py:347:main] Process 119763 exits successfully.
server1: [2024-04-08 23:45:29,933] [INFO] [launch.py:347:main] Process 3562202 exits successfully.



 NCCL_SOCKET_IFNAME=eno1 deepspeed --hostfile=hostfile --num_nodes 2 --master_addr= --master_port=29500 deploy_demo_ds.py
[2024-04-09 09:20:20,297] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-09 09:20:22,159] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: server1,server3
[2024-04-09 09:20:22,159] [INFO] [runner.py:555:main] cmd = pdsh -S -f 1024 -w server1,server3 export NCCL_SOCKET_IFNAME=eno1; export PYTHONPATH=/data1/tong/GLM/ChatGLM2-6B/ptuning;  cd /data1/tong/GLM/ChatGLM2-6B/ptuning; /root/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=%n --master_addr= --master_port=29500 deploy_demo_ds.py
server3: [2024-04-09 09:20:24,252] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-04-09 09:20:24,289] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-04-09 09:20:25,693] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
server3: [2024-04-09 09:20:25,693] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server3: [2024-04-09 09:20:25,694] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
server3: [2024-04-09 09:20:25,694] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server3: [2024-04-09 09:20:25,694] [INFO] [launch.py:163:main] dist_world_size=2
server3: [2024-04-09 09:20:25,694] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server1: [2024-04-09 09:20:25,863] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
server1: [2024-04-09 09:20:25,863] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server1: [2024-04-09 09:20:25,863] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
server1: [2024-04-09 09:20:25,863] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server1: [2024-04-09 09:20:25,863] [INFO] [launch.py:163:main] dist_world_size=2
server1: [2024-04-09 09:20:25,863] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server3: [2024-04-09 09:20:28,333] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-04-09 09:20:28,579] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server3: [2024-04-09 09:20:28,579] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-04-09 09:20:29,050] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-04-09 09:20:29,266] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server1: [2024-04-09 09:20:29,266] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-04-09 09:20:29,266] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
server1: [2024-04-09 09:38:33,975] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|██████████| 7/7 [30:51<00:00, 264.55s/it]
server1: [2024-04-09 10:09:25,847] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
server1: [2024-04-09 10:09:25,858] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
server1: [2024-04-09 10:09:25,860] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
server1: [2024-04-09 10:09:25,959] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
server1: [2024-04-09 10:09:25,960] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-04-09 10:09:25,960] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.17 GB, percent = 12.9%
server1: Parameter Offload: Total persistent parameters: 362496 in 85 params
server1: [2024-04-09 10:09:26,081] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
server1: [2024-04-09 10:09:26,081] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-04-09 10:09:26,082] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.17 GB, percent = 12.9%
server1: [2024-04-09 10:09:26,082] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   activation_checkpointing_config  {
server1:     "partition_activations": false, 
server1:     "contiguous_memory_optimization": false, 
server1:     "cpu_checkpointing": false, 
server1:     "number_checkpoints": null, 
server1:     "synchronize_checkpoint_boundary": false, 
server1:     "profile": false
server1: }
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   amp_enabled .................. False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   amp_params ................... False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   autotuning_config ............ {
server1:     "enabled": false, 
server1:     "start_step": null, 
server1:     "end_step": null, 
server1:     "metric_path": null, 
server1:     "arg_mappings": null, 
server1:     "metric": "throughput", 
server1:     "model_info": null, 
server1:     "results_dir": "autotuning_results", 
server1:     "exps_dir": "autotuning_exps", 
server1:     "overwrite": true, 
server1:     "fast": true, 
server1:     "start_profile_step": 3, 
server1:     "end_profile_step": 5, 
server1:     "tuner_type": "gridsearch", 
server1:     "tuner_early_stopping": 5, 
server1:     "tuner_num_trials": 50, 
server1:     "model_info_path": null, 
server1:     "mp_size": 1, 
server1:     "max_train_batch_size": null, 
server1:     "min_train_batch_size": 1, 
server1:     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
server1:     "min_train_micro_batch_size_per_gpu": 1, 
server1:     "num_tuning_micro_batch_sizes": 3
server1: }
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f8d905956d0>
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   communication_data_type ...... None
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   disable_allgather ............ False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   dump_state ................... False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   elasticity_enabled ........... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   flops_profiler_config ........ {
server1:     "enabled": false, 
server1:     "recompute_fwd_factor": 0.0, 
server1:     "profile_step": 1, 
server1:     "module_depth": -1, 
server1:     "top_modules": 1, 
server1:     "detailed": true, 
server1:     "output_file": null
server1: }
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   fp16_enabled ................. False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   global_rank .................. 0
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   loss_scale ................... 1.0
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   memory_breakdown ............. False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   mics_shard_size .............. -1
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   nebula_config ................ {
server1:     "enabled": false, 
server1:     "persistent_storage_path": null, 
server1:     "persistent_time_interval": 100, 
server1:     "num_of_version_in_retention": 2, 
server1:     "enable_nebula_load": true, 
server1:     "load_path": null
server1: }
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   optimizer_name ............... None
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   optimizer_params ............. None
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   pld_enabled .................. False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   pld_params ................... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   prescale_gradients ........... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   scheduler_name ............... None
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   scheduler_params ............. None
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   sparse_attention ............. None
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   steps_per_print .............. 2000
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   train_batch_size ............. 2
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   use_node_local_storage ....... True
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   world_size ................... 2
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
server1: [2024-04-09 10:09:26,085] [INFO] [config.py:964:print]   zero_enabled ................. True
server1: [2024-04-09 10:09:26,085] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
server1: [2024-04-09 10:09:26,085] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
server1: [2024-04-09 10:09:26,085] [INFO] [config.py:950:print_user_config]   json = {
server1:     "fp16": {
server1:         "enabled": false
server1:     }, 
server1:     "bf16": {
server1:         "enabled": true
server1:     }, 
server1:     "zero_optimization": {
server1:         "stage": 3, 
server1:         "offload_param": {
server1:             "device": "cpu", 
server1:             "pin_memory": true
server1:         }, 
server1:         "overlap_comm": true, 
server1:         "contiguous_gradients": true, 
server1:         "reduce_bucket_size": 4.194304e+06, 
server1:         "stage3_prefetch_bucket_size": 3.774874e+06, 
server1:         "stage3_param_persistence_threshold": 2.048000e+04
server1:     }, 
server1:     "checkpoint": {
server1:         "use_node_local_storage": true
server1:     }, 
server1:     "steps_per_print": 2.000000e+03, 
server1:     "train_batch_size": 2, 
server1:     "train_micro_batch_size_per_gpu": 1, 
server1:     "wall_clock_breakdown": false
server1: }
server1: 第一次推理时间:2024-04-09 10:09:26.086356
server1: --1
server1: --3
server1: --3
server1: --4
server1: --5
server1: /data1/common/python/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server1:   warnings.warn(
Loading checkpoint shards: 100%|██████████| 7/7 [30:51<00:00, 264.56s/it]
server3: 第一次推理时间:2024-04-09 10:09:26.592210
server3: --1
server3: --3
server3: --3
server3: --4
server3: --5
server3: /root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server3:   warnings.warn(
server1: --6
server3: --6
server1: rank0:
server1:    in=你叫什么名字
server1:   out=你叫什么名字? 
server1: 答: 我是一个名为 ChatGLM2-6
server1: 第一次推理耗时8297201.61986351 ms (2小时18分)
server1: 第2次推理时间:2024-04-09 12:27:43.288030
server3: rank1:
server3:    in=Is this review positive or negative? Review: this is the worst restaurant ever
server3:   out=Is this review positive or negative? Review: this is the worst restaurant ever I have ever
server3: 第一次推理耗时8296694.418668747 ms
server3: 第2次推理时间:2024-04-09 12:27:43.286708
server1: Input length of input_ids is 23, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
server3: rank1:
server3:    in=安徽的省会是哪里?
server3:   out=安徽的省会是哪里? 
server3: 安徽的省会是合肥。
server3: 第二次推理耗时5734279.165506363 ms(1小时36分钟)
server1: rank0:
server1:    in=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy
server1:   out=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy This
server1: 第二次推理耗时5734281.681060791 ms
server3: [2024-04-09 14:03:19,782] [INFO] [launch.py:347:main] Process 1048854 exits successfully.
server1: [2024-04-09 14:03:21,414] [INFO] [launch.py:347:main] Process 250383 exits successfully.

/root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.

  • outputs = ds_engine.module.generate(inputs, synced_gpus=True, max_new_tokens=100)



NCCL_SOCKET_IFNAME=eno1 /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=0 --master_addr= --master_port=29500 deploy_demo_ds.py
[2024-04-17 19:37:55,128] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-17 19:37:56,885] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
[2024-04-17 19:37:56,885] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
[2024-04-17 19:37:56,885] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
[2024-04-17 19:37:56,885] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
[2024-04-17 19:37:56,885] [INFO] [launch.py:163:main] dist_world_size=2
[2024-04-17 19:37:56,885] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-04-17 19:38:00,151] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-17 19:38:00,362] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-04-17 19:38:00,362] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-04-17 19:38:00,362] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-04-17 19:55:48,239] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 7/7 [27:14<00:00, 233.55s/it]
[2024-04-17 20:23:03,149] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
[2024-04-17 20:23:03,159] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-04-17 20:23:03,161] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
[2024-04-17 20:23:03,264] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-04-17 20:23:03,265] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.49 GB         Max_CA 1 GB 
[2024-04-17 20:23:03,265] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.86 GB, percent = 13.4%
Parameter Offload: Total persistent parameters: 362496 in 85 params
[2024-04-17 20:23:03,385] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-04-17 20:23:03,386] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.49 GB         Max_CA 1 GB 
[2024-04-17 20:23:03,386] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.85 GB, percent = 13.4%
[2024-04-17 20:23:03,387] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
[2024-04-17 20:23:03,387] [INFO] [config.py:964:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
[2024-04-17 20:23:03,387] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-04-17 20:23:03,387] [INFO] [config.py:964:print]   amp_enabled .................. False
[2024-04-17 20:23:03,387] [INFO] [config.py:964:print]   amp_params ................... False
[2024-04-17 20:23:03,387] [INFO] [config.py:964:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
[2024-04-17 20:23:03,387] [INFO] [config.py:964:print]   bfloat16_enabled ............. False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f9edfefab80>
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   communication_data_type ...... None
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   disable_allgather ............ False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   dump_state ................... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   elasticity_enabled ........... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   fp16_auto_cast ............... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   fp16_enabled ................. True
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   global_rank .................. 0
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 65536
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   loss_scale ................... 0
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   memory_breakdown ............. False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   mics_shard_size .............. -1
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   optimizer_name ............... None
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   optimizer_params ............. None
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   pld_enabled .................. False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   pld_params ................... False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   prescale_gradients ........... False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   scheduler_name ............... None
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   scheduler_params ............. None
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   sparse_attention ............. None
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   steps_per_print .............. 2000
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   train_batch_size ............. 2
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   use_node_local_storage ....... True
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   world_size ................... 2
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   zero_enabled ................. True
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
[2024-04-17 20:23:03,389] [INFO] [config.py:950:print_user_config]   json = {
    "fp16": {
        "enabled": true
    "bf16": {
        "enabled": false
    "zero_optimization": {
        "stage": 3, 
        "offload_param": {
            "device": "cpu", 
            "pin_memory": true
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "reduce_bucket_size": 4.194304e+06, 
        "stage3_prefetch_bucket_size": 3.774874e+06, 
        "stage3_param_persistence_threshold": 2.048000e+04
    "checkpoint": {
        "use_node_local_storage": true
    "steps_per_print": 2.000000e+03, 
    "train_batch_size": 2, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": false
当前时间:2024-04-17 20:23:03.391038

1 2024-04-17 20:26:53.509871...text_in:你好,介绍下你自己
2 2024-04-17 20:26:53.561594...tokenizer:ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-17 20:26:53.578529...inputs:tensor([[64790, 64792, 36474, 54591, 31123, 32025, 54578, 39376]],
4 2024-04-18 01:01:06.943339...outputs:tensor([[64790, 64792, 36474, 54591, 31123, 32025, 54578, 39376, 35416, 33031,
         31718, 54746, 31645, 31155,    13,    13, 33030, 32132, 32914, 54940,
         54645, 30932, 38628, 34797, 42481, 31155, 54546, 37893, 31799, 32330,
         31940, 31668, 30932, 31934, 32006, 36295, 31639, 31201]],

本轮推理耗时16453436.004161835 ms

当前时间:2024-04-18 01:01:06.945884

1 2024-04-18 08:51:18.150903...text_in:安徽的省会是哪里?
2 2024-04-18 08:51:18.213205...tokenizer:ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-18 08:51:18.230416...inputs:tensor([[64790, 64792, 30910, 33240, 54530, 54833, 45798, 33120, 31514]],
4 2024-04-18 13:25:48.953237...outputs:tensor([[64790, 64792, 30910, 33240, 54530, 54833, 45798, 33120, 31514, 30910,
            13,    13, 33240, 54530, 54833, 45798, 35606, 31155,     2]],

本轮推理耗时16470804.099321365 ms

当前时间:2024-04-18 13:25:48.954983

1 2024-04-18 13:45:12.674692...text_in:写一首关于月亮的四言绝句
2 2024-04-18 13:45:12.738515...tokenizer:ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-18 13:45:12.755618...inputs:tensor([[64790, 64792, 30910, 55172, 38572, 31809, 37752, 41807, 54994, 55384,
         55390]], device='cuda:0')
4 2024-04-18 18:24:06.553752...outputs:tensor([[64790, 64792, 30910, 55172, 38572, 31809, 37752, 41807, 54994, 55384,
         55390,    13,    13,    13,    13, 37752, 54589, 55990, 56786, 58259,
         54538, 30932, 59731, 55822, 54627, 55312, 55364, 54902, 56513, 31155,
           265,    13, 54852, 56060, 49447, 34313, 54655, 30932, 56181, 55786,
         54586]], device='cuda:0')

本轮推理耗时16733881.56747818 ms

当前时间:2024-04-18 18:24:06.556266


NCCL_SOCKET_IFNAME=eno1 /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=1 --master_addr= --master_port=29500 deploy_demo_ds.py
[2024-04-17 19:38:01,000] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-17 19:38:02,408] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
[2024-04-17 19:38:02,408] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
[2024-04-17 19:38:02,408] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
[2024-04-17 19:38:02,408] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
[2024-04-17 19:38:02,408] [INFO] [launch.py:163:main] dist_world_size=2
[2024-04-17 19:38:02,408] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-04-17 19:38:05,049] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-17 19:38:05,289] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-04-17 19:38:05,289] [INFO] [comm.py:616:init_distributed] cdb=None
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 7/7 [27:15<00:00, 233.66s/it]
当前时间:2024-04-17 20:23:04.450904

1 2024-04-17 20:27:13.342097...text_in:红楼梦的作者是谁?
2 2024-04-17 20:27:13.392741...tokenizer:ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-17 20:27:13.409613...inputs:tensor([[64790, 64792, 30910, 42470, 53422, 36289, 31514]], device='cuda:0')
4 2024-04-18 01:01:06.940493...outputs:tensor([[64790, 64792, 30910, 42470, 53422, 36289, 31514,    13,    13, 42470,
         53422, 54532, 37502, 33172, 54561, 56307, 55534, 57772, 31155,     2]],

本轮推理耗时16433600.287437439 ms

当前时间:2024-04-18 01:01:06.942412

1 2024-04-18 08:51:56.450500...text_in:新约和旧约有什么区别?
2 2024-04-18 08:51:56.512916...tokenizer:ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-18 08:51:56.530356...inputs:tensor([[64790, 64792, 33640, 55064, 54542, 55659, 55064, 33277, 34213, 31514]],
4 2024-04-18 13:25:48.957868...outputs:tensor([[64790, 64792, 33640, 55064, 54542, 55659, 55064, 33277, 34213, 31514,
         30910,    13,    13, 54575, 55064, 54542, 55659, 55064, 54532, 38046,
         54538, 38411, 31911, 31726, 30932, 32542, 32695, 34213, 31685, 33947,
         31155,    13,    13, 55659, 55064, 30946, 20866, 14891, 30945, 31779]],


旧约(Old Testament)包括
本轮推理耗时16432509.668111801 ms

当前时间:2024-04-18 13:25:48.960165

1 2024-04-18 13:46:02.574377...text_in:写一首关于亲情的七言律诗
2 2024-04-18 13:46:02.637360...tokenizer:ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-18 13:46:02.655157...inputs:tensor([[64790, 64792, 30910, 55172, 38572, 31809, 55113, 34684, 55254, 54994,
         55134, 55475]], device='cuda:0')
4 2024-04-18 18:24:06.555363...outputs:tensor([[64790, 64792, 30910, 55172, 38572, 31809, 55113, 34684, 55254, 54994,
         55134, 55475,    13,    13, 41437, 54625, 55364, 39671, 56786, 54683,
         30932,    13, 55100, 55994, 54579, 48893, 40014, 31155,    13, 34365,
         39003, 54908, 55169, 55169, 30932,    13, 34089, 41561, 54623, 54664,
         55994, 31155]], device='cuda:0')

本轮推理耗时16683983.394861221 ms

当前时间:2024-04-18 18:24:06.557781




 节点一:主节点(选取的是89.3 )
(base) [root@master25 ptuning]$NCCL_SOCKET_IFNAME=eno1 /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMiI6IFswXSwgInNlcnZlcjMiOiBbMF19 --node_rank=0 --master_addr= --master_port=29500 deploy_demo_ds.py
[2024-04-19 16:58:23,476] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-19 16:58:24,880] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
[2024-04-19 16:58:24,880] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server2': [0], 'server3': [0]}
[2024-04-19 16:58:24,880] [INFO] [launch.py:151:main] nnodes=3, num_local_procs=1, node_rank=0
[2024-04-19 16:58:24,880] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server2': [1], 'server3': [2]})
[2024-04-19 16:58:24,880] [INFO] [launch.py:163:main] dist_world_size=3
[2024-04-19 16:58:24,880] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-04-19 16:58:27,484] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-19 16:58:27,720] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-04-19 16:58:27,720] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-04-19 16:58:27,720] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-04-19 17:17:53,327] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [31:17<00:00, 268.27s/it]
[2024-04-19 17:49:11,244] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
[2024-04-19 17:49:11,254] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-04-19 17:49:11,255] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
[2024-04-19 17:49:11,348] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-04-19 17:49:11,349] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.49 GB         Max_CA 1 GB 
[2024-04-19 17:49:11,349] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 11.85 GB, percent = 9.4%
Parameter Offload: Total persistent parameters: 362496 in 85 params
[2024-04-19 17:49:11,457] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-04-19 17:49:11,458] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.49 GB         Max_CA 1 GB 
[2024-04-19 17:49:11,458] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 11.85 GB, percent = 9.4%
[2024-04-19 17:49:11,459] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
[2024-04-19 17:49:11,459] [INFO] [config.py:964:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
[2024-04-19 17:49:11,459] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-04-19 17:49:11,459] [INFO] [config.py:964:print]   amp_enabled .................. False
[2024-04-19 17:49:11,459] [INFO] [config.py:964:print]   amp_params ................... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   bfloat16_enabled ............. False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f245e7bbcd0>
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   communication_data_type ...... None
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   disable_allgather ............ False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   dump_state ................... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   elasticity_enabled ........... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   fp16_auto_cast ............... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   fp16_enabled ................. True
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   global_rank .................. 0
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 65536
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   loss_scale ................... 0
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   memory_breakdown ............. False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   mics_shard_size .............. -1
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   optimizer_name ............... None
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   optimizer_params ............. None
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   pld_enabled .................. False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   pld_params ................... False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   prescale_gradients ........... False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   scheduler_name ............... None
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   scheduler_params ............. None
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   sparse_attention ............. None
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   steps_per_print .............. 2000
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   train_batch_size ............. 3
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   use_node_local_storage ....... True
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   world_size ................... 3
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   zero_enabled ................. True
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
[2024-04-19 17:49:11,462] [INFO] [config.py:950:print_user_config]   json = {
    "fp16": {
        "enabled": true
    "bf16": {
        "enabled": false
    "zero_optimization": {
        "stage": 3, 
        "offload_param": {
            "device": "cpu", 
            "pin_memory": true
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "reduce_bucket_size": 4.194304e+06, 
        "stage3_prefetch_bucket_size": 3.774874e+06, 
        "stage3_param_persistence_threshold": 2.048000e+04
    "checkpoint": {
        "use_node_local_storage": true
    "steps_per_print": 2.000000e+03, 
    "train_batch_size": 3, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": false
当前时间:2024-04-19 17:49:11.463137

1 2024-04-19 17:56:19.307686...text_in:你好
2 2024-04-19 17:56:19.357040...tokenizer:ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-19 17:56:19.374337...inputs:tensor([[64790, 64792, 36474, 54591]], device='cuda:0')
4 2024-04-20 00:03:12.646182...outputs:tensor([[64790, 64792, 36474, 54591, 31123, 33030, 22011, 10461, 30944, 30943,
         30941, 30978, 30949, 31123, 30910, 32347, 54565, 32093, 42481, 31155,
         54546, 34161, 34941, 34030, 54532, 12980, 30944, 30943, 30941, 30978,
         30949, 31123, 30910, 32288]], device='cuda:0')
  out=你好,我是 ChatGLM2-6B, 一个人工智能助手。我背后使用的模型是 GLM2-6B, 是一种
本轮推理耗时22013340.56162834 ms

当前时间:2024-04-20 00:03:12.648256

1 2024-04-20 00:12:26.084211...text_in:写一首关于月亮的诗
2 2024-04-20 00:12:26.144560...tokenizer:ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-20 00:12:26.161506...inputs:tensor([[64790, 64792, 30910, 55172, 38572, 31809, 37752, 45113]],
4 2024-04-20 06:19:27.499328...outputs:tensor([[64790, 64792, 30910, 55172, 38572, 31809, 37752, 45113, 30910, 41881,
         54589, 55990, 54614, 35171, 30932,    13, 59731, 55822, 54627, 55312,
         55364, 54902, 55193, 31155,    13, 54595, 54578, 54852, 40921, 55459,
         55406, 30932,    13, 33961, 39861, 58423, 33263, 31155]],
  out=写一首关于月亮的诗 明月高挂天空中,
本轮推理耗时22021417.449235916 ms

当前时间:2024-04-20 06:19:27.501668

NCCL_SOCKET_IFNAME=eno1 /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMiI6IFswXSwgInNlcnZlcjMiOiBbMF19 --node_rank=1 --master_addr= --master_port=29500 deploy_demo_ds.py
[2024-04-19 16:58:59,531] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-19 16:59:00,932] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
[2024-04-19 16:59:00,932] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server2': [0], 'server3': [0]}
[2024-04-19 16:59:00,933] [INFO] [launch.py:151:main] nnodes=3, num_local_procs=1, node_rank=1
[2024-04-19 16:59:00,933] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server2': [1], 'server3': [2]})
[2024-04-19 16:59:00,933] [INFO] [launch.py:163:main] dist_world_size=3
[2024-04-19 16:59:00,933] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-04-19 16:59:03,797] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-19 16:59:04,017] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-04-19 16:59:04,017] [INFO] [comm.py:616:init_distributed] cdb=None
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [31:17<00:00, 268.27s/it]
当前时间:2024-04-19 17:49:11.795627

1 2024-04-19 17:56:29.232194...text_in:安徽的省会是哪里?
2 2024-04-19 17:56:29.285311...tokenizer:ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-19 17:56:29.302017...inputs:tensor([[64790, 64792, 30910, 33240, 54530, 54833, 45798, 33120, 31514]],
4 2024-04-20 00:03:12.643464...outputs:tensor([[64790, 64792, 30910, 33240, 54530, 54833, 45798, 33120, 31514, 30910,
            13,    13, 33240, 54530, 54833, 45798, 35606, 31155,     2]],

本轮推理耗时22003413.06900978 ms

当前时间:2024-04-20 00:03:12.645291

1 2024-04-20 00:12:34.545304...text_in:写一首关于太阳的诗
2 2024-04-20 00:12:34.611573...tokenizer:ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-20 00:12:34.628495...inputs:tensor([[64790, 64792, 30910, 55172, 38572, 31809, 33146, 45113]],
4 2024-04-20 06:19:27.496629...outputs:tensor([[64790, 64792, 30910, 55172, 38572, 31809, 33146, 45113, 30910, 33146,
         55674, 31123, 34607, 36022, 54530, 57227, 55190, 30910, 31822, 42059,
         49447, 54666, 35196, 30910, 31822, 33027, 46276, 54666, 36505, 30910,
         31822, 32824, 31123, 32067, 35818, 30910, 31822, 33219]],
  out=写一首关于太阳的诗 太阳啊,你是天空的瑰宝 你的光芒照耀着大地 你的温暖滋润着万物 你的美丽,无法形容 你的伟大
本轮推理耗时22012953.52959633 ms

当前时间:2024-04-20 06:19:27.498841

NCCL_SOCKET_IFNAME=eno1 /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMiI6IFswXSwgInNlcnZlcjMiOiBbMF19 --node_rank=2 --master_addr= --master_port=29500 deploy_demo_ds.py
[2024-04-19 16:58:30,385] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-19 16:58:31,942] [INFO] [launch.py:138:main] 2 NCCL_SOCKET_IFNAME=eno1
[2024-04-19 16:58:31,942] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server2': [0], 'server3': [0]}
[2024-04-19 16:58:31,942] [INFO] [launch.py:151:main] nnodes=3, num_local_procs=1, node_rank=2
[2024-04-19 16:58:31,942] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server2': [1], 'server3': [2]})
[2024-04-19 16:58:31,942] [INFO] [launch.py:163:main] dist_world_size=3
[2024-04-19 16:58:31,942] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-04-19 16:58:35,273] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-19 16:58:35,484] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-04-19 16:58:35,484] [INFO] [comm.py:616:init_distributed] cdb=None
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [31:17<00:00, 268.28s/it]
当前时间:2024-04-19 17:49:11.861933

1 2024-04-19 17:56:41.513744...text_in:红楼梦的作者是谁?
2 2024-04-19 17:56:41.562707...tokenizer:ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-19 17:56:41.579732...inputs:tensor([[64790, 64792, 30910, 42470, 53422, 36289, 31514]], device='cuda:0')
4 2024-04-20 00:03:12.647159...outputs:tensor([[64790, 64792, 30910, 42470, 53422, 36289, 31514,    13,    13, 42470,
         53422, 54532, 37502, 33172, 54561, 56307, 55534, 57772, 31155,     2]],

本轮推理耗时21991135.238409042 ms

当前时间:2024-04-20 00:03:12.648993

1 2024-04-20 00:12:59.758721...text_in:工业革命为什么发生在欧洲
2 2024-04-20 00:12:59.840548...tokenizer:ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-20 00:12:59.858234...inputs:tensor([[64790, 64792, 30910, 32068, 32123, 32148, 37537, 32857]],
4 2024-04-20 06:19:27.495783...outputs:tensor([[64790, 64792, 30910, 32068, 32123, 32148, 37537, 32857, 33740, 31722,
         31808, 31514, 30910,    13,    13, 32068, 32123, 34369, 54534, 32857,
         33740, 31722, 31808, 31763, 30932, 34024, 31949, 32857, 34892, 31201,
         31832, 43805, 31993, 54626, 39305, 31155,    13,    13]],


本轮推理耗时21987739.241361618 ms

当前时间:2024-04-20 06:19:27.497969



Finetune闻仲2.0-GPT2-3.5B-chinese显存爆炸,似乎offload_param没有生效 #111

分布式多机多卡训练卡住,超时后报错 #123


  • 看代码是模型已经切分了,在optimzer过程中gather参数的时候显存挂了,可以先尝试一下把这个参数再改小一点试试stage3_max_live_parameters
  • deepspeed会自动做参数的切分,这部分切分不一定是严格按照layer的,deepspeed在内部会有判断某一层的参数是否在forward之外的地方有调用,如果有的话就是不可分的,这个跟模型的实现有关系,目前看起来最低需要13G的显存,这里之后我们也会看看如何再优化一下
  • 这个里面调用的实际上是estimate_zero3_model_states_mem_needs_all_cold参数,all_alive假定了每一层都是可分的,是理论上的最小值,实际上有一些层是不可分的,在上面你爆显存的地方实际上就是在gather这些不可分层的参数,导致显存挂了,
  • activation checkpointing


















【进阶】Transformer 架构解析:模型训练和反向传播【获益匪浅,了解训练的整个过程的原理】






※,通义千问 Qwen-14B-Chat



关于如何查看GPU是否支持 NVLINK,使用命令

nvidia-smi topo -p2p n


python -c 'from transformers import AutoModel; \
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \
model = AutoModel.from_pretrained("THUDM/chatglm2-6b"); \
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)'

(base) [root@master33 ptuning]$python -c 'from transformers import AutoModel; \
> from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \
> model = AutoModel.from_pretrained("THUDM/chatglm2-6b"); \
> estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)'
[2024-03-11 11:20:24,517] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Loading THUDM/chatglm2-6b requires to execute some code in that repo, you can inspect the content of the repository at https://hf.co/THUDM/chatglm2-6b. You can dismiss this prompt by passing `trust_remote_code=True`.
Do you accept? [y/N] y
Loading THUDM/chatglm2-6b requires to execute some code in that repo, you can inspect the content of the repository at https://hf.co/THUDM/chatglm2-6b. You can dismiss this prompt by passing `trust_remote_code=True`.
Do you accept? [y/N] y
Loading checkpoint shards:   0%|                                                                                                           | 0/7 [00:00<?, ?it/s]Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00,  1.39s/it]
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 6243M total params, 266M largest layer params.
  per CPU  |  per GPU |   Options
  157.00GB |   0.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  157.00GB |   0.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
  139.55GB |  12.62GB | offload_param=none, offload_optimizer=cpu , zero_init=1
  139.55GB |  12.62GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    1.49GB | 105.66GB | offload_param=none, offload_optimizer=none, zero_init=1
   34.89GB | 105.66GB | offload_param=none, offload_optimizer=none, zero_init=0
(base) [root@master33 ptuning]$


posted on 2023-12-14 16:32  everest33  阅读(107)  评论(0编辑  收藏  举报