机器学习学习记录

※，大模型、GPU and so on

※，一文章看懂人工智能、机器学习和深度学习：

https://zhuanlan.zhihu.com/p/86794447

https://cloud.tencent.com/developer/article/1617263

※，★深度学习入门笔记系列

https://blog.csdn.net/TeFuirnever/article/details/100669859

※，

数据并行是指，多张 GPUs 使用相同的模型副本，但采用不同 batch 的数据进行训练.

模型并行是指，多张 GPUs 使用同一 batch 的数据，分别训练模型的不同部分.

※，学习率

※，★梯度下降与损失函数： https://blog.csdn.net/qq_38890412/article/details/109193294

※，模型收敛：https://www.zhihu.com/question/517057079

表示得到的模型是稳定的，可以复现的可能性高。如果发散，每次的结果都不同，实验结果有啥意义？！

一个模型，应该是在输入变量不变的时候，输出也应该是确定的。在说模型收敛的时候，一般指的是训练、验证损失曲线没有大的波动，而且随着训练轮数不断增加，波动依然可以在一定容忍范围内。个人理解，收敛的意义是系统稳定，就是模型的某一个权重参数发生小的改变的时候，模型输出结果不会发生强烈变化，导致系统崩溃，也就是所谓的发散。反过来说，发散就是在模型参数发生微小改变的时候，模型输出变化，导致损失剧烈变化。此时的模型就不是训练充分的模型。

※，

※，神经网络前向传播与反向传播：https://zhuanlan.zhihu.com/p/226080271

※，epoch、batch、batch_size、step、iteration

https://www.zhihu.com/question/358739684

epoch：表示将训练数据集中的所有样本都过一遍（且仅过一遍）的训练过程。在一个epoch中，训练算法会按照设定的顺序将所有样本输入模型进行前向传播、计算损失、反向传播和参数更新。一个epoch通常包含多个step。

batch：一般翻译为“批次”，表示一次性输入模型的一组样本。在神经网络的训练过程中，训练数据往往是很多的，比如几万条甚至几十万条——如果我们一次性将这上万条的数据全部放入模型，对计算机性能、神经网络模型学习能力等的要求太高了；那么就可以将训练数据划分为多个batch，并随后分批将每个batch的样本一起输入到模型中进行前向传播、损失计算、反向传播和参数更新。但要注意，一般batch这个词用的不多，多数情况大家都是只关注batch size的。

batch size：一般翻译为“批次大小”，表示训练过程中一次输入模型的一组样本的具体样本数量。前面提到了，我们在神经网络训练过程中，往往需要将训练数据划分为多个batch；而具体每一个batch有多少个样本，那么就是batch size指定的了。

step：一般翻译为“步骤”，表示在一个epoch中模型进行一次参数更新的操作。通俗地说，在神经网络训练过程中，每次完成对一个batch数据的训练，就是完成了一个step。很多情况下，step和iteration表示的是同样的含义。

iteration：一般翻译为“迭代”，多数情况下就表示在训练过程中经过一个step的操作。一个iteration包括了一个step中前向传播、损失计算、反向传播和参数更新的流程。当然，在某些情况下，step和iteration可能会有细微的区别——有时候iteration是指完成一次前向传播和反向传播的过程，而step是指通过优化算法对模型参数进行一次更新的操作。但是绝大多数情况下，我们就认为二者是一样的即可。

以上是对这些名词的解释，我们将他们带入实际的例子就更好理解了。

假设我们现在有一个训练数据集（这个数据集不包括测试集），其中数据的样本数量为1500。那么，我们将这1500条数据全部训练1次，就是一个epoch。其中，由于数据量较大（其实1500个样本在神经网络研究中肯定不算大，但是我们这里只是一个例子，大家理解即可），因此我们希望将其分为多个batch，分批加以训练；我们决定每1批训练100条数据，那么为了将这些数据全部训练完，就需要训练15批——在这里，batch size就是100，而batch就是15。而前面我们提到，每次完成对一个batch数据的训练，就是完成了一个step，那么step和iteration就也都是15。

以上是我们对这一数据集加以1次训练（1个epoch）的情况，而一般情况下我们肯定是需要训练多次的，也就是多个epoch。我们假设我们需要训练3个epoch，相当于需要将这1500个样本训练3次。那么，step和iteration都会随着epoch的改变而发生改变——二者都变为45，因为15 * 3。但是，batch依然是15，因为其是在每一个epoch的视角内来看待的，和epoch的具体大小没有关系。

※，激活函数

https://zhuanlan.zhihu.com/p/364620596

https://zhuanlan.zhihu.com/p/428448728 【REUL激活函数】

※，卷积神经网络CNN

https://blog.csdn.net/TeFuirnever/article/details/100057229

※，pytorch

https://blog.csdn.net/TeFuirnever/article/details/100034274

pytorch官方中文文档：点此。

★，win10 安装DeepSpeed环境: 2024年3月6日11:05:44

使用system账号打开cmd
NameError: name '_C' is not defined: 安装最新版torch(2.2.1)
[ERROR] Unable to pre-compile async_io

★，Tokenizer

Tokenizer是一个用于向量化文本，将文本转换为序列的类。计算机在处理语言文字时，是无法理解文字含义的，通常会把一个词（中文单个字或者词）转化为一个正整数，将一个文本就变成了一个序列，然后再对序列进行向量化，向量化后的数据送入模型处理。

★，nvidia-smi解析

深度学习训练中的GPU利用率和显存占用问题、num_workers&batch_size设置问题

常见 GPU 任务运行流程图如下：

有关GPU的Memory-usage的占用（GPU内存占有率）

GPU中Memory-usage最直接的影响因素是模型的大小和Batch size的大小。其中模型对GPU中Memory-usage因素包括网络的参数量（网络的深度，宽度等），而一般在训练时候模型结构都已经固定，很少再轻易的改动。因此，我们对Memory-usage的占用的影响主要调控在Batch size的大小，如batch size设置为12，Memory-usage为40%；与设置为24相比，Memory-usage内存占用率是80%，接近于2倍关系，偏差不大。所以在模型结构固定的情况下，尽量将batch size设置大，充分利用GPU的内存。（GPU会很快的算完你给进去的数据，而有关训练时间主要瓶颈在CPU的数据吞吐量上面）

有关Volatile GPU-Utile的利用率（GPU的利用率）

GPU利用的好不好，主要是看和CPU配合的怎么样，还有就是内存带宽等。如果衔接流畅就能有一个不错的利用率。

这个是Volatile GPU-Util表示，当没有设置好CPU的线程数时，这个参数是在反复的跳动的，0%，20%，70%，95%，0%。这样停息1-2 秒然后又重复起来。其实是GPU在等待数据从CPU传输过来，当从总线传输到GPU之后，GPU逐渐起计算来，利用率会突然升高，但是GPU的算力很强大，0.5秒就基本能处理完数据，所以利用率接下来又会降下去，等待下一个batch的传入。因此，这个GPU利用率瓶颈在内存带宽和内存介质上以及CPU的性能上面。最好当然就是换更好的四代或者更强大的内存条，配合更好的CPU。

另外的一个方法是，在PyTorch这个框架里面，数据加载Dataloader上做更改和优化，包括num_workers（线程数），pin_memory=True，会提升速度。解决好数据传输的带宽瓶颈和GPU的运算效率低的问题。在TensorFlow下面，也有这个加载数据的设置

为了提高利用率，首先要将num_workers（线程数）设置得体，4,8,16是几个常选的几个参数。本人测试过，将num_workers设置的非常大，例如，24，32,等，其效率反而降低，因为模型需要将数据平均分配到几个子线程去进行预处理，分发等数据操作，设高了反而影响效率。当然，线程数设置为1，是单个CPU来进行数据的预处理和传输给GPU，效率也会低。其次，当你的服务器或者电脑的内存较大，性能较好的时候，建议打开pin_memory打开，就省掉了将数据从CPU传入到缓存RAM里面，再给传输到GPU上；为True时是直接映射到GPU的相关内存块上，省掉了一点数据传输时间。

※，ChatGlm2-6B

ChatGlm2-6B int4需要6G显存，fp16需要13G显存

ChatGLM-6B源码解析之 cli_demo.py

model = AutoModel.from_pretrained("../ChatGLM-Tuning-master/chatglm-6b", trust_remote_code=True).half().cuda(): 加载预训练模型，并将其转移到GPU上，同时使用半精度浮点数(half-precision floating point)来提高运算速度。

(89.51,centos7) RuntimeError: Library cudart is not initialized：参考此文。

模型量化依赖cpm-kernels，cpm-kernels调用了libcudart.so。可以通过以下代码检查libcudart.so是否存在：

python -c "import ctypes.util; print(ctypes.util.find_library('cudart'))"

如果返回None，需要手动安装cudatoolkit，并可能需要修改环境变量LD_LIBRARY_PATH。
No Solve

※，多机多卡

2023年12月份以及2024年初出来了不少新的资料和教程

大模型训练的痛点是模型参数过大，动辄上百亿，单靠单个GPU来完成训练基本不可能，所以需要多卡或者分布式训练来完成这项工作。DeepSpeed是一个由微软开发的开源深度学习优化库，已经成为了大模型训练的标准配置！

infiniband是NVIDIA GPU节点之间互相通信的一种网络架构，高带宽。

nvlink：GPU点对点通信，比PCIE快很多，如果GPU不支持nvlink，训练或微调大模型时速度会很慢，但是不会出现OOM的问题。

空间换时间

★，DeepSpeed

训练：DeepSpeed ZeRO 训练支持完整的 ZeRO stages 1, 2 and 3、以及 optimizer states, gradients and parameters 的 CPU/Disk offload 。
- Stage 1：将 optimizer states 分片到数据并行 workers/GPUs 上。
- Stage 2：将 optimizer states + gradients 分片到数据并行 workers/GPUs 上。
- Stage 3：将 optimizer states + gradients + model parameters 分片到数据并行 workers/GPUs 上。
- Optimizer Offload：将 optimizer states + gradients 卸载到 CPU/Disk ，建立在 ZERO Stage 2 之上。
- Param Offload：将 model parameters 卸载到 CPU/Disk ，建立在 ZERO Stage 3 之上。
注意：关于 Disk Offload ，磁盘应该是 NVME 的，以便有好的速度，但技术上可以在任何磁盘上工作。
推理：DeepSpeed ZeRO Inference 支持 ZeRO Stage 3 与 ZeRO-Infinity 。它使用与训练相同的 ZeRO 协议，但它不使用优化器和 lr scheduler 。

关于Deepspeed的一些总结与心得

★，问题、要点等记录

部署GPU版本的ChatGLM-6B需要安装cuda版本的torch，大家需要检测自己的torch是否正确，可以通过如下命令检查（下面是python代码）：

import torch
print(torch.cuda.is_available())
如果以上代码输出的是True，那么恭喜你，你安装的是cuda版本的torch

环境一致性问题

you appear to be running an x server please exit x before installing

/etc/init.d/lightdm stop

libcudart.so.11.0: cannot open shared object file: No such file or directory

conda install cudatoolkit ❌
export LD_LIBRARY_PATH=

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType. （debug源码，缺少文件导致）

python导入上级模块：

# 要导入上级目录下模块，可以使用sys.path： 　
import sys 
sys.path.append("..") 
import xxx

如何将一个模型转换成 torch.nn.Module类型？？

transformer model类型如何转换为DeepSpeed model类型

cannot import name 'HfDeepSpeedConfig' from 'transformers.integrations'

理论：pytorch Transformer deepspeed

Transformer框架中集成deepspeed

https://developer.baidu.com/article/details/2687080

在深度学习训练中，随着模型规模和数据集的增大，单GPU内存和计算能力的局限性成为了制约训练速度的瓶颈。为了解决这个问题，一种有效的方法是采用模型并行训练，将模型的不同部分分布到多个计算设备上，实现并行计算，从而提高训练速度。
一、模型并行训练的基本概念
模型并行训练是指将一个深度学习模型的不同部分分布在不同的计算设备上，每个设备负责模型的一部分计算，从而实现并行计算加速。这些计算设备可以是多个GPU、CPU或者是分布式计算节点。通过模型并行训练，我们可以充分利用多GPU、多节点等计算资源，提高深度学习模型的训练速度。
二、PyTorch模型并行训练的实现方法
在PyTorch中，实现模型并行训练主要涉及到数据并行和模型并行两种方式。

数据并行
数据并行是指将数据集分成多个子集，每个子集在不同的计算设备上进行训练，然后汇总结果。在PyTorch中，可以使用DataParallel类实现数据并行。DataParallel会将输入数据划分为多个块，并将每个块分配给一个GPU进行计算。在每个GPU上，模型的一个副本被调用以处理其数据块。然后，所有GPU上的结果被收集并汇总以产生最终输出。
模型并行
模型并行是指将模型的不同部分分布在不同的计算设备上。在PyTorch中，可以使用torch.nn.parallel.DistributedDataParallel类实现模型并行。DistributedDataParallel会将模型的所有参数都放在一个进程中，而梯度计算在所有进程中进行。在每个GPU上，模型的某些层将在该GPU上执行前向和反向传播，然后将梯度聚合到所有进程中。
三、应用场景
模型并行训练适用于各种深度学习模型和任务，特别是那些需要大量计算资源和内存的模型。例如，对于自然语言处理任务中的Transformer模型、计算机视觉任务中的ResNet等大规模模型，以及语音识别、推荐系统等需要处理大量数据的任务，都可以通过模型并行训练来加速训练过程。
四、注意事项
在进行模型并行训练时，需要注意合理地分配模型和数据到不同的计算设备上，以保证负载均衡和性能优化。
由于涉及到多个计算设备间的通信和同步，模型并行训练可能会引入一定的额外开销。因此，在实际应用中，需要根据具体场景和需求进行权衡和优化。
另外，需要注意的是，在进行模型并行训练时，需要保证所有计算设备上的代码和环境一致，以避免出现不一致的问题。
总结：
通过PyTorch的模型并行训练技术，我们可以充分利用多GPU、多节点等计算资源，加速深度学习模型的训练过程。在实际应用中，需要根据具体场景和需求选择合适的方式进行模型并行训练。同时，需要注意负载均衡、性能优化以及环境一致性等问题。

★，torchrun is a python console script to the main module torch.distributed.run declared in the entry_points configuration in setup.py. It is equivalent to invoking python -m torch.distributed.run.

两台机器是填自己的ip还是互相的？需要填同一个IP，比如都填机器1的(node_rank=0)

export CUDA_VISIBLE_DEVICES="1,3,2"

--nnodes=1:4 (min 1; max 4)

torchrun(pytorch)中的DDP（或DP）实际是数据并行，其提到的模型并行并非真正的将模型参数切分然后分配至各个节点的不同GPU上，本质上还是每个GPU上都运行的是一个全参的网络模型，其“模型并行”指的是这些全参的网络模型的“并行”。

驱动问题、环境问题、版本问题、网络问题

网络带宽要求

报错信息列表：

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 134.00 MiB. GPU 0 has a total capacty of 11.76 GiB of which 130.00 MiB is free. Including non-PyTorch memory, this process has 11.63 GiB memory in use. Of the allocated memory 11.36 GiB is allocated by PyTorch, and 22.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer. This may indicate a possible application crash on rank 0 or a network set up issue.

server2: OSError: libcudart.so.11.0: cannot open shared object file: No such file or directory

server1: RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

server3: torch.distributed.DistNetworkError: Connection reset by peer

检查两个机器上deepspeed、transformers、accelerate、torch、peft、bitsandbytes的版本是否完全一致
python版本从3.8变为3.10就好了（注：不一定python 3.8不行，三个节点中有一个节点的python版本就是3.8的）

nvidia-smi驱动报错：NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

参考此文。亲测可用，解决方法如下：

sudo apt install dkms
sudo dkms install -m nvidia -v 418.87.00
其中，418.87.00 是之前安装 nvidia 驱动的版本号，可通过下面方法查到：
ls /usr/src | grep nvidia

Traceback (most recent call last):
File "/data1/tong/GLM/ChatGLM2-6B/ptuning/deploy_demo_ds.py", line 55, in <module>
model_hidden_size = config.d_model
File "/root/miniconda3/lib/python3.10/site-packages/transformers/configuration_utils.py", line 261, in __getattribute__
return super().__getattribute__(key)
AttributeError: 'ChatGLMConfig' object has no attribute 'd_model'

T0_3B大模型特有的配置：https://huggingface.co/bigscience/T0_3B/blob/main/config.json

Loading THUDM/chatglm2-6b requires to execute some code in that repo, you can inspect the content of the repository at https://hf.co/THUDM/chatglm2-6b. You can dismiss this prompt by passing `trust_remote_code=True`.
Do you accept? [y/N] y
Traceback (most recent call last):
File "/data1/tong/GLM/ChatGLM2-6B/ptuning/deploy_demo_ds.py", line 145, in <module>
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
File "/root/miniconda3/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained
return model_class.from_pretrained(
File "/root/miniconda3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2670, in from_pretrained
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 797, in __init__
_ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 769, in __init__
self._configure_train_batch_size()
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 941, in _configure_train_batch_size
self._set_batch_related_parameters()
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 909, in _set_batch_related_parameters
grad_acc = train_batch // micro_batch
TypeError: unsupported operand type(s) for //: 'int' and 'str'

model = AutoModel.from_pretrained(model_name)
grad_acc = train_batch // micro_batch，配置项一个配置的为1，一个配置的为“auto”，所以出现此错误。需统一类型。

[2024-03-05 15:09:11,736] [WARNING] [partition_parameters.py:921:_post_init_method] param `weight` in Embedding not on GPU so was not broadcasted from rank 0
[2024-03-05 15:09:12,121] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 0.27B parameters
Traceback (most recent call last):
File "/data1/tong/GLM/ChatGLM2-6B/ptuning/deploy_demo_ds.py", line 145, in <module>
model = AutoModel.from_pretrained(model_name,trust_remote_code=True)
File "/root/miniconda3/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained
return model_class.from_pretrained(
File "/root/miniconda3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2675, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 434, in wrapper
f(module, *args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 854, in __init__
self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device)
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 434, in wrapper
f(module, *args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 741, in __init__
self.embedding = init_method(Embedding, config, **init_kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/utils/init.py", line 53, in skip_init
return module_cls(*args, **kwargs).to_empty(device=final_device)
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 434, in wrapper
f(module, *args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 711, in __init__
self.word_embeddings = nn.Embedding(
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 441, in wrapper
self._post_init_method(module)
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 924, in _post_init_method
param.partition()
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1157, in partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1296, in _partition
self._partition_param(param, has_been_updated=has_been_updated)
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1381, in _partition_param
param.ds_tensor.copy_(src_tensor)
NotImplementedError: Cannot copy out of meta tensor; no data!

model = AutoModel.from_pretrained(xxx, empty_init=False) // from 此文。

[2024-03-05 15:42:56,108] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented

可以通过配置export NCCL_DEBUG=INFO来输出NCCL的debug信息
参考此文：https://github.com/microsoft/DeepSpeedExamples/issues/615 升级DeepSpeed可以消除此warning

Traceback (most recent call last):
File "/data1/tong/GLM/ChatGLM2-6B/ptuning/deploy_demo_ds.py", line 161, in <module>
tokenizer = AutoTokenizer.from_pretrained(model_name)
File "/root/miniconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 688, in from_pretrained
raise ValueError(
ValueError: Tokenizer class ChatGLMTokenizer does not exist or is not currently imported.

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

server3: master25:2726734:2726734 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation

完整报错日志如下：

查看代码

server1: master33:3278978:3278978 [0] NCCL INFO Bootstrap : Using eno1:192.168.89.1<0>
server1: master33:3278978:3278978 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
server1: master33:3278978:3278978 [0] NCCL INFO cudaDriverVersion 12020
server1: NCCL version 2.19.3+cuda12.3
server3: master25:2726734:2726734 [0] NCCL INFO cudaDriverVersion 12020
server3: master25:2726734:2726734 [0] NCCL INFO Bootstrap : Using eno1:192.168.89.3<0>
★server3: master25:2726734:2726734 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
server1: master33:3278978:3279686 [0] NCCL INFO Failed to open libibverbs.so[.1]
server1: master33:3278978:3279686 [0] NCCL INFO NET/Socket : Using [0]eno1:192.168.89.1<0> [1]br-efcde779becf:172.21.0.1<0> [2]kube-ipvs0:10.96.0.1<0> [3]flannel.1:10.244.0.0<0> [4]cni0:10.244.0.1<0>
server1: master33:3278978:3279686 [0] NCCL INFO Using non-device net plugin version 0
server1: master33:3278978:3279686 [0] NCCL INFO Using network Socket
server3: master25:2726734:2726907 [0] NCCL INFO Failed to open libibverbs.so[.1]
server3: master25:2726734:2726907 [0] NCCL INFO NET/Socket : Using [0]eno1:192.168.89.3<0> [1]kube-ipvs0:10.107.9.123<0> [2]flannel.1:10.244.2.0<0> [3]cni0:10.244.2.1<0>
server3: master25:2726734:2726907 [0] NCCL INFO Using non-device net plugin version 0
server3: master25:2726734:2726907 [0] NCCL INFO Using network Socket
server3: master25:2726734:2726907 [0] NCCL INFO comm 0x90982a0 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 3b000 commId 0x8b900e73d3d05138 - Init START
server1: master33:3278978:3279686 [0] NCCL INFO comm 0xa167a40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 3b000 commId 0x8b900e73d3d05138 - Init START
server1: master33:3278978:3279686 [0] NCCL INFO Setting affinity for GPU 0 to 55,55555555
server3: master25:2726734:2726907 [0] NCCL INFO Setting affinity for GPU 0 to 55,55555555
server1: master33:3278978:3279686 [0] NCCL INFO Channel 00/04 :    0   1
server1: master33:3278978:3279686 [0] NCCL INFO Channel 01/04 :    0   1
server1: master33:3278978:3279686 [0] NCCL INFO Channel 02/04 :    0   1
server1: master33:3278978:3279686 [0] NCCL INFO Channel 03/04 :    0   1
server1: master33:3278978:3279686 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1
server3: master25:2726734:2726907 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1
server3: master25:2726734:2726907 [0] NCCL INFO P2P Chunksize set to 131072
server1: master33:3278978:3279686 [0] NCCL INFO P2P Chunksize set to 131072
server1: master33:3278978:3279686 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/Socket/1
server1: master33:3278978:3279686 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/Socket/2
server1: master33:3278978:3279686 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[0] [receive] via NET/Socket/1
server1: master33:3278978:3279686 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[0] [receive] via NET/Socket/2
server1: master33:3278978:3279686 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/Socket/1
server1: master33:3278978:3279686 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/Socket/2
server3: master25:2726734:2726907 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/Socket/1
server1: master33:3278978:3279686 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[0] [send] via NET/Socket/1
server3: master25:2726734:2726907 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/Socket/3
server1: master33:3278978:3279686 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[0] [send] via NET/Socket/2
server3: master25:2726734:2726907 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[0] [receive] via NET/Socket/1
server3: master25:2726734:2726907 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[0] [receive] via NET/Socket/3
server3: master25:2726734:2726907 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/Socket/1
server3: master25:2726734:2726907 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/Socket/3
server3: master25:2726734:2726907 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[0] [send] via NET/Socket/1
server3: master25:2726734:2726907 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[0] [send] via NET/Socket/3
server1: 
server1: master33:3278978:3279712 [0] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to 10.107.9.123<56199> failed : Software caused connection abort
server1: master33:3278978:3279712 [0] NCCL INFO misc/socket.cc:565 -> 2
server1: master33:3278978:3279712 [0] NCCL INFO misc/socket.cc:587 -> 2
server1: master33:3278978:3279712 [0] NCCL INFO transport/net_socket.cc:338 -> 2
server1: master33:3278978:3279712 [0] NCCL INFO transport/net.cc:677 -> 2
server1: master33:3278978:3279686 [0] NCCL INFO transport/net.cc:304 -> 2
server1: master33:3278978:3279686 [0] NCCL INFO transport.cc:148 -> 2
server1: master33:3278978:3279686 [0] NCCL INFO init.cc:1117 -> 2
server1: master33:3278978:3279686 [0] NCCL INFO init.cc:1396 -> 2
server1: master33:3278978:3279686 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
server1: master33:3278978:3278978 [0] NCCL INFO group.cc:418 -> 2
server1: master33:3278978:3278978 [0] NCCL INFO group.cc:95 -> 2
server3: 
server3: master25:2726734:2726908 [0] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to 10.96.0.1<58433> failed : Software caused connection abort
server3: master25:2726734:2726908 [0] NCCL INFO misc/socket.cc:565 -> 2
server3: master25:2726734:2726908 [0] NCCL INFO misc/socket.cc:587 -> 2
server3: master25:2726734:2726908 [0] NCCL INFO transport/net_socket.cc:338 -> 2
server3: master25:2726734:2726908 [0] NCCL INFO transport/net.cc:677 -> 2
server3: master25:2726734:2726907 [0] NCCL INFO transport/net.cc:304 -> 2
server3: master25:2726734:2726907 [0] NCCL INFO transport.cc:148 -> 2
server3: master25:2726734:2726907 [0] NCCL INFO init.cc:1117 -> 2
server3: master25:2726734:2726907 [0] NCCL INFO init.cc:1396 -> 2
server3: master25:2726734:2726907 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
server3: master25:2726734:2726734 [0] NCCL INFO group.cc:418 -> 2
server3: master25:2726734:2726734 [0] NCCL INFO group.cc:95 -> 2
server3: 
server3: master25:2726734:2726908 [0] proxy.cc:1523 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection
server3: master25:2726734:2726908 [0] NCCL INFO misc/socket.cc:806 -> 3
server3: 
server3: master25:2726734:2726908 [0] proxy.cc:1533 NCCL WARN [Service thread] Could not receive type from localRank 0, res=3, closed=0
server3: 
server3: master25:2726734:2726908 [0] proxy.cc:1557 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 3
server1: master33:3278978:3279712 [0] NCCL INFO misc/socket.cc:47 -> 3
server1: master33:3278978:3279712 [0] NCCL INFO misc/socket.cc:58 -> 3
server1: master33:3278978:3279712 [0] NCCL INFO misc/socket.cc:773 -> 3
server1: master33:3278978:3279712 [0] NCCL INFO proxy.cc:1374 -> 3
server1: master33:3278978:3279712 [0] NCCL INFO proxy.cc:1415 -> 3
server1: 
server1: master33:3278978:3279712 [0] proxy.cc:1557 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
server3: Traceback (most recent call last):
server3:   File "/data1/tong/GLM/ChatGLM2-6B/ptuning/deploy_demo_ds.py", line 146, in <module>
server3:     model = AutoModel.from_pretrained(model_name,trust_remote_code=True, empty_init=False)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained
server3:     return model_class.from_pretrained(
server3:   File "/root/miniconda3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2675, in from_pretrained
server3:     model = cls(config, *model_args, **model_kwargs)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 434, in wrapper
server3:     f(module, *args, **kwargs)
server3:   File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 854, in __init__
server3:     self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 434, in wrapper
server3:     f(module, *args, **kwargs)
server3:   File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 741, in __init__
server3:     self.embedding = init_method(Embedding, config, **init_kwargs)
server3:   File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 48, in default_init
server3:     return cls(*args, **kwargs)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 434, in wrapper
server3:     f(module, *args, **kwargs)
server3:   File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 711, in __init__
server3:     self.word_embeddings = nn.Embedding(
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 441, in wrapper
server3:     self._post_init_method(module)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 915, in _post_init_method
server3:     dist.broadcast(param, 0, self.get_dp_process_group())
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 116, in log_wrapper
server3:     return func(*args, **kwargs)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 216, in broadcast
server3:     return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 188, in broadcast
server3:     return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
server3:     return func(*args, **kwargs)
server3:   File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast
server3:     work = default_pg.broadcast([tensor], opts)
server3: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
server3: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
server3: Last error:
server3: socketStartConnect: Connect to 10.96.0.1<58433> failed : Software caused connection abort

python -c "import torch;print(torch.cuda.nccl.version())" //(2, 19, 3)

NCCL是Nvidia Collective multi-GPU Communication Library的简称，它是一个实现多GPU的collective communication通信（all-gather, reduce, broadcast）库，Nvidia做了很多优化，以在PCIe、Nvlink、InfiniBand上实现较高的通信速度。目前NCCL 1.0版本只支持单机多卡，卡之间通过PCIe、NVlink、GPU Direct P2P来通信。NCCL 2.0会支持多机多卡，多机间通过Sockets (Ethernet)或者InfiniBand with GPU Direct RDMA通信。
·NCCL_DEBUG=info NCCL_SOCKET_IFNAME=eno1 NCCL_IB_DISABLE=1 deepspeed --hostfile=hostfile --num_nodes 2 --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py· // 参考此文。
防火墙问题

DeepSpeed + ChatGlm2-6B测试结论：

结论：
1. ChatGlm2-6B-int4 单机直接部署推理，成功，占用显存~4.5G
1. ChatGlm2-6B单机直接部署推理：OOM
1. ChatGlm2-6B单机DeepSpeed无Offload(下沉)部署推理：OOM
1. ChatGlm2-6B单机使用DeepSpeed Zero3 Offload CPU部署推理：成功，占用显存~4518MiB(/12288MiB),cpu占用明显升高，内存占用16G左右
1. ChatGlm2-6B多机（实验环境2台GPU 3080Ti）DeepSpeed Zero3 无Offload部署推理：成功，显存峰值时达到11502M(每张总共12288MiB)，网络传输流量11.5MB/s左右（已有硬件的最大传输能力)，用时3小时左右
1. ChatGlm2-6B多机（实验环境2台GPU 3080Ti）DeepSpeed Zero3 Offload CPU部署推理：成功，显存峰值时达到8080M(已看到的最大值，不确定是否是峰值)，网络传输流量11.5MB/s左右（已有硬件的最大传输能力)，用时1小时40分钟左右

环境：DeepSpeed+ChatGlm2-6B，RTX3030TI

(base) [root@master33 ptuning]$hostnamectl
Static hostname: master33
Icon name: computer-server
Chassis: server
Machine ID: 61badc2da433498cb35d496f6d7b34a8
Boot ID: 41aee70b71564c7fbca8e2e8e62a76e4
Operating System: Ubuntu 20.04.6 LTS
Kernel: Linux 5.15.0-92-generic
Architecture: x86-64

单机成功运行，脚本和日志如下：很快，使用89.1或89.3用时大概2分钟

脚本如下

(base) [root@master33 ptuning]$cat deploy_demo_ds.py 
#!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py


from transformers import AutoTokenizer, AutoConfig, AutoModel, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # To avoid warnings about parallelism in tokenizers

# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()

#model_name = "bigscience/T0_3B"
model_name = "THUDM/chatglm2-6b"
#model_name = "THUDM/chatglm2-6b-int4"

config = AutoConfig.from_pretrained(model_name,trust_remote_code=True)
#model_hidden_size = config.d_model
model_hidden_size = 2048

# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size

# ds_config notes
#
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
#
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
#
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
#
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
#
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed

# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
'''
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
'''
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": True
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "checkpoint": {
        "use_node_local_storage": True
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
# fmt: on

# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive

# now a model can be loaded.
#model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name,trust_remote_code=True, empty_init=False)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference

# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
rank = torch.distributed.get_rank()
if rank == 0:
    #text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
    text_in = "你叫什么名字"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

日志如下：

(base) [root@master25 ptuning]$deepspeed --num_gpus 1 deploy_demo_ds.py
[2024-03-07 16:05:56,460] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-07 16:05:57,890] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-03-07 16:05:57,890] [INFO] [runner.py:555:main] cmd = /root/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None deploy_demo_ds.py
[2024-03-07 16:05:59,380] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-07 16:06:00,776] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2024-03-07 16:06:00,776] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2024-03-07 16:06:00,776] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2024-03-07 16:06:00,776] [INFO] [launch.py:163:main] dist_world_size=1
[2024-03-07 16:06:00,776] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-03-07 16:06:03,385] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-07 16:06:03,631] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-07 16:06:03,632] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-03-07 16:06:03,632] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-03-07 16:06:13,327] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:13<00:00,  1.89s/it]
[2024-03-07 16:06:26,561] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
[2024-03-07 16:06:26,571] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-03-07 16:06:26,572] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
[2024-03-07 16:06:26,666] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-03-07 16:06:26,667] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.49 GB         Max_CA 1 GB 
[2024-03-07 16:06:26,667] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 18.56 GB, percent = 14.8%
Parameter Offload: Total persistent parameters: 362496 in 85 params
[2024-03-07 16:06:26,775] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-03-07 16:06:26,776] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.49 GB         Max_CA 1 GB 
[2024-03-07 16:06:26,776] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 18.56 GB, percent = 14.8%
[2024-03-07 16:06:26,777] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   amp_enabled .................. False
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   amp_params ................... False
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
[2024-03-07 16:06:26,777] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fac0e4130d0>
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   communication_data_type ...... None
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   disable_allgather ............ False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   dump_state ................... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   elasticity_enabled ........... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   fp16_enabled ................. False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   global_rank .................. 0
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   loss_scale ................... 1.0
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   memory_breakdown ............. False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   mics_shard_size .............. -1
[2024-03-07 16:06:26,778] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   optimizer_name ............... None
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   optimizer_params ............. None
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   pld_enabled .................. False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   pld_params ................... False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   prescale_gradients ........... False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   scheduler_name ............... None
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   scheduler_params ............. None
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   sparse_attention ............. None
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   steps_per_print .............. 2000
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   train_batch_size ............. 1
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   use_node_local_storage ....... True
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   world_size ................... 1
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   zero_enabled ................. True
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
[2024-03-07 16:06:26,779] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
[2024-03-07 16:06:26,779] [INFO] [config.py:950:print_user_config]   json = {
    "fp16": {
        "enabled": false
    }, 
    "bf16": {
        "enabled": true
    }, 
    "zero_optimization": {
        "stage": 3, 
        "offload_param": {
            "device": "cpu", 
            "pin_memory": true
        }, 
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "reduce_bucket_size": 4.194304e+06, 
        "stage3_prefetch_bucket_size": 3.774874e+06, 
        "stage3_param_persistence_threshold": 2.048000e+04
    }, 
    "checkpoint": {
        "use_node_local_storage": true
    }, 
    "steps_per_print": 2.000000e+03, 
    "train_batch_size": 1, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": false
}
/root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
rank0:
   in=你好
  out=你好，我是人工智能助手。 根据问题，我们需要回答“人工智能助手”的定义
[2024-03-07 16:06:48,830] [INFO] [launch.py:347:main] Process 459826 exits successfully.

双机Zero3 No Offload运行成功，脚本和日志如下：比较慢，使用89.1, 89.3用时将近3个小时(2个小时53分钟）

查看代码

 cat deploy_demo_ds.py
#!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py


from transformers import AutoTokenizer, AutoConfig, AutoModel, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # To avoid warnings about parallelism in tokenizers

# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()

#model_name = "bigscience/T0_3B"
model_name = "THUDM/chatglm2-6b"
#model_name = "THUDM/chatglm2-6b-int4"

config = AutoConfig.from_pretrained(model_name,trust_remote_code=True)
#model_hidden_size = config.d_model
model_hidden_size = 2048

# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size

# ds_config notes
#
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
#
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
#
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
#
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
#
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed

# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
'''
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
'''
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": True
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "checkpoint": {
        "use_node_local_storage": True
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
# fmt: on

# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive

# now a model can be loaded.
#model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name,trust_remote_code=True, empty_init=False)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference

# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
rank = torch.distributed.get_rank()
if rank == 0:
    #text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
    text_in = "你叫什么名字"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

查看代码

(base) [root@master33 ptuning]$NCCL_SOCKET_IFNAME=eno1 deepspeed --hostfile=hostfile --num_nodes 2 --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py
[2024-03-10 16:38:39,797] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-10 16:38:41,812] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: server1,server3
[2024-03-10 16:38:41,813] [INFO] [runner.py:555:main] cmd = pdsh -S -f 1024 -w server1,server3 export NCCL_SOCKET_IFNAME=eno1; export PYTHONPATH=/data1/tong/GLM/ChatGLM2-6B/ptuning;  cd /data1/tong/GLM/ChatGLM2-6B/ptuning; /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=%n --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py
server3: [2024-03-10 16:38:43,741] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-03-10 16:38:43,880] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-03-10 16:38:45,146] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
server3: [2024-03-10 16:38:45,146] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server3: [2024-03-10 16:38:45,146] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
server3: [2024-03-10 16:38:45,146] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server3: [2024-03-10 16:38:45,146] [INFO] [launch.py:163:main] dist_world_size=2
server3: [2024-03-10 16:38:45,146] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server1: [2024-03-10 16:38:45,407] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
server1: [2024-03-10 16:38:45,407] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server1: [2024-03-10 16:38:45,407] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
server1: [2024-03-10 16:38:45,407] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server1: [2024-03-10 16:38:45,407] [INFO] [launch.py:163:main] dist_world_size=2
server1: [2024-03-10 16:38:45,407] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server3: [2024-03-10 16:38:47,756] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-03-10 16:38:48,029] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server3: [2024-03-10 16:38:48,029] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-03-10 16:38:48,574] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-03-10 16:38:48,775] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server1: [2024-03-10 16:38:48,775] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-03-10 16:38:48,775] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
server1: [2024-03-10 16:39:34,659] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|██████████| 7/7 [42:07<00:00, 361.11s/it]   
Loading checkpoint shards: 100%|██████████| 7/7 [43:23<00:00, 371.91s/it]   
server1: [2024-03-10 17:22:58,082] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
server1: [2024-03-10 17:22:58,092] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
server1: [2024-03-10 17:22:58,093] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
server1: [2024-03-10 17:22:58,194] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
server1: [2024-03-10 17:22:58,195] [INFO] [utils.py:786:see_memory_usage] MA 5.82 GB         Max_MA 6.81 GB         CA 7.99 GB         Max_CA 8 GB 
server1: [2024-03-10 17:22:58,195] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 5.82 GB, percent = 4.6%
server1: Parameter Offload: Total persistent parameters: 362496 in 85 params
server1: [2024-03-10 17:22:58,310] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
server1: [2024-03-10 17:22:58,311] [INFO] [utils.py:786:see_memory_usage] MA 5.82 GB         Max_MA 5.82 GB         CA 7.99 GB         Max_CA 8 GB 
server1: [2024-03-10 17:22:58,311] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 5.82 GB, percent = 4.6%
server1: [2024-03-10 17:22:58,312] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
server1: [2024-03-10 17:22:58,312] [INFO] [config.py:964:print]   activation_checkpointing_config  {
server1:     "partition_activations": false, 
server1:     "contiguous_memory_optimization": false, 
server1:     "cpu_checkpointing": false, 
server1:     "number_checkpoints": null, 
server1:     "synchronize_checkpoint_boundary": false, 
server1:     "profile": false
server1: }
server1: [2024-03-10 17:22:58,312] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
server1: [2024-03-10 17:22:58,312] [INFO] [config.py:964:print]   amp_enabled .................. False
server1: [2024-03-10 17:22:58,312] [INFO] [config.py:964:print]   amp_params ................... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   autotuning_config ............ {
server1:     "enabled": false, 
server1:     "start_step": null, 
server1:     "end_step": null, 
server1:     "metric_path": null, 
server1:     "arg_mappings": null, 
server1:     "metric": "throughput", 
server1:     "model_info": null, 
server1:     "results_dir": "autotuning_results", 
server1:     "exps_dir": "autotuning_exps", 
server1:     "overwrite": true, 
server1:     "fast": true, 
server1:     "start_profile_step": 3, 
server1:     "end_profile_step": 5, 
server1:     "tuner_type": "gridsearch", 
server1:     "tuner_early_stopping": 5, 
server1:     "tuner_num_trials": 50, 
server1:     "model_info_path": null, 
server1:     "mp_size": 1, 
server1:     "max_train_batch_size": null, 
server1:     "min_train_batch_size": 1, 
server1:     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
server1:     "min_train_micro_batch_size_per_gpu": 1, 
server1:     "num_tuning_micro_batch_sizes": 3
server1: }
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fd8c7460eb0>
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   communication_data_type ...... None
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   disable_allgather ............ False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   dump_state ................... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   elasticity_enabled ........... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   flops_profiler_config ........ {
server1:     "enabled": false, 
server1:     "recompute_fwd_factor": 0.0, 
server1:     "profile_step": 1, 
server1:     "module_depth": -1, 
server1:     "top_modules": 1, 
server1:     "detailed": true, 
server1:     "output_file": null
server1: }
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   fp16_enabled ................. False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   global_rank .................. 0
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   loss_scale ................... 1.0
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   memory_breakdown ............. False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
server1: [2024-03-10 17:22:58,313] [INFO] [config.py:964:print]   mics_shard_size .............. -1
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   nebula_config ................ {
server1:     "enabled": false, 
server1:     "persistent_storage_path": null, 
server1:     "persistent_time_interval": 100, 
server1:     "num_of_version_in_retention": 2, 
server1:     "enable_nebula_load": true, 
server1:     "load_path": null
server1: }
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   optimizer_name ............... None
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   optimizer_params ............. None
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   pld_enabled .................. False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   pld_params ................... False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   prescale_gradients ........... False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   scheduler_name ............... None
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   scheduler_params ............. None
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   sparse_attention ............. None
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   steps_per_print .............. 2000
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   train_batch_size ............. 2
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   use_node_local_storage ....... True
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   world_size ................... 2
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   zero_enabled ................. True
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
server1: [2024-03-10 17:22:58,314] [INFO] [config.py:950:print_user_config]   json = {
server1:     "fp16": {
server1:         "enabled": false
server1:     }, 
server1:     "bf16": {
server1:         "enabled": true
server1:     }, 
server1:     "zero_optimization": {
server1:         "stage": 3, 
server1:         "overlap_comm": true, 
server1:         "contiguous_gradients": true, 
server1:         "reduce_bucket_size": 4.194304e+06, 
server1:         "stage3_prefetch_bucket_size": 3.774874e+06, 
server1:         "stage3_param_persistence_threshold": 2.048000e+04
server1:     }, 
server1:     "checkpoint": {
server1:         "use_node_local_storage": true
server1:     }, 
server1:     "steps_per_print": 2.000000e+03, 
server1:     "train_batch_size": 2, 
server1:     "train_micro_batch_size_per_gpu": 1, 
server1:     "wall_clock_breakdown": false
server1: }
server1: /data1/common/python/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server1:   warnings.warn(
server3: /root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server3:   warnings.warn(
server1: rank0:
server1:    in=你叫什么名字
server1:   out=你叫什么名字？ 
server1: 答: 我是一个名为 ChatGLM2-6
server3: rank1:
server3:    in=Is this review positive or negative? Review: this is the worst restaurant ever
server3:   out=Is this review positive or negative? Review: this is the worst restaurant ever I have ever
server3: [2024-03-10 19:31:37,252] [INFO] [launch.py:347:main] Process 496465 exits successfully.
server1: [2024-03-10 19:31:37,559] [INFO] [launch.py:347:main] Process 86940 exits successfully.

配置CPU Offload，用时1小时40分左右。日志如下：

查看代码

(base) [root@master33 ptuning]$NCCL_SOCKET_IFNAME=eno1 deepspeed --hostfile=hostfile --num_nodes 2 --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py
[2024-03-10 09:37:12,374] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-10 09:37:14,564] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: server1,server3
[2024-03-10 09:37:14,565] [INFO] [runner.py:555:main] cmd = pdsh -S -f 1024 -w server1,server3 export NCCL_SOCKET_IFNAME=eno1; export PYTHONPATH=/data1/tong/GLM/ChatGLM2-6B/ptuning;  cd /data1/tong/GLM/ChatGLM2-6B/ptuning; /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=%n --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py
server3: [2024-03-10 09:37:16,677] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-03-10 09:37:16,713] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-03-10 09:37:18,094] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
server3: [2024-03-10 09:37:18,094] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server3: [2024-03-10 09:37:18,094] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
server3: [2024-03-10 09:37:18,094] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server3: [2024-03-10 09:37:18,094] [INFO] [launch.py:163:main] dist_world_size=2
server3: [2024-03-10 09:37:18,094] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server1: [2024-03-10 09:37:18,249] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
server1: [2024-03-10 09:37:18,249] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server1: [2024-03-10 09:37:18,249] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
server1: [2024-03-10 09:37:18,249] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server1: [2024-03-10 09:37:18,249] [INFO] [launch.py:163:main] dist_world_size=2
server1: [2024-03-10 09:37:18,249] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server3: [2024-03-10 09:37:20,699] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-03-10 09:37:20,947] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server3: [2024-03-10 09:37:20,947] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-03-10 09:37:21,360] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-03-10 09:37:21,553] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server1: [2024-03-10 09:37:21,553] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-03-10 09:37:21,553] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
server1: [2024-03-10 09:55:05,084] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|██████████| 7/7 [27:14<00:00, 233.48s/it]
server1: [2024-03-10 10:22:19,454] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
server1: [2024-03-10 10:22:19,464] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
server1: [2024-03-10 10:22:19,465] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
server1: [2024-03-10 10:22:19,564] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
server1: [2024-03-10 10:22:19,565] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-03-10 10:22:19,565] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 13.09 GB, percent = 10.4%
server1: Parameter Offload: Total persistent parameters: 362496 in 85 params
server1: [2024-03-10 10:22:19,680] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
server1: [2024-03-10 10:22:19,681] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-03-10 10:22:19,681] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 13.09 GB, percent = 10.4%
server1: [2024-03-10 10:22:19,682] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
server1: [2024-03-10 10:22:19,682] [INFO] [config.py:964:print]   activation_checkpointing_config  {
server1:     "partition_activations": false, 
server1:     "contiguous_memory_optimization": false, 
server1:     "cpu_checkpointing": false, 
server1:     "number_checkpoints": null, 
server1:     "synchronize_checkpoint_boundary": false, 
server1:     "profile": false
server1: }
server1: [2024-03-10 10:22:19,682] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
server1: [2024-03-10 10:22:19,682] [INFO] [config.py:964:print]   amp_enabled .................. False
server1: [2024-03-10 10:22:19,682] [INFO] [config.py:964:print]   amp_params ................... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   autotuning_config ............ {
server1:     "enabled": false, 
server1:     "start_step": null, 
server1:     "end_step": null, 
server1:     "metric_path": null, 
server1:     "arg_mappings": null, 
server1:     "metric": "throughput", 
server1:     "model_info": null, 
server1:     "results_dir": "autotuning_results", 
server1:     "exps_dir": "autotuning_exps", 
server1:     "overwrite": true, 
server1:     "fast": true, 
server1:     "start_profile_step": 3, 
server1:     "end_profile_step": 5, 
server1:     "tuner_type": "gridsearch", 
server1:     "tuner_early_stopping": 5, 
server1:     "tuner_num_trials": 50, 
server1:     "model_info_path": null, 
server1:     "mp_size": 1, 
server1:     "max_train_batch_size": null, 
server1:     "min_train_batch_size": 1, 
server1:     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
server1:     "min_train_micro_batch_size_per_gpu": 1, 
server1:     "num_tuning_micro_batch_sizes": 3
server1: }
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fdd809406d0>
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   communication_data_type ...... None
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   disable_allgather ............ False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   dump_state ................... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   elasticity_enabled ........... False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   flops_profiler_config ........ {
server1:     "enabled": false, 
server1:     "recompute_fwd_factor": 0.0, 
server1:     "profile_step": 1, 
server1:     "module_depth": -1, 
server1:     "top_modules": 1, 
server1:     "detailed": true, 
server1:     "output_file": null
server1: }
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   fp16_enabled ................. False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   global_rank .................. 0
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
server1: [2024-03-10 10:22:19,683] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   loss_scale ................... 1.0
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   memory_breakdown ............. False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   mics_shard_size .............. -1
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   nebula_config ................ {
server1:     "enabled": false, 
server1:     "persistent_storage_path": null, 
server1:     "persistent_time_interval": 100, 
server1:     "num_of_version_in_retention": 2, 
server1:     "enable_nebula_load": true, 
server1:     "load_path": null
server1: }
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   optimizer_name ............... None
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   optimizer_params ............. None
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   pld_enabled .................. False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   pld_params ................... False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   prescale_gradients ........... False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   scheduler_name ............... None
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   scheduler_params ............. None
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   sparse_attention ............. None
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   steps_per_print .............. 2000
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   train_batch_size ............. 2
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   use_node_local_storage ....... True
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   world_size ................... 2
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   zero_enabled ................. True
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
server1: [2024-03-10 10:22:19,684] [INFO] [config.py:950:print_user_config]   json = {
server1:     "fp16": {
server1:         "enabled": false
server1:     }, 
server1:     "bf16": {
server1:         "enabled": true
server1:     }, 
server1:     "zero_optimization": {
server1:         "stage": 3, 
server1:         "offload_param": {
server1:             "device": "cpu", 
server1:             "pin_memory": true
server1:         }, 
server1:         "overlap_comm": true, 
server1:         "contiguous_gradients": true, 
server1:         "reduce_bucket_size": 4.194304e+06, 
server1:         "stage3_prefetch_bucket_size": 3.774874e+06, 
server1:         "stage3_param_persistence_threshold": 2.048000e+04
server1:     }, 
server1:     "checkpoint": {
server1:         "use_node_local_storage": true
server1:     }, 
server1:     "steps_per_print": 2.000000e+03, 
server1:     "train_batch_size": 2, 
server1:     "train_micro_batch_size_per_gpu": 1, 
server1:     "wall_clock_breakdown": false
server1: }
server1: /data1/common/python/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server1:   warnings.warn(
Loading checkpoint shards: 100%|██████████| 7/7 [27:14<00:00, 233.56s/it]
server3: /root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server3:   warnings.warn(
server1: rank0:
server1:    in=请介绍下戈尔迪之结的来历
server1:   out=请介绍下戈尔迪之结的来历。 戈尔迪之
server3: rank1:
server3:    in=Is this review positive or negative? Review: this is the worst restaurant ever
server3:   out=Is this review positive or negative? Review: this is the worst restaurant ever I have ever
server3: [2024-03-10 11:17:13,508] [INFO] [launch.py:347:main] Process 1625240 exits successfully.
server1: [2024-03-10 11:17:14,176] [INFO] [launch.py:347:main] Process 1894897 exits successfully.

再来一次运行日志

查看代码

(base) [root@master33 ptuning]$NCCL_SOCKET_IFNAME=eno1 deepspeed --hostfile=hostfile --num_nodes 2 --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py
[2024-03-26 15:34:20,559] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-26 15:34:22,553] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: server1,server3
[2024-03-26 15:34:22,554] [INFO] [runner.py:555:main] cmd = pdsh -S -f 1024 -w server1,server3 export NCCL_SOCKET_IFNAME=eno1; export PYTHONPATH=/data1/tong/GLM/ChatGLM2-6B/ptuning;  cd /data1/tong/GLM/ChatGLM2-6B/ptuning; /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=%n --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py
server3: [2024-03-26 15:34:24,513] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-03-26 15:34:24,706] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-03-26 15:34:25,922] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
server3: [2024-03-26 15:34:25,922] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server3: [2024-03-26 15:34:25,922] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
server3: [2024-03-26 15:34:25,922] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server3: [2024-03-26 15:34:25,923] [INFO] [launch.py:163:main] dist_world_size=2
server3: [2024-03-26 15:34:25,923] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server1: [2024-03-26 15:34:26,283] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
server1: [2024-03-26 15:34:26,283] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server1: [2024-03-26 15:34:26,283] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
server1: [2024-03-26 15:34:26,283] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server1: [2024-03-26 15:34:26,283] [INFO] [launch.py:163:main] dist_world_size=2
server1: [2024-03-26 15:34:26,283] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server3: [2024-03-26 15:34:28,563] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-03-26 15:34:28,808] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server3: [2024-03-26 15:34:28,808] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-03-26 15:34:29,578] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-03-26 15:34:29,800] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server1: [2024-03-26 15:34:29,801] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-03-26 15:34:29,801] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
server1: [2024-03-26 15:52:13,213] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|??????????| 7/7 [27:15<00:00, 233.68s/it]
server1: [2024-03-26 16:19:29,029] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
server1: [2024-03-26 16:19:29,040] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
server1: [2024-03-26 16:19:29,041] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
server1: [2024-03-26 16:19:29,146] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
server1: [2024-03-26 16:19:29,147] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-03-26 16:19:29,147] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.01 GB, percent = 12.8%
server1: Parameter Offload: Total persistent parameters: 362496 in 85 params
server1: [2024-03-26 16:19:29,265] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
server1: [2024-03-26 16:19:29,266] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-03-26 16:19:29,266] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.01 GB, percent = 12.8%
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   activation_checkpointing_config  {
server1:     "partition_activations": false, 
server1:     "contiguous_memory_optimization": false, 
server1:     "cpu_checkpointing": false, 
server1:     "number_checkpoints": null, 
server1:     "synchronize_checkpoint_boundary": false, 
server1:     "profile": false
server1: }
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   amp_enabled .................. False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   amp_params ................... False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   autotuning_config ............ {
server1:     "enabled": false, 
server1:     "start_step": null, 
server1:     "end_step": null, 
server1:     "metric_path": null, 
server1:     "arg_mappings": null, 
server1:     "metric": "throughput", 
server1:     "model_info": null, 
server1:     "results_dir": "autotuning_results", 
server1:     "exps_dir": "autotuning_exps", 
server1:     "overwrite": true, 
server1:     "fast": true, 
server1:     "start_profile_step": 3, 
server1:     "end_profile_step": 5, 
server1:     "tuner_type": "gridsearch", 
server1:     "tuner_early_stopping": 5, 
server1:     "tuner_num_trials": 50, 
server1:     "model_info_path": null, 
server1:     "mp_size": 1, 
server1:     "max_train_batch_size": null, 
server1:     "min_train_batch_size": 1, 
server1:     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
server1:     "min_train_micro_batch_size_per_gpu": 1, 
server1:     "num_tuning_micro_batch_sizes": 3
server1: }
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fc8ea1b26a0>
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   communication_data_type ...... None
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   disable_allgather ............ False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   dump_state ................... False
server1: [2024-03-26 16:19:29,267] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   elasticity_enabled ........... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   flops_profiler_config ........ {
server1:     "enabled": false, 
server1:     "recompute_fwd_factor": 0.0, 
server1:     "profile_step": 1, 
server1:     "module_depth": -1, 
server1:     "top_modules": 1, 
server1:     "detailed": true, 
server1:     "output_file": null
server1: }
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   fp16_enabled ................. False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   global_rank .................. 0
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   loss_scale ................... 1.0
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   memory_breakdown ............. False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   mics_shard_size .............. -1
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   nebula_config ................ {
server1:     "enabled": false, 
server1:     "persistent_storage_path": null, 
server1:     "persistent_time_interval": 100, 
server1:     "num_of_version_in_retention": 2, 
server1:     "enable_nebula_load": true, 
server1:     "load_path": null
server1: }
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   optimizer_name ............... None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   optimizer_params ............. None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   pld_enabled .................. False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   pld_params ................... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   prescale_gradients ........... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   scheduler_name ............... None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   scheduler_params ............. None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   sparse_attention ............. None
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   steps_per_print .............. 2000
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   train_batch_size ............. 2
server1: [2024-03-26 16:19:29,268] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   use_node_local_storage ....... True
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   world_size ................... 2
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   zero_enabled ................. True
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
server1: [2024-03-26 16:19:29,269] [INFO] [config.py:950:print_user_config]   json = {
server1:     "fp16": {
server1:         "enabled": false
server1:     }, 
server1:     "bf16": {
server1:         "enabled": true
server1:     }, 
server1:     "zero_optimization": {
server1:         "stage": 3, 
server1:         "offload_param": {
server1:             "device": "cpu", 
server1:             "pin_memory": true
server1:         }, 
server1:         "overlap_comm": true, 
server1:         "contiguous_gradients": true, 
server1:         "reduce_bucket_size": 4.194304e+06, 
server1:         "stage3_prefetch_bucket_size": 3.774874e+06, 
server1:         "stage3_param_persistence_threshold": 2.048000e+04
server1:     }, 
server1:     "checkpoint": {
server1:         "use_node_local_storage": true
server1:     }, 
server1:     "steps_per_print": 2.000000e+03, 
server1:     "train_batch_size": 2, 
server1:     "train_micro_batch_size_per_gpu": 1, 
server1:     "wall_clock_breakdown": false
server1: }
Loading checkpoint shards: 100%|??????????| 7/7 [27:15<00:00, 233.68s/it]
server3: rank1:
server3:    in=Is this review positive or negative? Review: this is the worst restaurant ever.
server3:   out=Is this review positive or negative? Review: this is the worst restaurant ever. The food is disgusting and the service is terrible. I will never go back. negative
server1: rank0:
server1:    in=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy.
server1:   out=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy. It is made of high quality materials and is well made. The non-stick coating is perfect and makes the skillet easy to use. The price is a little high but it is worth it. I would definitely recommend this skillet to anyone looking
server3: rank1:
server3:    in=中国古代经典小说《红楼梦》的作者是谁？
server3:   out=中国古代经典小说《红楼梦》的作者是谁？ 曹雪芹。
server3: before while
server3: start...
server3: 
server1: rank0:
server1:    in=你好，请做下自我介绍!
server1:   out=你好，请做下自我介绍! 我是人工智能助手 ChatGLM2-6B，一个基于语言模型的人工智能助手。我的任务是针对用户的问题和要求提供适当的答复和支持。
server1: before while
server1: start...

No CPU Offload，进行两次推理，添加打印时间，结论是：每次推理用时大概2小时（第一次推理用时2小时08分钟，第二次推理用时一小时32分钟）！！！（百兆迈威交换机），代码和日志及截图如下

代码：

查看代码

 #!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py


from transformers import AutoTokenizer, AutoConfig, AutoModel, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os,datetime
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # To avoid warnings about parallelism in tokenizers

# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()

#model_name = "bigscience/T0_3B"
model_name = "THUDM/chatglm2-6b"
#model_name = "THUDM/chatglm2-6b-int4"

config = AutoConfig.from_pretrained(model_name,trust_remote_code=True)
#model_hidden_size = config.d_model
model_hidden_size = 2048

# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size

# ds_config notes
#
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
#
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
#
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
#
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
#
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed

# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
'''
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
'''
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": True
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "checkpoint": {
        "use_node_local_storage": True
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
# fmt: on

# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive

# now a model can be loaded.
#model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name,trust_remote_code=True, empty_init=False)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference

# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
print(f"第一次推理时间：{datetime.datetime.now()}")
rank = torch.distributed.get_rank()
if rank == 0:
    #text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
    text_in = "你叫什么名字"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

print(f"第2次推理时间：{datetime.datetime.now()}")
rank = torch.distributed.get_rank()
if rank == 0:
    text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
    # text_in = "你叫什么名字"
elif rank == 1:
    # text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"
    text_in = "安徽的省会是哪里？"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

运行日志如下：

查看代码

 (base) [root@master25 ptuning]$NCCL_SOCKET_IFNAME=eno1 deepspeed --hostfile=hostfile --num_nodes 2 --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py
[2024-04-08 09:35:45,863] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-08 09:35:47,642] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: server1,server3
[2024-04-08 09:35:47,643] [INFO] [runner.py:555:main] cmd = pdsh -S -f 1024 -w server1,server3 export NCCL_SOCKET_IFNAME=eno1; export PYTHONPATH=/data1/tong/GLM/ChatGLM2-6B/ptuning;  cd /data1/tong/GLM/ChatGLM2-6B/ptuning; /root/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=%n --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py
server3: [2024-04-08 09:35:49,534] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-04-08 09:35:49,712] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-04-08 09:35:50,917] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
server3: [2024-04-08 09:35:50,917] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server3: [2024-04-08 09:35:50,917] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
server3: [2024-04-08 09:35:50,917] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server3: [2024-04-08 09:35:50,917] [INFO] [launch.py:163:main] dist_world_size=2
server3: [2024-04-08 09:35:50,917] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server1: [2024-04-08 09:35:51,248] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
server1: [2024-04-08 09:35:51,248] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server1: [2024-04-08 09:35:51,248] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
server1: [2024-04-08 09:35:51,248] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server1: [2024-04-08 09:35:51,248] [INFO] [launch.py:163:main] dist_world_size=2
server1: [2024-04-08 09:35:51,248] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server3: [2024-04-08 09:35:53,587] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-04-08 09:35:53,911] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server3: [2024-04-08 09:35:53,911] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-04-08 09:35:54,391] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-04-08 09:35:54,571] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server1: [2024-04-08 09:35:54,571] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-04-08 09:35:54,571] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
server1: [2024-04-08 09:36:40,589] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|██████████| 7/7 [42:10<00:00, 361.53s/it]   
server3: 第一次推理时间：2024-04-08 10:18:51.831791
Loading checkpoint shards: 100%|██████████| 7/7 [43:26<00:00, 372.32s/it]   
server1: [2024-04-08 10:20:06,846] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
server1: [2024-04-08 10:20:06,857] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
server1: [2024-04-08 10:20:06,858] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
server1: [2024-04-08 10:20:06,960] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
server1: [2024-04-08 10:20:06,960] [INFO] [utils.py:786:see_memory_usage] MA 5.82 GB         Max_MA 6.81 GB         CA 7.99 GB         Max_CA 8 GB 
server1: [2024-04-08 10:20:06,961] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 8.84 GB, percent = 7.0%
server1: Parameter Offload: Total persistent parameters: 362496 in 85 params
server1: [2024-04-08 10:20:07,077] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
server1: [2024-04-08 10:20:07,077] [INFO] [utils.py:786:see_memory_usage] MA 5.82 GB         Max_MA 5.82 GB         CA 7.99 GB         Max_CA 8 GB 
server1: [2024-04-08 10:20:07,078] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 8.84 GB, percent = 7.0%
server1: [2024-04-08 10:20:07,078] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   activation_checkpointing_config  {
server1:     "partition_activations": false, 
server1:     "contiguous_memory_optimization": false, 
server1:     "cpu_checkpointing": false, 
server1:     "number_checkpoints": null, 
server1:     "synchronize_checkpoint_boundary": false, 
server1:     "profile": false
server1: }
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   amp_enabled .................. False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   amp_params ................... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   autotuning_config ............ {
server1:     "enabled": false, 
server1:     "start_step": null, 
server1:     "end_step": null, 
server1:     "metric_path": null, 
server1:     "arg_mappings": null, 
server1:     "metric": "throughput", 
server1:     "model_info": null, 
server1:     "results_dir": "autotuning_results", 
server1:     "exps_dir": "autotuning_exps", 
server1:     "overwrite": true, 
server1:     "fast": true, 
server1:     "start_profile_step": 3, 
server1:     "end_profile_step": 5, 
server1:     "tuner_type": "gridsearch", 
server1:     "tuner_early_stopping": 5, 
server1:     "tuner_num_trials": 50, 
server1:     "model_info_path": null, 
server1:     "mp_size": 1, 
server1:     "max_train_batch_size": null, 
server1:     "min_train_batch_size": 1, 
server1:     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
server1:     "min_train_micro_batch_size_per_gpu": 1, 
server1:     "num_tuning_micro_batch_sizes": 3
server1: }
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f41eaef8e80>
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   communication_data_type ...... None
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   disable_allgather ............ False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   dump_state ................... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
server1: [2024-04-08 10:20:07,079] [INFO] [config.py:964:print]   elasticity_enabled ........... False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   flops_profiler_config ........ {
server1:     "enabled": false, 
server1:     "recompute_fwd_factor": 0.0, 
server1:     "profile_step": 1, 
server1:     "module_depth": -1, 
server1:     "top_modules": 1, 
server1:     "detailed": true, 
server1:     "output_file": null
server1: }
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   fp16_enabled ................. False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   global_rank .................. 0
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   loss_scale ................... 1.0
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   memory_breakdown ............. False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   mics_shard_size .............. -1
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   nebula_config ................ {
server1:     "enabled": false, 
server1:     "persistent_storage_path": null, 
server1:     "persistent_time_interval": 100, 
server1:     "num_of_version_in_retention": 2, 
server1:     "enable_nebula_load": true, 
server1:     "load_path": null
server1: }
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   optimizer_name ............... None
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   optimizer_params ............. None
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   pld_enabled .................. False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   pld_params ................... False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   prescale_gradients ........... False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   scheduler_name ............... None
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   scheduler_params ............. None
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   sparse_attention ............. None
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   steps_per_print .............. 2000
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   train_batch_size ............. 2
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   use_node_local_storage ....... True
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   world_size ................... 2
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
server1: [2024-04-08 10:20:07,080] [INFO] [config.py:964:print]   zero_enabled ................. True
server1: [2024-04-08 10:20:07,081] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
server1: [2024-04-08 10:20:07,081] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
server1: [2024-04-08 10:20:07,081] [INFO] [config.py:950:print_user_config]   json = {
server1:     "fp16": {
server1:         "enabled": false
server1:     }, 
server1:     "bf16": {
server1:         "enabled": true
server1:     }, 
server1:     "zero_optimization": {
server1:         "stage": 3, 
server1:         "overlap_comm": true, 
server1:         "contiguous_gradients": true, 
server1:         "reduce_bucket_size": 4.194304e+06, 
server1:         "stage3_prefetch_bucket_size": 3.774874e+06, 
server1:         "stage3_param_persistence_threshold": 2.048000e+04
server1:     }, 
server1:     "checkpoint": {
server1:         "use_node_local_storage": true
server1:     }, 
server1:     "steps_per_print": 2.000000e+03, 
server1:     "train_batch_size": 2, 
server1:     "train_micro_batch_size_per_gpu": 1, 
server1:     "wall_clock_breakdown": false
server1: }
server1: 第一次推理时间：2024-04-08 10:20:07.082254
server1: /data1/common/python/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server1:   warnings.warn(
server3: /root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server3:   warnings.warn(
server1: rank0:
server1:    in=你叫什么名字
server1:   out=你叫什么名字？ 
server1: 答: 我是一个名为 ChatGLM2-6
server1: 第2次推理时间：2024-04-08 12:28:50.366853
server3: rank1:
server3:    in=Is this review positive or negative? Review: this is the worst restaurant ever
server3:   out=Is this review positive or negative? Review: this is the worst restaurant ever I have ever
server3: 第2次推理时间：2024-04-08 12:28:50.366859
server1: Input length of input_ids is 23, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
server3: rank1:
server3:    in=安徽的省会是哪里？
server3:   out=安徽的省会是哪里？ 
server3: 
server3: 安徽的省会是合肥。
server1: rank0:
server1:    in=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy
server1:   out=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy This
server1: [2024-04-08 14:00:15,574] [INFO] [launch.py:347:main] Process 1265565 exits successfully.
server3: [2024-04-08 14:00:15,790] [INFO] [launch.py:347:main] Process 3665211 exits successfully.
(base) [root@master25 ptuning]$

有CPUoffload，打印推理时间，貌似和无cpuoffload差不多（和之前的结论不太一样...）,先记录下，便于后面查看

代码：

查看代码

 #!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py


from transformers import AutoTokenizer, AutoConfig, AutoModel, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os,datetime
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # To avoid warnings about parallelism in tokenizers

# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()

#model_name = "bigscience/T0_3B"
model_name = "THUDM/chatglm2-6b"
#model_name = "THUDM/chatglm2-6b-int4"

config = AutoConfig.from_pretrained(model_name,trust_remote_code=True)
#model_hidden_size = config.d_model
model_hidden_size = 2048

# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size

# ds_config notes
#
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
#
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
#
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
#
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
#
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed

# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
'''
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
'''
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": True
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "checkpoint": {
        "use_node_local_storage": True
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
# fmt: on

# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive

# now a model can be loaded.
#model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name,trust_remote_code=True, empty_init=False)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference

# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
print(f"第一次推理时间：{datetime.datetime.now()}")
rank = torch.distributed.get_rank()
if rank == 0:
    #text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
    text_in = "你叫什么名字"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

print(f"第2次推理时间：{datetime.datetime.now()}")
rank = torch.distributed.get_rank()
if rank == 0:
    text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
    # text_in = "你叫什么名字"
elif rank == 1:
    # text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"
    text_in = "安徽的省会是哪里？"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

日志：

查看代码

 (base) [root@master25 ptuning]$NCCL_SOCKET_IFNAME=eno1 deepspeed --hostfile=hostfile --num_nodes 2 --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py
[2024-04-08 19:20:54,006] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-08 19:20:55,803] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: server1,server3
[2024-04-08 19:20:55,803] [INFO] [runner.py:555:main] cmd = pdsh -S -f 1024 -w server1,server3 export NCCL_SOCKET_IFNAME=eno1; export PYTHONPATH=/data1/tong/GLM/ChatGLM2-6B/ptuning;  cd /data1/tong/GLM/ChatGLM2-6B/ptuning; /root/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=%n --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py
server3: [2024-04-08 19:20:57,684] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-04-08 19:20:57,936] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-04-08 19:20:59,096] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
server3: [2024-04-08 19:20:59,096] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server3: [2024-04-08 19:20:59,096] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
server3: [2024-04-08 19:20:59,096] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server3: [2024-04-08 19:20:59,096] [INFO] [launch.py:163:main] dist_world_size=2
server3: [2024-04-08 19:20:59,096] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server1: [2024-04-08 19:20:59,566] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
server1: [2024-04-08 19:20:59,567] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server1: [2024-04-08 19:20:59,567] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
server1: [2024-04-08 19:20:59,567] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server1: [2024-04-08 19:20:59,567] [INFO] [launch.py:163:main] dist_world_size=2
server1: [2024-04-08 19:20:59,567] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server3: [2024-04-08 19:21:01,718] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-04-08 19:21:01,968] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server3: [2024-04-08 19:21:01,968] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-04-08 19:21:02,862] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-04-08 19:21:03,061] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server1: [2024-04-08 19:21:03,061] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-04-08 19:21:03,061] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
server1: [2024-04-08 19:38:47,403] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|██████████| 7/7 [27:14<00:00, 233.45s/it]
server1: [2024-04-08 20:06:01,577] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
server1: [2024-04-08 20:06:01,588] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
server1: [2024-04-08 20:06:01,589] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
server1: [2024-04-08 20:06:01,695] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
server1: [2024-04-08 20:06:01,696] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-04-08 20:06:01,696] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.18 GB, percent = 12.9%
server1: Parameter Offload: Total persistent parameters: 362496 in 85 params
server1: [2024-04-08 20:06:01,817] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
server1: [2024-04-08 20:06:01,818] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-04-08 20:06:01,818] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.17 GB, percent = 12.9%
server1: [2024-04-08 20:06:01,818] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   activation_checkpointing_config  {
server1:     "partition_activations": false, 
server1:     "contiguous_memory_optimization": false, 
server1:     "cpu_checkpointing": false, 
server1:     "number_checkpoints": null, 
server1:     "synchronize_checkpoint_boundary": false, 
server1:     "profile": false
server1: }
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   amp_enabled .................. False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   amp_params ................... False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   autotuning_config ............ {
server1:     "enabled": false, 
server1:     "start_step": null, 
server1:     "end_step": null, 
server1:     "metric_path": null, 
server1:     "arg_mappings": null, 
server1:     "metric": "throughput", 
server1:     "model_info": null, 
server1:     "results_dir": "autotuning_results", 
server1:     "exps_dir": "autotuning_exps", 
server1:     "overwrite": true, 
server1:     "fast": true, 
server1:     "start_profile_step": 3, 
server1:     "end_profile_step": 5, 
server1:     "tuner_type": "gridsearch", 
server1:     "tuner_early_stopping": 5, 
server1:     "tuner_num_trials": 50, 
server1:     "model_info_path": null, 
server1:     "mp_size": 1, 
server1:     "max_train_batch_size": null, 
server1:     "min_train_batch_size": 1, 
server1:     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
server1:     "min_train_micro_batch_size_per_gpu": 1, 
server1:     "num_tuning_micro_batch_sizes": 3
server1: }
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f092dda06a0>
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   communication_data_type ...... None
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   disable_allgather ............ False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   dump_state ................... False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
server1: [2024-04-08 20:06:01,819] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   elasticity_enabled ........... False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   flops_profiler_config ........ {
server1:     "enabled": false, 
server1:     "recompute_fwd_factor": 0.0, 
server1:     "profile_step": 1, 
server1:     "module_depth": -1, 
server1:     "top_modules": 1, 
server1:     "detailed": true, 
server1:     "output_file": null
server1: }
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   fp16_enabled ................. False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   global_rank .................. 0
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   loss_scale ................... 1.0
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   memory_breakdown ............. False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   mics_shard_size .............. -1
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   nebula_config ................ {
server1:     "enabled": false, 
server1:     "persistent_storage_path": null, 
server1:     "persistent_time_interval": 100, 
server1:     "num_of_version_in_retention": 2, 
server1:     "enable_nebula_load": true, 
server1:     "load_path": null
server1: }
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   optimizer_name ............... None
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   optimizer_params ............. None
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   pld_enabled .................. False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   pld_params ................... False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   prescale_gradients ........... False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   scheduler_name ............... None
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   scheduler_params ............. None
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   sparse_attention ............. None
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   steps_per_print .............. 2000
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   train_batch_size ............. 2
server1: [2024-04-08 20:06:01,820] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   use_node_local_storage ....... True
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   world_size ................... 2
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   zero_enabled ................. True
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
server1: [2024-04-08 20:06:01,821] [INFO] [config.py:950:print_user_config]   json = {
server1:     "fp16": {
server1:         "enabled": false
server1:     }, 
server1:     "bf16": {
server1:         "enabled": true
server1:     }, 
server1:     "zero_optimization": {
server1:         "stage": 3, 
server1:         "offload_param": {
server1:             "device": "cpu", 
server1:             "pin_memory": true
server1:         }, 
server1:         "overlap_comm": true, 
server1:         "contiguous_gradients": true, 
server1:         "reduce_bucket_size": 4.194304e+06, 
server1:         "stage3_prefetch_bucket_size": 3.774874e+06, 
server1:         "stage3_param_persistence_threshold": 2.048000e+04
server1:     }, 
server1:     "checkpoint": {
server1:         "use_node_local_storage": true
server1:     }, 
server1:     "steps_per_print": 2.000000e+03, 
server1:     "train_batch_size": 2, 
server1:     "train_micro_batch_size_per_gpu": 1, 
server1:     "wall_clock_breakdown": false
server1: }
server1: 第一次推理时间：2024-04-08 20:06:01.822467
server1: /data1/common/python/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server1:   warnings.warn(
Loading checkpoint shards: 100%|██████████| 7/7 [27:14<00:00, 233.45s/it]
server3: 第一次推理时间：2024-04-08 20:06:02.258538
server3: /root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server3:   warnings.warn(
server1: rank0:
server1:    in=你叫什么名字
server1:   out=你叫什么名字？ 
server1: 答: 我是一个名为 ChatGLM2-6
server1: 第2次推理时间：2024-04-08 22:14:02.114204
server3: rank1:
server3:    in=Is this review positive or negative? Review: this is the worst restaurant ever
server3:   out=Is this review positive or negative? Review: this is the worst restaurant ever I have ever
server3: 第2次推理时间：2024-04-08 22:14:02.118180
server1: Input length of input_ids is 23, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
server3: rank1:
server3:    in=安徽的省会是哪里？
server3:   out=安徽的省会是哪里？ 
server3: 
server3: 安徽的省会是合肥。
server1: rank0:
server1:    in=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy
server1:   out=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy This
server3: [2024-04-08 23:45:29,034] [INFO] [launch.py:347:main] Process 119763 exits successfully.
server1: [2024-04-08 23:45:29,933] [INFO] [launch.py:347:main] Process 3562202 exits successfully.

有CPUoffload，再来一次运行日志

代码：

查看代码

 #!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py


from transformers import AutoTokenizer, AutoConfig, AutoModel, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os,datetime
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # To avoid warnings about parallelism in tokenizers

# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()

#model_name = "bigscience/T0_3B"
model_name = "THUDM/chatglm2-6b"
#model_name = "THUDM/chatglm2-6b-int4"

config = AutoConfig.from_pretrained(model_name,trust_remote_code=True)
#model_hidden_size = config.d_model
model_hidden_size = 2048

# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size

# ds_config notes
#
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
#
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
#
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
#
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
#
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed

# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
'''
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
'''
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": True
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "checkpoint": {
        "use_node_local_storage": True
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
# fmt: on

# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive

# now a model can be loaded.
#model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name,trust_remote_code=True, empty_init=False)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference

# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
print(f"第一次推理时间：{datetime.datetime.now()}")
rank = torch.distributed.get_rank()
if rank == 0:
    #text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
    text_in = "你叫什么名字"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

print(f"第2次推理时间：{datetime.datetime.now()}")
rank = torch.distributed.get_rank()
if rank == 0:
    text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
    # text_in = "你叫什么名字"
elif rank == 1:
    # text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"
    text_in = "安徽的省会是哪里？"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

运行日志：

查看代码

 (base) [root@master25 ptuning]$NCCL_SOCKET_IFNAME=eno1 deepspeed --hostfile=hostfile --num_nodes 2 --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py
[2024-04-09 09:20:20,297] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-09 09:20:22,159] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: server1,server3
[2024-04-09 09:20:22,159] [INFO] [runner.py:555:main] cmd = pdsh -S -f 1024 -w server1,server3 export NCCL_SOCKET_IFNAME=eno1; export PYTHONPATH=/data1/tong/GLM/ChatGLM2-6B/ptuning;  cd /data1/tong/GLM/ChatGLM2-6B/ptuning; /root/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=%n --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py
server3: [2024-04-09 09:20:24,252] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-04-09 09:20:24,289] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-04-09 09:20:25,693] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
server3: [2024-04-09 09:20:25,693] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server3: [2024-04-09 09:20:25,694] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
server3: [2024-04-09 09:20:25,694] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server3: [2024-04-09 09:20:25,694] [INFO] [launch.py:163:main] dist_world_size=2
server3: [2024-04-09 09:20:25,694] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server1: [2024-04-09 09:20:25,863] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
server1: [2024-04-09 09:20:25,863] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
server1: [2024-04-09 09:20:25,863] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
server1: [2024-04-09 09:20:25,863] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
server1: [2024-04-09 09:20:25,863] [INFO] [launch.py:163:main] dist_world_size=2
server1: [2024-04-09 09:20:25,863] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
server3: [2024-04-09 09:20:28,333] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server3: [2024-04-09 09:20:28,579] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server3: [2024-04-09 09:20:28,579] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-04-09 09:20:29,050] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
server1: [2024-04-09 09:20:29,266] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
server1: [2024-04-09 09:20:29,266] [INFO] [comm.py:616:init_distributed] cdb=None
server1: [2024-04-09 09:20:29,266] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
server1: [2024-04-09 09:38:33,975] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|██████████| 7/7 [30:51<00:00, 264.55s/it]
server1: [2024-04-09 10:09:25,847] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
server1: [2024-04-09 10:09:25,858] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
server1: [2024-04-09 10:09:25,860] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
server1: [2024-04-09 10:09:25,959] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
server1: [2024-04-09 10:09:25,960] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-04-09 10:09:25,960] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.17 GB, percent = 12.9%
server1: Parameter Offload: Total persistent parameters: 362496 in 85 params
server1: [2024-04-09 10:09:26,081] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
server1: [2024-04-09 10:09:26,081] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.49 GB         Max_CA 1 GB 
server1: [2024-04-09 10:09:26,082] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.17 GB, percent = 12.9%
server1: [2024-04-09 10:09:26,082] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   activation_checkpointing_config  {
server1:     "partition_activations": false, 
server1:     "contiguous_memory_optimization": false, 
server1:     "cpu_checkpointing": false, 
server1:     "number_checkpoints": null, 
server1:     "synchronize_checkpoint_boundary": false, 
server1:     "profile": false
server1: }
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   amp_enabled .................. False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   amp_params ................... False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   autotuning_config ............ {
server1:     "enabled": false, 
server1:     "start_step": null, 
server1:     "end_step": null, 
server1:     "metric_path": null, 
server1:     "arg_mappings": null, 
server1:     "metric": "throughput", 
server1:     "model_info": null, 
server1:     "results_dir": "autotuning_results", 
server1:     "exps_dir": "autotuning_exps", 
server1:     "overwrite": true, 
server1:     "fast": true, 
server1:     "start_profile_step": 3, 
server1:     "end_profile_step": 5, 
server1:     "tuner_type": "gridsearch", 
server1:     "tuner_early_stopping": 5, 
server1:     "tuner_num_trials": 50, 
server1:     "model_info_path": null, 
server1:     "mp_size": 1, 
server1:     "max_train_batch_size": null, 
server1:     "min_train_batch_size": 1, 
server1:     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
server1:     "min_train_micro_batch_size_per_gpu": 1, 
server1:     "num_tuning_micro_batch_sizes": 3
server1: }
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f8d905956d0>
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   communication_data_type ...... None
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   disable_allgather ............ False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   dump_state ................... False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
server1: [2024-04-09 10:09:26,083] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   elasticity_enabled ........... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   flops_profiler_config ........ {
server1:     "enabled": false, 
server1:     "recompute_fwd_factor": 0.0, 
server1:     "profile_step": 1, 
server1:     "module_depth": -1, 
server1:     "top_modules": 1, 
server1:     "detailed": true, 
server1:     "output_file": null
server1: }
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   fp16_enabled ................. False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   global_rank .................. 0
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   loss_scale ................... 1.0
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   memory_breakdown ............. False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   mics_shard_size .............. -1
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   nebula_config ................ {
server1:     "enabled": false, 
server1:     "persistent_storage_path": null, 
server1:     "persistent_time_interval": 100, 
server1:     "num_of_version_in_retention": 2, 
server1:     "enable_nebula_load": true, 
server1:     "load_path": null
server1: }
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   optimizer_name ............... None
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   optimizer_params ............. None
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   pld_enabled .................. False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   pld_params ................... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   prescale_gradients ........... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   scheduler_name ............... None
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   scheduler_params ............. None
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   sparse_attention ............. None
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   steps_per_print .............. 2000
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   train_batch_size ............. 2
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   use_node_local_storage ....... True
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   world_size ................... 2
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
server1: [2024-04-09 10:09:26,084] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
server1: [2024-04-09 10:09:26,085] [INFO] [config.py:964:print]   zero_enabled ................. True
server1: [2024-04-09 10:09:26,085] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
server1: [2024-04-09 10:09:26,085] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
server1: [2024-04-09 10:09:26,085] [INFO] [config.py:950:print_user_config]   json = {
server1:     "fp16": {
server1:         "enabled": false
server1:     }, 
server1:     "bf16": {
server1:         "enabled": true
server1:     }, 
server1:     "zero_optimization": {
server1:         "stage": 3, 
server1:         "offload_param": {
server1:             "device": "cpu", 
server1:             "pin_memory": true
server1:         }, 
server1:         "overlap_comm": true, 
server1:         "contiguous_gradients": true, 
server1:         "reduce_bucket_size": 4.194304e+06, 
server1:         "stage3_prefetch_bucket_size": 3.774874e+06, 
server1:         "stage3_param_persistence_threshold": 2.048000e+04
server1:     }, 
server1:     "checkpoint": {
server1:         "use_node_local_storage": true
server1:     }, 
server1:     "steps_per_print": 2.000000e+03, 
server1:     "train_batch_size": 2, 
server1:     "train_micro_batch_size_per_gpu": 1, 
server1:     "wall_clock_breakdown": false
server1: }
server1: 第一次推理时间：2024-04-09 10:09:26.086356
server1: --1
server1: --3
server1: --3
server1: --4
server1: --5
server1: /data1/common/python/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server1:   warnings.warn(
Loading checkpoint shards: 100%|██████████| 7/7 [30:51<00:00, 264.56s/it]
server3: 第一次推理时间：2024-04-09 10:09:26.592210
server3: --1
server3: --3
server3: --3
server3: --4
server3: --5
server3: /root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
server3:   warnings.warn(
server1: --6
server3: --6
server1: rank0:
server1:    in=你叫什么名字
server1:   out=你叫什么名字？ 
server1: 答: 我是一个名为 ChatGLM2-6
server1: 第一次推理耗时8297201.61986351 ms （2小时18分）
server1: 第2次推理时间：2024-04-09 12:27:43.288030
server3: rank1:
server3:    in=Is this review positive or negative? Review: this is the worst restaurant ever
server3:   out=Is this review positive or negative? Review: this is the worst restaurant ever I have ever
server3: 第一次推理耗时8296694.418668747 ms
server3: 第2次推理时间：2024-04-09 12:27:43.286708
server1: Input length of input_ids is 23, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
server3: rank1:
server3:    in=安徽的省会是哪里？
server3:   out=安徽的省会是哪里？ 
server3: 
server3: 安徽的省会是合肥。
server3: 第二次推理耗时5734279.165506363 ms（1小时36分钟）
server1: rank0:
server1:    in=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy
server1:   out=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy This
server1: 第二次推理耗时5734281.681060791 ms
server3: [2024-04-09 14:03:19,782] [INFO] [launch.py:347:main] Process 1048854 exits successfully.
server1: [2024-04-09 14:03:21,414] [INFO] [launch.py:347:main] Process 250383 exits successfully.

/root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:1353: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.

outputs = ds_engine.module.generate(inputs, synced_gpus=True, max_new_tokens=100)

改为服务形式，pdsh的方式无法传递python的input()交互式输入，所以不能用deepspeed启动器(只需在一台机器上执行)，而应该用pytorch的启动器（每个节点都需要运行）。

脚本如下：

#!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py


from transformers import AutoTokenizer, AutoConfig, AutoModel, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os,datetime,time
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # To avoid warnings about parallelism in tokenizers

# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()

#model_name = "bigscience/T0_3B"
model_name = "THUDM/chatglm2-6b"
#model_name = "THUDM/chatglm2-6b-int4"

config = AutoConfig.from_pretrained(model_name,trust_remote_code=True)
#model_hidden_size = config.d_model
model_hidden_size = 2048

# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size

# ds_config notes
#
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
#
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
#
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
#
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
#
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed

# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
'''
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
'''
ds_config = {
    "fp16": {
        "enabled": True
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "checkpoint": {
        "use_node_local_storage": True
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
# fmt: on

# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive

# now a model can be loaded.
#model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name,trust_remote_code=True, empty_init=False)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference


# inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
# with torch.no_grad():
#     outputs = ds_engine.module.generate(inputs, synced_gpus=True)
# text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
# print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

print(f"开始交互式推理")
while True:
    print(f"当前时间：{datetime.datetime.now()}")
    rank = torch.distributed.get_rank()
    if rank == 0:
        # text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
        text_in = input("\n用户0：")
    elif rank == 1:
        # text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"
        text_in = input("\n用户1：")

    if text_in.strip() == "stop":
        break
    t4 = time.time()
    print(f"1 {datetime.datetime.now()}...text_in：{text_in}")
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    print(f"2 {datetime.datetime.now()}...tokenizer：{tokenizer}")
    inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
    print(f"3 {datetime.datetime.now()}...inputs：{inputs}")
    with torch.no_grad():
        outputs = ds_engine.module.generate(inputs, synced_gpus=True, max_new_tokens=30)
    print(f"4 {datetime.datetime.now()}...outputs：{outputs}")
    text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"5...")
    print(f"用户{rank}:\n   in={text_in}\n  out={text_out}")
    t5 = time.time()
    print(f"本轮推理耗时{(t5 - t4)*1000} ms\n")

运行日志如下：

查看代码

第一个节点（主节点）
(base) [root@master33 ptuning]$NCCL_SOCKET_IFNAME=eno1 /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=0 --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py
[2024-04-17 19:37:55,128] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-17 19:37:56,885] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
[2024-04-17 19:37:56,885] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
[2024-04-17 19:37:56,885] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
[2024-04-17 19:37:56,885] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
[2024-04-17 19:37:56,885] [INFO] [launch.py:163:main] dist_world_size=2
[2024-04-17 19:37:56,885] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-04-17 19:38:00,151] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-17 19:38:00,362] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-04-17 19:38:00,362] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-04-17 19:38:00,362] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-04-17 19:55:48,239] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 7/7 [27:14<00:00, 233.55s/it]
[2024-04-17 20:23:03,149] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
[2024-04-17 20:23:03,159] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-04-17 20:23:03,161] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
[2024-04-17 20:23:03,264] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-04-17 20:23:03,265] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.49 GB         Max_CA 1 GB 
[2024-04-17 20:23:03,265] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.86 GB, percent = 13.4%
Parameter Offload: Total persistent parameters: 362496 in 85 params
[2024-04-17 20:23:03,385] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-04-17 20:23:03,386] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.49 GB         Max_CA 1 GB 
[2024-04-17 20:23:03,386] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 16.85 GB, percent = 13.4%
[2024-04-17 20:23:03,387] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
[2024-04-17 20:23:03,387] [INFO] [config.py:964:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-04-17 20:23:03,387] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-04-17 20:23:03,387] [INFO] [config.py:964:print]   amp_enabled .................. False
[2024-04-17 20:23:03,387] [INFO] [config.py:964:print]   amp_params ................... False
[2024-04-17 20:23:03,387] [INFO] [config.py:964:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-04-17 20:23:03,387] [INFO] [config.py:964:print]   bfloat16_enabled ............. False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f9edfefab80>
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   communication_data_type ...... None
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   disable_allgather ............ False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   dump_state ................... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   elasticity_enabled ........... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   fp16_auto_cast ............... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   fp16_enabled ................. True
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   global_rank .................. 0
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 65536
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   loss_scale ................... 0
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   memory_breakdown ............. False
[2024-04-17 20:23:03,388] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   mics_shard_size .............. -1
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   optimizer_name ............... None
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   optimizer_params ............. None
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   pld_enabled .................. False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   pld_params ................... False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   prescale_gradients ........... False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   scheduler_name ............... None
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   scheduler_params ............. None
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   sparse_attention ............. None
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   steps_per_print .............. 2000
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   train_batch_size ............. 2
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   use_node_local_storage ....... True
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   world_size ................... 2
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   zero_enabled ................. True
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
[2024-04-17 20:23:03,389] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
[2024-04-17 20:23:03,389] [INFO] [config.py:950:print_user_config]   json = {
    "fp16": {
        "enabled": true
    }, 
    "bf16": {
        "enabled": false
    }, 
    "zero_optimization": {
        "stage": 3, 
        "offload_param": {
            "device": "cpu", 
            "pin_memory": true
        }, 
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "reduce_bucket_size": 4.194304e+06, 
        "stage3_prefetch_bucket_size": 3.774874e+06, 
        "stage3_param_persistence_threshold": 2.048000e+04
    }, 
    "checkpoint": {
        "use_node_local_storage": true
    }, 
    "steps_per_print": 2.000000e+03, 
    "train_batch_size": 2, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": false
}
开始交互式推理
当前时间：2024-04-17 20:23:03.391038

用户0：你好，介绍下你自己
1 2024-04-17 20:26:53.509871...text_in：你好，介绍下你自己
2 2024-04-17 20:26:53.561594...tokenizer：ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-17 20:26:53.578529...inputs：tensor([[64790, 64792, 36474, 54591, 31123, 32025, 54578, 39376]],
       device='cuda:0')
4 2024-04-18 01:01:06.943339...outputs：tensor([[64790, 64792, 36474, 54591, 31123, 32025, 54578, 39376, 35416, 33031,
         31718, 54746, 31645, 31155,    13,    13, 33030, 32132, 32914, 54940,
         54645, 30932, 38628, 34797, 42481, 31155, 54546, 37893, 31799, 32330,
         31940, 31668, 30932, 31934, 32006, 36295, 31639, 31201]],
       device='cuda:0')
5...
用户0:
   in=你好，介绍下你自己
  out=你好，介绍下你自己和你公司的产品或服务。

我是来自中国的张三,是一名人工智能助手。我可以用自然语言处理技术,帮助人们解答问题、
本轮推理耗时16453436.004161835 ms

当前时间：2024-04-18 01:01:06.945884

用户0：安徽的省会是哪里？
1 2024-04-18 08:51:18.150903...text_in：安徽的省会是哪里？
2 2024-04-18 08:51:18.213205...tokenizer：ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-18 08:51:18.230416...inputs：tensor([[64790, 64792, 30910, 33240, 54530, 54833, 45798, 33120, 31514]],
       device='cuda:0')
4 2024-04-18 13:25:48.953237...outputs：tensor([[64790, 64792, 30910, 33240, 54530, 54833, 45798, 33120, 31514, 30910,
            13,    13, 33240, 54530, 54833, 45798, 35606, 31155,     2]],
       device='cuda:0')
5...
用户0:
   in=安徽的省会是哪里？
  out=安徽的省会是哪里？ 

安徽的省会是合肥。
本轮推理耗时16470804.099321365 ms

当前时间：2024-04-18 13:25:48.954983

用户0：写一首关于月亮的四言绝句
1 2024-04-18 13:45:12.674692...text_in：写一首关于月亮的四言绝句
2 2024-04-18 13:45:12.738515...tokenizer：ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-18 13:45:12.755618...inputs：tensor([[64790, 64792, 30910, 55172, 38572, 31809, 37752, 41807, 54994, 55384,
         55390]], device='cuda:0')
4 2024-04-18 18:24:06.553752...outputs：tensor([[64790, 64792, 30910, 55172, 38572, 31809, 37752, 41807, 54994, 55384,
         55390,    13,    13,    13,    13, 37752, 54589, 55990, 56786, 58259,
         54538, 30932, 59731, 55822, 54627, 55312, 55364, 54902, 56513, 31155,
           265,    13, 54852, 56060, 49447, 34313, 54655, 30932, 56181, 55786,
         54586]], device='cuda:0')
5...
用户0:
   in=写一首关于月亮的四言绝句
  out=写一首关于月亮的四言绝句



月亮高挂碧霄中,皎洁如银似白莲。  
清辉照耀天地间,祥瑞之
本轮推理耗时16733881.56747818 ms

当前时间：2024-04-18 18:24:06.556266

用户0：



第二个节点
(base) [root@master25 ptuning]$NCCL_SOCKET_IFNAME=eno1 /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMyI6IFswXX0= --node_rank=1 --master_addr=192.168.89.1 --master_port=29500 deploy_demo_ds.py
[2024-04-17 19:38:01,000] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-17 19:38:02,408] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
[2024-04-17 19:38:02,408] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server3': [0]}
[2024-04-17 19:38:02,408] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
[2024-04-17 19:38:02,408] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server3': [1]})
[2024-04-17 19:38:02,408] [INFO] [launch.py:163:main] dist_world_size=2
[2024-04-17 19:38:02,408] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-04-17 19:38:05,049] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-17 19:38:05,289] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-04-17 19:38:05,289] [INFO] [comm.py:616:init_distributed] cdb=None
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 7/7 [27:15<00:00, 233.66s/it]
开始交互式推理
当前时间：2024-04-17 20:23:04.450904

用户1：红楼梦的作者是谁？
1 2024-04-17 20:27:13.342097...text_in：红楼梦的作者是谁？
2 2024-04-17 20:27:13.392741...tokenizer：ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-17 20:27:13.409613...inputs：tensor([[64790, 64792, 30910, 42470, 53422, 36289, 31514]], device='cuda:0')
4 2024-04-18 01:01:06.940493...outputs：tensor([[64790, 64792, 30910, 42470, 53422, 36289, 31514,    13,    13, 42470,
         53422, 54532, 37502, 33172, 54561, 56307, 55534, 57772, 31155,     2]],
       device='cuda:0')
5...
用户1:
   in=红楼梦的作者是谁？
  out=红楼梦的作者是谁？

红楼梦的作者是清代小说家曹雪芹。
本轮推理耗时16433600.287437439 ms

当前时间：2024-04-18 01:01:06.942412

用户1：新约和旧约有什么区别？
1 2024-04-18 08:51:56.450500...text_in：新约和旧约有什么区别？
2 2024-04-18 08:51:56.512916...tokenizer：ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-18 08:51:56.530356...inputs：tensor([[64790, 64792, 33640, 55064, 54542, 55659, 55064, 33277, 34213, 31514]],
       device='cuda:0')
4 2024-04-18 13:25:48.957868...outputs：tensor([[64790, 64792, 33640, 55064, 54542, 55659, 55064, 33277, 34213, 31514,
         30910,    13,    13, 54575, 55064, 54542, 55659, 55064, 54532, 38046,
         54538, 38411, 31911, 31726, 30932, 32542, 32695, 34213, 31685, 33947,
         31155,    13,    13, 55659, 55064, 30946, 20866, 14891, 30945, 31779]],
       device='cuda:0')
5...
用户1:
   in=新约和旧约有什么区别？
  out=新约和旧约有什么区别？ 

新约和旧约是基督教中非常重要的两个部分,它们之间的区别非常显著。

旧约(Old Testament)包括
本轮推理耗时16432509.668111801 ms

当前时间：2024-04-18 13:25:48.960165

用户1：写一首关于亲情的七言律诗
1 2024-04-18 13:46:02.574377...text_in：写一首关于亲情的七言律诗
2 2024-04-18 13:46:02.637360...tokenizer：ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-18 13:46:02.655157...inputs：tensor([[64790, 64792, 30910, 55172, 38572, 31809, 55113, 34684, 55254, 54994,
         55134, 55475]], device='cuda:0')
4 2024-04-18 18:24:06.555363...outputs：tensor([[64790, 64792, 30910, 55172, 38572, 31809, 55113, 34684, 55254, 54994,
         55134, 55475,    13,    13, 41437, 54625, 55364, 39671, 56786, 54683,
         30932,    13, 55100, 55994, 54579, 48893, 40014, 31155,    13, 34365,
         39003, 54908, 55169, 55169, 30932,    13, 34089, 41561, 54623, 54664,
         55994, 31155]], device='cuda:0')
5...
用户1:
   in=写一首关于亲情的七言律诗
  out=写一首关于亲情的七言律诗

亲情长似蓝天碧水,
血浓于水流不息。
生日聚会乐融融,
岁月流转情更浓。
本轮推理耗时16683983.394861221 ms

当前时间：2024-04-18 18:24:06.557781

用户1：

三节点推理，每次推理大概用时6小时，脚本和代码如下：

三节点推理脚本

查看代码

 #!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py


from transformers import AutoTokenizer, AutoConfig, AutoModel, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os,datetime,time
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # To avoid warnings about parallelism in tokenizers

# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()

#model_name = "bigscience/T0_3B"
model_name = "THUDM/chatglm2-6b"
#model_name = "THUDM/chatglm2-6b-int4"

config = AutoConfig.from_pretrained(model_name,trust_remote_code=True)
#model_hidden_size = config.d_model
model_hidden_size = 2048

# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size

# ds_config notes
#
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
#
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
#
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
#
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
#
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed

# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
'''
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
'''
ds_config = {
    "fp16": {
        "enabled": True
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "checkpoint": {
        "use_node_local_storage": True
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
# fmt: on

# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive

# now a model can be loaded.
#model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name,trust_remote_code=True, empty_init=False)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference


# inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
# with torch.no_grad():
#     outputs = ds_engine.module.generate(inputs, synced_gpus=True)
# text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
# print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

print(f"开始交互式推理")
while True:
    print(f"当前时间：{datetime.datetime.now()}")
    rank = torch.distributed.get_rank()
    if rank == 0:
        # text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
        text_in = input("\n用户0：")
    elif rank == 1:
        # text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"
        text_in = input("\n用户1：")
    elif rank == 2:
        # text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"
        text_in = input("\n用户2：")

    if text_in.strip() == "stop":
        break
    t4 = time.time()
    print(f"1 {datetime.datetime.now()}...text_in：{text_in}")
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    print(f"2 {datetime.datetime.now()}...tokenizer：{tokenizer}")
    inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
    print(f"3 {datetime.datetime.now()}...inputs：{inputs}")
    with torch.no_grad():
        outputs = ds_engine.module.generate(inputs, synced_gpus=True, max_new_tokens=30)
    print(f"4 {datetime.datetime.now()}...outputs：{outputs}")
    text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"5...")
    print(f"用户{rank}:\n   in={text_in}\n  out={text_out}")
    t5 = time.time()
    print(f"本轮推理耗时{(t5 - t4)*1000} ms\n")

三节点推理日志：

查看代码

 节点一：主节点（选取的是89.3 ）
(base) [root@master25 ptuning]$NCCL_SOCKET_IFNAME=eno1 /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMiI6IFswXSwgInNlcnZlcjMiOiBbMF19 --node_rank=0 --master_addr=192.168.89.3 --master_port=29500 deploy_demo_ds.py
[2024-04-19 16:58:23,476] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-19 16:58:24,880] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eno1
[2024-04-19 16:58:24,880] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server2': [0], 'server3': [0]}
[2024-04-19 16:58:24,880] [INFO] [launch.py:151:main] nnodes=3, num_local_procs=1, node_rank=0
[2024-04-19 16:58:24,880] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server2': [1], 'server3': [2]})
[2024-04-19 16:58:24,880] [INFO] [launch.py:163:main] dist_world_size=3
[2024-04-19 16:58:24,880] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-04-19 16:58:27,484] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-19 16:58:27,720] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-04-19 16:58:27,720] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-04-19 16:58:27,720] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-04-19 17:17:53,327] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.24B parameters
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [31:17<00:00, 268.27s/it]
[2024-04-19 17:49:11,244] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
[2024-04-19 17:49:11,254] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-04-19 17:49:11,255] [INFO] [logging.py:96:log_dist] [Rank 0] Creating ZeRO Offload
[2024-04-19 17:49:11,348] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-04-19 17:49:11,349] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.99 GB         CA 1.49 GB         Max_CA 1 GB 
[2024-04-19 17:49:11,349] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 11.85 GB, percent = 9.4%
Parameter Offload: Total persistent parameters: 362496 in 85 params
[2024-04-19 17:49:11,457] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-04-19 17:49:11,458] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 1.49 GB         Max_CA 1 GB 
[2024-04-19 17:49:11,458] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 11.85 GB, percent = 9.4%
[2024-04-19 17:49:11,459] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
[2024-04-19 17:49:11,459] [INFO] [config.py:964:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2024-04-19 17:49:11,459] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-04-19 17:49:11,459] [INFO] [config.py:964:print]   amp_enabled .................. False
[2024-04-19 17:49:11,459] [INFO] [config.py:964:print]   amp_params ................... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   bfloat16_enabled ............. False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f245e7bbcd0>
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   communication_data_type ...... None
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   disable_allgather ............ False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   dump_state ................... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   elasticity_enabled ........... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   fp16_auto_cast ............... False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   fp16_enabled ................. True
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   global_rank .................. 0
[2024-04-19 17:49:11,460] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 1
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   gradient_clipping ............ 0.0
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 65536
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   loss_scale ................... 0
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   memory_breakdown ............. False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   mics_shard_size .............. -1
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   optimizer_name ............... None
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   optimizer_params ............. None
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   pld_enabled .................. False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   pld_params ................... False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   prescale_gradients ........... False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   scheduler_name ............... None
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   scheduler_params ............. None
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   sparse_attention ............. None
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   steps_per_print .............. 2000
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   train_batch_size ............. 3
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   use_node_local_storage ....... True
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   world_size ................... 3
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  False
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=4194304 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=3774873 param_persistence_threshold=20480 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   zero_enabled ................. True
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
[2024-04-19 17:49:11,461] [INFO] [config.py:964:print]   zero_optimization_stage ...... 3
[2024-04-19 17:49:11,462] [INFO] [config.py:950:print_user_config]   json = {
    "fp16": {
        "enabled": true
    }, 
    "bf16": {
        "enabled": false
    }, 
    "zero_optimization": {
        "stage": 3, 
        "offload_param": {
            "device": "cpu", 
            "pin_memory": true
        }, 
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "reduce_bucket_size": 4.194304e+06, 
        "stage3_prefetch_bucket_size": 3.774874e+06, 
        "stage3_param_persistence_threshold": 2.048000e+04
    }, 
    "checkpoint": {
        "use_node_local_storage": true
    }, 
    "steps_per_print": 2.000000e+03, 
    "train_batch_size": 3, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": false
}
开始交互式推理
当前时间：2024-04-19 17:49:11.463137

用户0：你好
1 2024-04-19 17:56:19.307686...text_in：你好
2 2024-04-19 17:56:19.357040...tokenizer：ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-19 17:56:19.374337...inputs：tensor([[64790, 64792, 36474, 54591]], device='cuda:0')
4 2024-04-20 00:03:12.646182...outputs：tensor([[64790, 64792, 36474, 54591, 31123, 33030, 22011, 10461, 30944, 30943,
         30941, 30978, 30949, 31123, 30910, 32347, 54565, 32093, 42481, 31155,
         54546, 34161, 34941, 34030, 54532, 12980, 30944, 30943, 30941, 30978,
         30949, 31123, 30910, 32288]], device='cuda:0')
5...
用户0:
   in=你好
  out=你好，我是 ChatGLM2-6B， 一个人工智能助手。我背后使用的模型是 GLM2-6B， 是一种
本轮推理耗时22013340.56162834 ms

当前时间：2024-04-20 00:03:12.648256

用户0：写一首关于月亮的诗
1 2024-04-20 00:12:26.084211...text_in：写一首关于月亮的诗
2 2024-04-20 00:12:26.144560...tokenizer：ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-20 00:12:26.161506...inputs：tensor([[64790, 64792, 30910, 55172, 38572, 31809, 37752, 45113]],
       device='cuda:0')
4 2024-04-20 06:19:27.499328...outputs：tensor([[64790, 64792, 30910, 55172, 38572, 31809, 37752, 45113, 30910, 41881,
         54589, 55990, 54614, 35171, 30932,    13, 59731, 55822, 54627, 55312,
         55364, 54902, 55193, 31155,    13, 54595, 54578, 54852, 40921, 55459,
         55406, 30932,    13, 33961, 39861, 58423, 33263, 31155]],
       device='cuda:0')
5...
用户0:
   in=写一首关于月亮的诗
  out=写一首关于月亮的诗 明月高挂天空中,
皎洁如银似白龙。
月下清风吹绿树,
一片宁静渲心中。
本轮推理耗时22021417.449235916 ms

当前时间：2024-04-20 06:19:27.501668

用户0：
-----------------
节点二：（89.2）
(base) [root@master34 ptuning]$NCCL_SOCKET_IFNAME=eno1 /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMiI6IFswXSwgInNlcnZlcjMiOiBbMF19 --node_rank=1 --master_addr=192.168.89.3 --master_port=29500 deploy_demo_ds.py
[2024-04-19 16:58:59,531] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-19 16:59:00,932] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=eno1
[2024-04-19 16:59:00,932] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server2': [0], 'server3': [0]}
[2024-04-19 16:59:00,933] [INFO] [launch.py:151:main] nnodes=3, num_local_procs=1, node_rank=1
[2024-04-19 16:59:00,933] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server2': [1], 'server3': [2]})
[2024-04-19 16:59:00,933] [INFO] [launch.py:163:main] dist_world_size=3
[2024-04-19 16:59:00,933] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-04-19 16:59:03,797] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-19 16:59:04,017] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-04-19 16:59:04,017] [INFO] [comm.py:616:init_distributed] cdb=None
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [31:17<00:00, 268.27s/it]
开始交互式推理
当前时间：2024-04-19 17:49:11.795627

用户1：安徽的省会是哪里？
1 2024-04-19 17:56:29.232194...text_in：安徽的省会是哪里？
2 2024-04-19 17:56:29.285311...tokenizer：ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-19 17:56:29.302017...inputs：tensor([[64790, 64792, 30910, 33240, 54530, 54833, 45798, 33120, 31514]],
       device='cuda:0')
4 2024-04-20 00:03:12.643464...outputs：tensor([[64790, 64792, 30910, 33240, 54530, 54833, 45798, 33120, 31514, 30910,
            13,    13, 33240, 54530, 54833, 45798, 35606, 31155,     2]],
       device='cuda:0')
5...
用户1:
   in=安徽的省会是哪里？
  out=安徽的省会是哪里？ 

安徽的省会是合肥。
本轮推理耗时22003413.06900978 ms

当前时间：2024-04-20 00:03:12.645291

用户1：写一首关于太阳的诗
1 2024-04-20 00:12:34.545304...text_in：写一首关于太阳的诗
2 2024-04-20 00:12:34.611573...tokenizer：ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-20 00:12:34.628495...inputs：tensor([[64790, 64792, 30910, 55172, 38572, 31809, 33146, 45113]],
       device='cuda:0')
4 2024-04-20 06:19:27.496629...outputs：tensor([[64790, 64792, 30910, 55172, 38572, 31809, 33146, 45113, 30910, 33146,
         55674, 31123, 34607, 36022, 54530, 57227, 55190, 30910, 31822, 42059,
         49447, 54666, 35196, 30910, 31822, 33027, 46276, 54666, 36505, 30910,
         31822, 32824, 31123, 32067, 35818, 30910, 31822, 33219]],
       device='cuda:0')
5...
用户1:
   in=写一首关于太阳的诗
  out=写一首关于太阳的诗 太阳啊，你是天空的瑰宝 你的光芒照耀着大地 你的温暖滋润着万物 你的美丽，无法形容 你的伟大
本轮推理耗时22012953.52959633 ms

当前时间：2024-04-20 06:19:27.498841

用户1：
------------------------------
节点三（89.1）
(base) [root@master33 ptuning]$NCCL_SOCKET_IFNAME=eno1 /data1/common/python/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyJzZXJ2ZXIxIjogWzBdLCAic2VydmVyMiI6IFswXSwgInNlcnZlcjMiOiBbMF19 --node_rank=2 --master_addr=192.168.89.3 --master_port=29500 deploy_demo_ds.py
[2024-04-19 16:58:30,385] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-19 16:58:31,942] [INFO] [launch.py:138:main] 2 NCCL_SOCKET_IFNAME=eno1
[2024-04-19 16:58:31,942] [INFO] [launch.py:145:main] WORLD INFO DICT: {'server1': [0], 'server2': [0], 'server3': [0]}
[2024-04-19 16:58:31,942] [INFO] [launch.py:151:main] nnodes=3, num_local_procs=1, node_rank=2
[2024-04-19 16:58:31,942] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'server1': [0], 'server2': [1], 'server3': [2]})
[2024-04-19 16:58:31,942] [INFO] [launch.py:163:main] dist_world_size=3
[2024-04-19 16:58:31,942] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-04-19 16:58:35,273] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-19 16:58:35,484] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-04-19 16:58:35,484] [INFO] [comm.py:616:init_distributed] cdb=None
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [31:17<00:00, 268.28s/it]
开始交互式推理
当前时间：2024-04-19 17:49:11.861933

用户2：红楼梦的作者是谁？
1 2024-04-19 17:56:41.513744...text_in：红楼梦的作者是谁？
2 2024-04-19 17:56:41.562707...tokenizer：ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-19 17:56:41.579732...inputs：tensor([[64790, 64792, 30910, 42470, 53422, 36289, 31514]], device='cuda:0')
4 2024-04-20 00:03:12.647159...outputs：tensor([[64790, 64792, 30910, 42470, 53422, 36289, 31514,    13,    13, 42470,
         53422, 54532, 37502, 33172, 54561, 56307, 55534, 57772, 31155,     2]],
       device='cuda:0')
5...
用户2:
   in=红楼梦的作者是谁？
  out=红楼梦的作者是谁？

红楼梦的作者是清代小说家曹雪芹。
本轮推理耗时21991135.238409042 ms

当前时间：2024-04-20 00:03:12.648993

用户2：工业革命为什么发生在欧洲
1 2024-04-20 00:12:59.758721...text_in：工业革命为什么发生在欧洲
2 2024-04-20 00:12:59.840548...tokenizer：ChatGLMTokenizer(name_or_path='THUDM/chatglm2-6b', vocab_size=64794, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='left', truncation_side='right', special_tokens={}, clean_up_tokenization_spaces=False)
3 2024-04-20 00:12:59.858234...inputs：tensor([[64790, 64792, 30910, 32068, 32123, 32148, 37537, 32857]],
       device='cuda:0')
4 2024-04-20 06:19:27.495783...outputs：tensor([[64790, 64792, 30910, 32068, 32123, 32148, 37537, 32857, 33740, 31722,
         31808, 31514, 30910,    13,    13, 32068, 32123, 34369, 54534, 32857,
         33740, 31722, 31808, 31763, 30932, 34024, 31949, 32857, 34892, 31201,
         31832, 43805, 31993, 54626, 39305, 31155,    13,    13]],
       device='cuda:0')
5...
用户2:
   in=工业革命为什么发生在欧洲
  out=工业革命为什么发生在欧洲而不是其他地区？ 

工业革命之所以在欧洲而不是其他地区发生,主要是由于欧洲的社会、政治和经济条件所导致的。


本轮推理耗时21987739.241361618 ms

当前时间：2024-04-20 06:19:27.497969

用户2：

微调、训练案例参考：

Finetune闻仲2.0-GPT2-3.5B-chinese显存爆炸，似乎offload_param没有生效 #111

分布式多机多卡训练卡住，超时后报错 #123

要点：

看代码是模型已经切分了，在optimzer过程中gather参数的时候显存挂了，可以先尝试一下把这个参数再改小一点试试stage3_max_live_parameters
deepspeed会自动做参数的切分，这部分切分不一定是严格按照layer的，deepspeed在内部会有判断某一层的参数是否在forward之外的地方有调用，如果有的话就是不可分的，这个跟模型的实现有关系，目前看起来最低需要13G的显存，这里之后我们也会看看如何再优化一下
这个里面调用的实际上是estimate_zero3_model_states_mem_needs_all_cold参数，all_alive假定了每一层都是可分的，是理论上的最小值，实际上有一些层是不可分的，在上面你爆显存的地方实际上就是在gather这些不可分层的参数，导致显存挂了，
activation checkpointing

※，《机器学习系统：设计和实现》

教程地址：https://openmlsys.github.io/chapter_preface/index.html

其他：

https://github.com/openmlsys/openmlsys-zh

在线LaTeX编辑（在线写书）：https://www.overleaf.com/home-2

【进阶】Transformer 架构解析：模型训练和反向传播【获益匪浅，了解训练的整个过程的原理】

主流机器学习框架：TensorFlow、PyTorch、PaddlePaddle、MindSpore

MindSpore多机多卡案例：参考此文。

华为MindSpore[晟思]：官网有比较详细的例子，没有运行成功，普遍反应生态不如pytorch

★，

※，通义千问 Qwen-14B-Chat

※，Huggingface

★，DeepSpeed

关于如何查看GPU是否支持 NVLINK,使用命令

nvidia-smi topo -p2p n

测试模型占用内存量：

python -c 'from transformers import AutoModel; \
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \
model = AutoModel.from_pretrained("THUDM/chatglm2-6b"); \
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)'

(base) [root@master33 ptuning]$python -c 'from transformers import AutoModel; \
> from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \
> model = AutoModel.from_pretrained("THUDM/chatglm2-6b"); \
> estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)'
[2024-03-11 11:20:24,517] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Loading THUDM/chatglm2-6b requires to execute some code in that repo, you can inspect the content of the repository at https://hf.co/THUDM/chatglm2-6b. You can dismiss this prompt by passing `trust_remote_code=True`.
Do you accept? [y/N] y
Loading THUDM/chatglm2-6b requires to execute some code in that repo, you can inspect the content of the repository at https://hf.co/THUDM/chatglm2-6b. You can dismiss this prompt by passing `trust_remote_code=True`.
Do you accept? [y/N] y
Loading checkpoint shards:   0%|                                                                                                           | 0/7 [00:00<?, ?it/s]Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00,  1.39s/it]
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 6243M total params, 266M largest layer params.
  per CPU  |  per GPU |   Options
  157.00GB |   0.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  157.00GB |   0.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
  139.55GB |  12.62GB | offload_param=none, offload_optimizer=cpu , zero_init=1
  139.55GB |  12.62GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    1.49GB | 105.66GB | offload_param=none, offload_optimizer=none, zero_init=1
   34.89GB | 105.66GB | offload_param=none, offload_optimizer=none, zero_init=0
(base) [root@master33 ptuning]$

posted on 2023-12-14 16:32 everest33 阅读(107) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

everest33

导航

公告

机器学习学习记录

※，大模型、GPU and so on

※，ChatGlm2-6B

※，多机多卡

★，DeepSpeed

※，《机器学习系统：设计和实现》

※，Huggingface