[Pytorch] Transformer Engine报错：RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

问题描述

有一天，我用Megatron-LM在跑MoE相关的代码时，报了这个错误：

Exception: The specified pointer resides on host memory and is not registered with any CUDA device.
Traceback (most recent call last):
  File "/workspace/userdata/moelb/Megatron-LM/tests/unit_tests/transformer/moe/xxx.py", line 309, in perf_test
    output, _ = self.model(hidden_states)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/userdata/moelb/Megatron-LM/megatron/core/distributed/data_parallel_base.py", line 22, in forward
    return self.module(*inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1857, in _call_impl
    return inner()
           ^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1805, in inner
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/userdata/moelb/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 251, in forward
    output, mlp_bias = custom_forward(hidden_states)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/userdata/moelb/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 217, in custom_forward
    expert_output, mlp_bias = self.experts(
                              ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1857, in _call_impl
    return inner()
           ^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1805, in inner
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/userdata/moelb/Megatron-LM/megatron/core/transformer/moe/experts.py", line 780, in forward
    intermediate_parallel, bias_parallel = self.linear_fc1(
                                           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1857, in _call_impl
    return inner()
           ^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1805, in inner
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/userdata/moelb/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 1059, in forward
    out = super().forward(x, m_splits, is_first_microbatch=_is_first_microbatch)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 749, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/module/grouped_linear.py", line 654, in forward
    out = linear_fn(*args)
          ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 578, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/module/grouped_linear.py", line 158, in forward
    _ = general_grouped_gemm(
        ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/cpp_extensions/gemm.py", line 206, in general_grouped_gemm
    bias = tex.te_general_grouped_gemm(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

分析

从字面意思上看，貌似是我在进行GroupedMLP计算时，有些tensor并不位于GPU上，而是位于CPU上。但我在expert.py内，linear_fc1计算之前，打印了输入tensor以及参数tensor的device，发现它们都是cuda，而且里面的内容看起来也没啥毛病。
也许是因为没有同步？我在前面加入torch.cuda.current_stream().synchronize()，结果依然报错。我打印了tensor里面的内容，看起来也没啥问题。
而且离谱的是，这个错误只在sequence_length*batch_size达到16384时才出现，8192的时候都没出现。我也尝试了很多其他选项（如moe_permute_fusion）是否会造成影响，结果依然都会报错。
总之我没辙了，怀疑这是transformer-engine内部的bug。

问题解决

出问题时，我使用的环境是nvidia官方的pytorch容器，版本25.03-py3，里面的transformer-engine版本为2.1.0。我将容器版本升级到25.05-py3，里面transformer-engine的版本为2.3.0。报错就消失了。所以这果然大概率是transformer-engine内部的bug。

posted @ 2025-06-24 14:50 CQzhangyu 阅读(154) 评论(4) 收藏举报

刷新页面返回顶部

CQzhangyu

------知识就是力量------

[Pytorch] Transformer Engine报错：RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

[Pytorch] Transformer Engine报错：RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

问题描述

分析

问题解决

公告