实用指南：ROCm GPU间 P2P 能力确定机制分析

2025-12-25 11:38 tlnshuju 阅读(4) 评论(0) 收藏举报

概述

P2P（Peer-to-Peer）能力决定了GPU之间能否直接访问彼此的内存，这对于多GPU协作和高性能计算至关重要。本文档详细分析ROCm中P2P能力的确定过程。

1. P2P能力确定的入口点

HIP API层 (`hip_peer.cpp`)

hipError_t hipDeviceCanAccessPeer(int* canAccess, int deviceId, int peerDeviceId) {
HIP_INIT_API(hipDeviceCanAccessPeer, canAccess, deviceId, peerDeviceId);
HIP_RETURN(canAccessPeer(canAccess, deviceId, peerDeviceId));
}

核心判断函数：

hipError_t canAccessPeer(int* canAccessPeer, int deviceId, int peerDeviceId) {
// 1. 基本检查
if (deviceId == peerDeviceId) {
*canAccessPeer = 0;  // 不能访问自己
return hipSuccess;
}
// 2. 获取设备对象
device = g_devices[deviceId]->devices()[0];
peer_device = g_devices[peerDeviceId]->devices()[0];
// 3. 关键判断：查找 p2pDevices_ 列表
*canAccessPeer = static_cast<int>(
  std::find(device->p2pDevices_.begin(),
  device->p2pDevices_.end(),
  as_cl(peer_device)) != device->p2pDevices_.end()
  );
  return hipSuccess;
  }

要点：

P2P能力的判断完全依赖于 device->p2pDevices_ 列表
如果 peer_device 在这个列表中，说明支持P2P访问
这个列表在设备初始化时就已经确定

2. P2P能力列表的构建 (`rocdevice.cpp`)

2.1 初始化时机

在 Device::init() → 所有设备创建完成后统一构建P2P拓扑：

bool Device::init() {
// ... HSA初始化 ...
if (devices.size() > 0) {
bool p2p_available = false;
// 遍历所有设备，构建P2P关系图
for (auto device1 : devices) {
// 找到所有可以访问 device1 的 agents
for (auto agent : static_cast<Device*>(device1)->p2pAgents()) {
  // 找到该 agent 对应的 cl_device_id
  for (auto device2 : devices) {
  if (agent.handle == static_cast<Device*>(device2)->getBackendDevice().handle) {
    // device2 可以访问 device1
    device2->p2pDevices_.push_back(as_cl(device1));
    device1->p2p_access_devices_.push_back(device2);
    p2p_available = true;
    }
    }
    }
    }
    }
    }

关键数据结构：

p2pDevices_: 当前设备可以访问的所有peer设备列表
p2p_access_devices_: 可以访问当前设备的所有设备列表
p2p_agents_: 可以访问当前设备内存池的HSA agents列表

2.2 单个设备的P2P Agents发现

在 Device::populateOCLDeviceConstants() 中确定每个设备的 p2p_agents_:

bool Device::populateOCLDeviceConstants() {
// ... 其他初始化 ...
// 遍历系统中所有GPU agents
for (auto agent : gpu_agents_) {
if (agent.handle != bkendDevice_.handle) {  // 排除自己
hsa_amd_memory_pool_access_t access;
// 查询该agent能否访问当前设备的VRAM内存池
err = Hsa::agent_memory_pool_get_info(
agent,
gpuvm_segment_,  // 当前设备的VRAM内存池
HSA_AMD_AGENT_MEMORY_POOL_INFO_ACCESS,
&access
);
if (err != HSA_STATUS_SUCCESS) {
continue;
}
// 如果可以访问（允许或默认禁止但可开启）
if (HSA_AMD_MEMORY_POOL_ACCESS_ALLOWED_BY_DEFAULT == access ||
HSA_AMD_MEMORY_POOL_ACCESS_DISALLOWED_BY_DEFAULT == access) {
// 该agent可以访问当前设备的内存
p2p_agents_.push_back(agent);
}
}
}
// 构建完整的P2P agents数组（包含自己）
p2p_agents_list_ = new hsa_agent_t[1 + p2p_agents_.size()];
p2p_agents_list_[0] = getBackendDevice();  // 第一个是自己
for (size_t i = 0; i < p2p_agents_.size(); ++i) {
p2p_agents_list_[1 + i] = p2p_agents_[i];
}
}

3. HSA层的P2P能力确定

3.1 内存池访问权限查询

关键HSA API：

hsa_status_t hsa_amd_agent_memory_pool_get_info(
hsa_agent_t agent,                    // 要查询的agent
hsa_amd_memory_pool_t memory_pool,    // 目标内存池
hsa_amd_agent_memory_pool_info_t attribute,  // 查询属性
void* value                           // 返回值
);

访问类型枚举：

typedef enum {
// 永远不允许访问（硬件不支持）
HSA_AMD_MEMORY_POOL_ACCESS_NEVER_ALLOWED = 0,
// 默认允许访问（无需额外配置）
HSA_AMD_MEMORY_POOL_ACCESS_ALLOWED_BY_DEFAULT = 1,
// 默认禁止，但可以通过 agents_allow_access 开启
HSA_AMD_MEMORY_POOL_ACCESS_DISALLOWED_BY_DEFAULT = 2
} hsa_amd_memory_pool_access_t;

3.2 判定逻辑

GPU A 能否访问 GPU B 的内存？
    ↓
查询：agent_memory_pool_get_info(GPU_A_agent, GPU_B_memory_pool, ACCESS)
    ↓
返回值 = ALLOWED_BY_DEFAULT        → 支持P2P，无需配置
返回值 = DISALLOWED_BY_DEFAULT     → 支持P2P，需要调用 agents_allow_access
返回值 = NEVER_ALLOWED             → 不支持P2P（硬件限制）

4. 底层硬件依据（KFD/Topology）

4.1 拓扑信息来源

HSA runtime通过以下机制获取P2P能力信息：

KFD驱动接口 (libhsakmt)
- 读取 /sys/class/kfd/kfd/topology/nodes/ 下的拓扑信息
- 每个节点的 io_links/ 目录描述节点间连接

关键文件：

/sys/class/kfd/kfd/topology/nodes/<node_id>/io_links/<link_id>/
  ├── type              # 链路类型 (PCIE, XGMI等)
  ├── node_to           # 目标节点ID
  ├── weight            # 链路权重（性能指标）
  ├── flags             # 支持的操作标志
  └── bandwidth         # 带宽（MB/s）

链路类型影响：
- XGMI/Infinity Fabric: 高性能直接P2P，低延迟
- PCIe: 通过PCIe交换机，性能较低
- No Link: 不支持P2P

4.2 内存池属性

每个GPU的内存池在初始化时会设置访问属性：

hsa_status_t Device::iterateGpuMemoryPoolCallback(hsa_amd_memory_pool_t pool, void* data) {
// 查询内存池的segment类型
Hsa::memory_pool_get_info(pool, HSA_AMD_MEMORY_POOL_INFO_SEGMENT, &segment_type);
switch (segment_type) {
case HSA_REGION_SEGMENT_GLOBAL: {
// 查询全局标志
Hsa::memory_pool_get_info(pool, HSA_AMD_MEMORY_POOL_INFO_GLOBAL_FLAGS, &global_flag);
if (global_flag & HSA_REGION_GLOBAL_FLAG_COARSE_GRAINED) {
// VRAM - 这是P2P的目标内存池
dev->gpuvm_segment_ = pool;
// 检查CPU agent能否访问（Large BAR支持）
hsa_amd_memory_pool_access_t tmp{};
Hsa::agent_memory_pool_get_info(dev->cpu_agent_info_->agent, pool,
HSA_AMD_AGENT_MEMORY_POOL_INFO_ACCESS, &tmp);
if (tmp == HSA_AMD_MEMORY_POOL_ACCESS_NEVER_ALLOWED) {
dev->info_.largeBar_ = false;  // 不支持Large BAR
} else {
dev->info_.largeBar_ = ROC_ENABLE_LARGE_BAR;
}
}
}
}
}

5. P2P链路信息查询

5.1 API接口

hipError_t hipExtGetLinkTypeAndHopCount(int device1, int device2,
uint32_t* linktype,
uint32_t* hopcount);
hipError_t hipDeviceGetP2PAttribute(int* value, hipDeviceP2PAttr attr,
int srcDevice, int dstDevice);

5.2 链路属性枚举

enum hipDeviceP2PAttr {
hipDevP2PAttrPerformanceRank,        // 性能等级（基于linktype）
hipDevP2PAttrAccessSupported,        // 是否支持P2P访问
hipDevP2PAttrNativeAtomicSupported,  // 是否支持原子操作
hipDevP2PAttrHipArrayAccessSupported // 是否支持HIP数组访问
};

5.3 链路信息获取

bool Device::findLinkInfo(const hsa_amd_memory_pool_t& pool,
std::vector<LinkAttrType>* link_attrs) {
  // 查询两个设备之间的hop数
  int32_t hops = 0;
  hsa_status_t status = Hsa::agent_memory_pool_get_info(
  bkendDevice_, pool,
  HSA_AMD_AGENT_MEMORY_POOL_INFO_NUM_LINK_HOPS,
  &hops
  );
  if (hops < 0) {
  return false;  // 不可达
  }
  if (hops == 0) {
  // 同一设备，无链路
  link_attr.second = -1;  // linktype无意义
  } else {
  // 查询链路类型、带宽等信息
  // 从KFD topology获取
  }
  }

6. P2P访问的启用/禁用

6.1 启用P2P

hipError_t hipDeviceEnablePeerAccess(int peerDeviceId, unsigned int flags) {
int deviceId = hip::getCurrentDevice()->deviceId();
int canAccess = 0;
// 1. 检查硬件是否支持
if ((hipSuccess != canAccessPeer(&canAccess, deviceId, peerDeviceId)) ||
(canAccess == 0)) {
HIP_RETURN(hipErrorInvalidDevice);
}
// 2. 在HSA层启用P2P
amd::Device* device = g_devices[deviceId]->asContext()->devices()[0];
amd::Device* peer_device = g_devices[peerDeviceId]->asContext()->devices()[0];
peer_device->enableP2P(device);
// 3. 在HIP层记录启用状态
HIP_RETURN(hip::getCurrentDevice()->EnablePeerAccess(peerDeviceId));
}

底层HSA操作：

bool Device::enableP2P(amd::Device* ptrDev) {
// 对于 DISALLOWED_BY_DEFAULT 的内存池
// 调用 hsa_amd_agents_allow_access 授权访问
hsa_status_t stat = Hsa::agents_allow_access(
1,  // 授权给1个agent
&(static_cast<roc::Device*>(ptrDev)->getBackendDevice()),
  nullptr,
  ptr  // 要授权的内存地址
  );
  }

6.2 禁用P2P

hipError_t hipDeviceDisablePeerAccess(int peerDeviceId) {
// 撤销访问权限
peer_device->disableP2P(device);
HIP_RETURN(hip::getCurrentDevice()->DisablePeerAccess(peerDeviceId));
}

7. 完整流程图

系统启动
    ↓
HSA初始化 (hsa_init)
    ↓
枚举所有agents (hsa_iterate_agents)
    ↓
对每个GPU agent:
    ├─ 枚举内存池 (hsa_amd_agent_iterate_memory_pools)
    │   └─ 找到VRAM内存池 (gpuvm_segment_)
    │
    └─ 对系统中其他GPU agents:
        └─ 查询访问权限 (agent_memory_pool_get_info)
            ├─ ALLOWED_BY_DEFAULT → 加入 p2p_agents_
            ├─ DISALLOWED_BY_DEFAULT → 加入 p2p_agents_
            └─ NEVER_ALLOWED → 跳过
    ↓
构建全局P2P拓扑:
    对每个 device1:
        对每个 p2pAgent in device1.p2pAgents():
            找到对应的 device2
            device2.p2pDevices_.push_back(device1)
    ↓
应用层调用 hipDeviceCanAccessPeer
    ↓
查找 device->p2pDevices_ 列表
    ↓
返回结果

8. 影响P2P能力的因素

8.1 硬件层面

GPU互连技术
- XGMI/Infinity Fabric: 全速P2P支持
- PCIe直连: 部分支持，性能受限
- PCIe switch: 支持但延迟较高
- 无物理连接: 不支持
Large BAR支持
- 支持Large BAR: CPU可直接访问全部VRAM
- 不支持: CPU只能访问256MB VRAM窗口
拓扑结构
- 同一socket下的GPU: P2P性能最优
- 跨socket GPU: 需通过CPU互连（UPI/GMI）

8.2 软件层面

KFD驱动版本: 必须支持P2P相关ioctl
HSA runtime版本: 需要正确报告拓扑信息
内核配置: 需要启用IOMMU相关配置
系统设置:
- /sys/module/amdgpu/parameters/noretry: 影响P2P稳定性
- IOMMU设置: 可能影响P2P性能

8.3 运行时因素

内存分配位置: 必须在VRAM中才能P2P访问
显式启用: 需调用 hipDeviceEnablePeerAccess
内存对齐: 某些架构需要特定对齐

9. 典型P2P拓扑示例

9.1 单机4卡XGMI连接

GPU0 ←XGMI→ GPU1
  ↑           ↑
 XGMI       XGMI
  ↓           ↓
GPU2 ←XGMI→ GPU3

特点： 全互联，所有GPU间都支持高性能P2P

9.2 双卡PCIe连接

    CPU
     ↓
  PCIe Switch
   ↙     ↘
GPU0    GPU1

特点： 通过PCIe switch，支持P2P但性能低于XGMI

9.3 跨socket配置

Socket0        Socket1
  ↓              ↓
 GPU0   ←UPI→   GPU2
  ↓              ↓
 GPU1   ←UPI→   GPU3

特点： 同socket内高性能，跨socket性能下降

10. 调试和验证

10.1 查看拓扑信息

# 查看节点数量
cat /sys/class/kfd/kfd/topology/nodes/*/gpu_id
# 查看节点间链路
for node in /sys/class/kfd/kfd/topology/nodes/*; do
echo "Node: $(basename $node)"
cat $node/gpu_id
for link in $node/io_links/*; do
echo "  Link: $(basename $link)"
echo "    Type: $(cat $link/type)"
echo "    To: $(cat $link/node_to)"
echo "    Weight: $(cat $link/weight)"
done
done

10.2 使用rocminfo

rocminfo | grep -A 20 "Link Type Info"

10.3 使用rocm-smi(见rocm-smi gpu topology使用分析）

rocm-smi --showtopo --showtopoaccess

10.3 编程验证

#include <hip/hip_runtime.h>
  #include <stdio.h>
    int main() {
    int deviceCount;
    hipGetDeviceCount(&deviceCount);
    printf("P2P Capability Matrix:\n");
    printf("     ");
    for (int i = 0; i < deviceCount; i++) {
    printf(" GPU%d", i);
    }
    printf("\n");
    for (int i = 0; i < deviceCount; i++) {
    printf("GPU%d:", i);
    for (int j = 0; j < deviceCount; j++) {
    int canAccess = 0;
    if (i != j) {
    hipDeviceCanAccessPeer(&canAccess, i, j);
    }
    printf("  %s ", canAccess ? "Y" : "N");
    }
    printf("\n");
    }
    return 0;
    }

11. 性能考虑

11.1 P2P传输性能排序

XGMI直连: ~200-400 GB/s （最快）
PCIe 4.0 x16: ~32 GB/s
PCIe 3.0 x16: ~16 GB/s
通过staging buffer: <10 GB/s （最慢）

11.2 优化建议

优先使用XGMI连接的GPU对
避免频繁的小块P2P传输
考虑数据局部性，减少跨GPU访问
使用异步拷贝隐藏延迟

总结

P2P能力的确定是一个多层次的过程：

硬件层: GPU间物理连接（XGMI/PCIe）决定底层能力
驱动层: KFD驱动读取拓扑信息并暴露给用户空间
Runtime层: HSA runtime查询内存池访问权限
应用层: ROCr构建P2P设备列表，HIP提供访问接口

整个过程在系统初始化时完成，之后应用程序可以通过HIP API查询和控制P2P访问。

刷新页面返回顶部

tlnshuju