LXR | KVM | PM | Time | Interrupt | Systems Performance | Bootup Optimization

基于gem5的ROCm开发环境搭建和测试

正如《我买不起AMD MI300X计算卡,也可以学习ROCm吗?》标题所说,不是谁都有AMD GPU算力卡进行ROCm开发。那么搭建一个虚拟开发环境,便捷经济。

如《Full System AMD GPU model》指出,gem5提供了全系统的GPU模拟,并且CPU部分的模拟通过KVM CPU实现,这就需要主机支持KVM。

通过gem5构建的虚拟化环境,可创建模拟AMD CPU与GPU的异构计算平台,配合gem5-resources预置的ROCm优化镜像,实现HIP程序在虚拟GPU上的无缝运行。gem5-resource根据需求创建ubuntu和kernel镜像,在gem5创建的虚拟机中执行。通过diod服务,主机和gem5虚拟机进行文件共享。通过gem5term登录到gem5虚拟机,进行ROCm代码编译和测试。

 whiteboard_exported_image (2)

1 gem5和gem5-resource编译

gem5是一个基于离散事件驱动的模块化计算机系统架构模拟器,广泛应用于计算机体系结构的学术研究和工业开发。该模拟器支持:

- 周期精确(cycle-accurate)的CPU指令集模拟
- 时序精确(timing-accurate)的内存子系统建模
- 可扩展的互连网络仿真框架

其官方配套资源库gem5-resources提供:
1. 预构建的Linux磁盘镜像(Kernel 5.10+)
2. 预编译的基准测试套件(SPEC CPU2017, NAS Parallel Benchmarks)
3. 硬件验证工作负载(如ARM AMBA AXI4总线测试)

该资源库通过标准化研究环境配置,显著降低异构计算架构的仿真门槛。

安装编译必备工具:

sudo apt update
sudo apt install build-essential git m4 scons zlib1g zlib1g-dev qemu-system-x86 libpng-dev libcapstone-dev diod \
    libprotobuf-dev protobuf-compiler libprotoc-dev libgoogle-perftools-dev \
    python3-dev python3-pip python3-setuptools libboost-all-dev pkg-config \
    cmake flex bison libhdf5-dev libxml2-dev libnuma-dev numactl \
    python3-venv python3-dev
sudo pip3 install scons pybind11

下载仓库,并使用scons开始编译:

git clone https://github.com/gem5/gem5.git
scons build/VEGA_X86/gem5.opt -j `nproc`

 下载gem5-resources,并编译资源:

https://github.com/gem5/gem5-resources.git
cd src/x86-ubuntu-gpu-ml
./build.sh

gem5-resource编译过程中的异常修复:

2025/08/25 19:50:25 packer-plugin-qemu_v1.1.3_x5.0_linux_amd64 plugin: 2025/08/25 19:50:25 Started Qemu. Pid: 229293
2025/08/25 19:50:25 packer-plugin-qemu_v1.1.3_x5.0_linux_amd64 plugin: 2025/08/25 19:50:25 Qemu stderr: Could not access KVM kernel module: Permission denied
2025/08/25 19:50:25 packer-plugin-qemu_v1.1.3_x5.0_linux_amd64 plugin: 2025/08/25 19:50:25 Qemu stderr: qemu-system-x86_64: failed to initialize kvm: Permission denied

如下:

# 将当前用户添加到 kvm 用户组
sudo usermod -aG kvm $USER

# 验证用户是否已加入 kvm 组
groups $USER

# 注销并重新登录,或者使用以下命令立即生效
newgrp kvm

2 在gem5模拟AMD CPU/GPU,并终端登录、与主机文件共享

2.1 启动文件共享服务

启动diod服务:

rm -f /tmp/gem5_9p.sock && sudo diod -f -o "trans=unix,path=/tmp/gem5_9p.sock,port=0" -e /home/lbq/data/rocm

为了和9p交互提供文件共享服务,修改gem5/configs/example/gpufs/runfs.py:

diff --git a/configs/example/gpufs/runfs.py b/configs/example/gpufs/runfs.py
index db2282808a..7c553c8d3d 100644
--- a/configs/example/gpufs/runfs.py
+++ b/configs/example/gpufs/runfs.py
@@ -51,7 +51,7 @@ from ruby import Ruby

 # GPU FS related
 from system.system import makeGpuFSSystem
-
+from m5.objects import PciVirtIO, VirtIO9PDiod, PciHost

 def addRunFSOptions(parser):
     parser.add_argument(
@@ -321,6 +321,44 @@ def runGpuFSSystem(args):
     )


+_real_instantiate = m5.instantiate
+
+def _instantiate_with_9p(*args,**kwargs):
+    root = m5.objects. Root.getInstance()
+    sys  = getattr(root, "system", None)
+    if sys and not hasattr(sys, "_virtio9p_added"):
+        viopci = PciVirtIO()
+        viopci.vio = VirtIO9PDiod()
+        viopci.vio.root = "/home/lbq/data/rocm"
+        viopci.vio.socketPath = "/tmp/gem5_9p.sock"
+        sys.viopci = viopci
+
+        host_bridge = next(obj for obj in sys.descendants()
+                        if isinstance(obj, PciHost))
+
+
+        viopci.host = host_bridge
+        viopci.pci_bus = 0
+        viopci.pci_dev = 2
+        viopci.pci_func = 0
+        viopci.pio =  sys.iobus.mem_side_ports
+        viopci.dma = sys.iobus.cpu_side_ports
+
+        viopci.VendorID = 0x1AF4
+        viopci.DeviceID = 0x1009
+        viopci.SubClassCode = 0x80
+        viopci.ClassCode = 0xFF
+        viopci.Revision = 0x00
+        viopci.SubsystemID = 0x09
+        viopci.InterruptPin = 1
+        viopci.InterruptLine = 11
+
+        sys._virtio9p_added = True
+    return _real_instantiate(*args, **kwargs)
+
+m5.instantiate = _instantiate_with_9p
+
+
 if __name__ == "__m5_main__":
     # Add gpufs, common, ruby, amdgpu, and gpu tlb args
     parser = argparse.ArgumentParser()

主要的组件如下:

组件​

​功能描述​

​交互接口​

​DIOD​

主机文件共享守护进程

9P协议(TCP/UDP端口56432)

​PciHost​

主机PCI控制器

物理PCIe总线接口

​PciVirtIO​

模拟的PCI设备

虚拟PCI配置空间

​VirtIO9PDiod​

9P协议传输引擎

VirtIO队列(请求/响应)

​Guest Kernel​

客户机9P客户端驱动

9P文件系统接口

流程图如下:

123

2.2 主机安装/dev/gem5_bridge

/dev/gem5_bridge是gem5模拟环境中一个特殊的设备接口,主要用于连接模拟系统内部与外部网络,或在模拟的不同组件间进行高效的数据交换。它在需要网络功能或设备模拟的仿真场景中尤为重要。

进入gem5/util/gem5_bridge,执行make,得到gem5_bridge.ko。对于x86来说,执行如下命令创建/dev/gem5_bridge:

sudo insmod gem5_bridge.ko \
    gem5_bridge_baseaddr=0x7f000000 \
    gem5_bridge_rangesize=0x1000

通过ls /dev/gem5_bridge查看是否成功,或者查看内核日志:

[  209.473510] gem5_bridge: loading out-of-tree module taints kernel.
[  209.473515] gem5_bridge: module verification failed: signature and/or required key missing - tainting kernel
[  209.474615] gem5_bridge_init: SUCCESS!

2.3 启动gem5执行ROCm环境

为了解决ROCm执行后自动退出环境,修改gem5/configs/example/gpufs/mi300.py:

diff --git a/configs/example/gpufs/mi300.py b/configs/example/gpufs/mi300.py
index 08dce8f4c2..85f369c28e 100644
--- a/configs/example/gpufs/mi300.py
+++ b/configs/example/gpufs/mi300.py
@@ -153,7 +153,7 @@ def runMI300GPUFS(
         )
         b64file.write(runscriptStr)

-    args.script = tempRunscript
+#    args.script = tempRunscript

     # Defaults for CPU
     args.cpu_type = "X86KvmCPU"

编写简单测试程序pytorch_test.py验证环境:

#!/usr/bin/env python3
import torch
import subprocess
import os

def get_gpu_info():
    """获取详细的GPU信息"""
    try:
        rocm_info = subprocess.check_output(["rocminfo"]).decode()
        return [line for line in rocm_info.split('\n')  if "Marketing Name" in line][0]
    except:
        return "N/A"

print("=== 增强版ROCm验证 ===")
print(f"PyTorch版本: {torch.__version__}")
print(f"Torch HIP版本: {torch.version.hip}")
print(f"CUDA可用: {torch.cuda.is_available()}")
print(f"GPU设备数: {torch.cuda.device_count()}")
print(f"当前设备: {torch.cuda.current_device()}")
print(f"设备名称: {torch.cuda.get_device_name(0)}")
print(f"GPU内存: {torch.cuda.get_device_properties(0).total_memory/1024**3:.2f}  GB")
print(f"ROCm营销名称: {get_gpu_info()}")

# 张量运算验证
x = torch.rand(5,3).to('cuda')
y = torch.rand(3,5).to('cuda')
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
z = (x @ y).norm()
end.record()
torch.cuda.synchronize()

print(f"\n计算验证:")
print(f"张量设备: {x.device}")
print(f"矩阵乘法结果范数: {z.item():.6f}")
print(f"CUDA事件耗时: {start.elapsed_time(end):.2f}  ms")

 

启动gem5,执行:

./gem5/build/VEGA_X86/gem5.opt ./gem5/configs/example/gpufs/mi300.py --disk-image gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-gpu-ml --kernel gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-gpu-ml --app ./pytorch_test.py

执行日志如下,提供了串口(3456)和gdb(7000)连接:

gem5 Simulator System.  https://www.gem5.org
gem5 is copyrighted software; use the --copyright option for details.

gem5 version DEVELOP-FOR-25.1
gem5 compiled Aug 25 2025 20:39:59
gem5 started Aug 26 2025 11:07:38
gem5 executing on lbq-hp, pid 7369
command line: ./gem5/build/VEGA_X86/gem5.opt ./gem5/configs/example/gpufs/mi300.py --disk-image gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-gpu-ml --kernel gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-gpu-ml --app ./pytorch_test.py

warn: Physical memory size specified is 8GiB which is greater than 3GiB.  Twice the number of memory controllers would be created.
Global frequency set at 1000000000000 ticks per second
warn: system.workload.acpi_description_table_pointer.rsdt adopting orphan SimObject param 'entries'
warn: No dot file generated. Please install pydot to generate the dot file and pdf.
src/mem/dram_interface.cc:690: warn: DRAM device capacity (8192 Mbytes) does not match the address range assigned (4096 Mbytes)
src/sim/kernel_workload.cc:46: info: kernel located at: gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-gpu-ml
src/base/statistics.hh:279: warn: One of the stats is a legacy stat. Legacy stat is a stat that does not belong to any statistics::Group. Legacy stat is deprecated.
src/cpu/kvm/base.cc:113: info: Using KVM CPU without perf. The stats related to the number of cycles and instructions executed by the KVM CPU will not be updated. The stats should not be used for performance evaluation.
src/base/statistics.hh:279: warn: One of the stats is a legacy stat. Legacy stat is a stat that does not belong to any statistics::Group. Legacy stat is deprecated.
src/mem/dram_interface.cc:690: warn: DRAM device capacity (128 Mbytes) does not match the address range assigned (16384 Mbytes)
src/base/statistics.hh:279: warn: One of the stats is a legacy stat. Legacy stat is a stat that does not belong to any statistics::Group. Legacy stat is deprecated.
      0: system.pc.south_bridge.cmos.rtc: Real-time clock set to Sun Jan  1 00:00:00 2012
system.pc.com_1.device: Listening for connections on port 3456
src/base/statistics.hh:279: warn: One of the stats is a legacy stat. Legacy stat is a stat that does not belong to any statistics::Group. Legacy stat is deprecated.
system.remote_gdb: Listening for connections on port 7000
src/dev/intel_8254_timer.cc:128: warn: Reading current count from inactive timer.
Running the simulation
src/cpu/kvm/base.cc:169: info: KVM: Coalesced MMIO disabled by config.
src/cpu/kvm/base.cc:591: hack: Pretending totalOps is equivalent to totalInsts()
src/arch/x86/kvm/x86_cpu.cc:1688: warn: kvm-x86: MSR (0x3a) unsupported by gem5. Skipping.
src/arch/x86/kvm/x86_cpu.cc:1688: warn: kvm-x86: MSR (0x48) unsupported by gem5. Skipping.
...
src/arch/x86/kvm/x86_cpu.cc:1688: warn: kvm-x86: MSR (0xc0010015) unsupported by gem5. Skipping.
src/arch/x86/kvm/x86_cpu.cc:1688: warn: kvm-x86: MSR (0x4b564d05) unsupported by gem5. Skipping.
src/dev/pci/host.cc:171: warn: 00:1f.1: Write to config space on non-existent PCI device
src/dev/pci/host.cc:171: warn: 00:1f.1: Write to config space on non-existent PCI device
src/dev/x86/pc.cc:117: warn: Don't know what interrupt to clear for console.

2.4 串口登录ROCm运行环境

将gem5/util/term/gem5term拷贝到/usr/bin:

gem5term localhost 3456

登陆后:

Ubuntu 24.04.2 LTS gem5 ttyS0

gem5 login: root (automatic login)


The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

Can't open /dev/gem5_bridge: No such file or directory
--> Make sure the gem5_bridge device driver has been properly inserted into the kernel. Otherwise, sudo access required to perform address-mode ops when linking against m5 library.
root@gem5:~# lspci
00:02.0 Unassigned class [ff80]: Red Hat, Inc. Virtio filesystem
00:04.0 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE
00:08.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X]

 2.5 挂载外部目录

在gem5+ROCm内部挂在外部目录:

mkdir /root/share
mount -t 9p -o trans=virtio,version=9p2000.L,aname=/home/lbq/data/rocm gem5 /root/share

 如下:

root@gem5:~/share# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        54G   39G   12G  77% /
tmpfs           3.9G     0  3.9G   0% /dev/shm
tmpfs           1.6G  492K  1.6G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           795M   16K  795M   1% /run/user/0
gem5            703G  106G  561G  16% /root/share

2.6 插入amggpu驱动module

 通过以下命令插入amdgpu驱动:

export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH
export HSA_ENABLE_INTERRUPT=0
export HCC_AMDGPU_TARGET=gfx942
dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128 #加载MI300X固件。
sh /home/gem5/load_amdgpu.sh

rocm-smi查GPU状态:

============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp    Power  Partitions          SCLK  MCLK  Fan  Perf  PwrCap       VRAM%  GPU%
              (DID,     GUID)  (Edge)  (Avg)  (Mem, Compute, ID)
==========================================================================================================================
0       1     0x74a1,   36215  N/A     N/A    N/A, SPX, 0         None  None  0%   n/a   Unsupported  1%     Unsupported
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================

rocminfo检查硬件信息和驱动安装状态:

ROCk module version 6.12.12 is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.15
Runtime Ext Version:     1.7
System Timestamp Freq.:  0.001000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
XNACK enabled:           NO
DMAbuf Support:          YES
VMM Support:             YES

==========
HSA Agents
==========
*******
Agent 1
*******...
*******
Agent 2
*******
  Name:                    gfx942
  Uuid:                    GPU-XX
  Marketing Name:          AMD Instinct MI300X
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
  Chip ID:                 29857(0x74a1)
  ASIC Revision:           2(0x2)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   100
  BDFID:                   64
  Internal Node ID:        1
  Compute Unit:            320
  SIMDs per CU:            4
  Shader Engines:          32
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 177
  SDMA engine uCode::      24
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    16760832(0xffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2...
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc-:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
    ISA 2...
*** Done ***

3 在gem5模拟环境中测试ROCm

3.1 借助gem5-resource进行hip测试

进入gem5-resources/src/gpu/square,修改Makefile:

diff --git a/src/gpu/square/Makefile b/src/gpu/square/Makefile
index 0e0cf02b..13dcbef4 100644
--- a/src/gpu/square/Makefile
+++ b/src/gpu/square/Makefile
@@ -1,4 +1,4 @@
-HIP_PATH?= /opt/rocm/hip
+HIP_PATH?= /opt/rocm
 HIPCC=$(HIP_PATH)/bin/hipcc

 BIN_DIR?= ./bin
@@ -6,7 +6,7 @@ BIN_DIR?= ./bin
 square: $(BIN_DIR)/square

 $(BIN_DIR)/square: square.cpp $(BIN_DIR)
-       $(HIPCC) --amdgpu-target=gfx900,gfx902 $(CXXFLAGS) square.cpp -o $(BIN_DIR)/square
+       $(HIPCC) --amdgpu-target=gfx942 --save-temps $(CXXFLAGS) square.cpp -o $(BIN_DIR)/square

 $(BIN_DIR):
        mkdir -p $(BIN_DIR)

编译如下:

make: Warning: File 'Makefile' has modification time 66360 s in the future
mkdir -p ./bin
/opt/rocm/bin/hipcc --amdgpu-target=gfx942  square.cpp -o ./bin/square
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
square.cpp:77:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
   77 |     hipDeviceSynchronize();
      |     ^~~~~~~~~~~~~~~~~~~~
1 warning generated when compiling for gfx942.
square.cpp:77:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]
   77 |     hipDeviceSynchronize();
      |     ^~~~~~~~~~~~~~~~~~~~
1 warning generated when compiling for host.
make: warning:  Clock skew detected.  Your build may be incomplete.

执行:

./bin/square

执行结果如下:

info: running on device AMD Instinct MI300X
info: allocate host and device mem (  7.63 MB)
info: launch 'vector_square' kernel
info: check result
PASSED!

反汇编查看指令:/opt/rocm-6.4.0/llvm/bin/llvm-objdump -d square-hip-amdgcn-amd-amdhsa-gfx942.o

square-hip-amdgcn-amd-amdhsa-gfx942.o:  file format elf64-amdgpu

Disassembly of section .text:

0000000000000000 <.text>:
        s_nop 0                                                    // 000000000000: BF800000
        s_nop 0                                                    // 000000000004: BF800000
...
        s_nop 0                                                    // 0000000003F8: BF800000
        s_nop 0                                                    // 0000000003FC: BF800000

Disassembly of section .text._Z13vector_squareIfEvPT_PKS0_m:

0000000000000000 <_Z13vector_squareIfEvPT_PKS0_m>:
        s_load_dword s3, s[0:1], 0x24                              // 000000000000: C00200C0 00000024
        s_load_dwordx2 s[8:9], s[0:1], 0x10                        // 000000000008: C0060200 00000010
        s_add_u32 s10, s0, 24                                      // 000000000010: 800A9800
        s_addc_u32 s11, s1, 0                                      // 000000000014: 820B8001
        v_mov_b32_e32 v1, 0                                        // 000000000018: 7E020280
        s_waitcnt lgkmcnt(0)                                       // 00000000001C: BF8CC07F
        s_and_b32 s3, s3, 0xffff                                   // 000000000020: 8603FF03 0000FFFF
        s_mul_i32 s2, s2, s3                                       // 000000000028: 92020302
        v_add_u32_e32 v0, s2, v0                                   // 00000000002C: 68000002
        v_cmp_gt_u64_e32 vcc, s[8:9], v[0:1]                       // 000000000030: 7DD80008
        s_and_saveexec_b64 s[4:5], vcc                             // 000000000034: BE84206A
        s_cbranch_execz 29                                         // 000000000038: BF88001D <_Z13vector_squareIfEvPT_PKS0_m+0xb0>
        s_load_dword s2, s[10:11], 0x0                             // 00000000003C: C0020085 00000000
        s_load_dwordx4 s[4:7], s[0:1], 0x0                         // 000000000044: C00A0100 00000000
        s_mov_b32 s1, 0                                            // 00000000004C: BE810080
        v_lshlrev_b64 v[2:3], 2, v[0:1]                            // 000000000050: D28F0002 00020082
        s_mov_b64 s[10:11], 0                                      // 000000000058: BE8A0180
        s_waitcnt lgkmcnt(0)                                       // 00000000005C: BF8CC07F
        s_mul_i32 s0, s2, s3                                       // 000000000060: 92000302
        s_lshl_b64 s[2:3], s[0:1], 2                               // 000000000064: 8E828200
        v_lshl_add_u64 v[4:5], s[6:7], 0, v[2:3]                   // 000000000068: D2080004 04090006
        global_load_dword v6, v[4:5], off                          // 000000000070: DC508000 067F0004
        v_lshl_add_u64 v[0:1], v[0:1], 0, s[0:1]                   // 000000000078: D2080000 00010100
        v_cmp_le_u64_e32 vcc, s[8:9], v[0:1]                       // 000000000080: 7DD60008
        v_lshl_add_u64 v[4:5], s[4:5], 0, v[2:3]                   // 000000000084: D2080004 04090004
        v_lshl_add_u64 v[2:3], v[2:3], 0, s[2:3]                   // 00000000008C: D2080002 00090102
        s_or_b64 s[10:11], vcc, s[10:11]                           // 000000000094: 878A0A6A
        s_waitcnt vmcnt(0)                                         // 000000000098: BF8C0F70
        v_mul_f32_e32 v6, v6, v6                                   // 00000000009C: 0A0C0D06
        global_store_dword v[4:5], v6, off                         // 0000000000A0: DC708000 007F0604
        s_andn2_b64 exec, exec, s[10:11]                           // 0000000000A8: 89FE0A7E
        s_cbranch_execnz 65518                                     // 0000000000AC: BF89FFEE <_Z13vector_squareIfEvPT_PKS0_m+0x68>
        s_endpgm                                                   // 0000000000B0: BF810000

square的依赖关系如下:

square

 

3.2 借助hip-tests进行测试

hip-tests位于rocm-systems/projects/hip-tests,参考rocm-systems/projects/hip-tests/README-doc.md对其编译:

cd rocm-systems/projects/hip-tests
mkdir -p build; cd build
cmake ../catch/ -DHIP_PLATFORM=amd
make -j$(nproc) build_tests
ctest # run tests

HIP Catch2 standalone test:

hipcc ./catch/unit/memory/hipPointerGetAttributes.cc -I ./catch/include ./catch/hipTestMain/standalone_main.cc -I ./catch/external/Catch2 -o hipPointerGetAttributes
./hipPointerGetAttributes

 

posted on 2025-08-30 23:59  ArnoldLu  阅读(139)  评论(0)    收藏  举报

导航