rocm-examples

ROCm/rocm-examples仓库是学习和使用 AMD ROCm 异构计算平台的宝贵资源库。它提供了从入门到进阶的大量高质量代码示例，覆盖了 ROCm 软件栈的核心组件（HIP）和主要加速库。其结构清晰、支持多种构建方式和平台，是开发者快速上手 ROCm、理解其功能并进行应用开发的必备参考。无论是想学习 HIP 基础语法，还是探索如何利用 rocBLAS、rocFFT 等库进行高性能计算，或是在 Windows 环境下开发，这个仓库都能提供直接的帮助。

1 核心内容与结构

仓库包含大量组织良好的示例，主要分为以下类别：

HIP 基础 (HIP-Basic):
- 覆盖 HIP 的核心概念：内核启动 (hello_world)、设备查询 (device_query)、内存管理（全局、共享、常量、统一内存）、流 (streams)、事件 (events)、原子操作、数学函数 (math) 等。
- 包含实用技巧：运行时编译 (runtime_compilation)、加载预编译模块 (load_module)、处理内联汇编 (inline_assembly)、处理不同 GPU 架构 (gpu_arch)、错误处理 (error_handling)。
- 展示与图形 API 的互操作性：OpenGL (opengl_interop)、Vulkan (vulkan_interop, vulkan_interop_mipmap)。
- 包含从 CUDA 迁移 HIP (hipify) 和跨平台（ROCm/CUDA）构建 (hello_world_cuda) 的示例。
ROCm 库示例 (Libraries):
- rocBLAS: 各级 BLAS 操作 (level_1, level_2, level_3)，如 AXPY (axpy), DOT (dot), GEMV (gemv), GEMM (gemm), GEMM 批处理 (gemm_strided_batched), HER (her) 等。
- rocFFT / hipFFT: 各种 FFT 变换（复-复 complex_complex, 实-复 real_complex, 复-实 complex_real）、多 GPU FFT (multi_gpu)、回调函数 (callback)。
- rocRAND: 伪随机数生成 (pseudorandom_generations)、分布生成 (simple_distributions_cpp)。
- rocPRIM / hipCUB: 基本并行原语，如规约 (device_sum, block_sum)、扫描 (prefix_sum)、排序 (bitonic_sort)。
- rocSOLVER / hipSOLVER: 线性代数求解器，如 LU 分解 (getf2, getrf, gesvd)、Cholesky 分解 (potrf)、QR 分解 (geqrf)、特征值/特征向量计算 (syevd, syevj, syevdx, sygvd)、线性系统求解 (gels)、矩阵求逆 (getri)。
- rocSPARSE / hipSPARSE: 稀疏矩阵运算（各级 level_1, level_2, level_3），各种存储格式（CSR, COO, ELL, BSR, GEBSR）下的 SPMV、SPMM、三角求解 (spsv, csrsv, bsrsv)、预处理 (bsric0, csrilu0)、特定格式求解器 (gtsv, gpsv)。
- rocThrust: STL 风格的并行算法库示例 (saxpy, reduce_sum, norm, vectors, remove_points, device_ptr)。
应用示例 (Applications):
- 展示常见算法或应用的 GPU 加速实现：卷积 (convolution)、Floyd-Warshall 最短路径 (floyd_warshall)、直方图 (histogram)、蒙特卡洛 Pi (monte_carlo_pi)。
AI 示例 (AI):
- 目前主要包含使用 MIGraphX（ROCm 的推理引擎）执行量化模型的示例 (MIGraphX/Quantization)。
教程代码 (HIP-Doc & Tutorials):
- 包含官方 HIP 教程 (HIP documentation — HIP 6.4.43484 Documentation, SAXPY - Hello, HIP — HIP 6.4.43484 Documentation)中使用的配套代码，方便学习和参考。

2 代码编译

在Linux下通过CMake编译：

git clone https://github.com/ROCm/rocm-examples.git
cd rocm-examples
cmake -S . -B build (on ROCm) or $ cmake -S . -B build -D GPU_RUNTIME=CUDA (on CUDA)
cmake --build build
cmake --install build --prefix install

或者通过make编译：

git clone https://github.com/ROCm/rocm-examples.git
cd rocm-examples
make (on ROCm) or $ make GPU_RUNTIME=CUDA (on CUDA)

3 详细说明

AI Showcases the functionality for executing quantized models using Torch-MIGraphX.
Applications groups a number of examples ... .
- bitonic_sort: Showcases how to order an array of n elements using a GPU implementation of the bitonic sort.
- convolution: A simple GPU implementation for the calculation of discrete convolutions.
- floyd_warshall: Showcases a GPU implementation of the Floyd-Warshall algorithm for finding shortest paths in certain types of graphs.
- histogram: Histogram over a byte array with memory bank optimization.
- monte_carlo_pi: Monte Carlo estimation of π using hipRAND for random number generation and hipCUB for evaluation.
- prefix_sum: Showcases a GPU implementation of a prefix sum with a 2-kernel scan algorithm.
Common contains common utility functionality shared between the examples.
HIP-Basic hosts self-contained recipes showcasing HIP runtime functionality.
- assembly_to_executable: Program and accompanying build systems that show how to manually compile and link a HIP application from host and device code.
- bandwidth: Program that measures memory bandwidth from host to device, device to host, and device to device.
- bit_extract: Program that showcases how to use HIP built-in bit extract.
- device_globals: Show cases how to set global variables on the device from the host.
- device_query: Program that showcases how properties from the device may be queried.
- dynamic_shared: Program that showcases how to use dynamic shared memory with the help of a simple matrix transpose kernel.
- events: Measuring execution time and synchronizing with HIP events.
- gpu_arch: Program that showcases how to implement GPU architecture-specific code.
- hello_world: Simple program that showcases launching kernels and printing from the device.
- hello_world_cuda: Simple HIP program that showcases setting up CMake to target the CUDA platform.
- hipify: Simple program and build definitions that showcase automatically converting a CUDA .cu source into portable HIP .hip source.
- llvm_ir_to_executable: Shows how to create a HIP executable from LLVM IR.
- inline_assembly: Program that showcases how to use inline assembly in a portable manner.
- matrix_multiplication: Multiply two dynamically sized matrices utilizing shared memory.
- module_api: Shows how to load and execute a HIP module in runtime.
- moving_average: Simple program that demonstrates parallel computation of a moving average of one-dimensional data.
- multi_gpu_data_transfer: Performs two matrix transposes on two different devices (one on each) to showcase how to use peer-to-peer communication among devices.
- occupancy: Shows how to find optimal configuration parameters for a kernel launch with maximum occupancy.
- opengl_interop: Showcases how to share resources and computation between HIP and OpenGL.
- runtime_compilation: Simple program that showcases how to use HIP runtime compilation (hipRTC) to compile a kernel and launch it on a device.
- saxpy: Implements the yi=axi+yi kernel and explains basic HIP functionality.
- shared_memory: Showcases how to use static shared memory by implementing a simple matrix transpose kernel.
- static_device_library: Shows how to create a static library containing device functions, and how to link it with an executable.
- static_host_library: Shows how to create a static library containing HIP host functions, and how to link it with an executable.
- streams: Program that showcases usage of multiple streams each with their own tasks.
- texture_management: Shows the usage of texture memory.
- vulkan_interop: Showcases how to share resources and computation between HIP and Vulkan via buffer memory.
- vulkan_interop_mipmap: Showcases how to share resources and computation between HIP and Vulkan via mipmapped memory (Only available in Windows).
- warp_shuffle: Uses a simple matrix transpose kernel to showcase how to use warp shuffle operations.
HIP-Doc hosts the HIP documentation's example codes. These are mainly intended for CI purposes but also serve as standalone examples.
- Programming-Guide contains the examples from the HIP documentation's Programming Guide section.
  - HIP-C++-Language-Extensions contains the examples from the HIP C++ language extensions page.
    - calling_global_functions: Shows how to call __global__ functions (kernels).
    - extern_shared_memory: Shows how to dynamically allocate __shared__ memory.
    - launch_bounds: Shows how to specify launch bounds for a kernel.
    - set_constant_memory: Shows how to initialize __constant__ memory.
    - template_warp_size_reduction: Shows how to perform a reduction while relying on the warp size as a compile-time constant.
    - timer: Shows how to read the device-side timer.
    - warp_size_reduction: Shows how to perform a reduction while relying on the warp size as an early-folded constant.
  - HIP-Porting-Guide contains the examples from the HIP porting guide page.
    - device_code_feature_identification: Shows how to query the device's compute features in device code.
    - host_code_feature_identification: Shows how to query the device's compute features in host code.
    - identifying_compilation_target_platform: Shows how to distinguish between AMD and NVIDIA target platforms in code.
    - identifying_host_device_compilation_pass: Shows how to disinguish between host and device compilation passes in code.
  - Introduction-to-the-HIP-Programming-Model contains the examples from the Introduction to the HIP programming model page.
    - add_kernel: Shows how to perform a vector addition with a GPU kernel.
  - Porting-CUDA-Driver-API contains the examples from the Porting CUDA driver API page.
    - address_retrieval: Shows how to obtain the address of a HIP runtime function.
    - load_module: Shows how to load precompiled code objects from disk and execute the contained kernel(s).
    - load_module_ex: Shows how to load precompiled code objects from memory and execute the contained kernel(s).
    - load_module_ex_cuda: Shows how to load CUDA PTX objects from disk and execute the contained kernel(s).
    - per_thread_default_stream: Shows how to manage streams on a per-thread basis.
    - pointer_memory_type: Shows how to query a pointer's memory type.
  - Programming-for-HIP-Runtime-Compiler contains the examples from the Programming for HIP runtime compiler (RTC) page.
    - compilation_apis: Shows how to compile a kernel at runtime using the HIPRTC API.
    - linker_apis: Shows how to link at runtime a LLVM bitcode object (which is stored in memory) using the HIPRTC API.
    - linker_apis_file: Shows how to link at runtime a LLVM bitcode object (which is stored in a file on disk) using the HIPRTC API.
    - linker_apis_options: Shows how to link at runtime a LLVM bitcode object (which is stored in memory) using the HIPRTC API. During the link stage an array of linker options is passed to the runtime linker.
    - lowered_names: Shows how to obtain the lowered (mangled) names of kernels and device variables using the HIPRTC API.
    - rtc_error_handling: Shows how to check the HIPRTC API calls for errors.
  - Using-HIP-Runtime-API contains the examples from the Using HIP runtime API subsection.
    - Asynchronous-Concurrent-Execution contains the examples from the Asynchronous concurrent execution page.
      - async_kernel_execution: Shows how to execute HIP operations and kernels asynchronously with regard to the host.
      - event_based_synchronization: Shows how to execute HIP operations and kernels asynchronously with regard to the host and how to synchronize the host and the device by using HIP events.
      - sequential_kernel_execution: Shows how to execute HIP operations and kernels sequentially.
    - Call-Stack contains the examples from the Call stack page.
      - call_stack_management: Shows how to adjust the device's call stack size.
      - device_recursion: Shows how to hit the device's stack limit on purpose.
    - Error-Handling contains the examples from the Error handling page.
      - error_handling: Shows how to handle HIP runtime errors without creating too much code overhead.
    - HIP-Graphs contains the examples from the HIP graphs page.
      - graph_capture: Shows how to capture HIP streams with the HIP graph API.
      - graph_creation: Shows how to explicitly create HIP graphs.
    - Initialization contains the examples from the Initialization page.
      - simple_device_query: Shows how the number of HIP-capable devices in the system can be determined, as well as how properties from the device may be queried.
    - Memory-Management contains the examples from the Memory management subsubsection.
      - Device-Memory contains the examples from the Device memory page.
        
        constant_memory: Shows how to transfer bytes between the host and the device's constant memory space.
        
        dynamic_shared_memory: Shows how to dynamically allocate shared memory on the host.
        
        explicit_copy: Shows how to transfer bytes between the host and the device's global memory space.
        
        kernel_memory_allocation: Shows how to allocate global device memory inside a kernel.
        
        static_shared_memory: Shows how to statically allocate shared memory inside a kernel.
      - Host-Memory contains the examples from the Host memory page.
        
        pageable_host_memory: Shows how to allocate pageable memory on the host and transfer its contents to the device.
        
        pinned_host_memory: Shows how to allocate pinned memory on the host and transfer its contents to the device.
      - SOMA contains the examples from the Stream Ordered Memory Allocator page.
        
        memory_pool: Shows how to use the stream ordered memory allocation (SOMA) API to set up and manage a memory pool.
        
        memory_pool_resource_usage_statistics: Shows how to query resource usage statistics for a memory pool.
        
        memory_pool_threshold: Shows how to use the stream ordered memory allocation (SOMA) API to set up and manage a memory pool, while defining a threshold to specify an amount of memory to reserve.
        
        memory_pool_trim: Shows how to trim a memory pool to a new size.
        
        ordinary_memory_allocation: Shows an ordinary memory allocation.
        
        stream_ordered_memory_allocation: Shows how to use stream ordered memory allocations.
      - Unified-Memory-Management contains the examples from the Unified memory management page.
        
        data_prefetching: Shows how to prefetch data in the unified memory space before it is actually needed.
        
        dynamic_unified_memory: Shows how to dynamically allocate unified memory and use it from both the host and the device.
        
        explicit_memory: Shows how to perform explicit memory management by allocating memory on the device and transferring bytes between the host and the device.
        
        memory_range_attributes: Shows how to query attributes of a given memory range.
        
        standard_unified_memory: Shows demonstrates how to dynamically allocate unified memory with standard C++ facilities and use it from both the host and the device.
        
        static_unified_memory: Shows how to statically allocate unified memory and use it from both the host and the device.
        
        unified_memory_advice: Shows how to set unified memory runtime hints.
      - Virtual-Memory-Management contains the examples from the Virtual memory management page.
        
        virtual_memory: Shows how to use HIP's virtual memory management API.
    - Multi-Device-Management contains the examples from the Multi-device management page.
      - device_enumeration: Shows how to query the number of devices in the system and how to access them.
      - device_selection: Shows how to switch between the different devices in the system and assign work to them.
      - multi_device_sychronization: Shows how to synchronize multiple devices using HIP events and streams.
      - p2p_memory_access: Shows how to copy data between devices by adding peer-to-peer accesses to the device selection example.
      - p2p_memory_access_failed: Shows how to copy data between devices by adding peer-to-peer accesses to the device selection example, but explicitly does not enable peer-to-peer access for the devices.
- Reference hosts the examples from the HIP documentation's Reference section.
  - CUDA-to-HIP-API-Function-Comparison contains the examples from the CUDA to HIP API Function Comparison page.
    - block_reduction: Shows a block-reduction kernel written in CUDA.
  - HIP-Complex-Math-API contains the examples from the HIP complex math API page.
    - complex_math: Shows how to use HIP's complex math API to compute the DFT.
  - HIP-Math-API contains the examples from the HIP math API page.
    - math: Shows how to use HIP's math API to compute the ULP difference.
  - Low-Precision-Floating-Point-Types contains the examples from the Low precision floating point types page.
    - low_precision_float_fp8: Shows how to convert a single-precision float value to an 8-bit floating-point type and back.
    - low_precision_float_fp16: Shows how to perform an addition of two 16-bit __half values and store the result as single-precision float.
Dockerfiles hosts Dockerfiles with ready-to-use environments for the various samples. See Dockerfiles/README.md for details.
Docs
- CONTRIBUTING.md contains information on how to contribute to the examples.
Libraries
- hipBLAS
  - gemm_strided_batched: Showcases the general matrix product operation with strided and batched matrices.
  - her: Showcases a rank-2 update of a Hermitian matrix with complex values.
  - scal: Simple program that showcases vector scaling (SCAL) operation.
- hipCUB
  - device_radix_sort: Simple program that showcases hipcub::DeviceRadixSort::SortPairs.
  - device_sum: Simple program that showcases hipcub::DeviceReduce::Sum.
- hipSOLVER
  - gels: Solve a linear system of the form A×X=B.
  - geqrf: Program that showcases how to obtain a QR decomposition with the hipSOLVER API.
  - gesvd: Program that showcases how to obtain a singular value decomposition with the hipSOLVER API.
  - getrf: Program that showcases how to perform a LU factorization with hipSOLVER.
  - potrf: Perform Cholesky factorization and solve linear system with result.
  - syevd: Program that showcases how to calculate the eigenvalues of a matrix using a divide-and-conquer algorithm in hipSOLVER.
  - syevdx: Shows how to compute a subset of the eigenvalues and the corresponding eigenvectors of a real symmetric matrix A using the Compatibility API of hipSOLVER.
  - sygvd: Showcases how to obtain a solution (X,Λ) for a generalized symmetric-definite eigenvalue problem of the form A⋅X=B⋅X⋅Λ.
  - syevj: Calculates the eigenvalues and eigenvectors from a real symmetric matrix using the Jacobi method.
  - syevj_batched: Showcases how to compute the eigenvalues and eigenvectors (via Jacobi method) of each matrix in a batch of real symmetric matrices.
  - sygvj: Calculates the generalized eigenvalues and eigenvectors from a pair of real symmetric matrices using the Jacobi method.
- rocBLAS
  - level_1: Operations between vectors and vectors.
    - axpy: Simple program that showcases the AXPY operation.
    - dot: Simple program that showcases dot product.
    - nrm2: Simple program that showcases Euclidean norm of a vector.
    - scal: Simple program that showcases vector scaling (SCAL) operation.
    - swap: Showcases exchanging elements between two vectors.
  - level_2: Operations between vectors and matrices.
    - her: Showcases a rank-1 update of a Hermitian matrix with complex values.
    - gemv: Showcases the general matrix-vector product operation.
  - level_3: Operations between matrices and matrices.
    - gemm: Showcases the general matrix product operation.
    - gemm_strided_batched: Showcases the general matrix product operation with strided and batched matrices.
- rocFFT
  - callback: Program that showcases the use of rocFFT callback functionality.
  - complex_complex: Program that showcases a Fast Fourier Transform from complex to complex numbers.
  - complex_real: Program that showcases a Fast Fourier Transform from complex to real numbers.
  - multi_gpu: Program that showcases the use of rocFFT multi-GPU functionality.
  - real_complex: Program that showcases a Fast Fourier Transform from real to complex numbers.
- rocPRIM
  - block_sum: Simple program that showcases rocprim::block_reduce with an addition operator.
  - device_sum: Simple program that showcases rocprim::reduce with an addition operator.
- hipFFT
  - multi_gpu: Program that showcases the use of hipFFT multi-GPU functionality.
  - plan_d2z: Forward fast Fourier transform for 1D, 2D, and 3D real input using a simple plan in hipFFT.
  - plan_z2z: Forward fast Fourier transform for 1D, 2D, and 3D complex input using a simple plan in hipFFT.
- rocRAND
  - c_cpp_api: rocRAND's C/C++ API examples.
    - simple_distributions_cpp: A command-line app to compare random number generation on the CPU and on the GPU with rocRAND.
  - device_api: rocRAND's device API examples.
    - pseudorandom_generations: Simple program that shows how to generate random values with rocRAND's pseudorandom generators.
- rocSOLVER
  - getf2: Program that showcases how to perform a LU factorization with rocSOLVER.
  - getri: Program that showcases matrix inversion by LU-decomposition using rocSOLVER.
  - syev: Shows how to compute the eigenvalues and eigenvectors from a symmetrical real matrix.
  - syev_batched: Shows how to compute the eigenvalues and eigenvectors for each matrix in a batch of real symmetric matrices.
  - syev_strided_batched: Shows how to compute the eigenvalues and eigenvectors for multiple symmetrical real matrices, that are stored with an arbitrary stride.
- rocSPARSE
  - level_1: Operations between sparse vectors and dense vectors.
    - axpyi: Showcases how to scale a sparse vector and add it to a dense vector.
    - doti: Showcases a dot product of a sparse vector with a dense vector.
    - gthr: Showcases how to gather elements from a dense vector and store them into a sparse vector.
    - roti: Showcases a Givens rotation to a dense and a sparse vector.
    - sctr: Showcases how to scatter elements in a sparse vector into a dense vector.
  - level_2: Operations between sparse matrices and dense vectors.
    - bsrmv: Showcases a sparse matrix-vector multiplication using BSR storage format.
    - bsrxmv: Showcases a masked sparse matrix-vector multiplication using BSR storage format.
    - bsrsv: Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix.
    - coomv: Showcases a sparse matrix-vector multiplication using COO storage format.
    - csritsv: Showcases how to find an iterative solution with the Jacobi method for a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix.
    - csrmv: Showcases a sparse matrix-vector multiplication using CSR storage format.
    - csrsv: Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix.
    - ellmv: Showcases a sparse matrix-vector multiplication using ELL storage format.
    - gebsrmv: Showcases a sparse matrix-dense vector multiplication using GEBSR storage format.
    - gemvi: Showcases a dense matrix-sparse vector multiplication.
    - spitsv: Showcases how to solve iteratively a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix.
    - spmv: Showcases a general sparse matrix-dense vector multiplication.
    - spsv: Showcases how to solve a linear system of equations whose coefficients are stored in a sparse triangular matrix.
  - level_3: Operations between sparse and dense matrices.
    - bsrmm: Showcases a sparse matrix-matrix multiplication using BSR storage format.
    - bsrsm: Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix, with solution and right-hand side stored in dense matrices.
    - csrmm: Showcases a sparse matrix-matrix multiplication using CSR storage format.
    - csrsm: Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix, with solution and right-hand side stored in dense matrices.
    - gebsrmm: Showcases a sparse matrix-matrix multiplication using GEBSR storage format.
    - gemmi: Showcases a dense matrix sparse matrix multiplication using CSR storage format.
    - sddmm: Showcases a sampled dense-dense matrix multiplication using CSR storage format.
    - spmm: Showcases a sparse matrix-dense matrix multiplication.
    - spsm: Showcases a sparse triangular linear system solver using CSR storage format.
  - preconditioner: Manipulations on sparse matrices to obtain sparse preconditioner matrices.
    - bsric0: Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse BSR matrix.
    - bsrilu0: Showcases how to obtain the incomplete LU decomposition of a sparse BSR square matrix.
    - csric0: Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse CSR matrix.
    - csrilu0: Showcases how to obtain the incomplete LU decomposition of a sparse CSR square matrix.
    - csritilu0: Showcases how to obtain iteratively the incomplete LU decomposition of a sparse CSR square matrix.
    - gpsv: Shows how to compute the solution of pentadiagonal linear system.
    - gtsv: Shows how to compute the solution of a tridiagonal linear system.
- rocThrust
  - device_ptr: Simple program that showcases the usage of the thrust::device_ptr template.
  - norm: An example that computes the Euclidean norm of a thrust::device_vector.
  - reduce_sum: An example that computes the sum of a thrust::device_vector integer vector using the thrust::reduce() generalized summation and the thrust::plus operator.
  - remove_points: Simple program that demonstrates the usage of the thrust random number generation, host vector, generation, tuple, zip iterator, and conditional removal templates. It generates a number of random points in a unit square and then removes all of them outside the unit circle.
  - saxpy: Simple program that implements the SAXPY operation (y[i] = a * x[i] + y[i]) using rocThrust and showcases the usage of the vector and functor templates and of thrust::fill and thrust::transform operations.
  - vectors: Simple program that showcases the host_vector and the device_vector of rocThrust.
Tutorials: Showcases HIP Documentation Tutorials.
- reduction: Showcases a reduction tutorial for HIP Documentation.

posted on 2025-09-06 23:59 ArnoldLu 阅读(7) 评论(0) 收藏举报

刷新页面返回顶部

Arnold Lu@南京

rocm-examples

1 核心内容与结构

2 代码编译

3 详细说明

导航

公告