rocm-examples
ROCm/rocm-examples仓库是学习和使用 AMD ROCm 异构计算平台的宝贵资源库。它提供了从入门到进阶的大量高质量代码示例,覆盖了 ROCm 软件栈的核心组件(HIP)和主要加速库。其结构清晰、支持多种构建方式和平台,是开发者快速上手 ROCm、理解其功能并进行应用开发的必备参考。无论是想学习 HIP 基础语法,还是探索如何利用 rocBLAS、rocFFT 等库进行高性能计算,或是在 Windows 环境下开发,这个仓库都能提供直接的帮助。
1 核心内容与结构
仓库包含大量组织良好的示例,主要分为以下类别:
- HIP 基础 (HIP-Basic):
- 覆盖 HIP 的核心概念:内核启动 (hello_world)、设备查询 (device_query)、内存管理(全局、共享、常量、统一内存)、流 (streams)、事件 (events)、原子操作、数学函数 (math) 等。
- 包含实用技巧:运行时编译 (runtime_compilation)、加载预编译模块 (load_module)、处理内联汇编 (inline_assembly)、处理不同 GPU 架构 (gpu_arch)、错误处理 (error_handling)。
- 展示与图形 API 的互操作性:OpenGL (opengl_interop)、Vulkan (vulkan_interop, vulkan_interop_mipmap)。
- 包含从 CUDA 迁移 HIP (hipify) 和跨平台(ROCm/CUDA)构建 (hello_world_cuda) 的示例。
- ROCm 库示例 (Libraries):
- rocBLAS: 各级 BLAS 操作 (level_1, level_2, level_3),如 AXPY (axpy), DOT (dot), GEMV (gemv), GEMM (gemm), GEMM 批处理 (gemm_strided_batched), HER (her) 等。
- rocFFT / hipFFT: 各种 FFT 变换(复-复 complex_complex, 实-复 real_complex, 复-实 complex_real)、多 GPU FFT (multi_gpu)、回调函数 (callback)。
- rocRAND: 伪随机数生成 (pseudorandom_generations)、分布生成 (simple_distributions_cpp)。
- rocPRIM / hipCUB: 基本并行原语,如规约 (device_sum, block_sum)、扫描 (prefix_sum)、排序 (bitonic_sort)。
- rocSOLVER / hipSOLVER: 线性代数求解器,如 LU 分解 (getf2, getrf, gesvd)、Cholesky 分解 (potrf)、QR 分解 (geqrf)、特征值/特征向量计算 (syevd, syevj, syevdx, sygvd)、线性系统求解 (gels)、矩阵求逆 (getri)。
- rocSPARSE / hipSPARSE: 稀疏矩阵运算(各级 level_1, level_2, level_3),各种存储格式(CSR, COO, ELL, BSR, GEBSR)下的 SPMV、SPMM、三角求解 (spsv, csrsv, bsrsv)、预处理 (bsric0, csrilu0)、特定格式求解器 (gtsv, gpsv)。
- rocThrust: STL 风格的并行算法库示例 (saxpy, reduce_sum, norm, vectors, remove_points, device_ptr)。
- 应用示例 (Applications):
- 展示常见算法或应用的 GPU 加速实现:卷积 (convolution)、Floyd-Warshall 最短路径 (floyd_warshall)、直方图 (histogram)、蒙特卡洛 Pi (monte_carlo_pi)。
- AI 示例 (AI):
- 目前主要包含使用 MIGraphX(ROCm 的推理引擎)执行量化模型的示例 (MIGraphX/Quantization)。
- 教程代码 (HIP-Doc & Tutorials):
- 包含官方 HIP 教程 (HIP documentation — HIP 6.4.43484 Documentation, SAXPY - Hello, HIP — HIP 6.4.43484 Documentation)中使用的配套代码,方便学习和参考。
2 代码编译
在Linux下通过CMake编译:
git clone https://github.com/ROCm/rocm-examples.git cd rocm-examples cmake -S . -B build (on ROCm) or $ cmake -S . -B build -D GPU_RUNTIME=CUDA (on CUDA) cmake --build build cmake --install build --prefix install
或者通过make编译:
git clone https://github.com/ROCm/rocm-examples.git cd rocm-examples make (on ROCm) or $ make GPU_RUNTIME=CUDA (on CUDA)
3 详细说明
- AI Showcases the functionality for executing quantized models using Torch-MIGraphX.
- Applications groups a number of examples ... .
- bitonic_sort: Showcases how to order an array of n elements using a GPU implementation of the bitonic sort.
- convolution: A simple GPU implementation for the calculation of discrete convolutions.
- floyd_warshall: Showcases a GPU implementation of the Floyd-Warshall algorithm for finding shortest paths in certain types of graphs.
- histogram: Histogram over a byte array with memory bank optimization.
- monte_carlo_pi: Monte Carlo estimation of π using hipRAND for random number generation and hipCUB for evaluation.
- prefix_sum: Showcases a GPU implementation of a prefix sum with a 2-kernel scan algorithm.
- Common contains common utility functionality shared between the examples.
- HIP-Basic hosts self-contained recipes showcasing HIP runtime functionality.
- assembly_to_executable: Program and accompanying build systems that show how to manually compile and link a HIP application from host and device code.
- bandwidth: Program that measures memory bandwidth from host to device, device to host, and device to device.
- bit_extract: Program that showcases how to use HIP built-in bit extract.
- device_globals: Show cases how to set global variables on the device from the host.
- device_query: Program that showcases how properties from the device may be queried.
- dynamic_shared: Program that showcases how to use dynamic shared memory with the help of a simple matrix transpose kernel.
- events: Measuring execution time and synchronizing with HIP events.
- gpu_arch: Program that showcases how to implement GPU architecture-specific code.
- hello_world: Simple program that showcases launching kernels and printing from the device.
- hello_world_cuda: Simple HIP program that showcases setting up CMake to target the CUDA platform.
- hipify: Simple program and build definitions that showcase automatically converting a CUDA
.cusource into portable HIP.hipsource. - llvm_ir_to_executable: Shows how to create a HIP executable from LLVM IR.
- inline_assembly: Program that showcases how to use inline assembly in a portable manner.
- matrix_multiplication: Multiply two dynamically sized matrices utilizing shared memory.
- module_api: Shows how to load and execute a HIP module in runtime.
- moving_average: Simple program that demonstrates parallel computation of a moving average of one-dimensional data.
- multi_gpu_data_transfer: Performs two matrix transposes on two different devices (one on each) to showcase how to use peer-to-peer communication among devices.
- occupancy: Shows how to find optimal configuration parameters for a kernel launch with maximum occupancy.
- opengl_interop: Showcases how to share resources and computation between HIP and OpenGL.
- runtime_compilation: Simple program that showcases how to use HIP runtime compilation (hipRTC) to compile a kernel and launch it on a device.
- saxpy: Implements the yi=axi+yi kernel and explains basic HIP functionality.
- shared_memory: Showcases how to use static shared memory by implementing a simple matrix transpose kernel.
- static_device_library: Shows how to create a static library containing device functions, and how to link it with an executable.
- static_host_library: Shows how to create a static library containing HIP host functions, and how to link it with an executable.
- streams: Program that showcases usage of multiple streams each with their own tasks.
- texture_management: Shows the usage of texture memory.
- vulkan_interop: Showcases how to share resources and computation between HIP and Vulkan via buffer memory.
- vulkan_interop_mipmap: Showcases how to share resources and computation between HIP and Vulkan via mipmapped memory (Only available in Windows).
- warp_shuffle: Uses a simple matrix transpose kernel to showcase how to use warp shuffle operations.
- HIP-Doc hosts the HIP documentation's example codes. These are mainly intended for CI purposes but also serve as standalone examples.
- Programming-Guide contains the examples from the HIP documentation's Programming Guide section.
- HIP-C++-Language-Extensions contains the examples from the HIP C++ language extensions page.
- calling_global_functions: Shows how to call
__global__functions (kernels). - extern_shared_memory: Shows how to dynamically allocate
__shared__memory. - launch_bounds: Shows how to specify launch bounds for a kernel.
- set_constant_memory: Shows how to initialize
__constant__memory. - template_warp_size_reduction: Shows how to perform a reduction while relying on the warp size as a compile-time constant.
- timer: Shows how to read the device-side timer.
- warp_size_reduction: Shows how to perform a reduction while relying on the warp size as an early-folded constant.
- calling_global_functions: Shows how to call
- HIP-Porting-Guide contains the examples from the HIP porting guide page.
- device_code_feature_identification: Shows how to query the device's compute features in device code.
- host_code_feature_identification: Shows how to query the device's compute features in host code.
- identifying_compilation_target_platform: Shows how to distinguish between AMD and NVIDIA target platforms in code.
- identifying_host_device_compilation_pass: Shows how to disinguish between host and device compilation passes in code.
- Introduction-to-the-HIP-Programming-Model contains the examples from the Introduction to the HIP programming model page.
- add_kernel: Shows how to perform a vector addition with a GPU kernel.
- Porting-CUDA-Driver-API contains the examples from the Porting CUDA driver API page.
- address_retrieval: Shows how to obtain the address of a HIP runtime function.
- load_module: Shows how to load precompiled code objects from disk and execute the contained kernel(s).
- load_module_ex: Shows how to load precompiled code objects from memory and execute the contained kernel(s).
- load_module_ex_cuda: Shows how to load CUDA PTX objects from disk and execute the contained kernel(s).
- per_thread_default_stream: Shows how to manage streams on a per-thread basis.
- pointer_memory_type: Shows how to query a pointer's memory type.
- Programming-for-HIP-Runtime-Compiler contains the examples from the Programming for HIP runtime compiler (RTC) page.
- compilation_apis: Shows how to compile a kernel at runtime using the HIPRTC API.
- linker_apis: Shows how to link at runtime a LLVM bitcode object (which is stored in memory) using the HIPRTC API.
- linker_apis_file: Shows how to link at runtime a LLVM bitcode object (which is stored in a file on disk) using the HIPRTC API.
- linker_apis_options: Shows how to link at runtime a LLVM bitcode object (which is stored in memory) using the HIPRTC API. During the link stage an array of linker options is passed to the runtime linker.
- lowered_names: Shows how to obtain the lowered (mangled) names of kernels and device variables using the HIPRTC API.
- rtc_error_handling: Shows how to check the HIPRTC API calls for errors.
- Using-HIP-Runtime-API contains the examples from the Using HIP runtime API subsection.
- Asynchronous-Concurrent-Execution contains the examples from the Asynchronous concurrent execution page.
- async_kernel_execution: Shows how to execute HIP operations and kernels asynchronously with regard to the host.
- event_based_synchronization: Shows how to execute HIP operations and kernels asynchronously with regard to the host and how to synchronize the host and the device by using HIP events.
- sequential_kernel_execution: Shows how to execute HIP operations and kernels sequentially.
- Call-Stack contains the examples from the Call stack page.
- call_stack_management: Shows how to adjust the device's call stack size.
- device_recursion: Shows how to hit the device's stack limit on purpose.
- Error-Handling contains the examples from the Error handling page.
- error_handling: Shows how to handle HIP runtime errors without creating too much code overhead.
- HIP-Graphs contains the examples from the HIP graphs page.
- graph_capture: Shows how to capture HIP streams with the HIP graph API.
- graph_creation: Shows how to explicitly create HIP graphs.
- Initialization contains the examples from the Initialization page.
- simple_device_query: Shows how the number of HIP-capable devices in the system can be determined, as well as how properties from the device may be queried.
- Memory-Management contains the examples from the Memory management subsubsection.
- Device-Memory contains the examples from the Device memory page.
- constant_memory: Shows how to transfer bytes between the host and the device's constant memory space.
- dynamic_shared_memory: Shows how to dynamically allocate shared memory on the host.
- explicit_copy: Shows how to transfer bytes between the host and the device's global memory space.
- kernel_memory_allocation: Shows how to allocate global device memory inside a kernel.
- static_shared_memory: Shows how to statically allocate shared memory inside a kernel.
- Host-Memory contains the examples from the Host memory page.
- pageable_host_memory: Shows how to allocate pageable memory on the host and transfer its contents to the device.
- pinned_host_memory: Shows how to allocate pinned memory on the host and transfer its contents to the device.
- SOMA contains the examples from the Stream Ordered Memory Allocator page.
- memory_pool: Shows how to use the stream ordered memory allocation (SOMA) API to set up and manage a memory pool.
- memory_pool_resource_usage_statistics: Shows how to query resource usage statistics for a memory pool.
- memory_pool_threshold: Shows how to use the stream ordered memory allocation (SOMA) API to set up and manage a memory pool, while defining a threshold to specify an amount of memory to reserve.
- memory_pool_trim: Shows how to trim a memory pool to a new size.
- ordinary_memory_allocation: Shows an ordinary memory allocation.
- stream_ordered_memory_allocation: Shows how to use stream ordered memory allocations.
- Unified-Memory-Management contains the examples from the Unified memory management page.
- data_prefetching: Shows how to prefetch data in the unified memory space before it is actually needed.
- dynamic_unified_memory: Shows how to dynamically allocate unified memory and use it from both the host and the device.
- explicit_memory: Shows how to perform explicit memory management by allocating memory on the device and transferring bytes between the host and the device.
- memory_range_attributes: Shows how to query attributes of a given memory range.
- standard_unified_memory: Shows demonstrates how to dynamically allocate unified memory with standard C++ facilities and use it from both the host and the device.
- static_unified_memory: Shows how to statically allocate unified memory and use it from both the host and the device.
- unified_memory_advice: Shows how to set unified memory runtime hints.
- Virtual-Memory-Management contains the examples from the Virtual memory management page.
- virtual_memory: Shows how to use HIP's virtual memory management API.
- Device-Memory contains the examples from the Device memory page.
- Multi-Device-Management contains the examples from the Multi-device management page.
- device_enumeration: Shows how to query the number of devices in the system and how to access them.
- device_selection: Shows how to switch between the different devices in the system and assign work to them.
- multi_device_sychronization: Shows how to synchronize multiple devices using HIP events and streams.
- p2p_memory_access: Shows how to copy data between devices by adding peer-to-peer accesses to the device selection example.
- p2p_memory_access_failed: Shows how to copy data between devices by adding peer-to-peer accesses to the device selection example, but explicitly does not enable peer-to-peer access for the devices.
- Asynchronous-Concurrent-Execution contains the examples from the Asynchronous concurrent execution page.
- HIP-C++-Language-Extensions contains the examples from the HIP C++ language extensions page.
- Reference hosts the examples from the HIP documentation's Reference section.
- CUDA-to-HIP-API-Function-Comparison contains the examples from the CUDA to HIP API Function Comparison page.
- block_reduction: Shows a block-reduction kernel written in CUDA.
- HIP-Complex-Math-API contains the examples from the HIP complex math API page.
- complex_math: Shows how to use HIP's complex math API to compute the DFT.
- HIP-Math-API contains the examples from the HIP math API page.
- math: Shows how to use HIP's math API to compute the ULP difference.
- Low-Precision-Floating-Point-Types contains the examples from the Low precision floating point types page.
- low_precision_float_fp8: Shows how to convert a single-precision
floatvalue to an 8-bit floating-point type and back. - low_precision_float_fp16: Shows how to perform an addition of two 16-bit
__halfvalues and store the result as single-precisionfloat.
- low_precision_float_fp8: Shows how to convert a single-precision
- CUDA-to-HIP-API-Function-Comparison contains the examples from the CUDA to HIP API Function Comparison page.
- Programming-Guide contains the examples from the HIP documentation's Programming Guide section.
- Dockerfiles hosts Dockerfiles with ready-to-use environments for the various samples. See Dockerfiles/README.md for details.
- Docs
- CONTRIBUTING.md contains information on how to contribute to the examples.
- Libraries
- hipBLAS
- gemm_strided_batched: Showcases the general matrix product operation with strided and batched matrices.
- her: Showcases a rank-2 update of a Hermitian matrix with complex values.
- scal: Simple program that showcases vector scaling (SCAL) operation.
- hipCUB
- device_radix_sort: Simple program that showcases
hipcub::DeviceRadixSort::SortPairs. - device_sum: Simple program that showcases
hipcub::DeviceReduce::Sum.
- device_radix_sort: Simple program that showcases
- hipSOLVER
- gels: Solve a linear system of the form A×X=B.
- geqrf: Program that showcases how to obtain a QR decomposition with the hipSOLVER API.
- gesvd: Program that showcases how to obtain a singular value decomposition with the hipSOLVER API.
- getrf: Program that showcases how to perform a LU factorization with hipSOLVER.
- potrf: Perform Cholesky factorization and solve linear system with result.
- syevd: Program that showcases how to calculate the eigenvalues of a matrix using a divide-and-conquer algorithm in hipSOLVER.
- syevdx: Shows how to compute a subset of the eigenvalues and the corresponding eigenvectors of a real symmetric matrix A using the Compatibility API of hipSOLVER.
- sygvd: Showcases how to obtain a solution (X,Λ) for a generalized symmetric-definite eigenvalue problem of the form A⋅X=B⋅X⋅Λ.
- syevj: Calculates the eigenvalues and eigenvectors from a real symmetric matrix using the Jacobi method.
- syevj_batched: Showcases how to compute the eigenvalues and eigenvectors (via Jacobi method) of each matrix in a batch of real symmetric matrices.
- sygvj: Calculates the generalized eigenvalues and eigenvectors from a pair of real symmetric matrices using the Jacobi method.
- rocBLAS
- level_1: Operations between vectors and vectors.
- level_2: Operations between vectors and matrices.
- level_3: Operations between matrices and matrices.
- gemm: Showcases the general matrix product operation.
- gemm_strided_batched: Showcases the general matrix product operation with strided and batched matrices.
- rocFFT
- callback: Program that showcases the use of rocFFT
callbackfunctionality. - complex_complex: Program that showcases a Fast Fourier Transform from complex to complex numbers.
- complex_real: Program that showcases a Fast Fourier Transform from complex to real numbers.
- multi_gpu: Program that showcases the use of rocFFT multi-GPU functionality.
- real_complex: Program that showcases a Fast Fourier Transform from real to complex numbers.
- callback: Program that showcases the use of rocFFT
- rocPRIM
- block_sum: Simple program that showcases
rocprim::block_reducewith an addition operator. - device_sum: Simple program that showcases
rocprim::reducewith an addition operator.
- block_sum: Simple program that showcases
- hipFFT
- rocRAND
- c_cpp_api: rocRAND's C/C++ API examples.
- simple_distributions_cpp: A command-line app to compare random number generation on the CPU and on the GPU with rocRAND.
- device_api: rocRAND's device API examples.
- pseudorandom_generations: Simple program that shows how to generate random values with rocRAND's pseudorandom generators.
- c_cpp_api: rocRAND's C/C++ API examples.
- rocSOLVER
- getf2: Program that showcases how to perform a LU factorization with rocSOLVER.
- getri: Program that showcases matrix inversion by LU-decomposition using rocSOLVER.
- syev: Shows how to compute the eigenvalues and eigenvectors from a symmetrical real matrix.
- syev_batched: Shows how to compute the eigenvalues and eigenvectors for each matrix in a batch of real symmetric matrices.
- syev_strided_batched: Shows how to compute the eigenvalues and eigenvectors for multiple symmetrical real matrices, that are stored with an arbitrary stride.
- rocSPARSE
- level_1: Operations between sparse vectors and dense vectors.
- axpyi: Showcases how to scale a sparse vector and add it to a dense vector.
- doti: Showcases a dot product of a sparse vector with a dense vector.
- gthr: Showcases how to gather elements from a dense vector and store them into a sparse vector.
- roti: Showcases a Givens rotation to a dense and a sparse vector.
- sctr: Showcases how to scatter elements in a sparse vector into a dense vector.
- level_2: Operations between sparse matrices and dense vectors.
- bsrmv: Showcases a sparse matrix-vector multiplication using BSR storage format.
- bsrxmv: Showcases a masked sparse matrix-vector multiplication using BSR storage format.
- bsrsv: Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix.
- coomv: Showcases a sparse matrix-vector multiplication using COO storage format.
- csritsv: Showcases how to find an iterative solution with the Jacobi method for a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix.
- csrmv: Showcases a sparse matrix-vector multiplication using CSR storage format.
- csrsv: Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix.
- ellmv: Showcases a sparse matrix-vector multiplication using ELL storage format.
- gebsrmv: Showcases a sparse matrix-dense vector multiplication using GEBSR storage format.
- gemvi: Showcases a dense matrix-sparse vector multiplication.
- spitsv: Showcases how to solve iteratively a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix.
- spmv: Showcases a general sparse matrix-dense vector multiplication.
- spsv: Showcases how to solve a linear system of equations whose coefficients are stored in a sparse triangular matrix.
- level_3: Operations between sparse and dense matrices.
- bsrmm: Showcases a sparse matrix-matrix multiplication using BSR storage format.
- bsrsm: Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix, with solution and right-hand side stored in dense matrices.
- csrmm: Showcases a sparse matrix-matrix multiplication using CSR storage format.
- csrsm: Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix, with solution and right-hand side stored in dense matrices.
- gebsrmm: Showcases a sparse matrix-matrix multiplication using GEBSR storage format.
- gemmi: Showcases a dense matrix sparse matrix multiplication using CSR storage format.
- sddmm: Showcases a sampled dense-dense matrix multiplication using CSR storage format.
- spmm: Showcases a sparse matrix-dense matrix multiplication.
- spsm: Showcases a sparse triangular linear system solver using CSR storage format.
- preconditioner: Manipulations on sparse matrices to obtain sparse preconditioner matrices.
- bsric0: Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse BSR matrix.
- bsrilu0: Showcases how to obtain the incomplete LU decomposition of a sparse BSR square matrix.
- csric0: Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse CSR matrix.
- csrilu0: Showcases how to obtain the incomplete LU decomposition of a sparse CSR square matrix.
- csritilu0: Showcases how to obtain iteratively the incomplete LU decomposition of a sparse CSR square matrix.
- gpsv: Shows how to compute the solution of pentadiagonal linear system.
- gtsv: Shows how to compute the solution of a tridiagonal linear system.
- level_1: Operations between sparse vectors and dense vectors.
- rocThrust
- device_ptr: Simple program that showcases the usage of the
thrust::device_ptrtemplate. - norm: An example that computes the Euclidean norm of a
thrust::device_vector. - reduce_sum: An example that computes the sum of a
thrust::device_vectorinteger vector using thethrust::reduce()generalized summation and thethrust::plusoperator. - remove_points: Simple program that demonstrates the usage of the
thrustrandom number generation, host vector, generation, tuple, zip iterator, and conditional removal templates. It generates a number of random points in a unit square and then removes all of them outside the unit circle. - saxpy: Simple program that implements the SAXPY operation (
y[i] = a * x[i] + y[i]) using rocThrust and showcases the usage of the vector and functor templates and ofthrust::fillandthrust::transformoperations. - vectors: Simple program that showcases the
host_vectorand thedevice_vectorof rocThrust.
- device_ptr: Simple program that showcases the usage of the
- hipBLAS
- Tutorials: Showcases HIP Documentation Tutorials.
- reduction: Showcases a reduction tutorial for HIP Documentation.
联系方式:arnoldlu@qq.com
浙公网安备 33010602011771号