LXR | KVM | PM | Time | Interrupt | Systems Performance | Bootup Optimization

rocm-examples

ROCm/rocm-examples仓库是学习和使用 AMD ROCm 异构计算平台的宝贵资源库。它提供了从入门到进阶的大量高质量代码示例,覆盖了 ROCm 软件栈的核心组件(HIP)和主要加速库。其结构清晰、支持多种构建方式和平台,是开发者快速上手 ROCm、理解其功能并进行应用开发的必备参考。无论是想学习 HIP 基础语法,还是探索如何利用 rocBLAS、rocFFT 等库进行高性能计算,或是在 Windows 环境下开发,这个仓库都能提供直接的帮助。

1 核心内容与结构

仓库包含大量组织良好的示例,主要分为以下类别:

  1. HIP 基础 (HIP-Basic):
    • 覆盖 HIP 的核心概念:内核启动 (hello_world)、设备查询 (device_query)、内存管理(全局、共享、常量、统一内存)、流 (streams)、事件 (events)、原子操作、数学函数 (math) 等。
    • 包含实用技巧:运行时编译 (runtime_compilation)、加载预编译模块 (load_module)、处理内联汇编 (inline_assembly)、处理不同 GPU 架构 (gpu_arch)、错误处理 (error_handling)。
    • 展示与图形 API 的互操作性:OpenGL (opengl_interop)、Vulkan (vulkan_interop, vulkan_interop_mipmap)。
    • 包含从 CUDA 迁移 HIP (hipify) 和跨平台(ROCm/CUDA)构建 (hello_world_cuda) 的示例。
  2. ROCm 库示例 (Libraries):
    • rocBLAS: 各级 BLAS 操作 (level_1, level_2, level_3),如 AXPY (axpy), DOT (dot), GEMV (gemv), GEMM (gemm), GEMM 批处理 (gemm_strided_batched), HER (her) 等。
    • rocFFT / hipFFT: 各种 FFT 变换(复-复 complex_complex, 实-复 real_complex, 复-实 complex_real)、多 GPU FFT (multi_gpu)、回调函数 (callback)。
    • rocRAND: 伪随机数生成 (pseudorandom_generations)、分布生成 (simple_distributions_cpp)。
    • rocPRIM / hipCUB: 基本并行原语,如规约 (device_sum, block_sum)、扫描 (prefix_sum)、排序 (bitonic_sort)。
    • rocSOLVER / hipSOLVER: 线性代数求解器,如 LU 分解 (getf2, getrf, gesvd)、Cholesky 分解 (potrf)、QR 分解 (geqrf)、特征值/特征向量计算 (syevd, syevj, syevdx, sygvd)、线性系统求解 (gels)、矩阵求逆 (getri)。
    • rocSPARSE / hipSPARSE: 稀疏矩阵运算(各级 level_1, level_2, level_3),各种存储格式(CSR, COO, ELL, BSR, GEBSR)下的 SPMV、SPMM、三角求解 (spsv, csrsv, bsrsv)、预处理 (bsric0, csrilu0)、特定格式求解器 (gtsv, gpsv)。
    • rocThrust: STL 风格的并行算法库示例 (saxpy, reduce_sum, norm, vectors, remove_points, device_ptr)。
  3. 应用示例 (Applications):
    • 展示常见算法或应用的 GPU 加速实现:卷积 (convolution)、Floyd-Warshall 最短路径 (floyd_warshall)、直方图 (histogram)、蒙特卡洛 Pi (monte_carlo_pi)。
  4. AI 示例 (AI):
    • 目前主要包含使用 MIGraphX(ROCm 的推理引擎)执行量化模型的示例 (MIGraphX/Quantization)。
  5. 教程代码 (HIP-Doc & Tutorials):

2 代码编译

在Linux下通过CMake编译:

git clone https://github.com/ROCm/rocm-examples.git
cd rocm-examples
cmake -S . -B build (on ROCm) or $ cmake -S . -B build -D GPU_RUNTIME=CUDA (on CUDA)
cmake --build build
cmake --install build --prefix install

或者通过make编译:

git clone https://github.com/ROCm/rocm-examples.git
cd rocm-examples
make (on ROCm) or $ make GPU_RUNTIME=CUDA (on CUDA)

3 详细说明

  • AI Showcases the functionality for executing quantized models using Torch-MIGraphX.
  • Applications groups a number of examples ... .
    • bitonic_sort: Showcases how to order an array of n elements using a GPU implementation of the bitonic sort.
    • convolution: A simple GPU implementation for the calculation of discrete convolutions.
    • floyd_warshall: Showcases a GPU implementation of the Floyd-Warshall algorithm for finding shortest paths in certain types of graphs.
    • histogram: Histogram over a byte array with memory bank optimization.
    • monte_carlo_pi: Monte Carlo estimation of π using hipRAND for random number generation and hipCUB for evaluation.
    • prefix_sum: Showcases a GPU implementation of a prefix sum with a 2-kernel scan algorithm.
  • Common contains common utility functionality shared between the examples.
  • HIP-Basic hosts self-contained recipes showcasing HIP runtime functionality.
    • assembly_to_executable: Program and accompanying build systems that show how to manually compile and link a HIP application from host and device code.
    • bandwidth: Program that measures memory bandwidth from host to device, device to host, and device to device.
    • bit_extract: Program that showcases how to use HIP built-in bit extract.
    • device_globals: Show cases how to set global variables on the device from the host.
    • device_query: Program that showcases how properties from the device may be queried.
    • dynamic_shared: Program that showcases how to use dynamic shared memory with the help of a simple matrix transpose kernel.
    • events: Measuring execution time and synchronizing with HIP events.
    • gpu_arch: Program that showcases how to implement GPU architecture-specific code.
    • hello_world: Simple program that showcases launching kernels and printing from the device.
    • hello_world_cuda: Simple HIP program that showcases setting up CMake to target the CUDA platform.
    • hipify: Simple program and build definitions that showcase automatically converting a CUDA .cu source into portable HIP .hip source.
    • llvm_ir_to_executable: Shows how to create a HIP executable from LLVM IR.
    • inline_assembly: Program that showcases how to use inline assembly in a portable manner.
    • matrix_multiplication: Multiply two dynamically sized matrices utilizing shared memory.
    • module_api: Shows how to load and execute a HIP module in runtime.
    • moving_average: Simple program that demonstrates parallel computation of a moving average of one-dimensional data.
    • multi_gpu_data_transfer: Performs two matrix transposes on two different devices (one on each) to showcase how to use peer-to-peer communication among devices.
    • occupancy: Shows how to find optimal configuration parameters for a kernel launch with maximum occupancy.
    • opengl_interop: Showcases how to share resources and computation between HIP and OpenGL.
    • runtime_compilation: Simple program that showcases how to use HIP runtime compilation (hipRTC) to compile a kernel and launch it on a device.
    • saxpy: Implements the yi=axi+yi kernel and explains basic HIP functionality.
    • shared_memory: Showcases how to use static shared memory by implementing a simple matrix transpose kernel.
    • static_device_library: Shows how to create a static library containing device functions, and how to link it with an executable.
    • static_host_library: Shows how to create a static library containing HIP host functions, and how to link it with an executable.
    • streams: Program that showcases usage of multiple streams each with their own tasks.
    • texture_management: Shows the usage of texture memory.
    • vulkan_interop: Showcases how to share resources and computation between HIP and Vulkan via buffer memory.
    • vulkan_interop_mipmap: Showcases how to share resources and computation between HIP and Vulkan via mipmapped memory (Only available in Windows).
    • warp_shuffle: Uses a simple matrix transpose kernel to showcase how to use warp shuffle operations.
  • HIP-Doc hosts the HIP documentation's example codes. These are mainly intended for CI purposes but also serve as standalone examples.
  • Dockerfiles hosts Dockerfiles with ready-to-use environments for the various samples. See Dockerfiles/README.md for details.
  • Docs
    • CONTRIBUTING.md contains information on how to contribute to the examples.
  • Libraries
    • hipBLAS
      • gemm_strided_batched: Showcases the general matrix product operation with strided and batched matrices.
      • her: Showcases a rank-2 update of a Hermitian matrix with complex values.
      • scal: Simple program that showcases vector scaling (SCAL) operation.
    • hipCUB
      • device_radix_sort: Simple program that showcases hipcub::DeviceRadixSort::SortPairs.
      • device_sum: Simple program that showcases hipcub::DeviceReduce::Sum.
    • hipSOLVER
      • gels: Solve a linear system of the form A×X=B.
      • geqrf: Program that showcases how to obtain a QR decomposition with the hipSOLVER API.
      • gesvd: Program that showcases how to obtain a singular value decomposition with the hipSOLVER API.
      • getrf: Program that showcases how to perform a LU factorization with hipSOLVER.
      • potrf: Perform Cholesky factorization and solve linear system with result.
      • syevd: Program that showcases how to calculate the eigenvalues of a matrix using a divide-and-conquer algorithm in hipSOLVER.
      • syevdx: Shows how to compute a subset of the eigenvalues and the corresponding eigenvectors of a real symmetric matrix A using the Compatibility API of hipSOLVER.
      • sygvd: Showcases how to obtain a solution (X,Λ) for a generalized symmetric-definite eigenvalue problem of the form A⋅X=B⋅X⋅Λ.
      • syevj: Calculates the eigenvalues and eigenvectors from a real symmetric matrix using the Jacobi method.
      • syevj_batched: Showcases how to compute the eigenvalues and eigenvectors (via Jacobi method) of each matrix in a batch of real symmetric matrices.
      • sygvj: Calculates the generalized eigenvalues and eigenvectors from a pair of real symmetric matrices using the Jacobi method.
    • rocBLAS
      • level_1: Operations between vectors and vectors.
        • axpy: Simple program that showcases the AXPY operation.
        • dot: Simple program that showcases dot product.
        • nrm2: Simple program that showcases Euclidean norm of a vector.
        • scal: Simple program that showcases vector scaling (SCAL) operation.
        • swap: Showcases exchanging elements between two vectors.
      • level_2: Operations between vectors and matrices.
        • her: Showcases a rank-1 update of a Hermitian matrix with complex values.
        • gemv: Showcases the general matrix-vector product operation.
      • level_3: Operations between matrices and matrices.
        • gemm: Showcases the general matrix product operation.
        • gemm_strided_batched: Showcases the general matrix product operation with strided and batched matrices.
    • rocFFT
      • callback: Program that showcases the use of rocFFT callback functionality.
      • complex_complex: Program that showcases a Fast Fourier Transform from complex to complex numbers.
      • complex_real: Program that showcases a Fast Fourier Transform from complex to real numbers.
      • multi_gpu: Program that showcases the use of rocFFT multi-GPU functionality.
      • real_complex: Program that showcases a Fast Fourier Transform from real to complex numbers.
    • rocPRIM
      • block_sum: Simple program that showcases rocprim::block_reduce with an addition operator.
      • device_sum: Simple program that showcases rocprim::reduce with an addition operator.
    • hipFFT
      • multi_gpu: Program that showcases the use of hipFFT multi-GPU functionality.
      • plan_d2z: Forward fast Fourier transform for 1D, 2D, and 3D real input using a simple plan in hipFFT.
      • plan_z2z: Forward fast Fourier transform for 1D, 2D, and 3D complex input using a simple plan in hipFFT.
    • rocRAND
    • rocSOLVER
      • getf2: Program that showcases how to perform a LU factorization with rocSOLVER.
      • getri: Program that showcases matrix inversion by LU-decomposition using rocSOLVER.
      • syev: Shows how to compute the eigenvalues and eigenvectors from a symmetrical real matrix.
      • syev_batched: Shows how to compute the eigenvalues and eigenvectors for each matrix in a batch of real symmetric matrices.
      • syev_strided_batched: Shows how to compute the eigenvalues and eigenvectors for multiple symmetrical real matrices, that are stored with an arbitrary stride.
    • rocSPARSE
      • level_1: Operations between sparse vectors and dense vectors.
        • axpyi: Showcases how to scale a sparse vector and add it to a dense vector.
        • doti: Showcases a dot product of a sparse vector with a dense vector.
        • gthr: Showcases how to gather elements from a dense vector and store them into a sparse vector.
        • roti: Showcases a Givens rotation to a dense and a sparse vector.
        • sctr: Showcases how to scatter elements in a sparse vector into a dense vector.
      • level_2: Operations between sparse matrices and dense vectors.
        • bsrmv: Showcases a sparse matrix-vector multiplication using BSR storage format.
        • bsrxmv: Showcases a masked sparse matrix-vector multiplication using BSR storage format.
        • bsrsv: Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix.
        • coomv: Showcases a sparse matrix-vector multiplication using COO storage format.
        • csritsv: Showcases how to find an iterative solution with the Jacobi method for a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix.
        • csrmv: Showcases a sparse matrix-vector multiplication using CSR storage format.
        • csrsv: Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix.
        • ellmv: Showcases a sparse matrix-vector multiplication using ELL storage format.
        • gebsrmv: Showcases a sparse matrix-dense vector multiplication using GEBSR storage format.
        • gemvi: Showcases a dense matrix-sparse vector multiplication.
        • spitsv: Showcases how to solve iteratively a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix.
        • spmv: Showcases a general sparse matrix-dense vector multiplication.
        • spsv: Showcases how to solve a linear system of equations whose coefficients are stored in a sparse triangular matrix.
      • level_3: Operations between sparse and dense matrices.
        • bsrmm: Showcases a sparse matrix-matrix multiplication using BSR storage format.
        • bsrsm: Showcases how to solve a linear system of equations whose coefficients are stored in a BSR sparse triangular matrix, with solution and right-hand side stored in dense matrices.
        • csrmm: Showcases a sparse matrix-matrix multiplication using CSR storage format.
        • csrsm: Showcases how to solve a linear system of equations whose coefficients are stored in a CSR sparse triangular matrix, with solution and right-hand side stored in dense matrices.
        • gebsrmm: Showcases a sparse matrix-matrix multiplication using GEBSR storage format.
        • gemmi: Showcases a dense matrix sparse matrix multiplication using CSR storage format.
        • sddmm: Showcases a sampled dense-dense matrix multiplication using CSR storage format.
        • spmm: Showcases a sparse matrix-dense matrix multiplication.
        • spsm: Showcases a sparse triangular linear system solver using CSR storage format.
      • preconditioner: Manipulations on sparse matrices to obtain sparse preconditioner matrices.
        • bsric0: Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse BSR matrix.
        • bsrilu0: Showcases how to obtain the incomplete LU decomposition of a sparse BSR square matrix.
        • csric0: Shows how to compute the incomplete Cholesky decomposition of a Hermitian positive-definite sparse CSR matrix.
        • csrilu0: Showcases how to obtain the incomplete LU decomposition of a sparse CSR square matrix.
        • csritilu0: Showcases how to obtain iteratively the incomplete LU decomposition of a sparse CSR square matrix.
        • gpsv: Shows how to compute the solution of pentadiagonal linear system.
        • gtsv: Shows how to compute the solution of a tridiagonal linear system.
    • rocThrust
      • device_ptr: Simple program that showcases the usage of the thrust::device_ptr template.
      • norm: An example that computes the Euclidean norm of a thrust::device_vector.
      • reduce_sum: An example that computes the sum of a thrust::device_vector integer vector using the thrust::reduce() generalized summation and the thrust::plus operator.
      • remove_points: Simple program that demonstrates the usage of the thrust random number generation, host vector, generation, tuple, zip iterator, and conditional removal templates. It generates a number of random points in a unit square and then removes all of them outside the unit circle.
      • saxpy: Simple program that implements the SAXPY operation (y[i] = a * x[i] + y[i]) using rocThrust and showcases the usage of the vector and functor templates and of thrust::fill and thrust::transform operations.
      • vectors: Simple program that showcases the host_vector and the device_vector of rocThrust.
  • Tutorials: Showcases HIP Documentation Tutorials.
    • reduction: Showcases a reduction tutorial for HIP Documentation.

 

posted on 2025-09-06 23:59  ArnoldLu  阅读(7)  评论(0)    收藏  举报

导航