CUDA & CUTLASS (2025/2/23)

CUDA

C TO CUDA

在 CUDA 编程中
可以简单的理解为用 CPU 调控计算，用 GPU 执行计算
因此有了 HOST 和 DEVICE 的概念
HOST 可以简单理解为 CPU ，DEVICE 为 GPU
因此一个 CUDA 代码的经典结构是：
首先在主机端分配和检查 GPU
分配主机端和设备端使用的内存
指定好线程块 block 和网格 grid
执行核函数，在设备端进行运算
返回运算结果到主机端，释放内存
CUDA 提供了常用的接口，这部分查文档即可

线程与并行计算

CUDA 运算高效的基础在于，通过分配线程并行计算
首先梳理一下，block 和 grid 的逻辑结构和物理结构
对于逻辑结构：
一个 block 可以是一个 1 到 3 维的线程矩阵，每一个元素是一个线程
grid 同样是一个 dim3 类型的 block 矩阵，每一个元素是一个 block
通过计算在当前线程在 block 和 grid 中的坐标，可以唯一确定每一个线程的位置
对于物理结构：
首先，最基础的运算单元是 CUDA 核心
若干个 CUDA 核心和其他单元组成一个 SM ，即流多处理器
若干个 SM 组成整个 GPU
大体上，一个线程对应 CUDA 核心，一个 block 对应 SM，grid对应 GPU
一个 block 只能在分配好的一个 SM 上，而 SM 则有可能对应若干个 block
需要注意的是，并发和并行不一样
并发可以理解成交替的进行两个 TASK
并行才是真正意义上的同时计算
因此，在实际上， block 中每 32 个线程构成一个 warp 线程束
线程束在物理结构上才是真正并行的
这也是 block 大小必须为 32 整倍数的原因

内存管理

从访问速度从快到慢
分别为寄存器、共享内存、本地内存、常量内存、纹理内存和全局内存
共享内存为线程块内共享
寄存器和本地内存为线程独享
常量内存和纹理内存可与 HOST 数据交换，对 DEVICE 是只读
全局内存可被所有线程共享
寄存器和共享均为 on chip，寄存器为线程独有，共享的生命周期与 block 对应
寄存器溢出：核函数所需寄存器数量超出硬件设备支持则保存至本地内存

why CUDA

计组课上学过
内存的局部性原理：访问相邻连续的内存效率更高
这一定程度上与内存的物理结构有关
因为在抽象为二维数组的内存中，访存过程分为两步
首先是取出一行的 1024 个字节，进入放大器中放大信号
然后按照列找到需要访问的地址
而对于随机访问，则多了放大器写回等过程
效率相比连续访问低了一个数量级
而与直觉相反的是，计算的瓶颈通常就在访存效率上
因此 CUDA 的作用，就是实现硬件的充分利用，提升效率
之前提到，一个 warp 的大小是 32 个线程，一个 SM 中 4 个 warp
一个线程对应 8 个字节，相乘为 1024 B 正好就是 amplifier 的 1024 个字节
CUDA 正是充分利用这一点提高内存访问效率的
简而言之，就是充分利用一切硬件资源

CUTLASS

CUDA TO CUTLASS

CUTLASS 是一个算子库
具有与高性能、高解耦的优点
CUTLASS 为不同种场景提供了多种 template ，使用时传入参数进行实例化
与 CUTLASS 相对应的是 cublas
cublas 是英伟达的闭源库，效率较高但扩展性差
具体而言，使用 CUTLASS 执行矩阵乘法主要分为三个步骤：
prologue、tensorcore、epilogue
prologue 主要包括了从 global memory 到 shared memory 的 overhead
tensorcore 则是实际进行运算的阶段
epilogue 包括了对矩阵乘法的一些额外操作，比如乘系数、加 bias 等等，也包括了写回到 global 的过程

mainloop

for (int cta_n = 0; cta_n < GemmN; cta_n += CtaTileN) {
  for (int cta_m = 0; cta_m < GemmM; cta_m += CtaTileM) {
    
    for (int cta_k = 0; cta_k < GemmK; cta_k += CtaTileK) {
      
      
      for (int warp_n = 0; warp_n < CtaTileN; warp_n += WarpTileN) {
        for (int wapr_m = 0; warp_m < CtaTileM; warp_m += WarpTileM) {
          
          for (int wapr_k = 0; warp_k < CtaTileK; warp_k += WarpTileK) {
            
            
            for (int mma_k = 0; mma_k < WarpTileK; mma_k += MmaK) {
              for (int mma_n = 0; mma_n < WarpTileN; mma_n += MmaN) {
                for (int mma_m = 0; mma_m < WarpTileM; mma_m += MmaM) {
                  
                  mma_instruction(d, a, b, c);
                  
                }
              }
            }
          }
        }
      }
    }
  }
}

这是 gemm 实现的代码框架
具体的操作：从大到小，对矩阵进行分块
最外面一层 cta 对应的是 block 的大小
block 中再对 warp 进行分块，warp 中再对寄存器分块
实现了充分利用局部性，达到高访存效率
block 对应 global memory 到 shared memory 的过程
而 shared memory 的内部还有到寄存器的过程
最终内部的函数实现的是用寄存器进行的矩阵乘法

prefetch & latency

对于朴素的想法，即每次迭代到新的 block 时直接到 global memory 访存
这样势必导致不必要的 global to shared 的访存开销
于是有了 prefetch 的方法以达到掩盖 latency 的目的
具体而言就是多开一倍的 buffer ，每次提前到 global 中读取下一次迭代的数据存入 buffer
而读取过程往往用时很长，在等待数据取回的同时对另一半的 buffer 中数据进行计算
这是在 shared 层次的数据预取，同理在 register 层次也是一样，多开一倍寄存器即可
或者等效成将矩阵规模减小一半也可达到类似效果
当然，一次读取的也不一定是 2 倍，具体由 stage 大小决定

2.X TO 3.X

CUTLASS 推出了 3.0 版本以适应 Hopper 及之后的架构
具体的实现细节比较复杂，这里只简单介绍
大量的 Argument 从手动指定改为了自动推断
在 3.0 之前，每个 warp 还都是 prologue tensorcore epilogue 的相同流程
这种 schedule 被称为 ampere style
这样的弊端在于，prologue 和 epilogue 的 latency 都暴露在外，造成不必要的开销
为了解决这一浪费， 3.0 首先引入 kernel scheduler 过程中的 kernel TMA 机制
TMA 实现了在一个 warp 中执行 prologue 使得其余 warp 不必重复执行
因为保持了 ampere style 此时 prologue 仍然 exposed
然后介绍新的 warp specialized 机制：不同 warp 得以分别执行不同阶段
于是我们得以让部分 warp 执行 prologue ，其余进行 tensorcore 和 epilogue
这种 schedule 被成为 hopper style，此时虽然仍然将 latency 暴露在外，且造成了寄存器浪费，但使得优化成为了可能
cooperative 使用了多一倍的 warp 去执行 tensorcore，利用满了寄存器，但是由于是同步执行，没有解决 latency 过程
最后是 pingpong 的引入，这也是 hopper 中实际使用的方法
通过 pingpong 通过两个 group 交替执行计算，实现了 latency 的隐藏
具体而言，一个 group 执行完 tensorcore 进入 epilogue 时，切换另一组 group 进行下一个 tile 的 tensorcore
实际 hopper 中常用的是 coop 与 pingpong，一般 pingpong 更优
顺带一提，TMA 的使用条件是数据类型为 16bit
因此 lowbit 只能用传统方法，遗憾离场了
其余新架构的细节未完待续

examples

00_basic_gemm
这是一个最基础的 gemm 示例
一般来说一个 cutlass gemm.cu 主要分为三步：
定义 operator ，定义参数 args ，实例化后使用参数调用
分别对应以下三段代码

  using ColumnMajor = cutlass::layout::ColumnMajor;

  using CutlassGemm = cutlass::gemm::device::Gemm<float,        // Data-type of A matrix
                                                  ColumnMajor,  // Layout of A matrix
                                                  float,        // Data-type of B matrix
                                                  ColumnMajor,  // Layout of B matrix
                                                  float,        // Data-type of C matrix
                                                  ColumnMajor>; // Layout of C matrix

  // Define a CUTLASS GEMM type
  CutlassGemm gemm_operator;

  int M, N, K;
  float alpha, float beta;
  int lda = M;
  int ldb = K;
  int ldc = M;
  // Construct the CUTLASS GEMM arguments object.
  CutlassGemm::Arguments args({M , N, K},  // Gemm Problem dimensions
                              {A, lda},    // Tensor-ref for source matrix A
                              {B, ldb},    // Tensor-ref for source matrix B
                              {C, ldc},    // Tensor-ref for source matrix C
                              {C, ldc},    // Tensor-ref for destination matrix D (may be different memory than source C matrix)
                              {alpha, beta}); // Scalars used in the Epilogue

  //
  // Launch the CUTLASS GEMM kernel.
  //
  cutlass::Status status = gemm_operator(args);
  //
  // Return a cudaError_t if the CUTLASS GEMM operator returned an error code.
  //
  if (status != cutlass::Status::kSuccess) {
    return cudaErrorUnknown;
  }

07_volta_tensorop_gemm
这是一个添加了 linear 计算到 epilogue 中的 gemm 示例
由于矩阵乘法之后经常会在其后添加一些线性运算
这里以 \(\alpha \times C \ + \ \beta \times D\) 为例
因此通过集成到 epilogue 中的方式
可以有效降低不必要的访存 latency，以及 epilogue latency 的掩盖

using ElementAccumulator = float;                   // <- data type of accumulator
using ElementComputeEpilogue = ElementAccumulator;  // <- data type of epilogue operations
using ElementInputA = cutlass::half_t;              // <- data type of elements in input matrix A
using ElementInputB = cutlass::half_t;              // <- data type of elements in input matrix B
using ElementOutput = float;                        // <- data type of elements in output matrix D

// The code section below describes matrix layout of input and output matrices. Column Major for
// Matrix A, Row Major for Matrix B and Row Major for Matrix C
using LayoutInputA = cutlass::layout::ColumnMajor;
using LayoutInputB = cutlass::layout::RowMajor;
using LayoutOutput = cutlass::layout::RowMajor;

// This code section describes whether you want to use tensor cores or regular SIMT cores on GPU SM
using MMAOp = cutlass::arch::OpClassTensorOp;

// This code section describes CUDA SM architecture number
using SmArch = cutlass::arch::Sm70;

// This code section describes the tile size a thread block will compute
using ShapeMMAThreadBlock =
    cutlass::gemm::GemmShape<128, 128, 32>;  // <- threadblock tile M = 128, N = 128, K = 32
// This code section describes tile size a warp will compute
using ShapeMMAWarp = cutlass::gemm::GemmShape<64, 64, 32>;  // <- warp tile M = 64, N = 64, K = 32 
// This code section describes the size of MMA op
using ShapeMMAOp = cutlass::gemm::GemmShape<8, 8, 4>;  // <- MMA Op tile M = 8, N = 8, K = 4

// This code section describes how threadblocks are scheduled on GPU
using SwizzleThreadBlock = cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>;  // <- ??

// This code section describes ?
using EpilogueOp = cutlass::epilogue::thread::LinearCombination<
    ElementOutput,                                     // <- data type of output matrix
    128 / cutlass::sizeof_bits<ElementOutput>::value,  // <- this is the number of elements per
                                                       // vectorized memory access. For half
                                                       // precision, it's 8 elements. This becomes
                                                       // the vector width of math instructions in
                                                       // epilogue too
    ElementAccumulator,                                // <- data type of accumulator
    ElementComputeEpilogue>;  // <- data type for alpha/beta in linear combination function

// Number of pipelines you want to use
constexpr int NumStages = 2;

using Gemm = cutlass::gemm::device::Gemm<ElementInputA,
                                         LayoutInputA,
                                         ElementInputB,
                                         LayoutInputB,
                                         ElementOutput,
                                         LayoutOutput,
                                         ElementAccumulator,
                                         MMAOp,
                                         SmArch,
                                         ShapeMMAThreadBlock,
                                         ShapeMMAWarp,
                                         ShapeMMAOp,
                                         EpilogueOp,
                                         SwizzleThreadBlock,
                                         NumStages>;

  const int length_m = 5120;
  const int length_n = 4096;
  const int length_k = 4096;

  // Create a tuple of problem size for matrix multiplication
  cutlass::gemm::GemmCoord problem_size(length_m, length_n, length_k);

  // Initialize tensors using CUTLASS helper functions
  cutlass::HostTensor<ElementInputA, LayoutInputA> tensor_a(
      problem_size.mk());  // <- Create matrix A with dimensions M x K
  cutlass::HostTensor<ElementInputB, LayoutInputB> tensor_b(
      problem_size.kn());  // <- Create matrix B with dimensions K x N
  cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_c(
      problem_size.mn());  // <- Create matrix C with dimensions M x N
  cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_d(
      problem_size.mn());  // <- Create matrix D with dimensions M x N used to store output from
                           // CUTLASS kernel
  cutlass::HostTensor<ElementOutput, LayoutOutput> tensor_ref_d(
      problem_size.mn());  // <- Create matrix D with dimensions M x N used to store output from
                           // reference kernel

  // Fill input and output matrices on host using CUTLASS helper functions
  cutlass::reference::host::TensorFillRandomUniform(
      tensor_a.host_view(),
      1,
      ElementInputA(4),
      ElementInputA(-4),
      0);  // <- Fill matrix A on host with uniform-distribution random data
  cutlass::reference::host::TensorFillRandomUniform(
      tensor_b.host_view(),
      1,
      ElementInputB(4),
      ElementInputB(-4),
      0);  // <- Fill matrix B on host with uniform-distribution random data
  cutlass::reference::host::TensorFillRandomUniform(
      tensor_c.host_view(),
      1,
      ElementOutput(4),
      ElementOutput(-4),
      0);  // <- Fill matrix C on host with uniform-distribution random data
  cutlass::reference::host::TensorFill(
      tensor_d.host_view());  // <- fill matrix D on host with zeros
  cutlass::reference::host::TensorFill(
      tensor_ref_d.host_view());  // <- fill matrix D for reference on host with zeros

  // Copy data from host to GPU
  tensor_a.sync_device();
  tensor_b.sync_device();
  tensor_c.sync_device();
  tensor_d.sync_device();
  tensor_ref_d.sync_device();

  // Initialize alpha and beta for dot product computation
  ElementComputeEpilogue alpha = ElementComputeEpilogue(1);
  ElementComputeEpilogue beta = ElementComputeEpilogue(0);

  // Create a tuple of gemm kernel arguments. This is later passed as arguments to launch
  // instantiated CUTLASS kernel
  typename Gemm::Arguments arguments{problem_size,  // <- problem size of matrix multiplication
                                     tensor_a.device_ref(),  // <- reference to matrix A on device
                                     tensor_b.device_ref(),  // <- reference to matrix B on device
                                     tensor_c.device_ref(),  // <- reference to matrix C on device
                                     tensor_d.device_ref(),  // <- reference to matrix D on device
                                     {alpha, beta},          // <- tuple of alpha and beta
                                     split_k_slices};        // <- k-dimension split factor


  // Using the arguments, query for extra workspace required for matrix multiplication computation
  size_t workspace_size = Gemm::get_workspace_size(arguments);

  // Allocate workspace memory
  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);

  // Instantiate CUTLASS kernel depending on templates
  Gemm gemm_op;

  // Check the problem size is supported or not 
  cutlass::Status status = gemm_op.can_implement(arguments);
  CUTLASS_CHECK(status);

  // Initialize CUTLASS kernel with arguments and workspace pointer
  status = gemm_op.initialize(arguments, workspace.get());
  CUTLASS_CHECK(status);

  // Launch initialized CUTLASS kernel
  status = gemm_op();
  CUTLASS_CHECK(status);

posted @ 2025-02-17 17:09 actypedef 阅读(422) 评论(1) 收藏举报

刷新页面返回顶部

actypedef

CUDA & CUTLASS (2025/2/23)

CUDA

C TO CUDA

线程与并行计算

内存管理

why CUDA

CUTLASS

CUDA TO CUTLASS

mainloop

prefetch & latency

2.X TO 3.X

examples

公告