LeetGPU的MOJO🔥实践
本博客原文地址:https://www.cnblogs.com/BobHuang/p/18890009
Chris Lattner最近AI民主化的文章写得很爽,并在Modular’s bet to break out of the Matrix 强烈推荐了他在做的MOJO🔥。
学习MOJO可以看官方文档、开源库的代码和开源大礼包,实际开发中应该享受的是Write CPU+GPU code together, unlocking incredible performance的slogan开发快感,学习算子的我仅是来体验下。
目前都是写了native的方法,等我有时间再探索探索。很像写CUDA,感觉还是Triton和CuTeDSL更胜一筹。以下代码已push到LeetGPU Github仓库,看代码更方便。同系列文章还有LeetGPU入门教程 (CUDA guide最佳实践)。
一、* 关于 Chris Lattner
Chris Lattner是LLVM项目的发起人,他在 2000 年至 2005 年 期间独立开发了 LLVM 的早期版本,并编写了大部分初始代码。曾任苹果公司编译器团队的首席架构师,是苹果自创编程语言Swift的创造者。初中读书时我了解到了Mac上的clang和Swift,clang在10.4开始用于底层优化(如 OpenGL 着色器编译),10.5作为实验性,并在10.6作为编译可选项,具体Mac版本如下所示。

Chris Lattner2018年在Google Brain主导开发了MLIR,并在2019年4月正式开源出来。感谢Chris哥哥赏饭吃,个人认为这个编译器基础设施设计得不错,催生了Triton的诞生。Chris Lattner设计 MLIR 的初衷是解决 AI 编译器生态的碎片化问题,但是好像都是Nvidia的形状。Triton能开源出来,还是非常感动的,会有点机会。Modular看起来并不会开源底层编译器,要吃饭的,都是Python kernel mode,并用MLIR做的后端。

二、Vector Addition
向量C=向量A + 向量B,和 CUDA 写起来没太多不同,是线程级的编程,语言是Python。它是statically typed静态类型的,不支持implicit conversions隐式类型转换,所以算完idx和N比较需要显式类型转换。注意使用@parameter,注释解释这个装饰器The function that will be launched and distributed across GPU threads.,其实就是编译期常量参数(compile-time constant)的修饰符。
from gpu.host import DeviceContext
from gpu.id import block_dim, block_idx, thread_idx
from memory import UnsafePointer
from math import ceildiv
@parameter
fn vector_add_kernel(A: UnsafePointer[Float32], B: UnsafePointer[Float32], C: UnsafePointer[Float32], N: Int32):
var idx = block_dim.x * block_idx.x + thread_idx.x
if Int32(idx) < N:
C[idx] = A[idx] + B[idx]
# A, B, C are device pointers (i.e. pointers to memory on the GPU)
@export
def solve(A: UnsafePointer[Float32], B: UnsafePointer[Float32], C: UnsafePointer[Float32], N: Int32):
var BLOCK_SIZE: Int32 = 128
var ctx = DeviceContext()
var num_blocks = ceildiv(N, BLOCK_SIZE)
ctx.enqueue_function[vector_add_kernel](
A, B, C, N,
grid_dim = num_blocks,
block_dim = BLOCK_SIZE
)
ctx.synchronize()
三、matrix-multiplication
C=A*B,矩阵A是M*N,矩阵B是N*K, 得到的矩阵C是M*K,C[m][k]的值将由A的一行A[m][i]和对应的B的一列B[i][k]相乘得到,在KMN的循环顺序下我们串行的代码是这样的。
void matmul(const float* A, const float* B, float* C, int M, int N, int K) {
memset(C, 0, M * K * sizeof(float));
for (int col = 0; col < K; col++) {
for (int row = 0; row < M; row++) {
for (int i = 0; i < N; i++) {
C[row * K + k] += A[row * N + i] * B[i * K + col];
}
}
}
}
继续按照CUDA的方法重写一遍。
from gpu.host import DeviceContext
from gpu.id import block_dim, block_idx, thread_idx
from memory import UnsafePointer
from math.math import ceildiv
@parameter
fn matrix_multiplication_kernel(A: UnsafePointer[Float32], B: UnsafePointer[Float32], C: UnsafePointer[Float32], M: Int32, N: Int32, K: Int32):
var col = block_idx.x * block_dim.x + thread_idx.x
var row = block_idx.y * block_dim.y + thread_idx.y
if Int32(row) < M and Int32(col) < K:
var sum : Float32 = 0.0
for i in range(N):
sum += A[row * N + i] * B[i * K + col]
C[row * K + col] = sum
# A, B, C are device pointers (i.e. pointers to memory on the GPU)
@export
def solve(A: UnsafePointer[Float32], B: UnsafePointer[Float32], C: UnsafePointer[Float32], M: Int32, N: Int32, K: Int32):
var BLOCK_SIZE: Int32 = 16
var ctx = DeviceContext()
var grid_dim_x = ceildiv(K, BLOCK_SIZE)
var grid_dim_y = ceildiv(M, BLOCK_SIZE)
ctx.enqueue_function[matrix_multiplication_kernel](
A, B, C, M, N, K,
grid_dim = (grid_dim_x, grid_dim_y),
block_dim = (BLOCK_SIZE, BLOCK_SIZE)
)
ctx.synchronize()
开源kernel更优秀max/kernels/src/linalg/matmul_sm90.mojo,等我周末也试试
四、Matrix Transpose
naive的方法就是转换索引,照抄之前CUDA的写法吧
from gpu.host import DeviceContext
from gpu.id import block_dim, block_idx, thread_idx
from memory import UnsafePointer
from math import ceildiv
fn matrix_transpose_kernel(input: UnsafePointer[Float32], output: UnsafePointer[Float32], rows: Int32, cols: Int32):
var col = block_idx.x * block_dim.x + thread_idx.x
var row = block_idx.y * block_dim.y + thread_idx.y
if Int32(col) < cols and Int32(row) < rows:
output[col * rows + row] = input[row * cols + col]
# input, output are device pointers (i.e. pointers to memory on the GPU)
@export
def solve(input: UnsafePointer[Float32], output: UnsafePointer[Float32], rows: Int32, cols: Int32):
var BLOCK_SIZE: Int32 = 32
var ctx = DeviceContext()
var grid_dim_x = ceildiv(cols, BLOCK_SIZE)
var grid_dim_y = ceildiv(rows, BLOCK_SIZE)
ctx.enqueue_function[matrix_transpose_kernel](
input, output, rows, cols,
grid_dim = (grid_dim_x, grid_dim_y),
block_dim = (BLOCK_SIZE, BLOCK_SIZE)
)
ctx.synchronize()
五、softmax
softmax所用公式为\(\sigma(\mathbf{z})_i \;=\; \frac{\exp\bigl(z_i - \max_k z_k\bigr)} {\displaystyle\sum_{j=1}^{K} \exp\bigl(z_j - \max_k z_k\bigr)}\),那我mojo肯定能享受到reduce的高级操作吧,确实有的。
from gpu.host import DeviceContext
from gpu.id import block_dim, block_idx, thread_idx
from gpu import barrier
from memory import UnsafePointer
from buffer import NDBuffer
from math import ceildiv, exp
from gpu.memory import AddressSpace
from algorithm._gpu.reduction import block_reduce
@parameter
fn softmax_kernel(input: UnsafePointer[Float32], output: UnsafePointer[Float32], N: Int32):
alias BLOCK_SIZE: Int = 1024
var tid = thread_idx.x
if tid == 0:
output[0] = 1
var max_buf = NDBuffer[
DType.float32, 1, MutableAnyOrigin, 1, address_space = AddressSpace.SHARED
].stack_allocation()
var sum_buf = NDBuffer[
DType.float32, 1, MutableAnyOrigin, 1, address_space = AddressSpace.SHARED
].stack_allocation()
# Step 1: compute max
var local_max = Scalar[DType.float32](input[tid])
for i in range(tid + BLOCK_SIZE, N, BLOCK_SIZE):
local_max = max(local_max, input[i])
@parameter
@always_inline
fn _max[
type: DType, width: Int
](x: SIMD[type, width], y: SIMD[type, width]) -> SIMD[type, width]:
return max(x,y)
var block_max = block_reduce[BLOCK_SIZE, _max](local_max, 0)
if tid == 0:
max_buf[0] = block_max
barrier()
# Step 2: out[i] = exp(in[i] - max) and compute sum of out[i]
var local_sum = Scalar[DType.float32](0)
for i in range(tid, N, BLOCK_SIZE):
local_sum += exp(input[i] - max_buf[0])
@parameter
@always_inline
fn _sum[
type: DType, width: Int
](x: SIMD[type, width], y: SIMD[type, width]) -> SIMD[type, width]:
return x+y
var block_sum = block_reduce[BLOCK_SIZE, _sum](local_sum, 0)
if tid == 0:
sum_buf[0] = block_sum
barrier()
# Step 3: Normalize output
for i in range(tid, N, BLOCK_SIZE):
output[i] = exp(input[i] - max_buf[0]) / sum_buf[0]
@export
def solve(input: UnsafePointer[Float32], output: UnsafePointer[Float32], N: Int32):
var BLOCK_SIZE: Int32 = 1024
var ctx = DeviceContext()
var num_blocks = ceildiv(N, BLOCK_SIZE)
ctx.enqueue_function[softmax_kernel](
input, output, N,
grid_dim = num_blocks,
block_dim = BLOCK_SIZE
)
ctx.synchronize()
六、和CUDA性能对比
既然和CUDA写起来这么像,我也用了native的写法,在算法一样的情况下对两者做下对比,都用的是H100,仅比较时间。
1、Vector Addition

2、matrix-multiplication

3、Matrix Transpose

4、softmax
用了规约高级特性就快了点,Triton没有跨块规约仅0.52937 ms,但是BLOCK_SIZE比较大

本文来自博客园,作者:暴力都不会的蒟蒻,转载请注明原文链接:https://www.cnblogs.com/BobHuang/p/18890009

浙公网安备 33010602011771号