2. CUDA--Heterogeneous data parallel computing

Using CUDA C to develop a simple data parallel program

2.1 Data parallelism

Taking an example about Calculate image.

2.2 CUDA C program structure

CUDA C 的代码包括:Host 端(CPU)和Device端(GPU);

CUDA 程序执行流程:Host ⇒ Device ⇒ 等待Device执行完毕 ⇒ Host

2.3 A vector addition kernel

Device 端的命名通常加一个suffix _d

Host 端的命名通常加一个suffix _h

线性执行代码描述如下:

// Compute vector sum C_h = A_h + B_h
void vecAdd(float* A_h, float* B_h, float* C_h, int n) {
	for (int i = 0; i < n; ++i) {
		C_h[i] = A_h[i] + B_h[i];
	}
} 
int main() { 
	// Memory allocation for arrays A, B, and C
	// I/O to read A and B, N elements each
	... 
	vecAdd(A, B, C, N); 
} 

按照并行修改,描述如下:

void vecAdd(float* A_h, float* B_h, float* C_h, int n) {
	int size = n* sizeof(float);
	float *d_A *d_B, *d_C;

	// Part 1: Allocate device memory for A, B, and C
  // Copy A and B to device memory
  ...
  
  // Part 2: Call kernel – to launch a grid of threads
  // to perform the actual vector addition
  ...
  
  // Part 3: Copy C from the device memory
  // Free device vectors
} 

也就是 allocate memory_d ⇒ copy memory_h to memory_d ⇒ call Kernel ⇒ copy res from memory_d ⇒ free memory(_d _h)

2.4 Device global memory and data transfer

像传统说的内存我们称为main memory(主存),而GPU 中的“main memory”则叫做:global memory,也就是通常说的“显存”,以此来区分两个不同的存储。

Data is transferred from host to device is equal to data is transferred from Host main memory to Device global memory.

cudaMalloc(param_1, size)

  作用和malloc() 差不多;

  The first parameter is the address of a pointer variable.

  The second paramter is size( bytes, uint64_t type )

Notice:

  • Address must Convert to void**

    This parameter allows the cudaMalloc function to write the address of the allocated memory into the provided pointer variable regardless of its type.

  • cudaMalloc() is different from malloc() ,malloc() takes only one parameter and its return value is a pointer, which points to address of the allocated object. cudaMalloc() can return error value.

cudaFree(param_1) :和free() 作用一样,pass the value as an argument;

  • Example of using cuadMalloc& cudaFree

    float * A_d;
    uint64_t size = n * sizeof(float);
    cudaMalloc((void **) &A_d, size);
    ...
    cudaFree(A_d);
    

cudaMemcpy():

  The first parameter is a pointer to the destination location for the data object to be copied.

  The second parameter points to the source location.

  The third parameter specifies the number of bytes to be copied.

  The fourth parameter indicates the types of memory involved in the copy: from host to host, from host to device, from device to host, from device to device.

  • Complete Version of Vecadd code

    void vecAdd(float* A_h, float* B_h, float* C_h, int n) { 
        int size = n * sizeof(float); 
        float *A_d, *B_d, *C_d; 
    
        cudaMalloc((void **) &A_d, size); 
        cudaMalloc((void **) &B_d, size); 
        cudaMalloc((void **) &C_d, size); 
    
        cudaMemcpy(A_d, A_h, size, cudaMemcpyHostToDevice); 
        cudaMemcpy(B_d, B_h, size, cudaMemcpyHostToDevice); 
    
        // Kernel invocation code – to be shown later 
        ... 
    
        cudaMemcpy(C_h, C_d, size, cudaMemcpyDeviceToHost); 
    
        cudaFree(A_d); 
        cudaFree(B_d); 
        cudaFree(C_d); 
    }
    
  • Error Checking

    主要还是需要自己来写一些判断,比如下面判断分配如果超过可用的内存就会报错;

    cudaError_t err 5 cudaMalloc((void) &A_d, size);
    if (error! 5 cudaSuccess) {
	    printf(“%s in %s at line %d\\n”, cudaGetErrorString(err),
	    __FILE__, __LINE__);
	    exit(EXIT_FAILURE);
    }

2.5 Kernel functions and threading

Note that SPMD is not the same as SIMD (single instruction multiple data) [Flynn 1972]. In an SPMD system the parallel processing units execute the same program on multiple parts of the data. However, these processing units do not need to be executing the same instruction at the same time. In an SIMD system, all processing units are executing the same instruction at any instant.

Thread ⇒ Block ⇒ Grid

All blocks of a grid are of the same size. Each block can contain up to 1024 threads on current systems.

Block 中的Thread数量最好是32 的倍数;

dim3 : a struct with three unsigned integer fields (x, y, and z)

threadIdx : unique index about thread in block

blockIdx : unique index about block in grid

NO gridIdx;

举了一个Illinois 的电话例子,把threadIdx 类比成电话号,blockIdx 类比为区号。同样的在同一地区打电话时不用加区号会默认打给本区的这个电话。

__global__ :This keyword indicates that the function is a kernel and that it can be called to generate a grid of threads on a device.

  • Extend vecAdd

    // Compute vector sum C = A + B 
    // Each thread performs one pair-wise addition 
    __global__ 
    void vecAddKernel(float* A, float* B, float* C, int n) { 
        int i = threadIdx.x + blockDim.x * blockIdx.x; 
        if (i < n) { 
            C[i] = A[i] + B[i]; 
        } 
    }
    

CUDA C keywords for function declaration

Qualifier KeywordCallable FromExecuted OnExecuted
Host Host Host Caller host thread
Global Host Device New grid of device threads
Device Device Device Caller device thread

Note:

  • By default, all functions in a CUDA program are host functions if they do not have any of the CUDA keywords in their declaration.
  • Program can use both “host” and “device” in a function declaration

2.6 Calling kernel functions

  • Calling Kernal function

    int vectAdd(float* A, float* B, float* C, int n) { 
        // A_d, B_d, C_d allocations and copies omitted
        ...
        // Launch ceil(n/256) blocks of 256 threads each 
        vecAddKernel<<<ceil(n/256.0), 256>>>(A_d, B_d, C_d, n);
    }
    
  • Complete Version of VecAdd

    void vecAdd(float* A, float* B, float* C, int n) { 
        float *A_d, *B_d, *C_d; 
        int size = n * sizeof(float); 
    
        cudaMalloc((void **) &A_d, size); 
        cudaMalloc((void **) &B_d, size); 
        cudaMalloc((void **) &C_d, size); 
    
        cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice); 
        cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice); 
    
        vecAddKernel<<<ceil(n/256.0), 256>>>(A_d, B_d, C_d, n); 
    
        cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost); 
    
        cudaFree(A_d); 
        cudaFree(B_d); 
        cudaFree(C_d); 
    }
    

function<<<grid, block>>>(param);

The first configuration parameter gives the number of blocks in the grid.

The second specifies the number of threads in each block

2.7 Compilation

nvcc , a complier. 用法和gcc 有点像,具体看Nvidia提供的文档。

Untitled

2.8 Summary

一点建议:多看CUDA 官方文档,其他就是总结这章学了什么。

Exercises

https://github.com/EternalBady/CUDA_freshMan/blob/main/Part-I%20%20Fundamental%20Concepts/Chapter-2.md

posted @ 2023-07-26 19:47  苏同学是苏苏  阅读(42)  评论(0)    收藏  举报