猿代码 高性能传统优化技术

高性能传统优化技术

高性能算法

lapack安装 lapack里面有blas和lapack 所以较为方便 但是下载的时候遇到了许多困难 最后是看知乎评论区解决的 需要补上cmake使用指南

cd lapack-3.11
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=RELEASE -DBUILD_SHARED_LIBS=ON
make

cmake使用指南


下载petsc

./configure --prefix=../petsc_install --with-mpi-dir=/thfs1/software/mpich/mpi-n-gcc9.3.0 --with-blas-lapack-dir=../lapack-3.11/build
make PETSC_DIR=/thfs1/home/monkeycode/training_system/zjk/petsc-3.18.1 PETSC_ARCH=arch-linux-c-debug all
make PETSC_DIR=/thfs1/home/monkeycode/training_system/zjk/petsc-3.18.1 PETSC_ARCH=arch-linux-c-debug install
mpicc ex1.c -o ex1 -I./petsc_install/include -L./petsc_install/lib -Wl,-rpath=./petsc_install/lib -lpetsc
srun -p thcp1 -n 1 ex1

KSP Object: 1 MPI process
  type: gmres
    restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
    happy breakdown tolerance 1e-30
  maximum iterations=10000, initial guess is zero
  tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
  left preconditioning
  using PRECONDITIONED norm type for convergence test
PC Object: 1 MPI process
  type: jacobi
    type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 1 MPI process
    type: seqaij
    rows=10, cols=10
    total: nonzeros=28, allocated nonzeros=50
    total number of mallocs used during MatSetValues calls=0
      not using I-node routines
Norm of error 2.41202e-15, Iterations 5

image-20240504124851109

程序性能分析

静态分析 利用understand进行静态分析

understand主要来分析程序的流程 调用关系

**动态分析 ** 利用gprof进行动态分析

gprof除了函数的调用关系,同时还能给出函数的调用时间分布

g++ -pg main.cpp -o main
srun -N -n 1 -p thcp1 ./main
gprof main gmon.out >output.txt
chmod +x gprof2dot.py
gprof2dot.py output.txt | dot - Tpng -o output.png #利用gproff2dot 生成图片

计时

CLOCKS_PER_SEC;
clock_t start, end;
start = clock();
end = clock();
printf("%f seconds\n", (double)(end - start) / CLOCKS_PER_SEC));

其他分析工具 valgrind + Qcachegrind

编译运行串行HPCG 之前校内是要求跑通HPL 相比之下HPCG明显简便多了

cd setup
cp Make.Linux_Serial ../
#修改Make.Linux_MPI 把mpi路径填上去
mkdir build
cd build
../configure Linux_Serial
vim makefile #添加-pg参数
make
cd bin
srun -n 1 -N 1 -p thcp1 xhpc
gprof xhpcg gmon.out >output.txt

利用gprof 进行jacobi程序性能分析

这里给出结果图

![img](file:///C:/Users/10235/AppData/Local/Packages/Microsoft.Windows.Photos_8wekyb3d8bbwe/TempState/ShareServiceTempFolder/output.jpeg)

传统性能优化

从体系结构的角度

(1)提高主频

(2)高速缓存

(3)流水线

(4)并行技术(超标量)

常见循环优化技术

(1)循环合并 (loop fusion)

before

int i;
for(i = 0; i < n; i++) x[i] = a[i] + b[i];
for(i = 0; i < n; i++) y[i] = a[i] - b[i];

after

int i;
for(i = 0; i < n; i++) {
    x[i] = a[i] + b[i];
	y[i] = a[i] - b[i];
}

image-20240504195252408

(2)循环展开 (loop unrolling)

before

int i = 0;
for(i = 0; i < N; i++) A[i] = A[i] + B[i]

after

int i = 0;
for(i = 0; i < N; i+=4) {
    A[i] = A[i] + B[i];
    A[i + 1] = A[i + 1] + B[i + 1];
    A[i + 2] = A[i + 2] + B[i + 2];
    A[i + 3] = A[i + 3] + B[i + 3];
}

image-20240504195552416

(3)循环交换(loop interchange)

before

int j, k, i;
for(j = 0; j < N; j++)
    for(k = 0; k < N; k++)
        for(i = 0; i < N; i++)
            A[i][j] += B[i][k] + C[k][j];

after

int j, k, i;
for(j = 0; j < N; j++)
	for(i = 0; i < N; i++)
        for(k = 0; k < N; k++)
            A[i][j] += B[i][k] + C[k][j];

image-20240504200955943

(4)循环分布(loop distribute)

before

int i;
for(i = 0; i < N; i++) {
    A[i] = i;
    B[i] = 2 + B[i];
    C[i] = 3 + C[i - 1];
}

after

int i;
for(i = 0; i < N; i++) {
    A[i] = i;
    B[i] = 2 + B[i];
}
for(i = 0; i < N; i++) C[i] = 3 + C[i - 1];

image-20240504202320728

(5)循环不变量外提

before

for(i = 0; i < N; i++)
    for(j = 0; j < M; j++) 
        U[i] += W[i] * W[i] * D[j] / (dt * dt);

aftere

T1 = dt * dt;
for(i = 0; i < N; i++) {
    T2 = W[i] * W[i];
    for(j = 0; j < M; j++) {
        U[i] += T2 * D[j] / T1;
    }
}

image-20240504203550602

优化Jacobi实例

initial

image-20240504193625645

loop fusion

image-20240504201045234

loop interchange

image-20240504201731059

循环不变量外提

image-20240504203515040

posted @ 2024-05-04 21:03  Hock  阅读(5)  评论(0编辑  收藏  举报