猿代码高性能传统优化技术

高性能传统优化技术

高性能算法

lapack安装 lapack里面有blas和lapack 所以较为方便但是下载的时候遇到了许多困难最后是看知乎评论区解决的需要补上cmake使用指南

cd lapack-3.11
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=RELEASE -DBUILD_SHARED_LIBS=ON
make

cmake使用指南

下载petsc

./configure --prefix=../petsc_install --with-mpi-dir=/thfs1/software/mpich/mpi-n-gcc9.3.0 --with-blas-lapack-dir=../lapack-3.11/build
make PETSC_DIR=/thfs1/home/monkeycode/training_system/zjk/petsc-3.18.1 PETSC_ARCH=arch-linux-c-debug all
make PETSC_DIR=/thfs1/home/monkeycode/training_system/zjk/petsc-3.18.1 PETSC_ARCH=arch-linux-c-debug install

mpicc ex1.c -o ex1 -I./petsc_install/include -L./petsc_install/lib -Wl,-rpath=./petsc_install/lib -lpetsc
srun -p thcp1 -n 1 ex1

KSP Object: 1 MPI process
  type: gmres
    restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization with no iterative refinement
    happy breakdown tolerance 1e-30
  maximum iterations=10000, initial guess is zero
  tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
  left preconditioning
  using PRECONDITIONED norm type for convergence test
PC Object: 1 MPI process
  type: jacobi
    type DIAGONAL
  linear system matrix = precond matrix:
  Mat Object: 1 MPI process
    type: seqaij
    rows=10, cols=10
    total: nonzeros=28, allocated nonzeros=50
    total number of mallocs used during MatSetValues calls=0
      not using I-node routines
Norm of error 2.41202e-15, Iterations 5

程序性能分析

静态分析 利用understand进行静态分析

understand主要来分析程序的流程调用关系

**动态分析 ** 利用gprof进行动态分析

gprof除了函数的调用关系，同时还能给出函数的调用时间分布

g++ -pg main.cpp -o main
srun -N -n 1 -p thcp1 ./main
gprof main gmon.out >output.txt
chmod +x gprof2dot.py
gprof2dot.py output.txt | dot - Tpng -o output.png #利用gproff2dot 生成图片

计时

CLOCKS_PER_SEC;
clock_t start, end;
start = clock();
end = clock();
printf("%f seconds\n", (double)(end - start) / CLOCKS_PER_SEC));

其他分析工具 valgrind + Qcachegrind

编译运行串行HPCG 之前校内是要求跑通HPL 相比之下HPCG明显简便多了

cd setup
cp Make.Linux_Serial ../
#修改Make.Linux_MPI 把mpi路径填上去
mkdir build
cd build
../configure Linux_Serial
vim makefile #添加-pg参数
make
cd bin
srun -n 1 -N 1 -p thcp1 xhpc
gprof xhpcg gmon.out >output.txt

利用gprof 进行jacobi程序性能分析

这里给出结果图

![img](file:///C:/Users/10235/AppData/Local/Packages/Microsoft.Windows.Photos_8wekyb3d8bbwe/TempState/ShareServiceTempFolder/output.jpeg)

传统性能优化

从体系结构的角度

(1)提高主频

(2)高速缓存

(3)流水线

(4)并行技术(超标量)

常见循环优化技术

(1)循环合并 (loop fusion)

before

int i;
for(i = 0; i < n; i++) x[i] = a[i] + b[i];
for(i = 0; i < n; i++) y[i] = a[i] - b[i];

after

int i;
for(i = 0; i < n; i++) {
    x[i] = a[i] + b[i];
	y[i] = a[i] - b[i];
}

(2)循环展开 (loop unrolling)

before

int i = 0;
for(i = 0; i < N; i++) A[i] = A[i] + B[i]

after

int i = 0;
for(i = 0; i < N; i+=4) {
    A[i] = A[i] + B[i];
    A[i + 1] = A[i + 1] + B[i + 1];
    A[i + 2] = A[i + 2] + B[i + 2];
    A[i + 3] = A[i + 3] + B[i + 3];
}

(3)循环交换(loop interchange)

before

int j, k, i;
for(j = 0; j < N; j++)
    for(k = 0; k < N; k++)
        for(i = 0; i < N; i++)
            A[i][j] += B[i][k] + C[k][j];

after

int j, k, i;
for(j = 0; j < N; j++)
	for(i = 0; i < N; i++)
        for(k = 0; k < N; k++)
            A[i][j] += B[i][k] + C[k][j];

(4)循环分布(loop distribute)

before

int i;
for(i = 0; i < N; i++) {
    A[i] = i;
    B[i] = 2 + B[i];
    C[i] = 3 + C[i - 1];
}

after

int i;
for(i = 0; i < N; i++) {
    A[i] = i;
    B[i] = 2 + B[i];
}
for(i = 0; i < N; i++) C[i] = 3 + C[i - 1];

(5)循环不变量外提

before

for(i = 0; i < N; i++)
    for(j = 0; j < M; j++) 
        U[i] += W[i] * W[i] * D[j] / (dt * dt);

aftere

T1 = dt * dt;
for(i = 0; i < N; i++) {
    T2 = W[i] * W[i];
    for(j = 0; j < M; j++) {
        U[i] += T2 * D[j] / T1;
    }
}

优化Jacobi实例

initial

loop fusion

loop interchange

循环不变量外提

posted @ 2024-05-04 21:02 Hock 阅读(85) 评论(0) 收藏举报

刷新页面返回顶部

Hock

猿代码 高性能传统优化技术

高性能传统优化技术

高性能算法

程序性能分析

传统性能优化

公告

猿代码高性能传统优化技术