python 耗时分析&使用ctypes调用C接口实现性能优化的解决方案

由于python相较于C++运行较慢，例如在DL时代，大规模的数据处理有的时候即便是多进程python也显得捉襟见肘，所以性能优化非常重要，下面是基于ctypes的性能优化流程：

一、性能分析

第一步首先要分析代码中哪些模块耗时，各模块的耗时比要有所掌握，这里使用line-profiler工具进行分析；

安装: pip install line-profiler

使用：

（1）不需要import；

（2）在需要分析性能的函数前面加上修饰器 "@profile"，如下图所示：

（3）使用命令：kernprof -l xxxxx.py

-l 表示逐行分析时间，最后结果会自动写到同目录级下的.lprof文件中，若是要是直接可视化出来，则加上 -v 参数，效果如下：

（上表中会有运行时间、运行次数、时间占比等）

（4）使用命令：python -m line_profiler xxxxx.py.lprof 查看lprof文件；

二、基于ctypes的性能优化

1、作用：ctypes用来为python提供C的接口，可以将C++实现的模块编译为.dll或.so，然后在python中调用对应dll中的模块名从而进行加速；

2、例程（目的是将cv2.imread()读取过程放在C++实现）：

（1）C++中的代码：

#include<opencv2/opencv.hpp>
#include<stdlib.h>

#define DLLEXPORT extern "C" __declspec(dllexport)  #DLLEXPORT用于声明将要生成到DLL中的函数

typedef struct Ret {
    int* buffer_args;
    uchar* buffer_img;
}Ret, *Ret_p;

DLLEXPORT Ret_p imread(char* img_path) {
    Ret_p ret_p = (Ret_p)malloc(sizeof(Ret));
    cv::Mat img = cv::imread(img_path);
    int img_h = img.rows;
    int img_w = img.cols;
    int img_c = img.channels();

    uchar* buffer_img = (uchar*)malloc(sizeof(uchar) * (img_h * img_w * img_c));
    ret_p->buffer_img = buffer_img;

    int* buffer_args = (int*)malloc(sizeof(int) * 3);
    memcpy(buffer_img, img.data, img_h * img_w * img_c);
    int args[3] = { img_h, img_w, img_c };
    memcpy(buffer_args, args, 3*sizeof(int));
    ret_p->buffer_args = buffer_args;
    return ret_p;
}

DLLEXPORT void release(uchar* data) {
    free(data);
}

由上面代码可知：C++中实现模块功能获得输出后，将输出存储到内存中，然后python调用该内存即可。

设置为生成.dll即可。

（2）python中代码：

import os
import cv2
import ctypes
import numpy as np

c_uint8_p = ctypes.POINTER(ctypes.c_ubyte)
c_int_p = ctypes.POINTER(ctypes.c_int) #ctypes没有定义c_int_p，因此需要自己构造
class Ret_p(ctypes.Structure):
    _fields_ = [("args", c_int_p),
                ("img_p", c_uint8_p*1)]

def main():
    template_h, template_w, template_c = 750, 840, 3
    src_img_path = "./template.jpg"
    dll = ctypes.WinDLL(r"./match.dll") #指定dll文件，后面将会调用这个dll中的函数
    src_img_path = ctypes.c_char_p(bytes(src_img_path, "utf-8")) #输入端：将python变量转换为ctypes类型变量从而适配C++端的输入
    dll.imread.restype = ctypes.POINTER(Ret_p) #输出端：设置ctypes类型变量
    pointer = dll.imread(src_img_path)  #调用dll中的imread函数，返回的是ctypes中的指针，指向自定义的结构体
    
    ret_args = np.asarray(np.fromiter(pointer.contents.args, dtype=np.int, count=3))  #np.fromiter从迭代器中获取数据，这里即可以从指针指向的内存区域中迭代去值
    
    print(ret_args)
    img = np.asarray(np.fromiter(pointer.contents.img_p, dtype=np.uint8, count=template_h*template_w*template_c))
    img = img.reshape((template_h, template_w, template_c))
    cv2.imwrite("./1.jpg", img)
　　 dll.release(img)  #内存释放

if __name__ == "__main__":
    main()

要注意：在使用ctypes时会涉及到 C++、ctypes 和 python 三方的数据转换，对应关系参考：https://blog.csdn.net/wchoclate/article/details/11684905

3、其他：

（1）做了一个实验，在实际测试中发现，原生cv2.imread()要由于这种ctypes的调用方式，因为C++实现了读取操作后（将图片数据保存为一维数组到内存中），但是还要在python中使用np.fromiter从内存中读取并reshape，这步非常耗时（涉及到多次迭代的内存读取开销）。

posted @ 2021-01-07 17:59 outthinker 阅读(1679) 评论(0) 收藏举报

刷新页面返回顶部

outthinker

python 耗时分析&使用ctypes调用C接口实现性能优化的解决方案

公告