python pycparser库学习

ReadMe

简介

GitHub:[GitHub - eliben/pycparser: :snake: Complete C99 parser in pure Python](https://github.com/eliben/pycparser)

pycparser 是一个用纯 Python 编写的 C 语言解析器。它是一个模块,旨在轻松集成到需要解析 C 源代码的应用程序中。

pycparser 适用于任何需要解析 C 代码的场景。以下是一些应用案例:

• C 代码混淆器
• 各类专用 C 编译器的前端
• 静态代码检查器
• 单元测试自动发现
• 为 C 语言添加专用扩展

pycparser 最受欢迎的用途之一是 [CFFI 文档 — CFFI 1.18.0.dev0 文档 --- CFFI documentation — CFFI 1.18.0.dev0 documentation](https://cffi.readthedocs.io/en/latest/index.html) 库,该库用它解析 C 函数和类型的声明以自动生成 FFI(外部函数接口)。

pycparser 的独特之处在于它是用纯 Python 编写的。熟悉 Lex 和 Yacc 的人会很容易理解 pycparser 的代码。它也没有外部依赖(只需 Python 解释器),安装和部署非常简单。

安装

前置条件

pycparser 已在 Linux、macOS 和 Windows 上通过 Python 3.8+ 测试。
pycparser 无外部依赖,唯一使用的非标准库是 PLY(已捆绑在 pycparser/ply 中)。

注意:pycparser(和 PLY)使用文档字符串(docstrings)进行语法规范。若 Python 安装移除了文档字符串(如使用 -OO 选项),将无法实例化或使用 pycparser。可尝试在正常模式下预生成 PLY 解析表以规避此问题,但此非官方支持的操作模式。

安装步骤

推荐使用 pip 安装 pycparser

pip install pycparser

使用指南

与 C 预处理器的交互

C 代码必须通过 C 预处理器(cpp)预处理后才能编译。cpp 处理 #include#define 等预处理指令,删除注释,并执行其他准备工作。

除极简单的 C 代码片段外,pycparser 需接收预处理后的 C 代码才能正确工作。若从 pycparser 包导入顶层 parse_file 函数,只要 cpp 在 PATH 中或提供其路径,该函数会自动调用 cpp

注意:可用 gcc -Eclang -E 替代 cpp。详见 using_gcc_E_libc.py 示例。

关于标准 C 库头文件

C 代码常通过 #include 包含标准库头文件(如 stdio.h)。虽然 pycparser 可解析任意 C 编译器的标准头文件(需额外工作),但更推荐使用 utils/fake_libc_include 中提供的 C11 "伪" 标准头文件。这些头文件仅包含必要内容,能有效解析依赖它们的文件,且由于极简,可显著提升解析大文件的性能。

关键点在于:pycparser 不关心类型的语义,只需知道源码中的标识符是否为已定义的类型。这对正确解析 C 代码至关重要。

详见此博客:<https://eli.thegreenplace.net/2015/on-parsing-c-type-declarations-and-fake-headers>

注意:伪头文件未包含在 pip 包中,也不通过 setup.py 安装(见 #224 <https://github.com/eliben/pycparser/issues/224>_)。

基础用法

查看发行版的 examples 目录以获取使用示例。多数实际 C 代码需先经预处理器处理再传递给 pycparser,详见前文。

高级用法

pycparser 的公共接口在 pycparser/c_parser.py 中有详细注释。关于解析器生成的 AST 节点,详见 pycparser/_c_ast.cfg

修改 pycparser

修改时需注意以下事项:

pycparser 的 AST 节点代码由配置文件 _c_ast.cfg 通过 _ast_gen.py 自动生成。若修改 AST 配置,需重新生成代码(运行 pycparser 目录下的 _build_tables.py 脚本)。
• 需理解 pycparser 的优化模式,详见 CParser 类的文档字符串。开发时应禁用优化以确保修改语法后重新生成 Yacc/Lex 表。

下载包内容

解压 pycparser 包后可见以下文件和目录:

  • README.rst: 本文档
  • LICENSE: 许可证文件
  • setup.py: 安装脚本
  • examples/: 使用示例目录
  • pycparser/: 模块源码
  • tests/: 单元测试
  • utils/fake_libc_include: 最小化标准 C 库头文件,可解析任意 C 代码。注意这些头文件包含 C11 内容,若预处理器配置为早期标准(如 -std=c99)可能不兼容。
  • utils/internal/: 内部工具,一般无需使用

官方examples

路径:examples

执行方式:在pycparser根目录执行

前提:大多数实际的C代码示例需要在将代码传递给pycparser之前运行C预处理器进行处理

测试环境:

$ gcc -v
gcc version 7.3.1 20180303 (Red Hat 7.3.1-5) (GCC) 

$ python --version
Python 2.7.5

using_gcc_E_libc.py

执行命令:

python examples/using_gcc_E_libc.py

解析的C文件:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

void convert(int thousands, int hundreds, int tens, int ones)
{
char *num[] = {"", "One", "Two", "Three", "Four", "Five", "Six",
	       "Seven", "Eight", "Nine"};

char *for_ten[] = {"", "", "Twenty", "Thirty", "Forty", "Fifty", "Sixty",
		   "Seventy", "Eighty", "Ninety"};

char *af_ten[] = {"Ten", "Eleven", "Twelve", "Thirteen", "Fourteen",
		  "Fifteen", "Sixteen", "Seventeen", "Eighteen", "Ninteen"};

  printf("\nThe year in words is:\n");

  printf("%s thousand", num[thousands]);
  if (hundreds != 0)
    printf(" %s hundred", num[hundreds]);

  if (tens != 1)
    printf(" %s %s", for_ten[tens], num[ones]);
  else
    printf(" %s", af_ten[ones]);
}


int main()
{
int year;
int n1000, n100, n10, n1;

  printf("\nEnter the year (4 digits): ");
  scanf("%d", &year);

  if (year > 9999 || year < 1000)
  {
    printf("\nError !! The year must contain 4 digits.");
    exit(EXIT_FAILURE);
  }

  n1000 = year/1000;
  n100 = ((year)%1000)/100;
  n10 = (year%100)/10;
  n1 = ((year%10)%10);

  convert(n1000, n100, n10, n1);

return 0;
}

解析后的语法树ast,以及一些ast节点的说明:

  • convert函数
  FuncDef:									# 函数定义: 函数声明或函数实现
    Decl: convert, [], [], [], []			# 函数名
      FuncDecl: 							# 函数声明的具体信息
        ParamList: 							# 函数参数列表
          Decl: thousands, [], [], [], []	# 每个 Decl 表示一个参数, 这里是thousands
            TypeDecl: thousands, [], None	# 参数的类型声明, None:类型限定符(如const、volatile 等,这里为空)
              IdentifierType: ['int']		# 参数的具体类型
          Decl: hundreds, [], [], [], []
            TypeDecl: hundreds, [], None
              IdentifierType: ['int']
          Decl: tens, [], [], [], []
            TypeDecl: tens, [], None
              IdentifierType: ['int']
          Decl: ones, [], [], [], []
            TypeDecl: ones, [], None
              IdentifierType: ['int']
        TypeDecl: convert, [], None			# 函数返回值列表
          IdentifierType: ['void']			# 返回的具体类型
    Compound: 							# 复合语句块,通常是函数体或代码块
      Decl: num, [], [], [], []			# 变量声明, 变量名为num
        ArrayDecl: []					# 数组声明
          PtrDecl: []					# 指针声明
            TypeDecl: num, [], None		# 类型声明
              IdentifierType: ['char']	# 变量的具体类型
        InitList: 						# 初始化列表, 用于初始化数组
          Constant: string, ""			# 每个 Constant 表示一个初始化值
          Constant: string, "One"
          Constant: string, "Two"
          Constant: string, "Three"
          Constant: string, "Four"
          Constant: string, "Five"
          Constant: string, "Six"
          Constant: string, "Seven"
          Constant: string, "Eight"
          Constant: string, "Nine"
      Decl: for_ten, [], [], [], []
        ArrayDecl: []
          PtrDecl: []
            TypeDecl: for_ten, [], None
              IdentifierType: ['char']
        InitList: 
          Constant: string, ""
          Constant: string, ""
          Constant: string, "Twenty"
          Constant: string, "Thirty"
          Constant: string, "Forty"
          Constant: string, "Fifty"
          Constant: string, "Sixty"
          Constant: string, "Seventy"
          Constant: string, "Eighty"
          Constant: string, "Ninety"
      Decl: af_ten, [], [], [], []
        ArrayDecl: []
          PtrDecl: []
            TypeDecl: af_ten, [], None
              IdentifierType: ['char']
        InitList: 
          Constant: string, "Ten"
          Constant: string, "Eleven"
          Constant: string, "Twelve"
          Constant: string, "Thirteen"
          Constant: string, "Fourteen"
          Constant: string, "Fifteen"
          Constant: string, "Sixteen"
          Constant: string, "Seventeen"
          Constant: string, "Eighteen"
          Constant: string, "Ninteen"
      FuncCall: 										# 函数调用
        ID: printf										# 函数名称
        ExprList: 										# 函数参数列表
          Constant: string, "\nThe year in words is:\n"	# string:常量类型, 后面跟的常量具体值
      FuncCall: 
        ID: printf
        ExprList: 
          Constant: string, "%s thousand"
          ArrayRef: 		# 表示一个数组引用
            ID: num			# 数组名称
            ID: thousands	# 数组索引
      If: 					# if条件判断
        BinaryOp: !=		# 条件表达式, 表示一个二元操作符 !=
          ID: hundreds		# 左操作数
          Constant: int, 0	# 右操作数, 是一个常量0
        FuncCall: 			# 条件为真时执行的语句
          ID: printf
          ExprList: 
            Constant: string, " %s hundred"
            ArrayRef: 
              ID: num
              ID: hundreds
      If: 
        BinaryOp: !=
          ID: tens
          Constant: int, 1
        FuncCall: 			# 条件为真时执行的语句
          ID: printf
          ExprList: 
            Constant: string, " %s %s"
            ArrayRef: 
              ID: for_ten
              ID: tens
            ArrayRef: 
              ID: num
              ID: ones
        FuncCall: 			# 条件为假时执行的语句
          ID: printf
          ExprList: 
            Constant: string, " %s"
            ArrayRef: 
              ID: af_ten
              ID: ones

serialize_ast.py

功能:序列化ast,将解析后的ast使用pickle模块进行打包,然后重新加载,解析解析。

rewrite_ast.py

功能:更改ast某个节点的值

func_defs.py

功能:使用pycparser打印出C文件中定义的所有函数

执行输出:

$ python examples/func_defs.py 
memmgr_init at examples/c_files/memmgr.c:46:6
get_mem_from_pool at examples/c_files/memmgr.c:55:22
memmgr_alloc at examples/c_files/memmgr.c:90:7
memmgr_free at examples/c_files/memmgr.c:159:6

func_defs_add_param.py

功能:给每一个函数添加一个入参int _hidden

func_calls.py

功能:打印所有的函数调用

执行输出:

$ python examples/func_calls.py 
foo called at examples/c_files/basic.c:4:3

explore_ast.py

功能:指导如何解析pycparser返回的ast,比如如何获取函数声明、函数体等

construct_ast_from_scratch.py

功能:将ast转化为C代码

c-to-c.py

将ast文件转换为C文件

扩展

示例1:解析头文件

功能:解析一个头文件,输出文件中的各种类型的值,包括:数据结构、枚举、共用体、函数声明、函数指针。

前提:由于pycparser解析的是预编译后的文件, 但示例中的头文件没有经过预编译,因此头文件中不能有宏,不能有注释。

比如头文件内容如下:

typedef char uint8_t;
typedef unsigned int uint32_t;
typedef uint8_t uint8_arr8[8];
typedef uint8_t uint8_arr128[128];
typedef uint8_t* uint8_p;
typedef uint8_t** uint8_pp;
typedef unsigned char byte;
typedef unsigned long ulong;

typedef union union_a0_u
{
    uint32_t a;
    uint32_t b;
} union_a0;

typedef union union_a1_u
{
    uint8_p c;
    uint8_arr8 d;
} union_a1;

typedef enum enum_b0_e {
    NONE = 0,
    NUM0 = 100,
} enum_b0;

typedef enum enum_b1_e {
    NUM2 = -10,
    NUM3 = 12,
} enum_b1;

typedef struct struct_c_s {
    char type;
    union u_member_u{
        uint32_t u0;
        uint32_t u1;
    } u_member;
    struct s_member0_s{
        int sa;
        int sb;
        struct {
            int se;
            int sf;
        }sub1;
    } s_member0;
    struct {
        int *c;
        int **d;
    } s_member1;
} struct_c;

int func_a(void);
void* func_b(char a, char *b, int a, int *b);
int func_c(struct_c c, enum_b0 b, union_a0 a);

typedef uint32_t (*func_p_0)(uint32_t a, uint8_arr128 b);
typedef void (*func_p_1)(const char *file, uint32_t line, const char *fmt, ...);

输出内容如下,EllipsisParam表示是一个可变参数:

typedef information:
------------------------------------
            uint8_t :: char
           uint32_t :: unsigned int
         uint8_arr8 :: uint8_t[8]
       uint8_arr128 :: uint8_t[128]
            uint8_p :: uint8_t*
           uint8_pp :: uint8_t**
               byte :: unsigned char
              ulong :: unsigned long
------------------------------------

enum information:
------------------------------
name: enum_b0_e:enum_b0
    ['NONE', '0']
    ['NUM0', '100']
++++++++++++++++++++++++++++++
name: enum_b1_e:enum_b1
    ['NUM2', '-10']
    ['NUM3', '12']
++++++++++++++++++++++++++++++
------------------------------

union information:
------------------------------
name: union_a0:union_a0_u
    ['a', 'uint32_t']
    ['b', 'uint32_t']
++++++++++++++++++++++++++++++
name: union_a1:union_a1_u
    ['c', 'uint8_p']
    ['d', 'uint8_arr8']
++++++++++++++++++++++++++++++
------------------------------

function pointer information:
------------------------------
name: func_p_0:uint32_t
    ['uint32_t', 'a']
    ['uint8_arr128', 'b']
++++++++++++++++++++++++++++++
name: func_p_1:void
    ['const char*', 'file']
    ['uint32_t', 'line']
    ['const char*', 'fmt']
    ['EllipsisParam', 'EllipsisParam']
++++++++++++++++++++++++++++++
------------------------------

function declare information:
------------------------------
name: func_a:int
    ['void', None]
++++++++++++++++++++++++++++++
name: func_b:void*
    ['char', 'a']
    ['char*', 'b']
    ['int', 'a']
    ['int*', 'b']
++++++++++++++++++++++++++++++
name: func_c:int
    ['struct_c', 'c']
    ['enum_b0', 'b']
    ['union_a0', 'a']
++++++++++++++++++++++++++++++
------------------------------

structure information:
------------------------------
name: struct_c
    ['char', 'type']
    union u_member_u, u_member
        ['uint32_t', 'u0']
        ['uint32_t', 'u1']
    struct s_member0_s, s_member0
        ['int', 'sa']
        ['int', 'sb']
        anonymous_struct, sub1
            ['int', 'se']
            ['int', 'sf']
    anonymous_struct, s_member1
        ['int*', 'c']
        ['int**', 'd']
++++++++++++++++++++++++++++++
------------------------------

实现源码:

import re
import logging
from pycparser import c_ast, parse_file

logging.basicConfig(level=logging.NOTSET, format='[%(filename)s:%(lineno)d]-%(levelname)s %(message)s')

class c_visitor(object):
    def __init__(self, file_path):
        self.typedef_dict = {}
        self.struct_dict = {}
        self.union_dict = {}
        self.func_p_dict = {}
        self.enum_dict = {}
        self.func_declare_dict = {}
        self.ast = parse_file(file_path)
        with open('pycparser.txt', 'w+') as f:  # 将ast存放下来, 方便查看
            f.write(str(self.ast))
        self.visit_struct()
        self.visit_typedef()
        self.visit_enum()
        self.visit_union()
        self.visit_func_pointer()
        self.visit_func_declare()


    def _get_type_name(self, node):
        """递归获取类型名称,处理数组和指针类型"""
        if isinstance(node, c_ast.TypeDecl):
            quals_str = ''
            if len(node.quals) != 0:
                for qual_str in node.quals:
                    quals_str = quals_str + ' ' + qual_str
                quals_str = quals_str.strip() + ' '
            return quals_str + self._get_type_name(node.type)
        elif isinstance(node, c_ast.IdentifierType):
            return ' '.join(node.names)
        elif isinstance(node, c_ast.ArrayDecl):
            if node.dim:
                if isinstance(node.dim, c_ast.ID):
                    dim = node.dim.name
                elif isinstance(node.dim, c_ast.FuncCall):
                    tmp_type_str = ''
                    for item in node.type.type.names:
                        tmp_type_str = tmp_type_str + ' ' + item
                    tmp_type_str = tmp_type_str.strip()
                    return f"{tmp_type_str}[{node.dim.name.name}({node.dim.args.exprs[0].name})]"
                else:
                    dim = node.dim.value
            else:
                dim = ''
            return f"{self._get_type_name(node.type)}[{dim}]"
        elif isinstance(node, c_ast.PtrDecl):
            return f"{self._get_type_name(node.type)}*"
        elif isinstance(node, c_ast.Constant):
            return node.value
        elif isinstance(node, c_ast.Struct):
            return 'struct ' + node.name
        elif isinstance(node, c_ast.UnaryOp):   # 有符号数
            return node.op + f"{self._get_type_name(node.expr)}"
        return ''


    def resolve_type(self, node):
        """深度解析类型节点"""
        while isinstance(node, (c_ast.TypeDecl, c_ast.PtrDecl, c_ast.ArrayDecl)):
            node = node.type
        return node


    def get_type_definition(self, decl):
        """获取类型的基础定义"""
        base_type = decl.type
        while isinstance(base_type, (c_ast.TypeDecl, c_ast.PtrDecl)):
            base_type = base_type.type
        return base_type


    def process_member(self, decl):
        """处理单个结构体成员"""
        # 解析基础类型
        base_type = self.get_type_definition(decl)
        # 处理匿名结构体/联合体
        if isinstance(base_type, (c_ast.Struct, c_ast.Union)):
            members = []
            if base_type.decls != None:
                for child in base_type.decls:
                    members.append(self.process_member(child))
                if isinstance(base_type, c_ast.Struct):
                    # logging.debug(base_type)
                    if base_type.name == None:
                        tag_type = 'anonymous_struct'
                    else:
                        tag_type = 'struct ' + str(base_type.name)
                elif isinstance(base_type, c_ast.Union):
                    if base_type.name == None:
                        tag_type = 'anonymous_union'
                    else:
                        tag_type = 'union ' + str(base_type.name)
                return [tag_type, decl.name, members]
            else:
                # 处理结构体的成员是一个结构体,比如struct s *a;
                if isinstance(base_type, c_ast.Struct):
                    return ['struct ' + self._get_type_name(decl.type) + base_type.name, decl.name, '']
                elif isinstance(base_type, c_ast.Union):
                    return ['union ' + base_type.name, decl.name, '']
        # 处理普通类型
        type_str = []
        node = decl.type
        while isinstance(node, (c_ast.TypeDecl, c_ast.PtrDecl, c_ast.ArrayDecl)):
            if isinstance(node, c_ast.ArrayDecl):
                if node.dim:
                    if isinstance(node.dim, c_ast.ID):
                        type_str.append(f"[{node.dim.name}]")
                    elif isinstance(node.dim, c_ast.FuncCall):
                        tmp_type_str = ''
                        for item in node.type.type.names:
                            tmp_type_str = tmp_type_str + ' ' + item
                        logging.debug(f"{tmp_type_str}[{node.dim.name.name}({node.dim.args.exprs[0].name})]")
                        type_str.append(f"[{node.dim.name.name}({node.dim.args.exprs[0].name})]")
                    else:
                        type_str.append(f"[{node.dim.value}]")
                else:
                    type_str.append([])
                node = node.type
            elif isinstance(node, c_ast.PtrDecl):
                type_str.append('*')
                node = node.type
            else:
                node = node.type
        type_name = self._get_type_name(decl.type)
        type_str = type_name + ''.join(reversed(type_str))
        return [type_name, decl.name]


    def visit_struct(self):
        for item in self.ast.ext:
            if isinstance(item, c_ast.Typedef) and isinstance(self.resolve_type(item.type), c_ast.Struct):
                struct_name = item.name
                struct_def = self.resolve_type(item.type)
                members_list = []
                if struct_def.decls == None:
                    continue
                for decl in struct_def.decls:
                    members_list.append(self.process_member(decl))
                self.struct_dict[struct_name] = members_list


    def visit_typedef(self):
        for decl in self.ast.ext:
            if isinstance(decl, c_ast.Typedef):
                if isinstance(decl.type.type, (c_ast.Struct, c_ast.Enum, c_ast.Union, c_ast.FuncDecl)):
                    continue
                type_name = decl.name
                base_type = self._get_type_name(decl.type)
                self.typedef_dict[type_name] = base_type


    def visit_enum(self):
        for item in self.ast.ext:
            if isinstance(item, c_ast.Typedef) and isinstance(self.resolve_type(item.type), c_ast.Enum):
                enum_declname = item.type.declname  # 声明的名字
                enum_name = item.type.type.name     # 名字
                members_list = []
                for enum_member in item.type.type.values.enumerators:
                    member_name = enum_member.name
                    member_value = enum_member.value
                    if member_value != None:
                        member_value = self._get_type_name(member_value)
                    members_list.append([member_name, member_value])
                self.enum_dict[f"{enum_name}:{enum_declname}"] = members_list


    def visit_union(self):
        for item in self.ast.ext:
            if isinstance(item, c_ast.Typedef) and isinstance(self.resolve_type(item.type), c_ast.Union):
                union_declname = item.type.declname
                union_name = item.type.type.name
                members_list = []
                for union_member in item.type.type.decls:
                    member_name = union_member.name
                    member_value = self._get_type_name(union_member.type)
                    members_list.append([member_name, member_value])
                self.union_dict[f"{union_declname}:{union_name}"] = members_list


    def visit_func_pointer(self):
        for item in self.ast.ext:
            if isinstance(item, c_ast.Typedef) and isinstance(self.resolve_type(item.type), c_ast.FuncDecl):
                funcP_declname = item.name
                return_type = self._get_type_name(item.type.type.type)
                members_list = []
                for funcP_member in item.type.type.args:
                    if isinstance(funcP_member, c_ast.EllipsisParam):  # 表示是可变参数
                        param_type = 'EllipsisParam'
                        param_name = 'EllipsisParam'
                    else:
                        param_type = self._get_type_name(funcP_member.type)
                        param_name = funcP_member.name
                    members_list.append([param_type, param_name])
                self.func_p_dict[f"{funcP_declname}:{return_type}"] = members_list


    def visit_func_declare(self):
        for item in self.ast.ext:
            if isinstance(item, c_ast.Decl) and isinstance(self.resolve_type(item.type), c_ast.FuncDecl):
                func_declname = item.name
                return_type = self._get_type_name(item.type.type)
                # logging.debug([func_declname, return_type])
                members_list = []
                for param_member in item.type.args.params:
                    if isinstance(param_member, c_ast.EllipsisParam):  # 表示是可变参数
                        param_type = 'EllipsisParam'
                        param_name = 'EllipsisParam'
                    else:
                        param_type = self._get_type_name(param_member.type)
                        param_name = param_member.name
                    members_list.append([param_type, param_name])
                self.func_declare_dict[f"{func_declname}:{return_type}"] = members_list


    def print_visit(self):
        typedef_dict = self.typedef_dict
        struct_dict = self.struct_dict
        enum_dict = self.enum_dict
        union_dict = self.union_dict
        func_p_dict = self.func_p_dict
        func_declare_dict = self.func_declare_dict

        print ('typedef information:')
        print ('------------------------------------')
        for typedef_key in typedef_dict:
            print ("    %15s :: %s" % (typedef_key, typedef_dict[typedef_key]))
        print ('------------------------------------\n')

        print ('enum information:')
        print ('------------------------------')
        for enum_key in enum_dict:
            print (f'name: {enum_key}')
            for item in enum_dict[enum_key]:
                print (f"    {item}")
            print ('++++++++++++++++++++++++++++++')
        print ('------------------------------\n')

        print ('union information:')
        print ('------------------------------')
        for union_key in union_dict:
            print (f'name: {union_key}')
            for item in union_dict[union_key]:
                print (f"    {item}")
            print ('++++++++++++++++++++++++++++++')
        print ('------------------------------\n')

        print ('function pointer information:')
        print ('------------------------------')
        for funcP_key in func_p_dict:
            print (f'name: {funcP_key}')
            for item in func_p_dict[funcP_key]:
                print (f"    {item}")
            print ('++++++++++++++++++++++++++++++')
        print ('------------------------------\n')

        print ('function declare information:')
        print ('------------------------------')
        for funcD_key in func_declare_dict:
            print (f'name: {funcD_key}')
            for item in func_declare_dict[funcD_key]:
                print (f"    {item}")
            print ('++++++++++++++++++++++++++++++')
        print ('------------------------------\n')

        print ('structure information:')
        print ('------------------------------')
        for struct_key in struct_dict:
            print (f'name: {struct_key}')
            for struct_data in struct_dict[struct_key]:
                if len (struct_data) > 2:
                    struct_union_type = struct_data[0]
                    struct_union_name = struct_data[1]
                    print (f'    {struct_union_type}, {struct_union_name}')
                    for item in struct_data[2:][0]:
                        if len (item) > 2:
                            struct_union_type = item[0]
                            struct_union_name = item[1]
                            print (f'        {struct_union_type}, {struct_union_name}')
                            for item0 in item[2:][0]:
                                print (f'            {item0}')
                        else:
                            print (f'        {item}')
                else:
                    print (f'    {struct_data}')
            print ('++++++++++++++++++++++++++++++')
        print ('------------------------------\n')


if __name__ == '__main__':
    file_path = "test_code.h"
    header_visitor = c_visitor(file_path)
    header_visitor.print_visit()

示例2:自动解析ast

原理

pycparser 中,NodeVisitor 类的工作机制是 自动遍历 AST 节点并根据节点类型触发对应的 visit_* 方法。你无需手动调用 visit_Typedef,它会通过以下流程自动执行:

  1. AST 的结构
    当解析 C 代码时,生成的 AST 中会包含多个节点,例如:
    Typedef(对应 typedef 语句)
    Struct(对应 struct 定义)
    Union(对应 union 定义)
    Decl(对应变量/成员声明)
  2. NodeVisitor 的递归遍历
    当你调用 visitor.visit(ast) 时,它会从 AST 的根节点开始递归遍历所有子节点。对于每个节点:
    • 检查该节点的类型(如 Typedef)。
    • 如果访问者类中定义了 visit_Typedef 方法,则自动调用它。
    • 如果没有定义 visit_Typedef,则调用通用的 generic_visit 方法(默认继续遍历子节点)。

若仍不理解,可以添加调试代码,打印所有节点类型:

from pycparser import c_ast, c_parser

class DebugVisitor(c_ast.NodeVisitor):
    def visit(self, node):
        print(f"Visiting node type: {node.__class__.__name__}")
        super().visit(node)

if __name__ == '__main__':
    text = '''
typedef unsigned char uint8_t;
'''
    parser = c_parser.CParser()
    ast = parser.parse(text)
    # 使用调试访问者
    debug_visitor = DebugVisitor()
    debug_visitor.visit(ast)
    print(ast)

输出会显示遍历的所有节点类型,例如:

Visiting node type: FileAST
Visiting node type: Typedef
Visiting node type: TypeDecl
Visiting node type: IdentifierType

这时候的ast如下:

FileAST(ext=[Typedef(name='uint8_t',
                     quals=[
                           ],
                     storage=['typedef'
                             ],
                     type=TypeDecl(declname='uint8_t',
                                   quals=[
                                         ],
                                   align=None,
                                   type=IdentifierType(names=['unsigned',
                                                              'char'
                                                             ]
                                                       )
                                   )
                     )
            ]
        )

实现

根据上述的原理,编写一个测试代码,解析全部的typedef定义

前提:由于pycparser解析的是预编译后的文件, 但示例中的头文件没有经过预编译,因此头文件中不能有宏,不能有注释。

需要解析的头文件:

typedef char uint8_t;
typedef unsigned int uint32_t;
typedef uint8_t uint8_arr8[8];
typedef uint8_t uint8_arr128[128];
typedef uint8_t* uint8_p;
typedef uint8_t** uint8_pp;
typedef unsigned char byte;
typedef unsigned long ulong;

源码实现:

import logging
from pycparser import c_ast, parse_file

logging.basicConfig(level=logging.NOTSET, format='[%(filename)s:%(lineno)d]-%(levelname)s %(message)s')

class c_visitor(c_ast.NodeVisitor):
    def __init__(self):
        self.typedef_dict = {}
        self.typedef_struct_map = {}    # 记录typedef与结构体的映射

    def _get_type_name(self, type_node):
        if isinstance(type_node, c_ast.IdentifierType):
            return ' '.join(type_node.names)
        elif isinstance(type_node, c_ast.TypeDecl):
            return self._get_type_name(type_node.type)
        elif isinstance(type_node, c_ast.ArrayDecl):
            dim = self._get_dim_value(type_node.dim)
            return f'{self._get_type_name(type_node.type)}[{dim}]'
        elif isinstance(type_node, c_ast.PtrDecl):
            if isinstance(type_node.type, c_ast.FuncDecl):
                return '函数指针'
            return f'{self._get_type_name(type_node.type)}*'
        elif isinstance(type_node, c_ast.Struct):
            return f'struct {type_node.name}' if type_node.name else 'anonymous_struct'
        elif isinstance(type_node, c_ast.Union):
            return f'union {type_node.name}' if type_node.name else 'anonymous_union'
        elif isinstance(type_node, c_ast.Enum):
            return f'enum {type_node.name}' if type_node.name else 'anonymous_enum'
        else:
            return type_node.__class__.__name__


    def _get_dim_value(self, dim_node):
        if isinstance(dim_node, c_ast.Constant):
            return dim_node.value
        elif isinstance(dim_node, c_ast.UnaryOp):
            return self._eval_unaryop(dim_node)
        return '?'


    def _eval_unaryop(self, node):
        return f"{node.op}{node.expr.value}" if node.op == '-' else '?'


    # 处理typedef
    def visit_Typedef(self, node):
        if isinstance(node.type.type, c_ast.Struct):
            struct_def = node.type.type
            typedef_name = node.name
            self.typedef_struct_map[struct_def.name] = typedef_name
        elif isinstance(node.type.type, c_ast.Enum):        # 先检测是否是枚举
            pass
        elif (isinstance(node.type, c_ast.PtrDecl) and      # 处理函数指针
              isinstance(node.type.type, c_ast.FuncDecl)):
            pass
        else:
            original_type = self._get_type_name(node.type)
            self.typedef_dict[node.name] = original_type
        self.generic_visit(node)


    def print_visit(self):
        typedef_dict = self.typedef_dict
        print ('typedef information:')
        print ('------------------------------------')
        for typedef_key in typedef_dict:
            print ("    %15s :: %s" % (typedef_key, typedef_dict[typedef_key]))
        print ('------------------------------------\n')


if __name__ == '__main__':
    file_path = "test_code.h"
    ast = parse_file(file_path)
    with open('./pycparser.txt', 'w+') as f:
        f.write(str(ast))
    visitor = c_visitor()
    visitor.visit(ast)
    visitor.print_visit()

输出:

typedef information:
------------------------------------
            uint8_t :: char
           uint32_t :: unsigned int
         uint8_arr8 :: uint8_t[8]
       uint8_arr128 :: uint8_t[128]
            uint8_p :: uint8_t*
           uint8_pp :: uint8_t**
               byte :: unsigned char
              ulong :: unsigned long
           union_a0 :: union union_a0_u
           union_a1 :: union union_a1_u
------------------------------------
posted @ 2025-03-24 14:43  Mrlayfolk  阅读(460)  评论(0)    收藏  举报
回到顶部