python pyc 加花指令

以HWS遇到的一道题为例子来讲解

这篇主要以python3为主

一、pyc文件格式

typedef struct {
    PyObject_HEAD
    int co_argcount;            /* #arguments, except *args */
    int co_posonlyargcount;     /* #positional only arguments */
    int co_kwonlyargcount;      /* #keyword only arguments */
    int co_nlocals;             /* #local variables */
    int co_stacksize;           /* #entries needed for evaluation stack */
    int co_flags;               /* CO_..., see below */
    int co_firstlineno;         /* first source line number */
    PyObject *co_code;          /* instruction opcodes */
    PyObject *co_consts;        /* list (constants used) */
    PyObject *co_names;         /* list of strings (names used) */
    PyObject *co_varnames;      /* tuple of strings (local variable names) */
    PyObject *co_freevars;      /* tuple of strings (free variable names) */
    PyObject *co_cellvars;      /* tuple of strings (cell variable names) */
    /* The rest aren't used in either hash or comparisons, except for co_name,
       used in both. This is done to preserve the name and line number
       for tracebacks and debuggers; otherwise, constant de-duplication
       would collapse identical functions/lambdas defined on different lines.
    */
    Py_ssize_t *co_cell2arg;    /* Maps cell vars which are arguments. */
    PyObject *co_filename;      /* unicode (where it was loaded from) */
    PyObject *co_name;          /* unicode (name, for reference) */
    PyObject *co_lnotab;        /* string (encoding addr<->lineno mapping) See
                                   Objects/lnotab_notes.txt for details. */
    void *co_zombieframe;       /* for optimization only (see frameobject.c) */
    PyObject *co_weakreflist;   /* to support weakrefs to code objects */
    /* Scratch space for extra data relating to the code object.
       Type is a void* to keep the format private in codeobject.c to force
       people to go through the proper APIs. */
    void *co_extra;

    /* Per opcodes just-in-time cache
     *
     * To reduce cache size, we use indirect mapping from opcode index to
     * cache object:
     *   cache = co_opcache[co_opcache_map[next_instr - first_instr] - 1]
     */

    // co_opcache_map is indexed by (next_instr - first_instr).
    //  * 0 means there is no cache for this opcode.
    //  * n > 0 means there is cache in co_opcache[n-1].
    unsigned char *co_opcache_map;
    _PyOpcache *co_opcache;
    int co_opcache_flag;  // used to determine when create a cache.
    unsigned char co_opcache_size;  // length of co_opcache.
} PyCodeObject;

PyCodeObject文件结构则在自己的python目录的Include/code.h 中

这张图是python2 pyc 总体上的结构图
前8字节代表模数和时间戳
第8个字节代表TYPE_CODE的类型，16进制表示为63
其后的四个四字节代表了全局变量的数据
再之后的1字节为TYPE_STRING 字段，0x73表示该字段
再四字节表示code block段的数据部分占用的空间，设为x
之后的x字节就是字节码，表示的可能是一个main函数的字节码(不确定，但一定不是所有的字节码)
x字节后进入TYPE_LIST ....
之后再次进入 TYPECODE ，其可能表示的是一个类或函数，之后表示该函数所使用到的字节码

pyobject结构体如下（include/object.h）

typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

PyCodeObject 结构体代表了pyc全局结构体

更为清晰的结构如下(python2)

这张图是python2 pyc 总体上的结构图
前8字节代表模数和时间戳
第8个字节代表TYPE_CODE的类型，16进制表示为63
其后的四个四字节代表了全局变量的数据
再之后的1字节为TYPE_STRING 字段，0x73表示该字段
再四字节表示code block段的数据部分占用的空间，设为x
之后的x字节就是字节码，表示的可能是一个main函数的字节码(不确定，但一定不是所有的字节码)
x字节后进入TYPE_LIST ....
之后再次进入 TYPECODE ，其可能表示的是一个类或函数，之后表示该函数所使用到的字节码

虽然是python2版本的，但勉强可以拿来分析，我没可以通过类似上面的分析方式来找到pyc的所有函数的字节码位置、所有变量之类的

其他一些帮助解析pyc结构的知识:

https://github.com/gmodena/pycdump

结合这篇python2 的pyc文章来分析pytohn3的pyc文章

https://kdr2.com/tech/main/1012-pyc-format.html

python/marshal.c:

https://github.com/python/cpython/blob/main/Python/marshal.c

二、相关api 或结构成员

fp = open(r"./new2.pyc","rb")
data = fp.read() 
Pyobj = marshal.loads(data[16:])    #反序列化#python3版本的需要去除前16个字节，python2版本的需要去除前8个字节

1、dis.dis(Pyobj)

dis.dis函数能反编译出所有的pyc的字节码，如下

2、dis.opmap

opmap是dis模块中的一个字典类型成员，他记录了字节码和16进制的对应关系

3、dis.Bytecode

bytecode = dis.Bytecode(Pyobj)
for instr in bytecode:
    print(instr)
    print(instr.opcode)         # 输出所有的opcode名字
    print(instr.offset)         # 输出所有的自解码偏移

它可以吧字节码更细致的打印出来，不过直接使用的话，只能打印第一个函数的信息（一般是main函数）

4、dis.get_instructions

for instr in dis.get_instructions(Pyobj):         #类似这样:  Instruction(opname='STORE_NAME', opcode=90, arg=3, argval='t', argrepr='t', offset=26, starts_line=None, is_jump_target=False)  
    print(instr)

5、Pyobj.co_code

默认打印第一个函数（main）函数中的字节码的16进制信息

由于 python3.6 版本之后，一个python字节码大小为两字节，我们可以使用一下方式来获取字节码数量

len(Pyobj.co_code) // 2

三、一些其他知识

1、python3 pyc 中一个opcode大小为2字节

第一个字节就是操作码，第二个字节则是参数

2、JUMP_FORWARD

JUMP_FORWARD 154 (to 392)

154的意思是从下一条指令开始向'下'跳，直到偏移为154的位置，也就是 2 + 236 + 154 == 392的位置 

 
四、处理花指令
 
 通过 前面提到的dis.opmap 字典找到 EXTENDED_ARG 175 对应的字节码，也就是 90 AF，注意其上一条指令，是 JUMP_FORWARD 36 (to 194) 也就是说从下一条指令开始他要向下偏移
36个字节，我们直接吧这36个字节 nop 掉
 
再次运行dis.dis函数，发现仍然存在问题
 
 第二处花指令，用上面的方式，重复一下即可
修复完之后，dis.dis函数即可反编译出字节码
 
但在反编译伪代码的时候则有些困难，只能用 pycdc 反编译出一部分，剩下的还是得靠自己硬读...不知道有没有师傅发现能反编译出完整伪代码的方法