Android中native进程内存泄露的调试技巧

Android中native进程内存泄露的调试技巧

代码基于Android2.3.x版本

Android为Java程序提供了方便的内存泄露信息和工具（如MAT），便于查找。但是，对于纯粹C/C++ 编写的natvie进程，却不那么容易查找内存泄露。传统的C/C＋＋程序可以使用valgrind工具，也可以使用某些代码检查工具。幸运的是，Google的bionic库为我们查找内存泄露提供了一个非常棒的API－－get_malloc_leak_info。利用它，我们很容易通过得到backtrace的方式找到涉嫌内存泄露的地方。

代码原理分析

我们可以使用adb shell setprop libc.debug.malloc 1来设置内存的调试等级（debug_level），更详细的等级解释见文件bionic/libc/bionic/malloc_debug_common.c中的注释：

/* Handle to shared library where actual memory allocation is implemented. * This library is loaded and memory allocation calls are redirected there * when libc.debug.malloc environment variable contains value other than * zero: * 1 – For memory leak detections. * 5 – For filling allocated / freed memory with patterns defined by * CHK_SENTINEL_VALUE, and CHK_FILL_FREE macros. * 10 – For adding pre-, and post- allocation stubs in order to detect * buffer overruns. * Note that emulator’s memory allocation instrumentation is not controlled by * libc.debug.malloc value, but rather by emulator, started with -memcheck * option. Note also, that if emulator has started with -memcheck option, * emulator’s instrumented memory allocation will take over value saved in * libc.debug.malloc. In other words, if emulator has started with -memcheck * option, libc.debug.malloc value is ignored. * Actual functionality for debug levels 1-10 is implemented in * libc_malloc_debug_leak.so, while functionality for emultor’s instrumented * allocations is implemented in libc_malloc_debug_qemu.so and can be run inside * the emulator only. */

对于不同的调试等级，内存分配管理函数操作句柄将指向不同的内存分配管理函数。这样，内存的分配和释放，在不同的的调试等级下，将使用不同的函数版本。详细过程如下：

如下面代码注释所说，在__libc_init例程中会调用malloc_debug_init进行初始化，进而调用malloc_init_impl（在一个进程中，使用pthread_once保证其只被执行一次）

在malloc_init_impl中，会打开对应的C库，解析出函数符号：malloc_debug_initialize（见行366），并执行之（行373）

当debug_level被设置为1、5、10时，打开库”/system/lib/libc_malloc_debug_leak.so”。在文件bionic/libc/bionic/malloc_debug_leak.c中，实现了malloc_debug_initialize，但只为返回0的空函数。若为20，则打开的是：”/system/lib/libc_malloc_debug_qemu.so”

接着，针对不同的debug_level，解析出不同的内存操作函数malloc/free/calloc/realloc/memalign实现：

对于debug_level等级1、5、10的情况，malloc/free/calloc/realloc/memalign各种版本的实现位于文件bionic/libc/bionic/malloc_debug_leak.c中。如debug_level为5时的情况，malloc/free/则是在分配内存时将分配的内存填充为0xeb，释放时填充为0xef：

当debug_level为1调试memory leak时，其实现是打出backtrace：

void* leak_malloc(size_t bytes) { // allocate enough space infront of the allocation to store the pointer for // the alloc structure. This will making free’ing the structer really fast!

// 1. allocate enough memory and include our header // 2. set the base pointer to be right after our header

void* base = dlmalloc(bytes + sizeof(AllocationEntry)); if (base != NULL) { pthread_mutex_lock(&gAllocationsMutex);

intptr_t backtrace[BACKTRACE_SIZE]; size_t numEntries = get_backtrace(backtrace, BACKTRACE_SIZE);

AllocationEntry* header = (AllocationEntry*)base; header->entry = record_backtrace(backtrace, numEntries, bytes); header->guard = GUARD;

// now increment base to point to after our header. // this should just work since our header is 8 bytes. base = (AllocationEntry*)base + 1;

pthread_mutex_unlock(&gAllocationsMutex); }

return base; }

该malloc函数在实际分配的bytes字节前额外分配了一块数据用作AllocationEntry。在分配内存成功后，分配了一个拥有32个元素的指针数组，用于存放调用堆栈指针，调用函数get_backtrace将调用堆栈保存起来，也就是将各函数指针保存到数组backtrace中；然后使用record_backtrace记录下该调用堆栈，然后让AllocationEntry的entry成员指向它。函数record_backtrace会通过hash值在全局调用堆栈表gHashTable里查找。若没找到，则创建一项调用堆栈信息，将其加入到全局表中。最后，将base所指向的地方往后移一下，然后它，就是分配的内存地址。可见，该版本的malloc函数额外记录了调用堆栈的信息。通过在分配的内存块前加一个头的方式，保存了如何查询hash表调用堆栈信息的entry。

再来看一下record_backtrace函数，在分析其代码之前，看一下结构体（文件malloc_debug_common.h）： struct HashEntry { size_t slot;// HashTable中的slots数组索引 HashEntry* prev;//前一项 HashEntry* next;//后一项，新添加时添加到后面 size_t numEntries;//调用堆栈中的函数指针数量 // fields above “size” are NOT sent to the host size_t size;//表示该次malloc操作所分配的内存数 size_t allocations;//调用的次数，即此处的malloc被调用了多少次 intptr_t backtrace[0];//调用堆栈 };

typedef struct HashTable HashTable; struct HashTable { size_t count; HashEntry* slots[HASHTABLE_SIZE];//HASHTABLE_SIZE=1543 }; 和在一个进程中，有一个全局的变量gHashTable，用于记录谁最终调用了malloc分配内存的调用堆栈列表。gHashTable的类型是HashTable，其有一个指针，这个指针指向一个slots数组，该数组的最大容量是1543；数组中有多少有效的值由另一个成员count记录。可以通过backtrace和 numEntries得到hash值，再与HASHTABLE_SIZE整除得到HashEntry在该数组中的索引，这样就可以根据自身信息根据hash，快速得到在数组中的索引。另一个结构体是HashEntry，因其成员存在指向前后的指针，所以它也是个链表，hash值相同将添加到链表的后面。HashEntry第一个成员slot就是自身在数组中的索引，亦即由hash运算而来；最后一项即调用堆栈backtrace[0]，里面是函数指针，这个数组具体有多少项则由另一个成员numEntries记录；size表示该次分配的内存的大小；allocations是分配次数，即有多少次同一调用路径。这两个数据结构关系可由下图表示：

在leak_malloc中调用record_backtrace记录堆栈信息时，先由backtrace和numEntries得到hash值，再整除运算后得到在gHashTable中的数组索引；接着检查是否已经存在该项，即有没有分配了相同内存大小、同一调用路径、记录了相当数量的函数指针的HashEntry。若有，则直接在原有项上的allocations加1，没有则创建新项：为HashEntry结构体分配内存（见行151，注意最后一个成员backtrace需要根据numEntries值来确定其有多少项），然后调用堆栈信息复制给HashEntry最后的一个成员backtrace。最后，还要为整个表格增加计数。这样record_backtrace函数完成了向全局表中添加backtrace信息的任务：要么新增加一项HashEntry，要么增加索引。

static HashEntry* record_backtrace(intptr_t* backtrace, size_t numEntries, size_t size) { size_t hash = get_hash(backtrace, numEntries);//得到backtrace和numEntries的hash值 size_t slot = hash % HASHTABLE_SIZE;//整除,得到的是HashTable中的HashEntry数组索引

if (size & SIZE_FLAG_MASK) { debug_log(“malloc_debug: allocation %zx exceeds bit widthn”, size); abort(); }

if (gMallocLeakZygoteChild) size |= SIZE_FLAG_ZYGOTE_CHILD;

HashEntry* entry = find_entry(&gHashTable, slot, backtrace, numEntries, size); //上面一行: 在全局表中搜索该项是否已经存在，即是否该调用路径是否已经被调用过 if (entry != NULL) { entry->allocations++;//若调用过，则增加计数 } else {//若没有调用，则创建一新项 // create a new entry entry = (HashEntry*)dlmalloc(sizeof(HashEntry) + numEntries*sizeof(intptr_t));//为该项分配内存， if (!entry)//接上一行:因HashEntry最后一项是intptr_t backtrace[0];故它是一动态长度，所有numEntries*sizeof(intptr_t) return NULL; entry->allocations = 1; entry->slot = slot; entry->prev = NULL; entry->next = gHashTable.slots[slot]; entry->numEntries = numEntries; entry->size = size;

memcpy(entry->backtrace, backtrace, numEntries * sizeof(intptr_t));//将backtrace拷贝到entry结构体的后面的内存中

gHashTable.slots[slot] = entry;//将新分配的并经过赋值的一项HashEntry添加到HashTable中的数组中去

if (entry->next != NULL) { entry->next->prev = entry; }

// we just added an entry, increase the size of the hashtable gHashTable.count++;//增加计数 }

return entry; }

在leak_free函数中会释放上述全局hash表中的堆栈项（见行550）：

void leak_free(void* mem) { if (mem != NULL) { pthread_mutex_lock(&gAllocationsMutex);

// check the guard to make sure it is valid AllocationEntry* header = (AllocationEntry*)mem – 1;

if (header->guard != GUARD) { // could be a memaligned block if (((void**)mem)[-1] == MEMALIGN_GUARD) { mem = ((void**)mem)[-2]; header = (AllocationEntry*)mem – 1; } }

if (header->guard == GUARD || is_valid_entry(header->entry)) { // decrement the allocations HashEntry* entry = header->entry; entry->allocations–; if (entry->allocations <= 0) { remove_entry(entry); dlfree(entry); }

// now free the memory! dlfree(header); } else { debug_log(“WARNING bad header guard: ’0x%x’! and invalid entry: %pn”, header->guard, header->entry); }

pthread_mutex_unlock(&gAllocationsMutex); } }

因此，在全局表中剩下的未被释放的项，就是分配了内存但未被释放的调用了malloc的调用堆栈。

get_malloc_leak_info

函数get_malloc_leak_info用于获取内存泄露信息。在分配内存时，记录下调用堆栈，在释放时清除它们。这样，剩下的就很有可能是产生内存泄露的根源。那么如何获取该内存调用堆栈全局hash表呢？在文件malloc_debug_common.c中提供了函数get_malloc_leak_info，可以获取该堆栈信息。函数get_malloc_leak_info接收5个参数，用于各种存放各种变量的地址，调用结束后，这些变量将得到修改。如其代码注释所说： *info将指向在该函数中分配的整块内存，这些内存空间大小为overallSize；整个空间若干小项组成，每项的大小为infoSize，这个小项的数据结构等同于HashEntry中自size成员开始的结构，即第一个成员是malloc分配的内存大小，第二个成员是allocations，即多次有着相同调用堆栈的计数，最后一项是backtrace，共32（BACKTRACE_SIZE）个指针值的空间。因此，*info指向的大内存块包含了共有overallSize/infoSize个小项。注意HashEntry中backtrace数组是按实际数量分配的，而此处则统一按32个分配空间，若不到32个，则后面的值置0； totalMemory是malloc分配的所有内存的大小；最后一个参数是backtraceSize，即32（BACKTRACE_SIZE）

函数get_malloc_leak_info首先检查传递进来的变量是否合法，以及全局堆栈中是否有堆栈项： void get_malloc_leak_info(uint8_t** info, size_t* overallSize, size_t* infoSize, size_t* totalMemory, size_t* backtraceSize) { // don’t do anything if we have invalid arguments if (info == NULL || overallSize == NULL || infoSize == NULL || totalMemory == NULL || backtraceSize == NULL) { return; } *totalMemory = 0;

pthread_mutex_lock(&gAllocationsMutex);

if (gHashTable.count == 0) { *info = NULL; *overallSize = 0; *infoSize = 0; *backtraceSize = 0; goto done; }

接着查看全局堆栈表中有多少项，然后分配一块内存，用于保存指针，这些指针用于指向gHashTable中的所有HashEntry项，并顺便计数出已分配但未释放的内存总数量totalMemory用于返回给调用者。最后一个参数是调用堆栈中的函数指针个数，实际值为BACKTRACE_SIZE，即32。. void** list = (void**)dlmalloc(sizeof(void*) * gHashTable.count);

// get the entries into an array to be sorted int index = 0; int i; for (i = 0 ; i < HASHTABLE_SIZE ; i++) {//遍历gHashTable全部项 HashEntry* entry = gHashTable.slots[i]; while (entry != NULL) {//有效项放到list中去 list[index] = entry; *totalMemory = *totalMemory +//计算总分配的内存 ((entry->size & ~SIZE_FLAG_MASK) * entry->allocations); index++; entry = entry->next;//让entry指向下一个，即相同的slot值 } }//经过此for循环，将全局表中所有的堆栈项指针存放到list指向的表中

// XXX: the protocol doesn’t allow variable size for the stack trace (yet) *infoSize = (sizeof(size_t) * 2) + (sizeof(intptr_t) * BACKTRACE_SIZE);//32个指针值项， //注意: info前面是两个size_t变量，它们是HashEntry中的size和allocations两个成员,后面是backtrace *overallSize = *infoSize * gHashTable.count;//计算所有调用堆栈项所需内存 *backtraceSize = BACKTRACE_SIZE;

最后，为所有调用堆栈项信息分配内存，即info指向的地方；并将gHashTable中的调用堆栈信息（即list表中的HashEntry自其结构体成员size后面的值）拷贝到info所指向的内存中。

// now get A byte array big enough for this *info = (uint8_t*)dlmalloc(*overallSize);//为所有堆栈项分配内存，包括各项的2个size_t变量

if (*info == NULL) {//分配不成功，没内存了 *overallSize = 0; goto out_nomem_info; }

qsort((void*)list, gHashTable.count, sizeof(void*), hash_entry_compare);//为列表中的项排序

uint8_t* head = *info; const int count = gHashTable.count; for (i = 0 ; i < count ; i++) { HashEntry* entry = list[i]; size_t entrySize = (sizeof(size_t) * 2) + (sizeof(intptr_t) * entry->numEntries); if (entrySize < *infoSize) { /* we’re writing less than a full entry, clear out the rest */ memset(head + entrySize, 0, *infoSize – entrySize);//调用堆栈32项中未填满的部分 } else { /* make sure the amount we’re copying doesn’t exceed the limit */ entrySize = *infoSize; }//下面的一行将32个指针占用空间加上前面两个size_t变量的值复制到info项中 memcpy(head, &(entry->size), entrySize);//size_t变量分别为size和allocations head += *infoSize;//让head指向下一个info所在内存 }

out_nomem_info: dlfree(list);

done: pthread_mutex_unlock(&gAllocationsMutex); }

当程序运行结束时，一般来说，内存都应该释放，这时我们可以调用get_malloc_leak_info获取未被释放的调用堆栈项。原理上，这些就是内存泄露的地方。但实际情况可能是，在我们运行get_malloc_leak_info时，某些内存应该保留还不应该释放。另外，我们有时要检查的进程是守护进程，不会退出。所以有些内存应该一直保持下去，不被释放。这时，我们可以选择某个状态的一个时刻来查看未释放的内存，比如在刚进入时的idle状态时的一个时刻，使用get_malloc_leak_info获取未释放的内存信息，然后在程序执行某些操作结束后返回Idle状态时，再次使用get_malloc_leak_info获取未释放的内存信息。两种信息对比，新多出来的调用堆栈项，就存在涉嫌内存泄露。使用get_malloc_leak_info函数的样例代码如下：

typedef struct { size_t size;//分配的内存 size_t dups;//重复数 intptr_t * backtrace;//调用堆栈指针 } AllocEntry;

uint8_t *info = NULL; size_t overallSize = 0; size_t infoSize = 0; size_t totalMemory = 0; size_t backtraceSize = 0;

get_malloc_leak_info(&info, &overallSize, &infoSize, &totalMemory, &backtraceSize); LOGI(“returned from get_malloc_leak_info, info=0x%x, overallSize=%d, infoSize=%d, totalMemory=%d, backtraceSize=%d”, (int)info, overallSize, infoSize, totalMemory, backtraceSize); if (info) { uint8_t *ptr = info; size_t count = overallSize / infoSize;

snprintf(buffer, SIZE, ” Allocation count %in”, count); result.append(buffer); snprintf(buffer, SIZE, ” Total meory %in”, totalMemory); result.append(buffer);

AllocEntry * entries = new AllocEntry[count];//数组

for (size_t i = 0; i < count; i++) {让获取的堆栈信息填充到 AllocEntry数组中 // Each entry should be size_t, size_t, intptr_t[backtraceSize] AllocEntry *e = &entries[i];

e->size = *reinterpret_cast<size_t *>(ptr); ptr += sizeof(size_t);

e->dups = *reinterpret_cast<size_t *>(ptr); ptr += sizeof(size_t);

e->backtrace = reinterpret_cast<intptr_t *>(ptr); ptr += sizeof(intptr_t) * backtraceSize; }

具体调试步骤： 参考http://freepine.blogspot.com/2010/02/analyze-memory-leak-of-android-native.html 下载其补丁包和python工具包将代码补丁达到android源码中的frameworks/base下，重新编译生成image，烧进手机板里，这时会在/system/bin/下有个二进制程序memorydumper。该代码补丁包向mediaserver进程中添加一个服务，二进制程序通过Binder IPC使用该服务。该服务使用get_malloc_leak_info获取未释放内存信息。

step1．设置调试等级并重启mediaserver进程 adb shell setprop libc.debug.malloc 1 adb shell ps mediaserver adb shell kill <mediaserver_pid>

它的目的是让mediaserver进程使用leak_malloc的版本。当设置调试等级后，杀死mediaserver进程，android系统将自动重启它。这时，它重新加载libc库，内存分配函数通过handle将使用leak_malloc、leak_free版本。 Step2:在某初始状态下，如在使用“照相机”程序之前，执行memorydumper，记录下此时未释放的内存： $ adb shell /system/bin/memorydumper $ adb pull /data/memstatus_<mediaserver_pid>.0 .

Step3：执行某些操作，如拍照、录制视频或播放几首歌曲，然后退出这些应用程序；

Step4：再次执行memorydumper，记录下此时未释放的内存；通过比较工具，比较此次和step2中的差异；这些差异就是有内存泄露嫌疑的地方。因为第一得到的未释放的可能就是那个时刻不该释放的，比较就是将它们排除掉。 $ adb pull /data/memstatus_<mediaserver_pid>.1 . $ diff memstatus_<mediaserver_pid>.0 memstatus_<mediaserver_pid>.1 >diff_0_1

Step5：获取maps文件。根据该文件，可以得到.so库文件所在地址范围空间，用于将调用堆栈函数符号地址解析出来。 $ adb pull /proc/<mediaserver_pid>/maps your_path

Step5.执行参考链接中的python脚本： ./addr2func.py –root-dir=~/u8500-android-2.3_v4.30 –maps-file=maps –product=u8500 diff._0_1>memleak.backtrace 该脚本将通过分析maps文件得到地址段对应的库文件所占用的地址空间，得到每个调用堆栈的地址对应的库，通过下面的命令，得到对应的经过编译器mangled后的函数名称、源文件及其行号： [root-dir]/prebuilt/linux-x86/toolchain/arm-eabi-4.4.0/bin/arm-eabi-addr2line -f -e [root-dir]/ /out/target/product/[product]/symbols/[libname] callstack_address

然后使用[root-dir]/prebuilt/linux-x86/toolchain/arm-eabi-4.4.0/bin/arm-eabi-c++filt进行函数的demangle，得到与源码一致的函数名称，使我们更易辨认。

一个例子的snapshot： 下面的截图是第一次使用memorydumper得到的调用堆栈地址：