记一次 .NET 某MES上位机拍照系统 内存暴涨分析
一:背景
1. 讲故事
这是训练营里的一位朋友找到我的,说他们的系统会有偶发的内存暴涨情况,自己也没分析出来,让我帮忙看下怎么回事,拿了一个20G+的dump文件,这文件是够大的,我个人建议一般是不超过10G,不然的话windbg分析起来很吃力。
二:内存暴涨分析
1. 为什么会内存暴涨
还是老办法,使用 !address -summary 观察提交内存,输出如下:
0:000> !address -summary
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
Free 1870 5ff8`c8447000 ( 95.972 TB) 74.98%
<unknown> 1064 2005`7faca000 ( 32.021 TB) 99.98% 25.02%
Heap 3594 1`56a34000 ( 5.354 GB) 0.02% 0.00%
Image 4747 0`35dfb000 ( 861.980 MB) 0.00% 0.00%
Stack 522 0`2b440000 ( 692.250 MB) 0.00% 0.00%
Other 314 0`00313000 ( 3.074 MB) 0.00% 0.00%
TEB 174 0`0015c000 ( 1.359 MB) 0.00% 0.00%
PEB 1 0`00001000 ( 4.000 kB) 0.00% 0.00%
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_FREE 1870 5ff8`c8447000 ( 95.972 TB) 74.98%
MEM_RESERVE 2326 2001`b95a7000 ( 32.007 TB) 99.93% 25.01%
MEM_COMMIT 8090 5`7e602000 ( 21.975 GB) 0.07% 0.02%
0:000> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x0000013e0f5919d8
generation 1 starts at 0x0000013e0f49a8b0
generation 2 starts at 0x0000013e09f21000
ephemeral segment allocation context: none
segment begin allocated size
0000013e09f20000 0000013e09f21000 0000013e0fb15b20 0x5bf4b20(96422688)
Large object heap starts at 0x0000013e19f21000
segment begin allocated size
0000013e19f20000 0000013e19f21000 0000013e211b6f50 0x7295f50(120151888)
...
00000143d6850000 00000143d6851000 00000143db009118 0x47b8118(75202840)
Total Size: Size: 0x33bd0f148 (13888450888) bytes.
------------------------------
GC Heap Size: Size: 0x33bd0f148 (13888450888) bytes.
从卦中可以看到提交内存是21.9G, Heap堆是5.3G,托管堆是 13.8G,既然占了一半多的提交内存,看样子要从托管堆入手了。
2. 托管堆怎么了
看托管内存的占用,可以借助强大的 perfview 做一个快速识别,看看哪些gcroot根占用比较大,截图如下:

从卦中可以清晰的看到 FinalizerQueue 吃了几乎所有的托管内存,如果大家对 FinalizerQueue 有所了解,应该知道下一步的追踪方向了。
接下来使用 !fq 命令观察终结器队列情况,参考输出如下:
0:000> !fq
SyncBlocks to be cleaned up: 0
Free-Threaded Interfaces to be released: 0
MTA Interfaces to be released: 0
STA Interfaces to be released: 0
----------------------------------
generation 0 has 2722 finalizable objects (0000013f4c737e08->0000013f4c73d318)
generation 1 has 73 finalizable objects (0000013f4c737bc0->0000013f4c737e08)
generation 2 has 20328 finalizable objects (0000013f4c710080->0000013f4c737bc0)
Ready for finalization 34482 objects (0000013f4c73d318->0000013f4c7808a8)
Statistics for all finalizable objects (including all objects ready for finalization):
上面的 Ready for finalization 即 终结器队列的 Freachable 区域,也就是终结器线程提取数据的地方,可以看到此时这个小节里积压了 3.4w 的数据,也就表明此时的终结器线程应该出了问题。
3. 终结器线程怎么了
要想找到终结器线程,可以先用 !t 切过去再观察调用栈即可。
0:000> !t
ThreadCount: 104
UnstartedThread: 0
BackgroundThread: 40
PendingThread: 0
DeadThread: 63
Hosted Runtime: no
Lock
ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
0 1 3854 0000013e082beb60 26020 Preemptive 0000013E0F6E63A0:0000013E0F6E79D8 0000013e08293ef0 0 STA
5 2 708 0000013e082e7bd0 2b220 Preemptive 0000000000000000:0000000000000000 0000013e08293ef0 0 MTA (Finalizer)
0:000> ~~[708]s
win32u!NtUserMessageCall+0x14:
00007ff8`6b151124 c3 ret
0:005> k
# Child-SP RetAddr Call Site
00 00000029`dbdfea38 00007ff8`6cce1082 win32u!NtUserMessageCall+0x14
01 00000029`dbdfea40 00007fff`9879b2d0 user32!SendMessageTimeoutW+0x102
02 00000029`dbdfead0 00007fff`985c4dc7 halcon!IOWIN32DumpToTexture+0xc90
03 00000029`dbdfef60 00007fff`974bff0e halcon!IPGenImaMask+0xae7
04 00000029`dbdfefd0 00007fff`9739d0ca halcon!HHandleClear+0x10e
05 00000029`dbdff050 00007ff7`f5d5a1a2 halcon!HLIClearHandle+0x2a
06 00000029`dbdff090 00007ff7`f5d5b571 halcondotnet!HalconDotNet.HHandleBase.ClearHandleInternal+0x92
07 00000029`dbdff140 00007ff7`f5ddf865 halcondotnet!HalconDotNet.HHandleBase.Dispose+0x21
08 00000029`dbdff180 00007ff8`542d67b6 halcondotnet!HalconDotNet.HHandleBase.Finalize+0x15
09 00000029`dbdff1c0 00007ff8`544934a1 clr!FastCallFinalizeWorker+0x6
0a 00000029`dbdff1f0 00007ff8`54493429 clr!FastCallFinalize+0x55
0b 00000029`dbdff240 00007ff8`54493358 clr!MethodTable::CallFinalizer+0xb5
0c 00000029`dbdff290 00007ff8`5449318b clr!CallFinalizer+0x5e
0d 00000029`dbdff2d0 00007ff8`544930a4 clr!FinalizerThread::DoOneFinalization+0x95
0e 00000029`dbdff3b0 00007ff8`544923fa clr!FinalizerThread::FinalizeAllObjects+0xbf
0f 00000029`dbdff3f0 00007ff8`542d7be8 clr!FinalizerThread::FinalizerThreadWorker+0xba
10 00000029`dbdff440 00007ff8`542d7b53 clr!ManagedThreadBase_DispatchInner+0x40
11 00000029`dbdff480 00007ff8`542d7a92 clr!ManagedThreadBase_DispatchMiddle+0x6c
12 00000029`dbdff580 00007ff8`5441c316 clr!ManagedThreadBase_DispatchOuter+0x4c
13 00000029`dbdff5f0 00007ff8`542dbcc5 clr!FinalizerThread::FinalizerThreadStart+0x116
14 00000029`dbdff690 00007ff8`6b3a7374 clr!Thread::intermediateThreadProc+0x8b
15 00000029`dbdff750 00007ff8`6d35cc91 kernel32!BaseThreadInitThunk+0x14
16 00000029`dbdff780 00000000`00000000 ntdll!RtlUserThreadStart+0x21
0:005> !clrstack
OS Thread Id: 0x708 (5)
Child SP IP Call Site
00000029dbdff0b8 00007ff86b151124 [InlinedCallFrame: 00000029dbdff0b8] HalconDotNet.HalconAPI.ClearHandle(IntPtr)
00000029dbdff0b8 00007ff7f5d5a1a2 [InlinedCallFrame: 00000029dbdff0b8] HalconDotNet.HalconAPI.ClearHandle(IntPtr)
00000029dbdff090 00007ff7f5d5a1a2 HalconDotNet.HHandleBase.ClearHandleInternal()
00000029dbdff140 00007ff7f5d5b571 HalconDotNet.HHandleBase.Dispose(Boolean)
00000029dbdff180 00007ff7f5ddf865 HalconDotNet.HHandleBase.Finalize()
00000029dbdff5d0 00007ff8542d67b6 [DebuggerU2MCatchHandlerFrame: 00000029dbdff5d0]
从卦象看,真尼玛坑爹呀,halcon的释放居然还要和某一个窗口通讯,即底层的 NtUserMessageCall 方法,窗口句柄记录在 rcx 寄存器里,输出如下:
0:005> r
rax=0000000000001007 rbx=00000029dbdfef10 rcx=00000000000f3736
rdx=000000000000c258 rsi=000000000000c258 rdi=0000000000000000
rip=00007ff86b151124 rsp=00000029dbdfea38 rbp=00007fff985c4ed0
r8=0000000000000015 r9=0000000000000000 r10=00007fff96d40000
r11=0000000000000000 r12=00000029dbdfefb0 r13=00007fff985c4ed0
r14=0000000000000e20 r15=00000000000f3736
iopl=0 nv up ei pl zr na po nc
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
win32u!NtUserMessageCall+0x14:
00007ff8`6b151124 c3 ret
接下来的问题如何找到 rcx 对应的窗口是哪一个,这个需要借助强大的 spy++ 探测,这个在我之前的文章都有所介绍,截图如下:

到这里所有的来龙去脉都搞清楚了,即窗体无响应导致的终结器线程卡死,进而引发灾难性的后果,最后让朋友重点关注下 halcon 以及用 spy++ 的探测。
三:总结
作为一个调试师,要善用多个分析工具,往往在解决问题时事半功倍。

浙公网安备 33010602011771号