记一次 .NET 某MES上位机拍照系统 内存暴涨分析

一:背景

1. 讲故事

这是训练营里的一位朋友找到我的,说他们的系统会有偶发的内存暴涨情况,自己也没分析出来,让我帮忙看下怎么回事,拿了一个20G+的dump文件,这文件是够大的,我个人建议一般是不超过10G,不然的话windbg分析起来很吃力。

二:内存暴涨分析

1. 为什么会内存暴涨

还是老办法,使用 !address -summary 观察提交内存,输出如下:


0:000> !address -summary

--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
Free                                   1870     5ff8`c8447000 (  95.972 TB)           74.98%
<unknown>                              1064     2005`7faca000 (  32.021 TB)  99.98%   25.02%
Heap                                   3594        1`56a34000 (   5.354 GB)   0.02%    0.00%
Image                                  4747        0`35dfb000 ( 861.980 MB)   0.00%    0.00%
Stack                                   522        0`2b440000 ( 692.250 MB)   0.00%    0.00%
Other                                   314        0`00313000 (   3.074 MB)   0.00%    0.00%
TEB                                     174        0`0015c000 (   1.359 MB)   0.00%    0.00%
PEB                                       1        0`00001000 (   4.000 kB)   0.00%    0.00%

--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_FREE                               1870     5ff8`c8447000 (  95.972 TB)           74.98%
MEM_RESERVE                            2326     2001`b95a7000 (  32.007 TB)  99.93%   25.01%
MEM_COMMIT                             8090        5`7e602000 (  21.975 GB)   0.07%    0.02%

0:000> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x0000013e0f5919d8
generation 1 starts at 0x0000013e0f49a8b0
generation 2 starts at 0x0000013e09f21000
ephemeral segment allocation context: none
         segment             begin         allocated              size
0000013e09f20000  0000013e09f21000  0000013e0fb15b20  0x5bf4b20(96422688)
Large object heap starts at 0x0000013e19f21000
         segment             begin         allocated              size
0000013e19f20000  0000013e19f21000  0000013e211b6f50  0x7295f50(120151888)
...
00000143d6850000  00000143d6851000  00000143db009118  0x47b8118(75202840)
Total Size:              Size: 0x33bd0f148 (13888450888) bytes.
------------------------------
GC Heap Size:            Size: 0x33bd0f148 (13888450888) bytes.

从卦中可以看到提交内存是21.9G, Heap堆是5.3G,托管堆是 13.8G,既然占了一半多的提交内存,看样子要从托管堆入手了。

2. 托管堆怎么了

看托管内存的占用,可以借助强大的 perfview 做一个快速识别,看看哪些gcroot根占用比较大,截图如下:

从卦中可以清晰的看到 FinalizerQueue 吃了几乎所有的托管内存,如果大家对 FinalizerQueue 有所了解,应该知道下一步的追踪方向了。

接下来使用 !fq 命令观察终结器队列情况,参考输出如下:


0:000> !fq
SyncBlocks to be cleaned up: 0
Free-Threaded Interfaces to be released: 0
MTA Interfaces to be released: 0
STA Interfaces to be released: 0
----------------------------------
generation 0 has 2722 finalizable objects (0000013f4c737e08->0000013f4c73d318)
generation 1 has 73 finalizable objects (0000013f4c737bc0->0000013f4c737e08)
generation 2 has 20328 finalizable objects (0000013f4c710080->0000013f4c737bc0)
Ready for finalization 34482 objects (0000013f4c73d318->0000013f4c7808a8)
Statistics for all finalizable objects (including all objects ready for finalization):

上面的 Ready for finalization 即 终结器队列的 Freachable 区域,也就是终结器线程提取数据的地方,可以看到此时这个小节里积压了 3.4w 的数据,也就表明此时的终结器线程应该出了问题。

3. 终结器线程怎么了

要想找到终结器线程,可以先用 !t 切过去再观察调用栈即可。


0:000> !t
ThreadCount:      104
UnstartedThread:  0
BackgroundThread: 40
PendingThread:    0
DeadThread:       63
Hosted Runtime:   no
                                                                                                        Lock  
       ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   0    1 3854 0000013e082beb60    26020 Preemptive  0000013E0F6E63A0:0000013E0F6E79D8 0000013e08293ef0 0     STA 
   5    2  708 0000013e082e7bd0    2b220 Preemptive  0000000000000000:0000000000000000 0000013e08293ef0 0     MTA (Finalizer) 

0:000> ~~[708]s
win32u!NtUserMessageCall+0x14:
00007ff8`6b151124 c3              ret
0:005> k
 # Child-SP          RetAddr               Call Site
00 00000029`dbdfea38 00007ff8`6cce1082     win32u!NtUserMessageCall+0x14
01 00000029`dbdfea40 00007fff`9879b2d0     user32!SendMessageTimeoutW+0x102
02 00000029`dbdfead0 00007fff`985c4dc7     halcon!IOWIN32DumpToTexture+0xc90
03 00000029`dbdfef60 00007fff`974bff0e     halcon!IPGenImaMask+0xae7
04 00000029`dbdfefd0 00007fff`9739d0ca     halcon!HHandleClear+0x10e
05 00000029`dbdff050 00007ff7`f5d5a1a2     halcon!HLIClearHandle+0x2a
06 00000029`dbdff090 00007ff7`f5d5b571     halcondotnet!HalconDotNet.HHandleBase.ClearHandleInternal+0x92
07 00000029`dbdff140 00007ff7`f5ddf865     halcondotnet!HalconDotNet.HHandleBase.Dispose+0x21
08 00000029`dbdff180 00007ff8`542d67b6     halcondotnet!HalconDotNet.HHandleBase.Finalize+0x15
09 00000029`dbdff1c0 00007ff8`544934a1     clr!FastCallFinalizeWorker+0x6
0a 00000029`dbdff1f0 00007ff8`54493429     clr!FastCallFinalize+0x55
0b 00000029`dbdff240 00007ff8`54493358     clr!MethodTable::CallFinalizer+0xb5
0c 00000029`dbdff290 00007ff8`5449318b     clr!CallFinalizer+0x5e
0d 00000029`dbdff2d0 00007ff8`544930a4     clr!FinalizerThread::DoOneFinalization+0x95
0e 00000029`dbdff3b0 00007ff8`544923fa     clr!FinalizerThread::FinalizeAllObjects+0xbf
0f 00000029`dbdff3f0 00007ff8`542d7be8     clr!FinalizerThread::FinalizerThreadWorker+0xba
10 00000029`dbdff440 00007ff8`542d7b53     clr!ManagedThreadBase_DispatchInner+0x40
11 00000029`dbdff480 00007ff8`542d7a92     clr!ManagedThreadBase_DispatchMiddle+0x6c
12 00000029`dbdff580 00007ff8`5441c316     clr!ManagedThreadBase_DispatchOuter+0x4c
13 00000029`dbdff5f0 00007ff8`542dbcc5     clr!FinalizerThread::FinalizerThreadStart+0x116
14 00000029`dbdff690 00007ff8`6b3a7374     clr!Thread::intermediateThreadProc+0x8b
15 00000029`dbdff750 00007ff8`6d35cc91     kernel32!BaseThreadInitThunk+0x14
16 00000029`dbdff780 00000000`00000000     ntdll!RtlUserThreadStart+0x21

0:005> !clrstack
OS Thread Id: 0x708 (5)
        Child SP               IP Call Site
00000029dbdff0b8 00007ff86b151124 [InlinedCallFrame: 00000029dbdff0b8] HalconDotNet.HalconAPI.ClearHandle(IntPtr)
00000029dbdff0b8 00007ff7f5d5a1a2 [InlinedCallFrame: 00000029dbdff0b8] HalconDotNet.HalconAPI.ClearHandle(IntPtr)
00000029dbdff090 00007ff7f5d5a1a2 HalconDotNet.HHandleBase.ClearHandleInternal()
00000029dbdff140 00007ff7f5d5b571 HalconDotNet.HHandleBase.Dispose(Boolean)
00000029dbdff180 00007ff7f5ddf865 HalconDotNet.HHandleBase.Finalize()
00000029dbdff5d0 00007ff8542d67b6 [DebuggerU2MCatchHandlerFrame: 00000029dbdff5d0] 

从卦象看,真尼玛坑爹呀,halcon的释放居然还要和某一个窗口通讯,即底层的 NtUserMessageCall 方法,窗口句柄记录在 rcx 寄存器里,输出如下:


0:005> r
rax=0000000000001007 rbx=00000029dbdfef10 rcx=00000000000f3736
rdx=000000000000c258 rsi=000000000000c258 rdi=0000000000000000
rip=00007ff86b151124 rsp=00000029dbdfea38 rbp=00007fff985c4ed0
 r8=0000000000000015  r9=0000000000000000 r10=00007fff96d40000
r11=0000000000000000 r12=00000029dbdfefb0 r13=00007fff985c4ed0
r14=0000000000000e20 r15=00000000000f3736
iopl=0         nv up ei pl zr na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246
win32u!NtUserMessageCall+0x14:
00007ff8`6b151124 c3              ret

接下来的问题如何找到 rcx 对应的窗口是哪一个,这个需要借助强大的 spy++ 探测,这个在我之前的文章都有所介绍,截图如下:

到这里所有的来龙去脉都搞清楚了,即窗体无响应导致的终结器线程卡死,进而引发灾难性的后果,最后让朋友重点关注下 halcon 以及用 spy++ 的探测。

三:总结

作为一个调试师,要善用多个分析工具,往往在解决问题时事半功倍。

图片名称
posted @ 2026-01-07 09:20  一线码农  阅读(655)  评论(2)    收藏  举报