一次酣畅淋漓的问题排查(c++标准库异常实现原理)
PS:要转载请注明出处,本人版权所有。
PS: 这个只是基于《我自己》的理解,
如果和你的原则及想法相冲突,请谅解,勿喷。
环境说明
无
前言
在集成和定制llama.cpp工程的时候,做了许多工作,也遇到了很多问题,但是绝大部分问题都是很快就能解决的,少部分问题花一些时间也能解决掉,其中有两个关联问题是让我最印象深刻的。为了整理和探究这两个问题的根源,特在此编写本文。且在写本文这段时间内,也整理和提了一个关联的pr给llama.cpp(https://github.com/ggml-org/llama.cpp/pull/17653)。
首先我们有如下的代码示例:
try{
{
// ... ...
if (!std::filesystem::exists("/bugreports"))
// ... ...
}
{
std::filesystem::directory_iterator dir_it("/", fs::directory_options::skip_permission_denied);
for (const auto & entry : dir_it) {
// ... ...
}
// ... ...
}
return ;
}
catch (const std::exception& e){
printf("exception: %s\n", e.what());
return ;
}
catch(...){
printf("Fatal Error, Unkown exception\n");
return ;
}
根据上面的代码示例,在不同的编译条件、同一个执行环境(软、硬件)下它3个code-block分支都会走,这让我简直头大。下面是两个catch-code-block部分的输出:
exception: filesystem error: in posix_stat: failed to determine attributes for the specified path: Permission denied ["/bugreports"]
Fatal Error, Unkown exception
当然,上面的3个code-block其实对应这几个问题:
- 为什么同一个设备,同一段代码在不同条件下执行3个不同的分支,尤其是什么情况下正常执行,什么情况下抛出异常?
- std::filesystem::exists/std::filesystem::directory_iterator 什么情况下会抛出异常?
- 对于std::filesystem::exists/std::filesystem::directory_iterator抛出的异常来说,为什么捕获路径不一样(是否能抓到filesystem error)?
下面我们分别对这几个问题进行分析(以std::filesystem::exists为例)。
问题初步分析
为什么同一设备,同一代码,不同编译条件可以正常或者异常运行?
在我的例子里面,根据我的实际测试反馈来看,在build.gradle里面,【 compileSdk = 34,minSdk = 34,ndk=26】【 compileSdk = 34,minSdk = 34,ndk=26】两种不同配置,导致运行结果不一样,当minSdk=26时,代码会抛出异常,当minSdk=34时,代码正常运行。
经过上面的分析和测试,我们可以得到一个猜(可能性极大)的原因:因为ndk版本是一样的,意味着上面的标准库实现是一样的,因此这个现象的主要原因还是不同的编译条件,让我们使用posix api访问/bugreports目录时,posix api有不同的返回。
更底层的原因导致posix api有不同的返回,我不是很了解、不熟悉android的底层系统细节,因此就不继续排查了,有缘再说,下次一定。
接着我们排查一下c++标准库的std::filesystem::exists实现,看看异常从哪里来?
什么情况下std::filesystem::exists会抛出异常?
我们先查看https://en.cppreference.com/w/cpp/filesystem/exists.html,其定义如下:
bool exists( std::filesystem::file_status s ) noexcept; (1) (since C++17)
bool exists( const std::filesystem::path& p ); (2) (since C++17)
bool exists( const std::filesystem::path& p, std::error_code& ec ) noexcept; (3) (since C++17)
/*
Exceptions
Any overload not marked noexcept may throw std::bad_alloc if memory allocation fails.
2) Throws std::filesystem::filesystem_error on underlying OS API errors, constructed with p as the first path argument and the OS error code as the error code argument.
*/
因此,对于我们上文的用法,如果底层OS的API出现问题,那么会抛出异常,这个现象是符合标准定义的。
下面我们来看看exists的源码具体实现(libcxx):
inline _LIBCPP_HIDE_FROM_ABI bool exists(const path& __p) { return exists(__status(__p)); }
_LIBCPP_EXPORTED_FROM_ABI file_status __status(const path&, error_code* __ec = nullptr);
file_status __status(const path& p, error_code* ec) { return detail::posix_stat(p, ec); }
inline file_status posix_stat(path const& p, error_code* ec) {
StatT path_stat;
return posix_stat(p, path_stat, ec);
}
inline file_status posix_stat(path const& p, StatT& path_stat, error_code* ec) {
error_code m_ec;
if (detail::stat(p.c_str(), &path_stat) == -1)
m_ec = detail::capture_errno();
return create_file_status(m_ec, p, path_stat, ec);
}
namespace detail {
using ::stat; //<sys/stat.h>
} // end namespace detail
inline file_status create_file_status(error_code& m_ec, path const& p, const StatT& path_stat, error_code* ec) {
if (ec)
*ec = m_ec;
if (m_ec && (m_ec.value() == ENOENT || m_ec.value() == ENOTDIR)) {
return file_status(file_type::not_found);
} else if (m_ec) {
ErrorHandler<void> err("posix_stat", ec, &p);
err.report(m_ec, "failed to determine attributes for the specified path");
return file_status(file_type::none);
}
// ... ... other code
}
因此exists()抛异常的根本原因就是,调用detail::stat的时候,产生了Permission denied 错误,然后在create_file_status中抛出了异常。
对于std::filesystem::filesystem_error异常,在不同位置捕获的原因?
根据上面的最小化测试代码,再一次对整体构建过程进行排查后,有如下发现:
- 当上面的代码在一个so中,如果启用了-Wl,--version-script功能,导致未导出vtable和typeinfo对象的符号(Android)。
- 在x86里面构建上面同样的实例时,发现启用了-Wl,--version-script功能,默认也能导出了vtable和typeinfo对象的符号。
上面的现象把我搞郁闷了,经过编译器、链接器、编译参数、链接参数和符号等相关的排查,终于在一个位置发现了一些奇怪的东西:
# readelf -sW build/libnativelib.so|grep fs10filesystem16filesystem_errorE
# 下面的so能在catch (const std::exception& e)中捕获异常,nm -CD 也有fs10filesystem16filesystem_errorE相关的符号
12: 0000000000000000 0 OBJECT GLOBAL DEFAULT UND _ZTINSt6__ndk14__fs10filesystem16filesystem_errorE
18: 0000000000000000 0 OBJECT GLOBAL DEFAULT UND _ZTVNSt6__ndk14__fs10filesystem16filesystem_errorE
235: 0000000000000000 0 OBJECT GLOBAL DEFAULT UND _ZTINSt6__ndk14__fs10filesystem16filesystem_errorE
241: 0000000000000000 0 OBJECT GLOBAL DEFAULT UND _ZTVNSt6__ndk14__fs10filesystem16filesystem_errorE
# 下面的so只能在catch(...)捕获异常,nm -CD 没有fs10filesystem16filesystem_errorE相关的符号
393: 0000000000036340 24 OBJECT LOCAL DEFAULT 17 _ZTINSt6__ndk14__fs10filesystem16filesystem_errorE
395: 0000000000036318 40 OBJECT LOCAL DEFAULT 17 _ZTVNSt6__ndk14__fs10filesystem16filesystem_errorE
410: 000000000000ad5a 47 OBJECT LOCAL DEFAULT 11 _ZTSNSt6__ndk14__fs10filesystem16filesystem_errorE
上面我们可以知道,正常的so,其相关的typeinfo/vtable是GLOBAL 且未定义的,其定义应该在libc++.so或者libstdc++.so的。而异常的so相关的typeinfo/vtable的符号是LOCAL且已经定义了。
经过一系列查询,上面问题的差异出在ANDROID_STL在cmake中默认是c++_static的(https://developer.android.com/ndk/guides/cpp-support?hl=zh-cn#selecting_a_c_runtime),这个时候c++标准库的实现是以静态库的方式链接到我的so,因此相关的实现是local的,现在只需要改为c++_shared就解决了上面的异常路径不一致的情况。
此外,当我还是用c++_static继续编译,只是手动把typeinfo/vtable的符号都导出为依赖libc++.so或者libstdc++.so时,发现也能够正常捕获异常了。
上面我们只是找到了引起问题的地方,但是没有回答,为什么nm -CD 没有fs10filesystem16filesystem_errorE相关的typeinfo/vtable符号的时候,只有catch(...)能捕获异常。要回答这个问题,我们得去初步看一下c++异常机制是怎么实现的,下面我们继续分析。
c++标准库异常实现原理简单分析
为了尽可能的贴近我的遇到问题的场景和方便调试,且不同ABI的异常实现可能不一致,下面基于clang,x64,来分析c++异常实现的基本原理(Itanium C++ ABI)。
首先我们来看看我们throw一个异常的时候调用的汇编代码是什么?
extern "C" __attribute__((visibility("default"))) void pp()
{
throw std::runtime_error("test_exception");
}
0x00007ffff7f9a380 <+0>: push %rbp
0x00007ffff7f9a381 <+1>: mov %rsp,%rbp
0x00007ffff7f9a384 <+4>: sub $0x20,%rsp
0x00007ffff7f9a388 <+8>: mov $0x10,%edi
=> 0x00007ffff7f9a38d <+13>: call 0x7ffff7fb48e0 <__cxa_allocate_exception>
0x00007ffff7f9a392 <+18>: mov %rax,%rdi
0x00007ffff7f9a395 <+21>: mov %rdi,%rax
0x00007ffff7f9a398 <+24>: mov %rax,-0x18(%rbp)
0x00007ffff7f9a39c <+28>: lea -0x902d(%rip),%rsi # 0x7ffff7f91376
0x00007ffff7f9a3a3 <+35>: call 0x7ffff7fb5e80 <_ZNSt13runtime_errorC2EPKc>
0x00007ffff7f9a3a8 <+40>: jmp 0x7ffff7f9a3ad <pp()+45>
0x00007ffff7f9a3ad <+45>: mov -0x18(%rbp),%rdi
0x00007ffff7f9a3b1 <+49>: lea 0x1d158(%rip),%rsi # 0x7ffff7fb7510 <_ZTISt13runtime_error>
0x00007ffff7f9a3b8 <+56>: lea 0xb1(%rip),%rdx # 0x7ffff7f9a470 <_ZNSt15underflow_errorD2Ev>
0x00007ffff7f9a3bf <+63>: call 0x7ffff7fb4b00 <__cxa_throw>
0x00007ffff7f9a3c4 <+68>: mov -0x18(%rbp),%rdi
0x00007ffff7f9a3c8 <+72>: mov %rax,%rcx
0x00007ffff7f9a3cb <+75>: mov %edx,%eax
0x00007ffff7f9a3cd <+77>: mov %rcx,-0x8(%rbp)
0x00007ffff7f9a3d1 <+81>: mov %eax,-0xc(%rbp)
0x00007ffff7f9a3d4 <+84>: call 0x7ffff7fb49c0 <__cxa_free_exception>
0x00007ffff7f9a3d9 <+89>: mov -0x8(%rbp),%rdi
0x00007ffff7f9a3dd <+93>: call 0x7ffff7fb6160 <_Unwind_Resume@plt>
从上面的代码可以知道,先调用__cxa_allocate_exception在特定空间分配内存(不是一般的堆栈空间,避免干扰堆栈),然后调用placement new 在前面的空间上面构造std::runtime_error对象,然后执行__cxa_throw开始堆栈展开,查找异常链。这个链接介绍了cpp标准里面对异常展开流程的描述(https://en.cppreference.com/w/cpp/language/throw.html)。
下面我们通过查看__cxa_throw的源码,看看libc++对异常展开是怎么实现的。
libcxxabi\src\cxa_exception.cpp
void
__cxa_throw(void *thrown_object, std::type_info *tinfo, void (_LIBCXXABI_DTOR_FUNC *dest)(void *)) {
__cxa_eh_globals *globals = __cxa_get_globals();
__cxa_exception* exception_header = cxa_exception_from_thrown_object(thrown_object);
exception_header->unexpectedHandler = std::get_unexpected();
exception_header->terminateHandler = std::get_terminate();
exception_header->exceptionType = tinfo;
exception_header->exceptionDestructor = dest;
setOurExceptionClass(&exception_header->unwindHeader);
exception_header->referenceCount = 1; // This is a newly allocated exception, no need for thread safety.
globals->uncaughtExceptions += 1; // Not atomically, since globals are thread-local
exception_header->unwindHeader.exception_cleanup = exception_cleanup_func;
#if __has_feature(address_sanitizer)
// Inform the ASan runtime that now might be a good time to clean stuff up.
__asan_handle_no_return();
#endif
#ifdef __USING_SJLJ_EXCEPTIONS__
_Unwind_SjLj_RaiseException(&exception_header->unwindHeader);
#else
_Unwind_RaiseException(&exception_header->unwindHeader);
#endif
// This only happens when there is no handler, or some unexpected unwinding
// error happens.
failed_throw(exception_header);
}
这里可以看到,首先函数3个参数分别是:刚刚的std::runtime_error对象,异常对象的typeinfo,std::runtime_error对应的析构函数。然后就开始根据不同的异常实现,开始展开堆栈。此外,这里有个地方可以值得注意:exceptionType 很明显就是我们本文的问题有关系,如果没有导出对应的typeinfo,很有可能在其他地方无法匹配这个异常。
还有这里补充一个细节:现在常见的异常模型大概有3类,SJLJ(setjump-longjump),DWARF,SEH (Windows),当前类linux用的异常模型是DWARF中的定义。
根据上面的执行流,我们接着来看_Unwind_RaiseException的实现。
libunwind\src\UnwindLevel1.c
/// Called by __cxa_throw. Only returns if there is a fatal error.
_LIBUNWIND_EXPORT _Unwind_Reason_Code
_Unwind_RaiseException(_Unwind_Exception *exception_object) {
_LIBUNWIND_TRACE_API("_Unwind_RaiseException(ex_obj=%p)",
static_cast<void *>(exception_object));
unw_context_t uc;
unw_cursor_t cursor;
__unw_getcontext(&uc);
// This field for is for compatibility with GCC to say this isn't a forced
// unwind. EHABI #7.2
exception_object->unwinder_cache.reserved1 = 0;
// phase 1: the search phase
_Unwind_Reason_Code phase1 = unwind_phase1(&uc, &cursor, exception_object);
if (phase1 != _URC_NO_REASON)
return phase1;
// phase 2: the clean up phase
return unwind_phase2(&uc, &cursor, exception_object, false);
}
从这里来看,异常展开分为了两个阶段,phase1和phase2,从备注来看就是搜索、清理。下面我们先来看unwind_phase1的做了什么。
libunwind\src\UnwindLevel1.c
static _Unwind_Reason_Code
unwind_phase1(unw_context_t *uc, unw_cursor_t *cursor, _Unwind_Exception *exception_object) {
__unw_init_local(cursor, uc);
// Walk each frame looking for a place to stop.
while (true) {
// Ask libunwind to get next frame (skip over first which is
// _Unwind_RaiseException).
int stepResult = __unw_step(cursor);
// ... ...
// See if frame has code to run (has personality routine).
unw_proc_info_t frameInfo;
unw_word_t sp;
if (__unw_get_proc_info(cursor, &frameInfo) != UNW_ESUCCESS) {
// ... ...
}
// ... ...
// If there is a personality routine, ask it if it will want to stop at
// this frame.
if (frameInfo.handler != 0) {
_Unwind_Personality_Fn p =
(_Unwind_Personality_Fn)(uintptr_t)(frameInfo.handler);
_LIBUNWIND_TRACE_UNWINDING(
"unwind_phase1(ex_ojb=%p): calling personality function %p",
(void *)exception_object, (void *)(uintptr_t)p);
_Unwind_Reason_Code personalityResult =
(*p)(1, _UA_SEARCH_PHASE, exception_object->exception_class,
exception_object, (struct _Unwind_Context *)(cursor));
switch (personalityResult) {
case _URC_HANDLER_FOUND:
// found a catch clause or locals that need destructing in this frame
// stop search and remember stack pointer at the frame
__unw_get_reg(cursor, UNW_REG_SP, &sp);
exception_object->private_2 = (uintptr_t)sp;
_LIBUNWIND_TRACE_UNWINDING(
"unwind_phase1(ex_ojb=%p): _URC_HANDLER_FOUND",
(void *)exception_object);
return _URC_NO_REASON;
case _URC_CONTINUE_UNWIND:
_LIBUNWIND_TRACE_UNWINDING(
"unwind_phase1(ex_ojb=%p): _URC_CONTINUE_UNWIND",
(void *)exception_object);
// continue unwinding
break;
default:
// something went wrong
_LIBUNWIND_TRACE_UNWINDING(
"unwind_phase1(ex_ojb=%p): _URC_FATAL_PHASE1_ERROR",
(void *)exception_object);
return _URC_FATAL_PHASE1_ERROR;
}
}
}
return _URC_NO_REASON;
}
static _Unwind_Reason_Code
unwind_phase2(unw_context_t *uc, unw_cursor_t *cursor, _Unwind_Exception *exception_object) {
__unw_init_local(cursor, uc);
_LIBUNWIND_TRACE_UNWINDING("unwind_phase2(ex_ojb=%p)",
(void *)exception_object);
// uc is initialized by __unw_getcontext in the parent frame. The first stack
// frame walked is unwind_phase2.
unsigned framesWalked = 1;
// Walk each frame until we reach where search phase said to stop.
while (true) {
// Ask libunwind to get next frame (skip over first which is
// _Unwind_RaiseException).
int stepResult = __unw_step(cursor);
// ... ...
// Get info about this frame.
unw_word_t sp;
unw_proc_info_t frameInfo;
__unw_get_reg(cursor, UNW_REG_SP, &sp);
if (__unw_get_proc_info(cursor, &frameInfo) != UNW_ESUCCESS) {
// ... ...
}
// ... ...
++framesWalked;
// If there is a personality routine, tell it we are unwinding.
if (frameInfo.handler != 0) {
_Unwind_Personality_Fn p =
(_Unwind_Personality_Fn)(uintptr_t)(frameInfo.handler);
_Unwind_Action action = _UA_CLEANUP_PHASE;
if (sp == exception_object->private_2) {
// Tell personality this was the frame it marked in phase 1.
action = (_Unwind_Action)(_UA_CLEANUP_PHASE | _UA_HANDLER_FRAME);
}
_Unwind_Reason_Code personalityResult =
(*p)(1, action, exception_object->exception_class, exception_object,
(struct _Unwind_Context *)(cursor));
switch (personalityResult) {
case _URC_CONTINUE_UNWIND:
// Continue unwinding
_LIBUNWIND_TRACE_UNWINDING(
"unwind_phase2(ex_ojb=%p): _URC_CONTINUE_UNWIND",
(void *)exception_object);
if (sp == exception_object->private_2) {
// Phase 1 said we would stop at this frame, but we did not...
_LIBUNWIND_ABORT("during phase1 personality function said it would "
"stop here, but now in phase2 it did not stop here");
}
break;
case _URC_INSTALL_CONTEXT:
_LIBUNWIND_TRACE_UNWINDING(
"unwind_phase2(ex_ojb=%p): _URC_INSTALL_CONTEXT",
(void *)exception_object);
// Personality routine says to transfer control to landing pad.
// We may get control back if landing pad calls _Unwind_Resume().
if (_LIBUNWIND_TRACING_UNWINDING) {
unw_word_t pc;
__unw_get_reg(cursor, UNW_REG_IP, &pc);
__unw_get_reg(cursor, UNW_REG_SP, &sp);
_LIBUNWIND_TRACE_UNWINDING("unwind_phase2(ex_ojb=%p): re-entering "
"user code with ip=0x%" PRIxPTR
", sp=0x%" PRIxPTR,
(void *)exception_object, pc, sp);
}
__unw_phase2_resume(cursor, framesWalked);
// __unw_phase2_resume() only returns if there was an error.
return _URC_FATAL_PHASE2_ERROR;
default:
// Personality routine returned an unknown result code.
_LIBUNWIND_DEBUG_LOG("personality function returned unknown result %d",
personalityResult);
return _URC_FATAL_PHASE2_ERROR;
}
}
}
// Clean up phase did not resume at the frame that the search phase
// said it would...
return _URC_FATAL_PHASE2_ERROR;
}
这里的代码也很明晰,首先获取了当前栈帧的信息,然后将frameInfo.handler转换为_Unwind_Personality_Fn处理函数,然后调用这个函数进行处理。这里有两种情况:
- unwind_phase1,当action=_UA_SEARCH_PHASE时,代码我们当前阶段是通过_Unwind_Personality_Fn搜索catch代码块,当找到处理块时,返回_URC_HANDLER_FOUND,并给exception_object->private_2赋值,方便在第二阶段进行执行。
- unwind_phase2,exception_object->private_2 == sp时,当action=(_UA_CLEANUP_PHASE | _UA_HANDLER_FRAME)时,我们开始调用_Unwind_Personality_Fn安装对应的catch-block,然后返回_URC_INSTALL_CONTEXT,最后执行__unw_phase2_resume开始执行异常处理。
此外,这里的 __unw_init_local执行了一个非常重要的操作,那就是找到了.eh_frame的位置,下面简单看一下代码流程:
inline bool LocalAddressSpace::findUnwindSections(pint_t targetAddr,
UnwindInfoSections &info) {
// ... ...
info.dso_base = 0;
// Bare metal is statically linked, so no need to ask the dynamic loader
info.dwarf_section_length = (size_t)(&__eh_frame_end - &__eh_frame_start);
info.dwarf_section = (uintptr_t)(&__eh_frame_start);
// ... ...
}
template <typename A, typename R>
void UnwindCursor<A, R>::setInfoBasedOnIPRegister(bool isReturnAddress) {
// ... ...
// Ask address space object to find unwind sections for this pc.
UnwindInfoSections sects;
if (_addressSpace.findUnwindSections(pc, sects))
// ... ...
}
// template <typename A, typename R>
// int UnwindCursor<A, R>::step() {
// // ... ...
// this->setInfoBasedOnIPRegister(true);
// // ... ...
// }
_LIBUNWIND_HIDDEN int __unw_init_local(unw_cursor_t *cursor,
unw_context_t *context) {
// ... ...
// Use "placement new" to allocate UnwindCursor in the cursor buffer.
new (reinterpret_cast<UnwindCursor<LocalAddressSpace, REGISTER_KIND> *>(cursor))
UnwindCursor<LocalAddressSpace, REGISTER_KIND>(
context, LocalAddressSpace::sThisAddressSpace);
#undef REGISTER_KIND
AbstractUnwindCursor *co = (AbstractUnwindCursor *)cursor;
co->setInfoBasedOnIPRegister();
return UNW_ESUCCESS;
}
这里的_Unwind_Personality_Fn函数是itanium-cxx-abi 定义的,定义文档在这个位置https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html#cxx-throw。主要作用就是和c++特性相关的堆栈展开特定代码,这个函数在gcc/clang里面叫做:__gxx_personality_v0,我们直接去看他的源码。
libcxxabi\src\cxa_personality.cpp
#if !defined(_LIBCXXABI_ARM_EHABI)
#if defined(__SEH__) && !defined(__USING_SJLJ_EXCEPTIONS__)
static _Unwind_Reason_Code __gxx_personality_imp
#else
_LIBCXXABI_FUNC_VIS _Unwind_Reason_Code
#ifdef __USING_SJLJ_EXCEPTIONS__
__gxx_personality_sj0
#elif defined(__MVS__)
__zos_cxx_personality_v2
#else
__gxx_personality_v0
#endif
#endif
(int version, _Unwind_Action actions, uint64_t exceptionClass,
_Unwind_Exception* unwind_exception, _Unwind_Context* context)
{
if (version != 1 || unwind_exception == 0 || context == 0)
return _URC_FATAL_PHASE1_ERROR;
bool native_exception = (exceptionClass & get_vendor_and_language) ==
(kOurExceptionClass & get_vendor_and_language);
scan_results results;
// Process a catch handler for a native exception first.
if (actions == (_UA_CLEANUP_PHASE | _UA_HANDLER_FRAME) &&
native_exception) {
// Reload the results from the phase 1 cache.
__cxa_exception* exception_header =
(__cxa_exception*)(unwind_exception + 1) - 1;
results.ttypeIndex = exception_header->handlerSwitchValue;
results.actionRecord = exception_header->actionRecord;
results.languageSpecificData = exception_header->languageSpecificData;
results.landingPad =
reinterpret_cast<uintptr_t>(exception_header->catchTemp);
results.adjustedPtr = exception_header->adjustedPtr;
// Jump to the handler.
set_registers(unwind_exception, context, results);
// Cache base for calculating the address of ttype in
// __cxa_call_unexpected.
if (results.ttypeIndex < 0) {
#if defined(_AIX)
exception_header->catchTemp = (void *)_Unwind_GetDataRelBase(context);
#else
exception_header->catchTemp = 0;
#endif
}
return _URC_INSTALL_CONTEXT;
}
// In other cases we need to scan LSDA.
scan_eh_tab(results, actions, native_exception, unwind_exception, context);
if (results.reason == _URC_CONTINUE_UNWIND ||
results.reason == _URC_FATAL_PHASE1_ERROR)
return results.reason;
if (actions & _UA_SEARCH_PHASE)
{
// Phase 1 search: All we're looking for in phase 1 is a handler that
// halts unwinding
assert(results.reason == _URC_HANDLER_FOUND);
if (native_exception) {
// For a native exception, cache the LSDA result.
__cxa_exception* exc = (__cxa_exception*)(unwind_exception + 1) - 1;
exc->handlerSwitchValue = static_cast<int>(results.ttypeIndex);
exc->actionRecord = results.actionRecord;
exc->languageSpecificData = results.languageSpecificData;
exc->catchTemp = reinterpret_cast<void*>(results.landingPad);
exc->adjustedPtr = results.adjustedPtr;
}
return _URC_HANDLER_FOUND;
}
assert(actions & _UA_CLEANUP_PHASE);
assert(results.reason == _URC_HANDLER_FOUND);
set_registers(unwind_exception, context, results);
// Cache base for calculating the address of ttype in __cxa_call_unexpected.
if (results.ttypeIndex < 0) {
__cxa_exception* exception_header =
(__cxa_exception*)(unwind_exception + 1) - 1;
#if defined(_AIX)
exception_header->catchTemp = (void *)_Unwind_GetDataRelBase(context);
#else
exception_header->catchTemp = 0;
#endif
}
return _URC_INSTALL_CONTEXT;
}
我们从整体来看这段代码,从上面可以知道,phase1,phase2都会调用到这里来:
- phase1, action=_UA_SEARCH_PHASE, 调用scan_eh_tab查找catch-block,并返回_URC_HANDLER_FOUND
- phase2, action=(_UA_CLEANUP_PHASE | _UA_HANDLER_FRAME),通过set_registers设置对应的catch-block,然后返回_URC_INSTALL_CONTEXT,然后在__unw_phase2_resume执行对应的catch-block。
从上面的实现来看,scan_eh_tab是核心,其正是展开异常搜索和匹配的关键。其源码如下
static void scan_eh_tab(scan_results &results, _Unwind_Action actions,
bool native_exception,
_Unwind_Exception *unwind_exception,
_Unwind_Context *context) {
// Initialize results to found nothing but an error
results.ttypeIndex = 0;
results.actionRecord = 0;
results.languageSpecificData = 0;
results.landingPad = 0;
results.adjustedPtr = 0;
results.reason = _URC_FATAL_PHASE1_ERROR;
// Check for consistent actions
// ... ...
// Start scan by getting exception table address.
const uint8_t *lsda = (const uint8_t *)_Unwind_GetLanguageSpecificData(context);
if (lsda == 0)
{
// There is no exception table
results.reason = _URC_CONTINUE_UNWIND;
return;
}
results.languageSpecificData = lsda;
#if defined(_AIX)
uintptr_t base = _Unwind_GetDataRelBase(context);
#else
uintptr_t base = 0;
#endif
// Get the current instruction pointer and offset it before next
// instruction in the current frame which threw the exception.
uintptr_t ip = _Unwind_GetIP(context) - 1;
// Get beginning current frame's code (as defined by the
// emitted dwarf code)
uintptr_t funcStart = _Unwind_GetRegionStart(context);
#ifdef __USING_SJLJ_EXCEPTIONS__
if (ip == uintptr_t(-1))
{
// no action
results.reason = _URC_CONTINUE_UNWIND;
return;
}
else if (ip == 0)
call_terminate(native_exception, unwind_exception);
// ip is 1-based index into call site table
#else // !__USING_SJLJ_EXCEPTIONS__
uintptr_t ipOffset = ip - funcStart;
#endif // !defined(_USING_SLJL_EXCEPTIONS__)
const uint8_t* classInfo = NULL;
// Note: See JITDwarfEmitter::EmitExceptionTable(...) for corresponding
// dwarf emission
// Parse LSDA header.
uint8_t lpStartEncoding = *lsda++;
const uint8_t* lpStart =
(const uint8_t*)readEncodedPointer(&lsda, lpStartEncoding, base);
if (lpStart == 0)
lpStart = (const uint8_t*)funcStart;
uint8_t ttypeEncoding = *lsda++;
if (ttypeEncoding != DW_EH_PE_omit)
{
// Calculate type info locations in emitted dwarf code which
// were flagged by type info arguments to llvm.eh.selector
// intrinsic
uintptr_t classInfoOffset = readULEB128(&lsda);
classInfo = lsda + classInfoOffset;
}
// Walk call-site table looking for range that
// includes current PC.
uint8_t callSiteEncoding = *lsda++;
#ifdef __USING_SJLJ_EXCEPTIONS__
(void)callSiteEncoding; // When using SjLj exceptions, callSiteEncoding is never used
#endif
uint32_t callSiteTableLength = static_cast<uint32_t>(readULEB128(&lsda));
const uint8_t* callSiteTableStart = lsda;
const uint8_t* callSiteTableEnd = callSiteTableStart + callSiteTableLength;
const uint8_t* actionTableStart = callSiteTableEnd;
const uint8_t* callSitePtr = callSiteTableStart;
while (callSitePtr < callSiteTableEnd)
{
// There is one entry per call site.
#ifndef __USING_SJLJ_EXCEPTIONS__
// The call sites are non-overlapping in [start, start+length)
// The call sites are ordered in increasing value of start
uintptr_t start = readEncodedPointer(&callSitePtr, callSiteEncoding);
uintptr_t length = readEncodedPointer(&callSitePtr, callSiteEncoding);
uintptr_t landingPad = readEncodedPointer(&callSitePtr, callSiteEncoding);
uintptr_t actionEntry = readULEB128(&callSitePtr);
if ((start <= ipOffset) && (ipOffset < (start + length)))
#else // __USING_SJLJ_EXCEPTIONS__
// ip is 1-based index into this table
uintptr_t landingPad = readULEB128(&callSitePtr);
uintptr_t actionEntry = readULEB128(&callSitePtr);
if (--ip == 0)
#endif // __USING_SJLJ_EXCEPTIONS__
{
// Found the call site containing ip.
#ifndef __USING_SJLJ_EXCEPTIONS__
if (landingPad == 0)
{
// No handler here
results.reason = _URC_CONTINUE_UNWIND;
return;
}
landingPad = (uintptr_t)lpStart + landingPad;
#else // __USING_SJLJ_EXCEPTIONS__
++landingPad;
#endif // __USING_SJLJ_EXCEPTIONS__
results.landingPad = landingPad;
if (actionEntry == 0)
{
// Found a cleanup
results.reason = actions & _UA_SEARCH_PHASE
? _URC_CONTINUE_UNWIND
: _URC_HANDLER_FOUND;
return;
}
// Convert 1-based byte offset into
const uint8_t* action = actionTableStart + (actionEntry - 1);
bool hasCleanup = false;
// Scan action entries until you find a matching handler, cleanup, or the end of action list
while (true)
{
const uint8_t* actionRecord = action;
int64_t ttypeIndex = readSLEB128(&action);
if (ttypeIndex > 0)
{
// Found a catch, does it actually catch?
// First check for catch (...)
const __shim_type_info* catchType =
get_shim_type_info(static_cast<uint64_t>(ttypeIndex),
classInfo, ttypeEncoding,
native_exception, unwind_exception,
base);
if (catchType == 0)
{
// Found catch (...) catches everything, including
// foreign exceptions. This is search phase, cleanup
// phase with foreign exception, or forced unwinding.
assert(actions & (_UA_SEARCH_PHASE | _UA_HANDLER_FRAME |
_UA_FORCE_UNWIND));
results.ttypeIndex = ttypeIndex;
results.actionRecord = actionRecord;
results.adjustedPtr =
get_thrown_object_ptr(unwind_exception);
results.reason = _URC_HANDLER_FOUND;
return;
}
// Else this is a catch (T) clause and will never
// catch a foreign exception
else if (native_exception)
{
__cxa_exception* exception_header = (__cxa_exception*)(unwind_exception+1) - 1;
void* adjustedPtr = get_thrown_object_ptr(unwind_exception);
const __shim_type_info* excpType =
static_cast<const __shim_type_info*>(exception_header->exceptionType);
if (adjustedPtr == 0 || excpType == 0)
{
// Something very bad happened
call_terminate(native_exception, unwind_exception);
}
if (catchType->can_catch(excpType, adjustedPtr))
{
// Found a matching handler. This is either search
// phase or forced unwinding.
assert(actions &
(_UA_SEARCH_PHASE | _UA_FORCE_UNWIND));
results.ttypeIndex = ttypeIndex;
results.actionRecord = actionRecord;
results.adjustedPtr = adjustedPtr;
results.reason = _URC_HANDLER_FOUND;
return;
}
}
// Scan next action ...
}
else if (ttypeIndex < 0)
{
// Found an exception specification.
if (actions & _UA_FORCE_UNWIND) {
// Skip if forced unwinding.
} else if (native_exception) {
// Does the exception spec catch this native exception?
__cxa_exception* exception_header = (__cxa_exception*)(unwind_exception+1) - 1;
void* adjustedPtr = get_thrown_object_ptr(unwind_exception);
const __shim_type_info* excpType =
static_cast<const __shim_type_info*>(exception_header->exceptionType);
if (adjustedPtr == 0 || excpType == 0)
{
// Something very bad happened
call_terminate(native_exception, unwind_exception);
}
if (exception_spec_can_catch(ttypeIndex, classInfo,
ttypeEncoding, excpType,
adjustedPtr,
unwind_exception, base))
{
// Native exception caught by exception
// specification.
assert(actions & _UA_SEARCH_PHASE);
results.ttypeIndex = ttypeIndex;
results.actionRecord = actionRecord;
results.adjustedPtr = adjustedPtr;
results.reason = _URC_HANDLER_FOUND;
return;
}
} else {
// foreign exception caught by exception spec
results.ttypeIndex = ttypeIndex;
results.actionRecord = actionRecord;
results.adjustedPtr =
get_thrown_object_ptr(unwind_exception);
results.reason = _URC_HANDLER_FOUND;
return;
}
// Scan next action ...
} else {
hasCleanup = true;
}
const uint8_t* temp = action;
int64_t actionOffset = readSLEB128(&temp);
if (actionOffset == 0)
{
// End of action list. If this is phase 2 and we have found
// a cleanup (ttypeIndex=0), return _URC_HANDLER_FOUND;
// otherwise return _URC_CONTINUE_UNWIND.
results.reason = hasCleanup && actions & _UA_CLEANUP_PHASE
? _URC_HANDLER_FOUND
: _URC_CONTINUE_UNWIND;
return;
}
// Go to next action
action += actionOffset;
} // there is no break out of this loop, only return
}
#ifndef __USING_SJLJ_EXCEPTIONS__
else if (ipOffset < start)
{
// There is no call site for this ip
// Something bad has happened. We should never get here.
// Possible stack corruption.
call_terminate(native_exception, unwind_exception);
}
#endif // !__USING_SJLJ_EXCEPTIONS__
} // there might be some tricky cases which break out of this loop
// It is possible that no eh table entry specify how to handle
// this exception. By spec, terminate it immediately.
call_terminate(native_exception, unwind_exception);
}
从这里可以看到,这里的核心就是获取lsda数据(_Unwind_GetLanguageSpecificData, .gcc_except_table段),然后用上下文传过来的抛出的异常信息来匹配,如果匹配上,就找到了对应的catch字段,我们就返回并执行,如果没有匹配上,就只有调用std::terminate了。
其实这里的解析lsda,就能找到对应的catch-block,因此我们需要了解一下lsda的大致结构:
/*
Exception Handling Table Layout:
+-----------------+--------+
| lpStartEncoding | (char) |
+---------+-------+--------+---------------+-----------------------+
| lpStart | (encoded with lpStartEncoding) | defaults to funcStart |
+---------+-----+--------+-----------------+---------------+-------+
| ttypeEncoding | (char) | Encoding of the type_info table |
+---------------+-+------+----+----------------------------+----------------+
| classInfoOffset | (ULEB128) | Offset to type_info table, defaults to null |
+-----------------++--------+-+----------------------------+----------------+
| callSiteEncoding | (char) | Encoding for Call Site Table |
+------------------+--+-----+-----+------------------------+--------------------------+
| callSiteTableLength | (ULEB128) | Call Site Table length, used to find Action table |
+---------------------+-----------+---------------------------------------------------+
+---------------------+-----------+------------------------------------------------+
| Beginning of Call Site Table The current ip is a 1-based index into |
| ... this table. Or it is -1 meaning no |
| action is needed. Or it is 0 meaning |
| terminate. |
| +-------------+---------------------------------+------------------------------+ |
| | landingPad | (ULEB128) | offset relative to lpStart | |
| | actionEntry | (ULEB128) | Action Table Index 1-based | |
| | | | actionEntry == 0 -> cleanup | |
| +-------------+---------------------------------+------------------------------+ |
| ... |
+----------------------------------------------------------------------------------+
+---------------------------------------------------------------------+
| Beginning of Action Table ttypeIndex == 0 : cleanup |
| ... ttypeIndex > 0 : catch |
| ttypeIndex < 0 : exception spec |
| +--------------+-----------+--------------------------------------+ |
| | ttypeIndex | (SLEB128) | Index into type_info Table (1-based) | |
| | actionOffset | (SLEB128) | Offset into next Action Table entry | |
| +--------------+-----------+--------------------------------------+ |
| ... |
+---------------------------------------------------------------------+-----------------+
| type_info Table, but classInfoOffset does *not* point here! |
| +----------------+------------------------------------------------+-----------------+ |
| | Nth type_info* | Encoded with ttypeEncoding, 0 means catch(...) | ttypeIndex == N | |
| +----------------+------------------------------------------------+-----------------+ |
| ... |
| +----------------+------------------------------------------------+-----------------+ |
| | 1st type_info* | Encoded with ttypeEncoding, 0 means catch(...) | ttypeIndex == 1 | |
| +----------------+------------------------------------------------+-----------------+ |
| +---------------------------------------+-----------+------------------------------+ |
| | 1st ttypeIndex for 1st exception spec | (ULEB128) | classInfoOffset points here! | |
| | ... | (ULEB128) | | |
| | Mth ttypeIndex for 1st exception spec | (ULEB128) | | |
| | 0 | (ULEB128) | | |
| +---------------------------------------+------------------------------------------+ |
| ... |
| +---------------------------------------+------------------------------------------+ |
| | 0 | (ULEB128) | throw() | |
| +---------------------------------------+------------------------------------------+ |
| ... |
| +---------------------------------------+------------------------------------------+ |
| | 1st ttypeIndex for Nth exception spec | (ULEB128) | | |
| | ... | (ULEB128) | | |
| | Mth ttypeIndex for Nth exception spec | (ULEB128) | | |
| | 0 | (ULEB128) | | |
| +---------------------------------------+------------------------------------------+ |
+---------------------------------------------------------------------------------------+
*/
从这里可以知道,其实lsda的核心,就是遍历 Call Site Table,获取到Action Table Index,然后在Action Table中获取到ttypeIndex,然后根据ttypeIndex在type_info Table中开始搜索和匹配异常对象和catch对象是否匹配。如果匹配,返回,如果不匹配,循环遍历Action Table中的action链表,直到处理完。
本文不同异常捕获的原因分析
根据上文的分析,本文的问题肯定出在lsda的Action Table和type_info Table上面。
int main(int argc, char* argv[])
{
try{
p();
}
catch(std::exception& e){
printf("std::exception: %s\n", e.what());
}
catch(...){
printf("unkown exception\n");
}
return 0;
}
# objdump -d --disassemble=main ./build/test
# 此时是正常捕获std异常
0000000000001a70 <main>:
1a70: 55 push %rbp
1a71: 48 89 e5 mov %rsp,%rbp
1a74: 48 83 ec 30 sub $0x30,%rsp
1a78: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
1a7f: 89 7d f8 mov %edi,-0x8(%rbp)
1a82: 48 89 75 f0 mov %rsi,-0x10(%rbp)
1a86: e8 35 01 00 00 call 1bc0 <p@plt>
1a8b: e9 00 00 00 00 jmp 1a90 <main+0x20>
1a90: e9 51 00 00 00 jmp 1ae6 <main+0x76>
1a95: 48 89 c1 mov %rax,%rcx
1a98: 89 d0 mov %edx,%eax
1a9a: 48 89 4d e8 mov %rcx,-0x18(%rbp)
1a9e: 89 45 e4 mov %eax,-0x1c(%rbp)
1aa1: 8b 45 e4 mov -0x1c(%rbp),%eax
1aa4: b9 02 00 00 00 mov $0x2,%ecx
1aa9: 39 c8 cmp %ecx,%eax
1aab: 0f 85 3d 00 00 00 jne 1aee <main+0x7e>
1ab1: 48 8b 7d e8 mov -0x18(%rbp),%rdi
1ab5: e8 16 01 00 00 call 1bd0 <__cxa_begin_catch@plt>
1aba: 48 89 45 d8 mov %rax,-0x28(%rbp)
1abe: 48 8b 7d d8 mov -0x28(%rbp),%rdi
1ac2: 48 8b 07 mov (%rdi),%rax
1ac5: 48 8b 40 10 mov 0x10(%rax),%rax
1ac9: ff d0 call *%rax
1acb: 48 89 c6 mov %rax,%rsi
1ace: 48 8d 3d a1 ed ff ff lea -0x125f(%rip),%rdi # 876 <_IO_stdin_used+0x16>
1ad5: 31 c0 xor %eax,%eax
1ad7: e8 04 01 00 00 call 1be0 <printf@plt>
1adc: e9 00 00 00 00 jmp 1ae1 <main+0x71>
1ae1: e8 0a 01 00 00 call 1bf0 <__cxa_end_catch@plt>
1ae6: 31 c0 xor %eax,%eax
1ae8: 48 83 c4 30 add $0x30,%rsp
1aec: 5d pop %rbp
1aed: c3 ret
1aee: 48 8b 7d e8 mov -0x18(%rbp),%rdi
1af2: e8 d9 00 00 00 call 1bd0 <__cxa_begin_catch@plt>
1af7: 48 8d 3d 66 ed ff ff lea -0x129a(%rip),%rdi # 864 <_IO_stdin_used+0x4>
1afe: 31 c0 xor %eax,%eax
1b00: e8 db 00 00 00 call 1be0 <printf@plt>
1b05: e9 00 00 00 00 jmp 1b0a <main+0x9a>
1b0a: e8 e1 00 00 00 call 1bf0 <__cxa_end_catch@plt>
1b0f: e9 d2 ff ff ff jmp 1ae6 <main+0x76>
1b14: 48 89 c1 mov %rax,%rcx
1b17: 89 d0 mov %edx,%eax
1b19: 48 89 4d e8 mov %rcx,-0x18(%rbp)
1b1d: 89 45 e4 mov %eax,-0x1c(%rbp)
1b20: e8 cb 00 00 00 call 1bf0 <__cxa_end_catch@plt>
1b25: e9 00 00 00 00 jmp 1b2a <main+0xba>
1b2a: e9 1b 00 00 00 jmp 1b4a <main+0xda>
1b2f: 48 89 c1 mov %rax,%rcx
1b32: 89 d0 mov %edx,%eax
1b34: 48 89 4d e8 mov %rcx,-0x18(%rbp)
1b38: 89 45 e4 mov %eax,-0x1c(%rbp)
1b3b: e8 b0 00 00 00 call 1bf0 <__cxa_end_catch@plt>
1b40: e9 00 00 00 00 jmp 1b45 <main+0xd5>
1b45: e9 00 00 00 00 jmp 1b4a <main+0xda>
1b4a: 48 8b 7d e8 mov -0x18(%rbp),%rdi
1b4e: e8 ad 00 00 00 call 1c00 <_Unwind_Resume@plt>
1b53: 48 89 c7 mov %rax,%rdi
1b56: e8 05 00 00 00 call 1b60 <__clang_call_terminate>
当正常捕获异常时,cmp %ecx,%eax位置的eax的值是2,正常进入异常分支。当异常捕获异常时,cmp %ecx,%eax位置的eax的值是1,进入异常捕获分支。意味着在异常情况下:get_shim_type_info(scan_eh_tab中)返回值是0。(注意,第一次查找到了类型,但是不匹配,循环遍历链表下一此匹配到了catch(...))
上面是我们的猜测,我们直接重新构建libcxx/libcxxabi的debug版本,然后再构建我们的测试程序,然后在scan_eh_tab中我们得到了如下的图的核心结果:


从上面可知,我们不同的构建方法,导致了cxx底层无法对两个class类型进行dynamic_cast,导致无法匹配,因此进入了catch(...)的代码段。有兴趣的人可以去追踪dynamic_cast的底层实现函数如下:
__dynamic_cast(const void *static_ptr, const __class_type_info *static_type,
const __class_type_info *dst_type,
std::ptrdiff_t src2dst_offset)
也就是说,我们的核心原因就是__class_type_info在静态编译、动态编译不同情况下,虽然定义是一样的,当两个符号分别在libc++.so和libuser.so的不同符号的时候(地址不一样),但是无法进行cast操作,这是合理的。
后记
总的来说,上面的内容解答了如下两个问题:
- 为什么会捕获到异常:编译条件导致的android系统底层对某些api有不同的控制行为?
- 为什么符号都存在的情况下,走了不一样的异常捕获路径:核心在于typeinfo对象无法dynamic_cast
此次问题调查,加深了我对stl_static/stl_shared的理解,同时加深了我对c++底层实现的了解。加深了我对gcc/clang等编译器的底层功能结构的了解。
同时,根据这次折腾llvm源码的过程,下次再一次想了解c++底层的实现的话,会快捷、方便不少。
参考文献
- https://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html#cxx-throw
- https://en.cppreference.com/w/cpp/filesystem/exists.html
- https://developer.android.com/ndk/guides/cpp-support?hl=zh-cn#selecting_a_c_runtime
- https://en.cppreference.com/w/cpp/language/throw.html

PS: 请尊重原创,不喜勿喷。
PS: 要转载请注明出处,本人版权所有。
PS: 有问题请留言,看到后我会第一时间回复。
浙公网安备 33010602011771号