静态库覆盖动态库同名类以及方法
某天客户反馈他们测试环境,助手段(客户端)上报的审计信息无法插入Oracle数据库,最近他们升级了Oracle,想问下是不是升级导致的,不过刚好这个时间,有新需求。给客户提供了新包,也可能是自己改代码搞出来的bug,而不是升级导致的
让现场同事采集了下堆栈信息,我们抽出有用的信息(此现场无法远程)
Thread 16 (Thread 0x7f7cf07e0700 (LWP 5519)):
#0 0x00007f7db0d4b54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f7db0d46e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007f7db0d46d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00000000012b6c31 in HIEUtil::RecMutex::lock (
this=0x7f7db4c43548 <HOTLDBMgrOracle::sta_ins+936>)
at ../HIERecMutex.cpp:249
#4 0x00007f7db49bd89a in HOTLDBMgrOracle::get_connect (
this=0x7f7db4c431a0 <HOTLDBMgrOracle::sta_ins>,
ui_id=@0x7f7cf07df9bc: 32637, str_exec_sql=...)
at ../../../../../../BlowSnow_PUFA/source/DBOracle/HOTLDBMgrOracle.h:1991
#5 0x00007f7db49bbff7 in HOTLDBMgrOracle::do_exec_sql (
this=0x7f7db4c431a0 <HOTLDBMgrOracle::sta_ins>, str_sql=...,
i_buffer_size=4096)
at ../../../../../../BlowSnow_PUFA/source/DBOracle/HOTLDBMgrOracle.h:1631
#6 0x00007f7db49c460c in DBBatInsertAuditInfoCommon::insert_general_config_warningInfo (
lst_audit_infos=Python Exception <type 'exceptions.ValueError'> Cannot find type const std::list<std::pair<AuditCommonInfo, LVPReportUserActionMonVT>, std::allocator<std::pair<AuditCommonInfo, LVPReportUserActionMonVT> > >::_Node:
std::list) at DBBatInsertAuditInfoCenterOracle.cpp:947
#7 0x0000000000cba179 in DBBatInsertAuditInfoCommon::insert_audit_info(std::list<std::pair<AuditCommonInfo, LVPReportUserActionMonVT>, std::allocator<std::pair<AuditCommonInfo, LVPReportUserActionMonVT> > > const&) ()
#8 0x0000000000bdbf4c in DBBatInsertAuditInfoThread<std::pair<AuditCommonInfo, LVPReportUserActionMonVT> >::run() ()
#9 0x00000000012bb0b7 in HThreadReal::run (this=0x7f7d0c002440)
at ../HThreadPool.cpp:213
#10 0x0000000001290827 in startHook (arg=0x7f7d0c002440)
at ../HIEThread.cpp:606
#11 0x00007f7db0d44ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f7db0a6db0d in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f7cfc7f8700 (LWP 6881)):
#0 0x00007f7db0d4b54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f7db0d46e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007f7db0d46d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00000000012b6c31 in HIEUtil::RecMutex::lock (
this=0x7f7db4c43548 <HOTLDBMgrOracle::sta_ins+936>)
at ../HIERecMutex.cpp:249
#4 0x00007f7db49bd89a in HOTLDBMgrOracle::get_connect (
this=0x7f7db4c431a0 <HOTLDBMgrOracle::sta_ins>,
ui_id=@0x7f7cfc7f76dc: 32637, str_exec_sql=...)
at ../../../../../../BlowSnow_PUFA/source/DBOracle/HOTLDBMgrOracle.h:1991
#5 0x00007f7db49bbff7 in HOTLDBMgrOracle::do_exec_sql (
this=0x7f7db4c431a0 <HOTLDBMgrOracle::sta_ins>, str_sql=...,
i_buffer_size=4096)
at ../../../../../../BlowSnow_PUFA/source/DBOracle/HOTLDBMgrOracle.h:1631
#6 0x00007f7db49c2c70 in DBBatInsertAuditInfoCommon::insert_cloud_monitor (
lst_audit_infos=Python Exception <type 'exceptions.ValueError'> Cannot find ---Type <return> to continue, or q <return> to quit---type const std::list<std::pair<AuditCommonInfo, LVPReportCoudMonitoringVT>, std::allocator<std::pair<AuditCommonInfo, LVPReportCoudMonitoringVT> > >::_Node:
std::list) at DBBatInsertAuditInfoCenterOracle.cpp:701
#7 0x0000000000cba0e9 in DBBatInsertAuditInfoCommon::insert_audit_info(std::list<std::pair<AuditCommonInfo, LVPReportCoudMonitoringVT>, std::allocator<std::pair<AuditCommonInfo, LVPReportCoudMonitoringVT> > > const&) ()
#8 0x0000000000bdd85c in DBBatInsertAuditInfoThread<std::pair<AuditCommonInfo, LVPReportCoudMonitoringVT> >::run() ()
#9 0x00000000012bb0b7 in HThreadReal::run (this=0x7f7d0c03a9c0)
at ../HThreadPool.cpp:213
#10 0x00000000012907d0 in startHook (arg=0x7f7d0c03a9c0)
at ../HIEThread.cpp:585
#11 0x00007f7db0d44ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f7db0a6db0d in clone () from /lib64/libc.so.6
看到这个很容易想到其它地方获取锁卡住导致这里获取锁时获取步到,因此卡住了,但是除此之外没有其它请求锁的线程栈了,已经让现场同事采集了好几次,都是这样的堆栈
既然不是别的地方获取锁导致执行sql语句的接口卡住了,那么可能是mutex的内存被写坏了导致的
再看后台日志时发现了个奇怪的现象,后台有个记录sql执行以及异常时相关信息的日志。我们发现每次卡住都能看到类似的sql
[1][2022-07-01][11:01:07][T2784966400][HOTLDBMgrOracle.h ][1489][E]Msg:ORA-00903: invalid table name
, STM:update set = -1 where uidrecordid = '064420220628tNiyhZuTdpPCGl80', State:, Var:, id 0
根据对应的State和id查找代码
//catch里的处理逻辑
for (int i = 0; i < mvt_connector.size(); i++)
{
if (mvt_connector[i].i_status == HOTLCS_CON_ED)
{
mvt_connector[i].i_status = HOTLCS_CON_NO;
if (!mb_thread_run)
{
mb_thread_run = true;
HThreadDBOracleMMaintain* pt = new HThreadDBOracleMMaintain;
pt->mdm = this;
pt->start().detach();
}
}
}
//接着跟
class HThreadDBOracleMMaintain : public HIEUtil::Thread
{
public:
HOTLDBMgrOracle* mdm;
virtual void run()
{
mdm->run();
}
};
void HOTLDBMgrOracle::run()
{
HLog(HGET_INFO << L"run start");
while(1)
{
bool b_need = need_recon();
if (b_need)
{
HLog(HGET_INFO << L"need_recon");
oracle::otl_connect* p_connect = get_new_connect();
HLog(HGET_INFO << L"get_new_connect ret " << (__int64)p_connect);
if (p_connect == 0)
{
HEnvironment::Sleep(1000 * 20);
continue;
}
add_new(p_connect);
}
else
{
HEnvironment::Sleep(1000 * 2);
}
}
}
以上代码在HOTLDBMgrOracle.h
[1351][2022-07-01][11:01:07][T2835322624][HOTLDBMgr.cpp ][0879][I]run start
[1352][2022-07-01][11:01:07][T2835322624][HOTLDBMgr.cpp ][0887][I]need_recon
[1353][2022-07-01][11:01:07][T2835322624][HOTLDBMgr.cpp ][0747][I]mstr_user scmsbusinessdb mstr_dsn OracleDB
[1354][2022-07-01][11:01:07][T2835322624][HOTLDBMgr.cpp ][1648][I]rlogon take time 58 ms
[1355][2022-07-01][11:01:07][T2835322624][HOTLDBMgr.cpp ][0891][I]get_new_connect ret 140437713089200
但是日志里的“run start" “need_recon” “get_new_connect ret”
日志的信息不会有错,但为什么会调用到HOTLDBMgr.cpp里面(应该是HOTLDBMgrOracle.h里),我们翻阅这个文件的代码
class HThreadDBMMaintain : public HIEUtil::Thread
{
public:
HOTLDBMgr* mdm;
virtual void run()
{
mdm->run();
}
};
两个文件中有同名的类,细想下编译不会报错么。
经过资料的查询以及自己写demo测试,发现动态库同名的类以及方法调用时确实会被静态库的所覆盖
(当然前提是你要在代码中的某处调用了这个同名类,否则仅仅只是定义在那里,即便同名了也不会覆盖,还是各自调用各自的)
HOTLDBMgr.cpp .h被制作成了静态库
HOTLDBMgrOracle.h被制作成了动态库
然后一起被链接到我们的程序里了
故将此处修改为不同名,给到现场验证。入不了库的问题不在出现
因为这里的同名类中将实例HOTLDBMgrOracle的地址给到了类HOTLDBMgr,而HOTLDBMgr用这个地址做了其它事,不能远程无法知道具体的细节,但是可以猜测的是,这里将不同类的实例赋值并且进行相关操作,导致了HOTLDBMgrOracle里的mutex变量被修改,故lock的时候卡住导致无法入库(后续将lock改为trylock发现返回值为16,该锁已经被占用,基本可以认定是内存中的相关变量被写坏)
浙公网安备 33010602011771号