大数据引发的小悲剧（一）

前几天，MonitorServer有个功能在客户现场被报告无法工作，于是立即跟踪之。

该功能要完成的工作是：从上层(BS系统)接收配置参数，按照系统运行情况，将设置参数转发给指定的下层系统（有些发送到嵌入式设备上，有些发送到其他程序）。

之前在本地测试，一切都Ok。为什么在客户现场就不行呢？

于是做了两项测试：

（1）使用本地数据重新测试，结果正常。

（2）将涉及到的客户现场数据导回来，测试之，果然无法正常工作。

跟踪发现，问题出在MonitorServer将参数转发给下层（另外一个程序monitord）时，没有收到monitord返回的响应，导致转发失败。

于是去查看monitord程序，发现它竟然崩溃了，当然就不会给MonitorServer发响应了。

问题是，monitod为什么会崩溃呢？之前的测试都是通过了的阿。

分析了下数据，发现一点：客户现场转发的数据，远比本地测试时的数据要多。

再分析MonitorServer的发送部分和monitord接收部分的代码，分别如下：

 1 //MonitorServer发送代码
 2 
 3 esmonitor_cfg_t m_monitord_prms;
 4 
 5 ...
 6 
 7 tmp = sock.CreateSock(pConf->m_monitordServerList[i].c_str(), ES_MONITOR_PORT, IPPROTO_TCP, CLIENT);
 8 
 9 ...
10 
11 if (sock.SendData((__int8*) & m_monitord_prms, sizeof (m_monitord_prms)) < 0)
12 {
13     printf("setAlarmPrmTask::Send2Monitord().  SendData failed.\n");
14     sock.closeHandle();
15     nRet = -1;
16     continue;
17 }
18 
19 esmonitor_cfg_resp_t resp;
20 if (sock.ReceiveData((__int8 *) & resp, sizeof (resp)) < 0)
21 {
22     printf("setAlarmPrmTask::Send2Monitord().  ReceiveData failed.\n");
23     sock.closeHandle();
24     nRet = -1;
25     continue;
26 }
27 
28 
29 //下面是发送的结构体的定义
30 
31 #define MAX_CFG_NUM 1024
32 typedef struct _esmonitor_cfg_t
33 {
34     _esmonitor_cfg_t()
35     {
36         memset(&header, 0, sizeof(header_t));
37         i_cfg_num = 0;
38         memset(&threshold, 0, sizeof(threshold_t)*MAX_CFG_NUM);
39 
40         header.i_sync = htonl(0x12345678);
41         header.i_vession = htonl(0x1);
42         header.i_type = htonl(ES_MONITOR_CFG_ADD);
43     }
44 
45     header_t    header;
46     uint32_t    i_cfg_num;
47     threshold_t    threshold[MAX_CFG_NUM];
48 }esmonitor_cfg_t;

monitord接收代码：

 1 static uint8_t p_recv_buf[1500];
 2 while (1)
 3 {
 4     sock_accept = (SOCKET)accept(sock, (SOCKADDR*)&addr_from, &i_len);
 5     if (sock_accept > 0)
 6     {
 7         i_recv_size = recv(sock_accept, &p_recv_buf, sizeof(p_recv_buf), 0);
 8         if (i_recv_size > 0)
 9         {
10             p_header = p_recv_buf;
11             p_header->i_type = ntohl(p_header->i_type);
12 
13             ...
14         }    
15     }
16 }

发现问题没有？

monitord的接收缓冲区只有1500字节，而MonitorServer发送的结构体远远超过它！实测sizeof (m_monitord_prms)的大小超过6000字节！

那为什么数据量比较小的时候没有崩溃，而数据量大的时候才崩溃呢？

我们来分析。

首先，发送端定义的结构体esmonitor_cfg_t中，前几个字段大小固定，后面跟着1024个数组（每个数组存放一组配置参数），通过字段i_cfg_num来指定实际有效的数组个数。这样，每次发送的字节数为sizeof (m_monitord_prms)，也就是6000字节左右（假定6000字节）。

然后，接收端定义的接收缓冲区是uint8_t p_recv_buf[1500]，也就是1500字节。

这样，接收端每次只能接收用户发送过来的6000字节中的前1500字节。

monitord接收到这1500字节后，又做了如下处理：

1 for (i = 0; i < p_cfg->i_cfg_num; i++)
2 {
3     p_threshold = p_cfg->threshold + i;
4     p_threshold->i_alarm_delay = ntohl(p_threshold->i_alarm_delay);
5     p_threshold->i_alarm_id = ntohl(p_threshold->i_alarm_id);
6 ...
7 }

当发送端指定的i_cfg_num比较小时，虽然用户只接收了部分数据，但monitord并不会访问丢失的数据。

而一旦i_cfg_num指示的数据不在接收到的1500字节中，p_threshold就会发生数组越界，造成危险的“野指针”，于是就造成了程序崩溃。

查明了原因，问题就很好解决了：增大monitord的接收缓冲区，至少不小于发送端的结构体大小。

----------------------------------------------------------------------------------

ps:MonitorServer和monitord是有不同的人负责的，之前也没有协调，最后才会发生这种问题。

俺不由得大吼一声：坑爹呀。。。

posted @ 2012-04-28 17:38 楚阅读(417) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

愁倚西风望芙蓉

大数据引发的小悲剧（一）

公告