mongodump 失败且导致mongo服务挂掉【本质原因,wt文件损坏】

======================================================

标题遇到的问题是我要解决的问题的中间环节。

原本问题是:需要在之前standlone的MongoDB数据库中搭建replicaSet集群,发现该集合同步一半就导致本MongoDB实例挂掉

怎么搭建MongoDB relicaSet集群,参考另一篇博客:http://www.cnblogs.com/zhzhang/p/6783425.html

======================================================

服务器虚拟机:6核8G。

问题阐述:

mongodb版本3.2.7(yum安装)

需要mongodump出一个collection 如下

mongodump --collection abc --db db

 abc 为接近2亿条,单条大概200B

每次执行mongodum命令,进度到52.5%就会报错如下,并且mongo服务挂掉,必须重启

2017-05-02T17:08:51.663+0800    [############............]  db.abc  91363661/177602822  (51.4%)
2017-05-02T17:08:54.663+0800    [############............]  db.abc  91744632/177602822  (51.7%)
2017-05-02T17:08:57.663+0800    [############............]  db.abc  92279192/177602822  (52.0%)
2017-05-02T17:09:00.663+0800    [############............]  db.abc  92629211/177602822  (52.2%)
2017-05-02T17:09:03.663+0800    [############............]  db.abc  93112828/177602822  (52.4%)
2017-05-02T17:09:05.619+0800    [############............]  db.abc  93288043/177602822  (52.5%)
2017-05-02T17:09:09.823+0800    Failed: error reading collection: EOF
You have mail in /var/spool/mail/admin

[admin@syslog-1.dev.abc-inc.com /abc_log_nas]
$ps aux | grep mongo
admin    30931  0.0  0.0 103244   860 pts/2    S+   17:14   0:00 grep mongo
You have mail in /var/spool/mail/admin

[root@syslog-1.dev.abc-inc.com /home/admin/bin]
#/etc/init.d/mongod status
mongod dead but subsys locked

[root@syslog-1.dev.abc-inc.com /home/admin/bin]
#/etc/init.d/mongod restart
Stopping mongod:                                           [  OK  ]
Starting mongod:                                           [  OK  ]

[root@syslog-1.dev.abc-inc.com /home/admin/bin]
#tail -n 10 /var/spool/mail/admin 
X-Cron-Env: <PATH=/usr/bin:/bin>
X-Cron-Env: <LOGNAME=admin>
X-Cron-Env: <USER=admin>
Message-Id: <20160601115558.741D2601FD@syslog-1.dev.abc-inc.com>
Date: Mon, 11 Apr 2016 05:16:11 +0800 (CST)

ssh: Could not resolve hostname syslog-1: Temporary failure in name resolution
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
rsync error: unexplained error (code 255) at io.c(600) [receiver=3.0.6]

rsync error: unexplained error (code 255) at io.c(600) [receiver=3.0.6]

rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
rsync error: unexplained error (code 255) at io.c(600) [receiver=3.0.6]

目前已经找到解决的方法。。。。。。同时多谢网络中的各位匿名大神。。。。。。总结文稿,希望对其他人有帮助

=====================================================================

尝试方法一(无效):增大连接数,传送门:http://www.cnblogs.com/zhzhang/p/6762239.html参考文件数打开限制的配置

尝试方法二(无效):将oplogSize增大,之前为1G,修改完之后10G,传送门MongoDB官网:

https://docs.mongodb.com/manual/tutorial/change-oplog-size/

尝试方法三(无效):因为这个collection上只有_id一个索引,尝试建索引,结果也是建在50%左右的时候挂掉。

而且MongoDB重启默认还是执行建索引的操作,必须在配置文件中显示指定

storage:
  indexBuildRetry: false

这样重启,才不至于导致重蹈建索引。。。死机。。。重启。。。建索引。。。的死循环

尝试方法四(无效):想着怎么把该集合分开,该集合恰好没有人为建索引。哎哎哎,索引?忽的发现了貌似的救命稻草:mongo自带的ObjectId索引

在该服务器或者其他服务器启动mongod实例,利用mongo自带的索引(ObjectId使用12字节的存储空间,每个字节两位十六进制数字,是一个24位的字符串)clone出按照时间区分的几个集合。

clone方法:在新建的mongo实例中

db.runCommand({cloneCollection: "db.abc", from: "syslog-1:27017", query: {"_id": {$gt: ObjectId("583aa21d382653813be7c18d"),$lte: ObjectId("587aa21d382653813be7c18d")}}})

db.getCollection("abc").renameCollection("abc_587")

在原来的实例中:

db.runCommand({cloneCollection: "db.abc_587", from: "syslog-3:37017", query: {}})

检查下没问题就可以删除那个巨大的集合中对应的数据了:

db.abc.remove({"_id": {$gt: ObjectId("583aa21d382653813be7c18d"),$lte: ObjectId("587aa21d382653813be7c18d")}})

不得不吐槽下,mongo的集合的文档批量删除实在是太慢了。。。。几千万的数据估计得删几十个小时。。。平均每分钟5w左右的样子。

貌似机器负载不重的时候,3-4w/秒

等等等。。。。。。

尝试方法五(实测有效,也是问题根本原因):

再看一下详细的报错信息

2017-05-04T19:14:06.533+0800 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:56728 #35 (11 connections now open)
2017-05-04T19:14:06.545+0800 I NETWORK  [conn35] end connection 127.0.0.1:56728 (10 connections now open)
2017-05-04T19:14:06.550+0800 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:56730 #36 (11 connections now open)
2017-05-04T19:14:06.563+0800 I NETWORK  [conn36] end connection 127.0.0.1:56730 (10 connections now open)
2017-05-04T19:14:06.818+0800 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:56731 #37 (11 connections now open)
2017-05-04T19:14:06.831+0800 I NETWORK  [conn37] end connection 127.0.0.1:56731 (10 connections now open)
2017-05-04T19:14:06.837+0800 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:56732 #38 (11 connections now open)
2017-05-04T19:14:06.870+0800 I NETWORK  [conn38] end connection 127.0.0.1:56732 (10 connections now open)
2017-05-04T19:14:24.465+0800 I COMMAND  [conn24] query service.client_agent query: { $query: {}, $orderby: { _id: 1 } } planSummary: IXSCAN { _id: 1 } cursorid:51274086361 ntoreturn:0 ntoskip:11350867 keysExamined:11350968 docsExamined:11350968 keyUpdates:0 writeConflicts:0 numYields:88679 nreturned:101 reslen:12747 locks:{ Global: { acquireCount: { r: 177360 } }, Database: { acquireCount: { r: 88680 } }, Collection: { acquireCount: { r: 88680 } } } 10756ms
2017-05-04T19:14:24.531+0800 E STORAGE  [conn24] WiredTiger (0) [1493896464:531510][14803:0x7f40712c3700], file:collection-4--1812812328855925336.wt, WT_CURSOR.search: read checksum error for 8192B block at offset 3412144128: block header checksum of 943205936 doesn't match expected checksum of 3037857471
2017-05-04T19:14:24.531+0800 E STORAGE  [conn24] WiredTiger (0) [1493896464:531635][14803:0x7f40712c3700], file:collection-4--1812812328855925336.wt, WT_CURSOR.search: collection-4--1812812328855925336.wt: encountered an illegal file format or internal value
2017-05-04T19:14:24.531+0800 E STORAGE  [conn24] WiredTiger (-31804) [1493896464:531656][14803:0x7f40712c3700], file:collection-4--1812812328855925336.wt, WT_CURSOR.search: the process must exit and restart: WT_PANIC: WiredTiger library panic
2017-05-04T19:14:24.531+0800 I -        [conn24] Fatal Assertion 28558
2017-05-04T19:14:24.531+0800 I -        [conn24] 

***aborting after fassert() failure


2017-05-04T19:14:24.599+0800 I -        [WTJournalFlusher] Fatal Assertion 28559
2017-05-04T19:14:24.599+0800 I -        [WTJournalFlusher] 

***aborting after fassert() failure


2017-05-04T19:14:24.609+0800 F -        [conn24] Got signal: 6 (Aborted).

 0x1304482 0x13033a9 0x1303bb2 0x7f40828dd7e0 0x7f408256c625 0x7f408256de05 0x128a472 0x1072bb3 0x1a7945c 0x1a7991d 0x1a79d04 0x19acfb7 0x19c9c85 0x19cf380 0x19f0207 0x19ba7a8 0x1a0c71c 0x1067a83 0xbdc2c9 0xb9916e 0xbbf2b5 0xdee255 0xdee919 0xdaaf72 0xdab66d 0xc82da9 0xc89075 0x94ed5c 0x12aea65 0x7f40828d5aa1 0x7f408262293d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"F04482","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"F033A9"},{"b":"400000","o":"F03BB2"},{"b":"7F40828CE000","o":"F7E0"},{"b":"7F408253A000","o":"32625","s":"gsignal"},{"b":"7F408253A000","o":"33E05","s":"abort"},{"b":"400000","o":"E8A472","s":"_ZN5mongo13fassertFailedEi"},{"b":"400000","o":"C72BB3"},{"b":"400000","o":"167945C","s":"__wt_eventv"},{"b":"400000","o":"167991D","s":"__wt_err"},{"b":"400000","o":"1679D04","s":"__wt_panic"},{"b":"400000","o":"15ACFB7","s":"__wt_bm_read"},{"b":"400000","o":"15C9C85","s":"__wt_bt_read"},{"b":"400000","o":"15CF380","s":"__wt_page_in_func"},{"b":"400000","o":"15F0207","s":"__wt_row_search"},{"b":"400000","o":"15BA7A8","s":"__wt_btcur_search"},{"b":"400000","o":"160C71C"},{"b":"400000","o":"C67A83","s":"_ZN5mongo21WiredTigerRecordStore6Cursor9seekExactERKNS_8RecordIdE"},{"b":"400000","o":"7DC2C9","s":"_ZN5mongo16WorkingSetCommon5fetchEPNS_16OperationContextEPNS_10WorkingSetEmNS_11unowned_ptrINS_20SeekableRecordCursorEEE"},{"b":"400000","o":"79916E","s":"_ZN5mongo10FetchStage4workEPm"},{"b":"400000","o":"7BF2B5","s":"_ZN5mongo9SkipStage4workEPm"},{"b":"400000","o":"9EE255","s":"_ZN5mongo12PlanExecutor11getNextImplEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE"},{"b":"400000","o":"9EE919","s":"_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE"},{"b":"400000","o":"9AAF72"},{"b":"400000","o":"9AB66D","s":"_ZN5mongo7getMoreEPNS_16OperationContextEPKcixPbS4_"},{"b":"400000","o":"882DA9","s":"_ZN5mongo15receivedGetMoreEPNS_16OperationContextERNS_10DbResponseERNS_7MessageERNS_5CurOpE"},{"b":"400000","o":"889075","s":"_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE"},{"b":"400000","o":"54ED5C","s":"_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE"},{"b":"400000","o":"EAEA65","s":"_ZN5mongo17PortMessageServer17handleIncomingMsgEPv"},{"b":"7F40828CE000","o":"7AA1"},{"b":"7F408253A000","o":"E893D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.7", "gitVersion" : "4249c1d2b5999ebbf1fdf3bc0e0e3b3ff5c0aaf2", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "2.6.32-504.el6.x86_64", "version" : "#1 SMP Wed Oct 15 04:27:16 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "384A822B93AE1E4CFE393F0CECA08575DF6EB381" }, { "b" : "7FFF67CB8000", "elfType" : 3, "buildId" : "08E42C6C3D2CD1E5D68A43B717C9EB3D310F2DF0" }, { "b" : "7F4083775000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "B84C31B86733DE212F6886FE6F55630FE56180A9" }, { "b" : "7F4083391000", "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "A30A68D2F579614CBEA988BDAAC20CD56D8C48FC" }, { "b" : "7F4083189000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "95159178F1A4A3DBDC7819FBEA2C80E5FCDD6BAC" }, { "b" : "7F4082F85000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "29B61382141595ECBA6576232E44F2310C3AAB72" }, { "b" : "7F4082D01000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "989FE3A42CA8CEBDCC185A743896F23A0CF537ED" }, { "b" : "7F4082AEB000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "9350579A4970FA47F3144AD8F40B183B0954497D" }, { "b" : "7F40828CE000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "C56DD1B811FC0D9263248EBB308C73FCBCD80FC1" }, { "b" : "7F408253A000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "8E6FA4C4B0594C355C1B90C1D49990368C81A040" }, { "b" : "7F40839E1000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "959C5E10A47EE8A633E7681B64B4B9F74E242ED5" }, { "b" : "7F40822F6000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "441FA45097A11508E50D55A3D1FF169BF2BE7C62" }, { "b" : "7F408200F000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "F62622218875795666E08B92D176A50791183EEC" }, { "b" : "7F4081E0B000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "152E2C18A7A2145021A8A879A01A82EE134E3946" }, { "b" : "7F4081BDF000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "B8DEDADC140347276164C729418C7A37B7224135" }, { "b" : "7F40819C9000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }, { "b" : "7F40817BE000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "4BDFC7A19C1F328EB4FCFBCE7A1E27606928610D" }, { "b" : "7F40815BB000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }, { "b" : "7F40813A1000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "C39D7FFB49DFB1B55AD09D1D711AD802123F6623" }, { "b" : "7F4081182000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "E6798A06BEE17CF102BBA44FD512FF8B805CEAF1" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x1304482]
 mongod(+0xF033A9) [0x13033a9]
 mongod(+0xF03BB2) [0x1303bb2]
 libpthread.so.0(+0xF7E0) [0x7f40828dd7e0]
 libc.so.6(gsignal+0x35) [0x7f408256c625]
 libc.so.6(abort+0x175) [0x7f408256de05]
 mongod(_ZN5mongo13fassertFailedEi+0x82) [0x128a472]
 mongod(+0xC72BB3) [0x1072bb3]
 mongod(__wt_eventv+0x42C) [0x1a7945c]
 mongod(__wt_err+0x8D) [0x1a7991d]
 mongod(__wt_panic+0x24) [0x1a79d04]
 mongod(__wt_bm_read+0x77) [0x19acfb7]
 mongod(__wt_bt_read+0x85) [0x19c9c85]
 mongod(__wt_page_in_func+0x180) [0x19cf380]
 mongod(__wt_row_search+0x677) [0x19f0207]
 mongod(__wt_btcur_search+0xB08) [0x19ba7a8]
 mongod(+0x160C71C) [0x1a0c71c]
 mongod(_ZN5mongo21WiredTigerRecordStore6Cursor9seekExactERKNS_8RecordIdE+0x53) [0x1067a83]
 mongod(_ZN5mongo16WorkingSetCommon5fetchEPNS_16OperationContextEPNS_10WorkingSetEmNS_11unowned_ptrINS_20SeekableRecordCursorEEE+0x99) [0xbdc2c9]
 mongod(_ZN5mongo10FetchStage4workEPm+0x2FE) [0xb9916e]
 mongod(_ZN5mongo9SkipStage4workEPm+0x45) [0xbbf2b5]
 mongod(_ZN5mongo12PlanExecutor11getNextImplEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0x275) [0xdee255]
 mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE+0x39) [0xdee919]
 mongod(+0x9AAF72) [0xdaaf72]
 mongod(_ZN5mongo7getMoreEPNS_16OperationContextEPKcixPbS4_+0x52D) [0xdab66d]
 mongod(_ZN5mongo15receivedGetMoreEPNS_16OperationContextERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0x1A9) [0xc82da9]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xE35) [0xc89075]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE+0xEC) [0x94ed5c]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x325) [0x12aea65]
 libpthread.so.0(+0x7AA1) [0x7f40828d5aa1]
 libc.so.6(clone+0x6D) [0x7f408262293d]
-----  END BACKTRACE  -----

纳尼,竟然文件坏了,我特么什么时候动过这个文件了。。。。。。

没办法,继续修问题。。。

下面解决问题的最核心的步骤即将到来。。。

1. 下载安装必要的软件

wget http://source.wiredtiger.com/releases/wiredtiger-2.7.0.tar.bz2
tar xvf wiredtiger-2.7.0.tar.bz2
cd wiredtiger-2.7.0
sudo apt-get install libsnappy-dev build-essential
./configure --enable-snappy
make

2. 将出问题的wt文件拷贝一份出来(至于怎么查看可以查看通过对应的集合查看,如下)

db.abc.stats().wiredTiger.uri

3. 拯救坏掉的集合

./wt -v -h ../mongo-bak -C "extensions=[./ext/compressors/snappy/.libs/libwiredtiger_snappy.so]" -R salvage collection-2657--1723320556100349955.wt

4. 通过dump/load导入wt文件到MongoDB集合

./wt -v -h ../data -C "extensions=[./ext/compressors/snappy/.libs/libwiredtiger_snappy.so]" -R dump -f ../collection.dump collection-2657--1723320556100349955

5. 创建一个新的mongo实例,目的是获取一个空的集合实例,以方便将load出的文件导入该集合

mongod --dbpath tmp-mongo --storageEngine wiredTiger --nojournal                                  
use Recovery
db.borkedCollection.insert({test: 1})
db.borkedCollection.remove({})
db.borkedCollection.stats()

6. 将第4步生成的collection.dump文件导入刚启动的mongo实例的数据目录下(对应的mongo实例需要停掉)

对应的collection-******,参考第五步创建的集合的db.abc.stats().wiredTiger.uri状态

./wt -v -h ../data -C "extensions=[./ext/compressors/snappy/.libs/libwiredtiger_snappy.so]" -R load -f ../collection.dump -r collection-2-880383588247732034

7. 启动mongo实例,并且登录进去,发现文档为空

db.borkedCollection.count()
0

8.但是通过利用如下语句查询,确实是有内容的:

db.borkedCollection.find({}, {_id: 1})

9.利用mongodump将集合数据dum出来

mongodump

10.利用mongorestore将集合数据导入进mongo中

mongorestore --drop

11.登录MongoDB,发现数据恢复

由于是数据文件损坏,可能会少一些数据,像我这个例子,1.7亿条少了10w左右,可以接受

 

本次问题解决,参考网址:http://www.alexbevi.com/blog/2016/02/10/recovering-a-wiredtiger-collection-from-a-corrupt-mongodb-installation/

 

希望对各位能有所帮助。

 

PS:发现通过排查问题确实是学习的好方法。

通过本次问题排查,MongoDB摸索的东西不少。。。

 

posted @ 2017-05-02 21:22  宝山方圆  阅读(9462)  评论(0编辑  收藏  举报