mongodump 失败且导致mongo服务挂掉【本质原因,wt文件损坏】
======================================================
標題遇到的問題是我要解決的問題的中間環(huán)節(jié)。
原本問題是:需要在之前standlone的MongoDB數(shù)據(jù)庫中搭建replicaSet集群,發(fā)現(xiàn)該集合同步一半就導(dǎo)致本MongoDB實例掛掉
怎么搭建MongoDB relicaSet集群,參考另一篇博客:http://www.cnblogs.com/zhzhang/p/6783425.html
======================================================
服務(wù)器虛擬機:6核8G。
問題闡述:
mongodb版本3.2.7(yum安裝)
需要mongodump出一個collection 如下
mongodump --collection abc --db db
?abc 為接近2億條,單條大概200B
每次執(zhí)行mongodum命令,進度到52.5%就會報錯如下,并且mongo服務(wù)掛掉,必須重啟
2017-05-02T17:08:51.663+0800 [############............] db.abc 91363661/177602822 (51.4%) 2017-05-02T17:08:54.663+0800 [############............] db.abc 91744632/177602822 (51.7%) 2017-05-02T17:08:57.663+0800 [############............] db.abc 92279192/177602822 (52.0%) 2017-05-02T17:09:00.663+0800 [############............] db.abc 92629211/177602822 (52.2%) 2017-05-02T17:09:03.663+0800 [############............] db.abc 93112828/177602822 (52.4%) 2017-05-02T17:09:05.619+0800 [############............] db.abc 93288043/177602822 (52.5%) 2017-05-02T17:09:09.823+0800 Failed: error reading collection: EOF You have mail in /var/spool/mail/admin[admin@syslog-1.dev.abc-inc.com /abc_log_nas] $ps aux | grep mongo admin 30931 0.0 0.0 103244 860 pts/2 S+ 17:14 0:00 grep mongo You have mail in /var/spool/mail/admin[root@syslog-1.dev.abc-inc.com /home/admin/bin] #/etc/init.d/mongod status mongod dead but subsys locked[root@syslog-1.dev.abc-inc.com /home/admin/bin] #/etc/init.d/mongod restart Stopping mongod: [ OK ] Starting mongod: [ OK ][root@syslog-1.dev.abc-inc.com /home/admin/bin] #tail -n 10 /var/spool/mail/admin X-Cron-Env: <PATH=/usr/bin:/bin> X-Cron-Env: <LOGNAME=admin> X-Cron-Env: <USER=admin> Message-Id: <20160601115558.741D2601FD@syslog-1.dev.abc-inc.com> Date: Mon, 11 Apr 2016 05:16:11 +0800 (CST)ssh: Could not resolve hostname syslog-1: Temporary failure in name resolution rsync: connection unexpectedly closed (0 bytes received so far) [receiver] rsync error: unexplained error (code 255) at io.c(600) [receiver=3.0.6]rsync error: unexplained error (code 255) at io.c(600) [receiver=3.0.6]rsync: connection unexpectedly closed (0 bytes received so far) [receiver] rsync error: unexplained error (code 255) at io.c(600) [receiver=3.0.6]目前已經(jīng)找到解決的方法。。。。。。同時多謝網(wǎng)絡(luò)中的各位匿名大神。。。。。。總結(jié)文稿,希望對其他人有幫助
=====================================================================
嘗試方法一(無效):增大連接數(shù),傳送門:http://www.cnblogs.com/zhzhang/p/6762239.html參考文件數(shù)打開限制的配置
嘗試方法二(無效):將oplogSize增大,之前為1G,修改完之后10G,傳送門MongoDB官網(wǎng):
https://docs.mongodb.com/manual/tutorial/change-oplog-size/
嘗試方法三(無效):因為這個collection上只有_id一個索引,嘗試建索引,結(jié)果也是建在50%左右的時候掛掉。
而且MongoDB重啟默認還是執(zhí)行建索引的操作,必須在配置文件中顯示指定
storage:indexBuildRetry: false這樣重啟,才不至于導(dǎo)致重蹈建索引。。。死機。。。重啟。。。建索引。。。的死循環(huán)
嘗試方法四(無效):想著怎么把該集合分開,該集合恰好沒有人為建索引。哎哎哎,索引?忽的發(fā)現(xiàn)了貌似的救命稻草:mongo自帶的ObjectId索引
在該服務(wù)器或者其他服務(wù)器啟動mongod實例,利用mongo自帶的索引(ObjectId使用12字節(jié)的存儲空間,每個字節(jié)兩位十六進制數(shù)字,是一個24位的字符串)clone出按照時間區(qū)分的幾個集合。
clone方法:在新建的mongo實例中
db.runCommand({cloneCollection: "db.abc", from: "syslog-1:27017", query: {"_id": {$gt: ObjectId("583aa21d382653813be7c18d"),$lte: ObjectId("587aa21d382653813be7c18d")}}})
db.getCollection("abc").renameCollection("abc_587")
在原來的實例中:
db.runCommand({cloneCollection: "db.abc_587", from: "syslog-3:37017", query: {}})
檢查下沒問題就可以刪除那個巨大的集合中對應(yīng)的數(shù)據(jù)了:
db.abc.remove({"_id": {$gt: ObjectId("583aa21d382653813be7c18d"),$lte: ObjectId("587aa21d382653813be7c18d")}})
不得不吐槽下,mongo的集合的文檔批量刪除實在是太慢了。。。。幾千萬的數(shù)據(jù)估計得刪幾十個小時。。。平均每分鐘5w左右的樣子。
貌似機器負載不重的時候,3-4w/秒
等等等。。。。。。
嘗試方法五(實測有效,也是問題根本原因):
再看一下詳細的報錯信息
2017-05-04T19:14:06.533+0800 I NETWORK [initandlisten] connection accepted from 127.0.0.1:56728 #35 (11 connections now open) 2017-05-04T19:14:06.545+0800 I NETWORK [conn35] end connection 127.0.0.1:56728 (10 connections now open) 2017-05-04T19:14:06.550+0800 I NETWORK [initandlisten] connection accepted from 127.0.0.1:56730 #36 (11 connections now open) 2017-05-04T19:14:06.563+0800 I NETWORK [conn36] end connection 127.0.0.1:56730 (10 connections now open) 2017-05-04T19:14:06.818+0800 I NETWORK [initandlisten] connection accepted from 127.0.0.1:56731 #37 (11 connections now open) 2017-05-04T19:14:06.831+0800 I NETWORK [conn37] end connection 127.0.0.1:56731 (10 connections now open) 2017-05-04T19:14:06.837+0800 I NETWORK [initandlisten] connection accepted from 127.0.0.1:56732 #38 (11 connections now open) 2017-05-04T19:14:06.870+0800 I NETWORK [conn38] end connection 127.0.0.1:56732 (10 connections now open) 2017-05-04T19:14:24.465+0800 I COMMAND [conn24] query service.client_agent query: { $query: {}, $orderby: { _id: 1 } } planSummary: IXSCAN { _id: 1 } cursorid:51274086361 ntoreturn:0 ntoskip:11350867 keysExamined:11350968 docsExamined:11350968 keyUpdates:0 writeConflicts:0 numYields:88679 nreturned:101 reslen:12747 locks:{ Global: { acquireCount: { r: 177360 } }, Database: { acquireCount: { r: 88680 } }, Collection: { acquireCount: { r: 88680 } } } 10756ms 2017-05-04T19:14:24.531+0800 E STORAGE [conn24] WiredTiger (0) [1493896464:531510][14803:0x7f40712c3700], file:collection-4--1812812328855925336.wt, WT_CURSOR.search: read checksum error for 8192B block at offset 3412144128: block header checksum of 943205936 doesn't match expected checksum of 3037857471 2017-05-04T19:14:24.531+0800 E STORAGE [conn24] WiredTiger (0) [1493896464:531635][14803:0x7f40712c3700], file:collection-4--1812812328855925336.wt, WT_CURSOR.search: collection-4--1812812328855925336.wt: encountered an illegal file format or internal value 2017-05-04T19:14:24.531+0800 E STORAGE [conn24] WiredTiger (-31804) [1493896464:531656][14803:0x7f40712c3700], file:collection-4--1812812328855925336.wt, WT_CURSOR.search: the process must exit and restart: WT_PANIC: WiredTiger library panic 2017-05-04T19:14:24.531+0800 I - [conn24] Fatal Assertion 28558 2017-05-04T19:14:24.531+0800 I - [conn24] ***aborting after fassert() failure2017-05-04T19:14:24.599+0800 I - [WTJournalFlusher] Fatal Assertion 28559 2017-05-04T19:14:24.599+0800 I - [WTJournalFlusher] ***aborting after fassert() failure2017-05-04T19:14:24.609+0800 F - [conn24] Got signal: 6 (Aborted).0x1304482 0x13033a9 0x1303bb2 0x7f40828dd7e0 0x7f408256c625 0x7f408256de05 0x128a472 0x1072bb3 0x1a7945c 0x1a7991d 0x1a79d04 0x19acfb7 0x19c9c85 0x19cf380 0x19f0207 0x19ba7a8 0x1a0c71c 0x1067a83 0xbdc2c9 0xb9916e 0xbbf2b5 0xdee255 0xdee919 0xdaaf72 0xdab66d 0xc82da9 0xc89075 0x94ed5c 0x12aea65 0x7f40828d5aa1 0x7f408262293d ----- BEGIN BACKTRACE ----- {"backtrace":[{"b":"400000","o":"F04482","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"F033A9"},{"b":"400000","o":"F03BB2"},{"b":"7F40828CE000","o":"F7E0"},{"b":"7F408253A000","o":"32625","s":"gsignal"},{"b":"7F408253A000","o":"33E05","s":"abort"},{"b":"400000","o":"E8A472","s":"_ZN5mongo13fassertFailedEi"},{"b":"400000","o":"C72BB3"},{"b":"400000","o":"167945C","s":"__wt_eventv"},{"b":"400000","o":"167991D","s":"__wt_err"},{"b":"400000","o":"1679D04","s":"__wt_panic"},{"b":"400000","o":"15ACFB7","s":"__wt_bm_read"},{"b":"400000","o":"15C9C85","s":"__wt_bt_read"},{"b":"400000","o":"15CF380","s":"__wt_page_in_func"},{"b":"400000","o":"15F0207","s":"__wt_row_search"},{"b":"400000","o":"15BA7A8","s":"__wt_btcur_search"},{"b":"400000","o":"160C71C"},{"b":"400000","o":"C67A83","s":"_ZN5mongo21WiredTigerRecordStore6Cursor9seekExactERKNS_8RecordIdE"},{"b":"400000","o":"7DC2C9","s":"_ZN5mongo16WorkingSetCommon5fetchEPNS_16OperationContextEPNS_10WorkingSetEmNS_11unowned_ptrINS_20SeekableRecordCursorEEE"},{"b":"400000","o":"79916E","s":"_ZN5mongo10FetchStage4workEPm"},{"b":"400000","o":"7BF2B5","s":"_ZN5mongo9SkipStage4workEPm"},{"b":"400000","o":"9EE255","s":"_ZN5mongo12PlanExecutor11getNextImplEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE"},{"b":"400000","o":"9EE919","s":"_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE"},{"b":"400000","o":"9AAF72"},{"b":"400000","o":"9AB66D","s":"_ZN5mongo7getMoreEPNS_16OperationContextEPKcixPbS4_"},{"b":"400000","o":"882DA9","s":"_ZN5mongo15receivedGetMoreEPNS_16OperationContextERNS_10DbResponseERNS_7MessageERNS_5CurOpE"},{"b":"400000","o":"889075","s":"_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE"},{"b":"400000","o":"54ED5C","s":"_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE"},{"b":"400000","o":"EAEA65","s":"_ZN5mongo17PortMessageServer17handleIncomingMsgEPv"},{"b":"7F40828CE000","o":"7AA1"},{"b":"7F408253A000","o":"E893D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.7", "gitVersion" : "4249c1d2b5999ebbf1fdf3bc0e0e3b3ff5c0aaf2", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "2.6.32-504.el6.x86_64", "version" : "#1 SMP Wed Oct 15 04:27:16 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "384A822B93AE1E4CFE393F0CECA08575DF6EB381" }, { "b" : "7FFF67CB8000", "elfType" : 3, "buildId" : "08E42C6C3D2CD1E5D68A43B717C9EB3D310F2DF0" }, { "b" : "7F4083775000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "B84C31B86733DE212F6886FE6F55630FE56180A9" }, { "b" : "7F4083391000", "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "A30A68D2F579614CBEA988BDAAC20CD56D8C48FC" }, { "b" : "7F4083189000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "95159178F1A4A3DBDC7819FBEA2C80E5FCDD6BAC" }, { "b" : "7F4082F85000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "29B61382141595ECBA6576232E44F2310C3AAB72" }, { "b" : "7F4082D01000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "989FE3A42CA8CEBDCC185A743896F23A0CF537ED" }, { "b" : "7F4082AEB000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "9350579A4970FA47F3144AD8F40B183B0954497D" }, { "b" : "7F40828CE000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "C56DD1B811FC0D9263248EBB308C73FCBCD80FC1" }, { "b" : "7F408253A000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "8E6FA4C4B0594C355C1B90C1D49990368C81A040" }, { "b" : "7F40839E1000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "959C5E10A47EE8A633E7681B64B4B9F74E242ED5" }, { "b" : "7F40822F6000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "441FA45097A11508E50D55A3D1FF169BF2BE7C62" }, { "b" : "7F408200F000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "F62622218875795666E08B92D176A50791183EEC" }, { "b" : "7F4081E0B000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "152E2C18A7A2145021A8A879A01A82EE134E3946" }, { "b" : "7F4081BDF000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "B8DEDADC140347276164C729418C7A37B7224135" }, { "b" : "7F40819C9000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }, { "b" : "7F40817BE000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "4BDFC7A19C1F328EB4FCFBCE7A1E27606928610D" }, { "b" : "7F40815BB000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }, { "b" : "7F40813A1000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "C39D7FFB49DFB1B55AD09D1D711AD802123F6623" }, { "b" : "7F4081182000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "E6798A06BEE17CF102BBA44FD512FF8B805CEAF1" } ] }}mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x1304482]mongod(+0xF033A9) [0x13033a9]mongod(+0xF03BB2) [0x1303bb2]libpthread.so.0(+0xF7E0) [0x7f40828dd7e0]libc.so.6(gsignal+0x35) [0x7f408256c625]libc.so.6(abort+0x175) [0x7f408256de05]mongod(_ZN5mongo13fassertFailedEi+0x82) [0x128a472]mongod(+0xC72BB3) [0x1072bb3]mongod(__wt_eventv+0x42C) [0x1a7945c]mongod(__wt_err+0x8D) [0x1a7991d]mongod(__wt_panic+0x24) [0x1a79d04]mongod(__wt_bm_read+0x77) [0x19acfb7]mongod(__wt_bt_read+0x85) [0x19c9c85]mongod(__wt_page_in_func+0x180) [0x19cf380]mongod(__wt_row_search+0x677) [0x19f0207]mongod(__wt_btcur_search+0xB08) [0x19ba7a8]mongod(+0x160C71C) [0x1a0c71c]mongod(_ZN5mongo21WiredTigerRecordStore6Cursor9seekExactERKNS_8RecordIdE+0x53) [0x1067a83]mongod(_ZN5mongo16WorkingSetCommon5fetchEPNS_16OperationContextEPNS_10WorkingSetEmNS_11unowned_ptrINS_20SeekableRecordCursorEEE+0x99) [0xbdc2c9]mongod(_ZN5mongo10FetchStage4workEPm+0x2FE) [0xb9916e]mongod(_ZN5mongo9SkipStage4workEPm+0x45) [0xbbf2b5]mongod(_ZN5mongo12PlanExecutor11getNextImplEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0x275) [0xdee255]mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE+0x39) [0xdee919]mongod(+0x9AAF72) [0xdaaf72]mongod(_ZN5mongo7getMoreEPNS_16OperationContextEPKcixPbS4_+0x52D) [0xdab66d]mongod(_ZN5mongo15receivedGetMoreEPNS_16OperationContextERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0x1A9) [0xc82da9]mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xE35) [0xc89075]mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE+0xEC) [0x94ed5c]mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x325) [0x12aea65]libpthread.so.0(+0x7AA1) [0x7f40828d5aa1]libc.so.6(clone+0x6D) [0x7f408262293d] ----- END BACKTRACE -----納尼,竟然文件壞了,我特么什么時候動過這個文件了。。。。。。
沒辦法,繼續(xù)修問題。。。
下面解決問題的最核心的步驟即將到來。。。
1. 下載安裝必要的軟件
wget http://source.wiredtiger.com/releases/wiredtiger-2.7.0.tar.bz2 tar xvf wiredtiger-2.7.0.tar.bz2 cd wiredtiger-2.7.0 sudo apt-get install libsnappy-dev build-essential ./configure --enable-snappy make2. 將出問題的wt文件拷貝一份出來(至于怎么查看可以查看通過對應(yīng)的集合查看,如下)
db.abc.stats().wiredTiger.uri
3. 拯救壞掉的集合
./wt -v -h ../mongo-bak -C "extensions=[./ext/compressors/snappy/.libs/libwiredtiger_snappy.so]" -R salvage collection-2657--1723320556100349955.wt4. 通過dump/load導(dǎo)入wt文件到MongoDB集合
./wt -v -h ../data -C "extensions=[./ext/compressors/snappy/.libs/libwiredtiger_snappy.so]" -R dump -f ../collection.dump collection-2657--17233205561003499555. 創(chuàng)建一個新的mongo實例,目的是獲取一個空的集合實例,以方便將load出的文件導(dǎo)入該集合
mongod --dbpath tmp-mongo --storageEngine wiredTiger --nojournal use Recovery db.borkedCollection.insert({test: 1}) db.borkedCollection.remove({}) db.borkedCollection.stats()6. 將第4步生成的collection.dump文件導(dǎo)入剛啟動的mongo實例的數(shù)據(jù)目錄下(對應(yīng)的mongo實例需要停掉)
對應(yīng)的collection-******,參考第五步創(chuàng)建的集合的db.abc.stats().wiredTiger.uri狀態(tài)
./wt -v -h ../data -C "extensions=[./ext/compressors/snappy/.libs/libwiredtiger_snappy.so]" -R load -f ../collection.dump -r collection-2-8803835882477320347. 啟動mongo實例,并且登錄進去,發(fā)現(xiàn)文檔為空
db.borkedCollection.count() 08.但是通過利用如下語句查詢,確實是有內(nèi)容的:
db.borkedCollection.find({}, {_id: 1})9.利用mongodump將集合數(shù)據(jù)dum出來
mongodump10.利用mongorestore將集合數(shù)據(jù)導(dǎo)入進mongo中
mongorestore --drop11.登錄MongoDB,發(fā)現(xiàn)數(shù)據(jù)恢復(fù)
由于是數(shù)據(jù)文件損壞,可能會少一些數(shù)據(jù),像我這個例子,1.7億條少了10w左右,可以接受
?
本次問題解決,參考網(wǎng)址:http://www.alexbevi.com/blog/2016/02/10/recovering-a-wiredtiger-collection-from-a-corrupt-mongodb-installation/
?
希望對各位能有所幫助。
?
PS:發(fā)現(xiàn)通過排查問題確實是學(xué)習(xí)的好方法。
通過本次問題排查,MongoDB摸索的東西不少。。。
?
總結(jié)
以上是生活随笔為你收集整理的mongodump 失败且导致mongo服务挂掉【本质原因,wt文件损坏】的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 《网络安全原理与实践》一2.1 安全区介
- 下一篇: Apache JMeter 3.2版新特