悬崖边上的舞者,记7.2生产数据库灾难事件
懸崖邊上的舞者,記7.2生產數據庫災難事件
作者:張子良
版權所有,轉載請注明出處
引子:出事了
7月2日是一個難得的大晴天,一段時間以來桂林一直在下雨,一直下,害的我減肥的計劃一再的泡湯,因為下雨每天早上的跑步變成了觀雨。和往常一樣,準時來到單位,卻不準時的得到一個消息:昨晚前置生產數據庫出問題了,連接不上。本能的反應,日志滿了,或者服務器交換空間不足。急忙趕到運維間,經詢問夜間值班人員,發現后續的情況是:logsegment日志空間確實是滿了,發現問題后,運維人員采用dump日志的方式進行處理,卻發現dump根本不被執行,無奈最后只得采用增加log設備空間的方式,進行了緊急處理。現在的問題是到底是什么原因引起的log空間的陡然間增加呢?
一、日志分析:發現了問題,卻無法定位?
取了數據庫運行日志,展開分析,發現了以下問題:
00:06:00000:00912:2013/07/02 03:31:53.98 server? Space available in the log segment has fallen critically low in database 'XXXDB'.? All future modifications to this database will be suspended until the log is successfully dumped and space becomes available.
00:01:00000:00046:2013/07/02 03:31:54.08 server? Error: 1105, Severity: 17, State: 4
00:01:00000:00046:2013/07/02 03:31:54.09 server? Can't allocate space for object 'syslogs' in database 'XXXDB' because 'logsegment' segment is full/has no free extents. If you ran out of space in syslogs, dump the transaction log. Otherwise, use ALTER DATABASE to increase the size of the segment.
00:01:00000:00046:2013/07/02 03:31:54.09 server? DUMP TRANSACTION for database 'XXXDB' could not truncate the log. Either extend the log using ALTER DATABASE ... LOG ON command or eliminate the oldest active transaction in database 'frtdb' shown in syslogshold table.
00:01:00000:00046:2013/07/02 03:31:54.09 server? DUMP TRANSACTION for database 'XXXDB' could not truncate the log. Either extend the log using ALTER DATABASE ... LOG ON command or eliminate the oldest active transaction in database 'XXXDB' shown in syslogshold table.
00:01:00000:00046:2013/07/02 03:31:54.09 server? background task message: LOG DUMP:'logsegment' for 'XXXDB' dumped!
問題很明顯:2013/07/02 03:31:53就已經出現了空間不足的情況,2013/07/02 03:31:54.08出現了log空間已滿的情況。2013/07/02 03:31:54.09執行了Dump操作,但是失敗了,無奈那么為什么會失敗呢?又是什么操作導致了日志空間的劇烈增長呢?繼續跟蹤日志。
二、發現死鎖:死鎖會導致日志空間滿嗎
繼續向上追溯發現在此之前曾經發現出現過死鎖的情況,具體日志如下:
00:02:00000:00796:2013/07/01 07:13:24.33 server? Deadlock Id 1 detected
Deadlock Id 1 detected. 1 deadlock chain(s) involved.
Deadlock Id 1: Process (Familyid 0, Spid 1117, Suid 4) was executing a UPDATE command in the procedure 'insprep0' at line 1 at nested level 1.
Deadlock Id 1: Process 1117 was involved in application ''.
Deadlock Id 1: Process 1117 was involved on host name ''.
Deadlock Id 1: Process 1117 was involved in transaction '$user_transaction'.
SQL Text: ?
Executing procedure: *00111700009969_3e59fe
Subordinate SQL Text:XXXXX
Deadlock Id 1: Process (Familyid 0, Spid 796, Suid 4) was executing a FETCH CURSOR command at line -1.
Deadlock Id 1: Process 796 was involved in application ''.
Deadlock Id 1: Process 796 was involved on host name ''.
SQL Text:? XXXXXX
Deadlock Id 1: Process (Familyid 0, Spid 796) was waiting for a 'shared page' lock on page 60607 of table 't_merch_det' in database 'frtdb' but process (Familyid 0, Spid 1117) already held a 'exclusive page' lock on it.
Deadlock Id 1: Process (Familyid 0, Spid 1117) was waiting for a 'exclusive page' lock on page 5866141 of table 't_merch_det' in database 'frtdb' but process (Familyid 0, Spid 796) already held a 'shared page' lock on it.
Deadlock Id 1: Process (Familyid 0, Spid 796) was chosen as the victim.
Victim process host = `', user = `sysadmin' program name = `' host processes = `13697416' .
End of deadlock information.
00:00:00000:01262:2013/07/01 14:49:42.27 server? Deadlock Id 2 detected
Deadlock Id 2 detected. 1 deadlock chain(s) involved.
Deadlock Id 2: Process (Familyid 0, Spid 813, Suid 4) was executing a FETCH CURSOR command at line -1.
Deadlock Id 2: Process 813 was involved in application ''.
Deadlock Id 2: Process 813 was involved on host name ''.
Deadlock Id 2: Process 813 was involved in transaction '$user_transaction'.
SQL Text: XXXXXX
Deadlock Id 2: Process (Familyid 0, Spid 1262, Suid 4) was executing a UPDATE command in the procedure 'insprep0' at line 1 at nested level 1.
Deadlock Id 2: Process 1262 was involved in application ''.
Deadlock Id 2: Process 1262 was involved on host name ''.
Deadlock Id 2: Process 1262 was involved in transaction '$user_transaction'.
SQL Text: ?
Executing procedure: *00126200040493_8f416a
Subordinate SQL Text:? XXXXXXX
Deadlock Id 2: Process (Familyid 0, Spid 1262) was waiting for a 'exclusive row' lock on row 21 page 10073 of the 'XXXXXX' table in database 'frtdb' but process (Familyid 0, Spid 813) already held a 'exclusive row' lock on it.
Deadlock Id 2: Process (Familyid 0, Spid 813) was waiting for a 'shared row' lock on row 11 page 18705629 of the 'XXXXXX' table in database 'frtdb' but process (Familyid 0, Spid 1262) already held a 'exclusive row' lock on it.
Deadlock Id 2: Process (Familyid 0, Spid 1262) was chosen as the victim.
Victim process host = `', user = `sysadmin' program name = `' host processes = `13500790' .
End of deadlock information.
根據分析發現,死鎖的時間發生在2013/07/01 14:49:42.27,而發生死鎖的進程也被終止,如果是由此原因導致的問題的話,爆發時間對不上。根據日志清理的定時計劃,再次階段出現的問題,日志應該已經被清理,所以可以排除掉這種情況。
三、數據庫變更:重啟過,都變更了什么?
繼續向上追溯,發現6.30日數據庫曾經被重啟過,發現數據庫參數也曾經發生過變化:
00:00:00000:00013:2013/06/30 01:10:56.60 server? Completed filling free space info for database 'XXXDB'.
00:00:00000:00013:2013/06/30 01:10:56.61 server? Started cleaning up the default data cache for database 'XXXDB'.
00:00:00000:00013:2013/06/30 01:10:57.04 server? Completed cleaning up the default data cache for database 'XXXDB'.
00:00:00000:00013:2013/06/30 01:10:57.04 server? Checking external objects.
00:00:00000:00013:2013/06/30 01:10:57.46 server? The transaction log in the database 'XXXDB' will use I/O size of 8 Kb.
00:00:00000:00013:2013/06/30 01:10:57.46 server? Database 'XXXDB' is now online.
00:00:00000:00001:2013/06/30 01:10:57.47 server? Recovery has restored the value of 'local async prefetch limit' for '64K' pool in 'default data cache' from '80' to 'DEFAULT'.
00:00:00000:00001:2013/06/30 01:10:57.47 server? Recovery has restored the value of 'local async prefetch limit' for '8K' pool in 'default data cache' from '80' to 'DEFAULT'.
00:00:00000:00001:2013/06/30 01:10:57.57 server? Recovery has restored the original size for '64K' pool and '8K' pool in 'default data cache'.
00:00:00000:00001:2013/06/30 01:10:57.58 server? Recovery complete.
00:00:00000:00001:2013/06/30 01:10:57.59 server? ASE's default unicode sort order is 'binary'.
00:00:00000:00001:2013/06/30 01:10:57.60 server? ASE's default sort order is:
00:00:00000:00001:2013/06/30 01:10:57.60 server ??? ?'nocase_cp936' (ID = 52)
00:00:00000:00001:2013/06/30 01:10:57.60 server? on top of default character set:
00:00:00000:00001:2013/06/30 01:10:57.60 server ??? ?'cp936' (ID = 171).
那么是否是數據庫參數變更導致的問題,這個地方是我重點懷疑的地方,但是始終沒有找到數據支撐,變更這些參數,是否就是導致問題的根源呢?我需要更多的數據的支撐。
四、性能監控:陡然增加的日志空間占用,究竟發生了什么?
? 好在部署了數據庫性能監控監控軟件,通過分析性能監控圖表,發現7.2凌晨2點至3點半,數據庫log空間出現了激增,空間占用率呈直線60度上升,直至100%。那么可以肯定的是,在2點的時候,一定是有什么大事務的數據庫操作?那么究竟是什么操作呢?
五、柳暗花明:日志,還是日志
既然數據庫本身的日志無法滿足分析的需要,那么前置平臺日志,肯定會有相關的記錄,跟蹤前置日志,終于發現,7.2日凌晨2點的-3點半期間確實有一個大的數據庫事務發生,那就是從當前交易流水表復制交易記錄到歷史交易記錄表,記錄數1100w條,單表需要數據庫存儲空間5G,執行完成插入操作后,需要進行后續的刪除操作,而恰在此時時出現了日志空間不足的情況。新的疑問就出現了,為何這個問題之前沒有發生呢?
六、歷史問題:為何以前沒有發生問題呢
通過跟蹤6月30日以前的日志發現,執行當前表到歷史表的數據拷貝的操作,都是失敗的。直接返回-1,失敗的原因是由于當前數據庫可用空間不足或者其他原因,正在跟蹤分析中。。。
?
?
總結
以上是生活随笔為你收集整理的悬崖边上的舞者,记7.2生产数据库灾难事件的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Java学习笔记50:JSONObjec
- 下一篇: Windows2003 SQL2005解