一个suse11 sp1的crash工具版本问题
這幾年排查的各種類(lèi)型的crash也比較多了,各種類(lèi)型的也算見(jiàn)過(guò),但是排查這個(gè)crash,走了不該走的彎路,事后顯得很low,為了防止自己犯類(lèi)似錯(cuò)誤,也同時(shí)提醒后人,記錄之。
內(nèi)核是suse11,sp1,
uname -a Linux Ftp1 2.6.32.59-0.7-default #1 SMP 2012-07-13 15:50:56 +0200 x86_64 x86_64 x86_64 GNU/Linuxcrash目錄下有三個(gè)文件:
README.txt vmcore vmlinux-2.6.32.59-0.7-default常規(guī)動(dòng)作,編譯vmlinux,然后看crash:
A10111916:~ # crash /home/caq/vmlinux /home/zxin11/vmcorecrash 4.0-7.6--------------------------------------------------------------------低版本 Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details.GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"...crash: invalid structure size: x8664_pdaFILE: x86_64.c LINE: 561 FUNCTION: x86_64_cpu_pda_init()[/usr/bin/crash] error trace: 535689 => 4569bd => 4cb321 => 4e53004e5300: SIZE_verify+2244cb321: x86_64_init+16814569bd: main_loop+93535689: (undetermined)我還以為是vmcore拷貝的有問(wèn)題,檢查了線上的vmcore和拷貝回來(lái)的vmcore,大小一樣,md5值都是一樣。然后檢查編譯的vmlinux,主要是檢查.config文件 以及編譯內(nèi)核的
環(huán)境的gcc版本是否和線上出問(wèn)題的gcc版本一致,也沒(méi)有問(wèn)題。過(guò)了好一會(huì)才開(kāi)始懷疑,
是不是crash的版本有問(wèn)題,為了驗(yàn)證這個(gè)想法,將vmlinux拷貝到線上去檢查,線上環(huán)境的crash是5.0.1版本,就沒(méi)有報(bào)錯(cuò),看來(lái)真的跟crash版本有關(guān)系。這個(gè)也給自己上了一課,總共就
三個(gè)文件,crash,vmlinux,vmcore,解析出錯(cuò),在保證vmlinux編譯沒(méi)問(wèn)題和vmcore是完整的情況下,要仔細(xì)確認(rèn)下crash的版本。
crash 5.0.1------------------------------------------------os自帶版本 Copyright (C) 2002-2010 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details.GNU gdb (GDB) 7.0 Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"...KERNEL: vmlinuxDUMPFILE: vmcoreCPUS: 48DATE: Thu Feb 28 23:10:39 2019UPTIME: 71 days, 17:26:09 LOAD AVERAGE: 0.08, 0.13, 0.10TASKS: 866NODENAME: Ftp1RELEASE: 2.6.32.59-0.7-defaultVERSION: #1 SMP 2012-07-13 15:50:56 +0200MACHINE: x86_64 (1861 Mhz)MEMORY: 32 GBPANIC: "[6186227.149497] Oops: 0000 [#1] SMP " (check log for details)PID: 0COMMAND: "swapper"TASK: ffffffff8180c020 (1 of 48) [THREAD_INFO: ffffffff81800000]CPU: 0STATE: TASK_RUNNING (ACTIVE)WARNING: panic task not found居然顯示"panic task not found",常見(jiàn)的crash都如下所示,而這個(gè)crash解析多了一個(gè)warning,
PANIC: "[335750.721156] Oops: 0002 [#1] SMP " (check log for details)PID: 6879COMMAND: "bash"TASK: ffff88031b886380 [THREAD_INFO: ffff880319958000]CPU: 1STATE: TASK_RUNNING (PANIC)比?crash 4.0-7.6 有進(jìn)步,也算是個(gè)好兆頭,下載?crash 5.0.1 的源碼檢查,發(fā)現(xiàn)這個(gè)warning關(guān)系也不大,但我犯了一個(gè)致命錯(cuò)誤,就是對(duì):
PANIC: "[6186227.149497] Oops: 0000 [#1] SMP " (check log for details)這一行沒(méi)有仔細(xì)看,高版本一些的內(nèi)核,都是打印dmesg.txt在單獨(dú)的一個(gè)文件,通過(guò)這個(gè)文件至少能快速地確認(rèn)出panic的堆棧。而PANIC這行,
要求我去看log命令,我又沒(méi)有去看,因?yàn)槿蝿?wù)不多,直接去看各個(gè)進(jìn)程的堆棧。導(dǎo)致又走了彎路。發(fā)現(xiàn)了兩個(gè)堆棧比較可疑:
PID: 44451 TASK: ffff88067bbc6080 CPU: 8 COMMAND: "SMSvr"#0 [ffff88067dbf5dc8] schedule at ffffffff813923c4#1 [ffff88067dbf5de0] sys_reboot at ffffffff8105e00d#2 [ffff88067dbf5e60] do_notify_resume at ffffffff810028c5#3 [ffff88067dbf5f30] sys_rt_sigreturn at ffffffff81002aa8#4 [ffff88067dbf5f50] ptregscall_common at ffffffff81003216RIP: 00007f017b6e8efd RSP: 00007f0178b71dc0 RFLAGS: 00000293RAX: fffffffffffffdfc RBX: 0000000000000000 RCX: ffffffffffffffffRDX: 0000000000000000 RSI: 00007f0178b71df0 RDI: 00007f0178b71df0RBP: 00007f0178b71e00 R8: fefefefefefeffff R9: 0000000000000001R10: 0000000000000800 R11: 0000000000000293 R12: 00007fff853b82f0R13: 00007f0178b72000 R14: 0000000000000003 R15: 0000000000001000ORIG_RAX: 0000000000000023 CS: 0033 SS: 002b這個(gè)函數(shù)里面居然有一個(gè)sys_reboot調(diào)用,reboot導(dǎo)致panic我確實(shí)還沒(méi)經(jīng)歷過(guò),不死心,反匯編一下sys_reboot,打印如下:
rash> dis -l sys_reboot /home/caq/usr/src/linux-2.6.32.59-0.7/kernel/sys.c: 362 0xffffffff8105df50 <sys_reboot>: test %edi,0x1(%rbx) 0xffffffff8105df53 <sys_reboot+3>: add %al,(%rax) 0xffffffff8105df55 <sys_reboot+5>: cmp $0x1f,%ebx 0xffffffff8105df58 <sys_reboot+8>: jg 0xffffffff8105df69 <sys_reboot+25> 0xffffffff8105df5a <sys_reboot+10>: lea -0x1(%rbx),%ecx 0xffffffff8105df5d <sys_reboot+13>: mov $0x8430000,%eax /home/caq/usr/src/linux-2.6.32.59-0.7/kernel/sys.c: 367 0xffffffff8105df62 <sys_reboot+18>: shr %cl,%rax 0xffffffff8105df65 <sys_reboot+21>: test $0x1,%al /home/caq/usr/src/linux-2.6.32.59-0.7/kernel/sys.c: 362明顯反匯編得不對(duì)啊,reboot的代碼里面有很多case對(duì)應(yīng)的魔術(shù)字,而這個(gè)卻沒(méi)有cmp指令,而且代碼一開(kāi)始進(jìn)來(lái)也沒(méi)有建立棧的過(guò)程,立馬再次對(duì)這個(gè)crash的解析結(jié)果產(chǎn)生懷疑,因?yàn)榘吹览?/p>
crash從vmlinux取出響應(yīng)的符號(hào)對(duì)應(yīng)的地址,然后到vmcore中找到對(duì)應(yīng)的地址展示出來(lái),說(shuō)明vmcore和vmlinux還是存在不對(duì)應(yīng)。但這個(gè)crash工具居然沒(méi)提示(我見(jiàn)過(guò)不一致的提示,類(lèi)似于WARNING: kernel version inconsistency between vmlinux and dumpfile)
為了驗(yàn)證自己的想法,我到編譯的vmlinux中找一下sys_reboot,
linux-h9c2:/home/caq # objdump -d vmlinux >caq.txtlinux-h9c2:/home/caq # grep sys_reboot caq.txt ffffffff8105df50 <sys_reboot>:linux-h9c2:/home/caq # nm vmlinux |grep -i sys_reboot ffffffff8105df50 T sys_reboot地址是:ffffffff8105df50,crash工具將這個(gè)地址去找sys_reboot,結(jié)果打印的卻不是sys_reboot的反匯編,不可能crash工具出這么低級(jí)的問(wèn)題啊,說(shuō)明vmlinux和vmcore還是存在不對(duì)應(yīng)。
想著reboot調(diào)用跟panic按道理風(fēng)牛馬不相及啊,放棄這條路,因?yàn)榧热籹ys_reboot是錯(cuò)的,那么可能堆棧回溯都是錯(cuò)的了,
就剩下pid 38021了。
crash> bt -f 38021 PID: 38021 TASK: ffff88003531c340 CPU: 2 COMMAND: "sh"#0 [ffff880476051de8] schedule at ffffffff813923c4ffff880476051df0: 0000000000000000 0000000000000000ffff880476051e00: 0000000000000000 0000000000000000ffff880476051e10: 0000000000000000 0000000000000000ffff880476051e20: 0000000000000000 0000000000000000ffff880476051e30: 0000000000000000 0000000000000000ffff880476051e40: 0000000000000000 0000000000000000ffff880476051e50: 0000000000000000 0000000000000000ffff880476051e60: 0000000000000000 0000000000000000ffff880476051e70: 0000000000000000 0000000000000000ffff880476051e80: 0000000000000000 0000000000000000ffff880476051e90: 0000000000000000 0000000000000000ffff880476051ea0: 0000000000000000 0000000000000000ffff880476051eb0: 0000000000000000 0000000000000000ffff880476051ec0: 0000000000000000 0000000000000000ffff880476051ed0: 0000000000000000 0000000000000000ffff880476051ee0: 0000000000000000 0000000000000000ffff880476051ef0: 0000000000000000 0000000000000000ffff880476051f00: 0000000000000000 0000000000000000ffff880476051f10: 0000000000000000 0000000000000000ffff880476051f20: 0000000000000000 0000000000000000ffff880476051f30: 00000000006c9870 ffff88027dd62480ffff880476051f40: ffff88084c3a8d40 0000000000000000ffff880476051f50: 00000000006a0dd0 00007fffbc69e690ffff880476051f60: 0000000000000441 00000000006d3040ffff880476051f70: 0000000000000003 00000000006d3ba0ffff880476051f80: ffffffff81002f7b#1 [ffff880476051f80] auditsys at ffffffff81002f7bRIP: 00007fb09a95b4f0 RSP: 00007fffbc69e6c0 RFLAGS: 00010202RAX: 0000000000000002 RBX: ffffffff81002f7b RCX: 0000000000000000RDX: 00000000000001b6 RSI: 0000000000000441 RDI: 00000000006d3040RBP: 00000000006d3ba0 R8: 0000000000000020 R9: 6c6568732f6d732fR10: 0000000000000020 R11: 0000000000000246 R12: 0000000000000003R13: 00000000006d3040 R14: 0000000000000441 R15: 00007fffbc69e690ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b?看著堆棧不太對(duì)啊,auditsys 不是一個(gè)系統(tǒng)調(diào)用的入口,按道理第一個(gè)壓棧的函數(shù)應(yīng)該是常見(jiàn)的system_call_fastpath?,直接查看一下這個(gè)地址:
Ftp1:/home # grep ffffffff81002f /proc/kallsyms ffffffff81002f00 T system_call_after_swapgs ffffffff81002f65 t system_call_fastpath ffffffff81002f80 t ret_from_sys_call ffffffff81002f85 t sysret_check ffffffff81002fd8 t sysret_careful ffffffff81002fe8 t sysret_signal發(fā)現(xiàn)?ffffffff81002f7b 應(yīng)該屬于?system_call_fastpath 的地址范圍。
看來(lái)這crash的工具用不了,映射是錯(cuò)的,于是找了個(gè)更新一點(diǎn)的crash工具,版本為7.0.9
crash 7.0.9---------------------------------------------------更高版本 Copyright (C) 2002-2014 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details.crash: vmlinux: no .gnu_debuglink section GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"...WARNING: kernel version inconsistency between vmlinux and dumpfile------------------------有告警KERNEL: vmlinuxDUMPFILE: vmcoreCPUS: 48DATE: Thu Feb 28 23:10:39 2019UPTIME: 71 days, 17:26:09 LOAD AVERAGE: 0.08, 0.13, 0.10TASKS: 867NODENAME: Ftp1RELEASE: 2.6.32.59-0.7-defaultVERSION: #1 SMP 2012-07-13 15:50:56 +0200MACHINE: x86_64 (1861 Mhz)MEMORY: 32 GBPANIC: "[6186227.149497] Oops: 0000 [#1] SMP " (check log for details)PID: 38021COMMAND: "sh"-------------------------------------------------------找到對(duì)應(yīng)的panic任務(wù),比上一個(gè)版本靠譜TASK: ffff88003531c340 [THREAD_INFO: ffff880476050000]CPU: 2STATE: TASK_RUNNING (PANIC)升級(jí)到7.0.9,然后敲入log命令:
對(duì)應(yīng)的log中顯示:
[6186227.149460] BUG: unable to handle kernel NULL pointer dereference at (null) [6186227.149479] IP: [<ffffffff811e7752>] strlen+0x2/0x30 [6186227.149492] PGD 47b9be067 PUD 42e601067 PMD 0 [6186227.149497] Oops: 0000 [#1] SMP [6186227.149502] last sysfs file: /sys/devices/pci0000:40/0000:40:07.0/0000:45:00.1/host4/rport-4:0-0/target4:0:0/4:0:0:0/state [6186227.149510] CPU 2 [6186227.149513] Modules linked in: secureProof(N) iptable_filter ip_tables x_tables dm_round_robin dm_multipath scsi_dh ipv6 bonding microcode f use loop dm_mod tpm_tis dcdbas(X) tpm qla2xxx usbhid tpm_bios hid iTCO_wdt scsi_transport_fc iTCO_vendor_support serio_raw sr_mod scsi_tgt ses cd rom pcspkr enclosure bnx2 sg rtc_cmos rtc_core rtc_lib wmi power_meter button uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fa n processor ide_pci_generic ide_core ata_generic ata_piix libata megaraid_sas thermal thermal_sys hwmon mpdh(N) mpdt(N) scsi_mod [last unloaded: secureProof] [6186227.149571] Supported: Yes [6186227.149577] Pid: 38021, comm: sh Tainted: G NX 2.6.32.59-0.7-default #1 PowerEdge R910 [6186227.149582] RIP: 0010:[<ffffffff811e7752>] [<ffffffff811e7752>] strlen+0x2/0x30 [6186227.149588] RSP: 0018:ffff880476051280 EFLAGS: 00010246 [6186227.149592] RAX: 0000000000000000 RBX: ffff8805f94ec000 RCX: 0000000000000000 [6186227.149596] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000 [6186227.149600] RBP: 0000000000000000 R08: ffff8804760511f8 R09: ffffffff81539570 [6186227.149604] R10: 0000000000000020 R11: 0000000000000fff R12: ffff880476051d38 [6186227.149608] R13: ffff88067bbc6080 R14: ffffffffa03d8f79 R15: 0000000000000000 [6186227.149612] FS: 00007fb09b21e700(0000) GS:ffff880487400000(0000) knlGS:0000000000000000 [6186227.149617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [6186227.149621] CR2: 0000000000000000 CR3: 0000000473f4d000 CR4: 00000000000006e0 [6186227.149625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [6186227.149629] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [6186227.149634] Process sh (pid: 38021, threadinfo ffff880476050000, task ffff88003531c340) [6186227.149638] Stack: [6186227.149640] ffffffffa03d5768 ffffffffa03d8e4c ffff88067bbc6080 ffff880476051d38 [6186227.149645] <0> ffff8804760514c8 0000000000000019 ffffffffa03d5b53 ffff880476051d58 [6186227.149650] <0> ffffffffa03d8e4c 787a2f656d6f682f 7374642f30316e69 76534d532f6d732f [6186227.149657] Call Trace: [6186227.149678] [<ffffffffa03d5768>] getprocpath+0xa8/0x150 [secureProof] [6186227.149701] [<ffffffffa03d5b53>] checkTrustProc+0x83/0x270 [secureProof] [6186227.149710] [<ffffffffa03d66ca>] checkProcAndFile+0x3da/0x890 [secureProof] [6186227.149720] [<ffffffffa03d7aba>] our_sys_open+0xfa/0x1d0 [secureProof]-----------我們模塊接管的open [6186227.149736] [<ffffffff81002f7b>] system_call_fastpath+0x16/0x1b [6186227.149745] [<00007fb09a95b4f0>] 0x7fb09a95b4f0 [6186227.149749] Code: 00 48 83 c7 01 0f b6 07 84 c0 74 0c 0f b6 c0 f6 80 a0 08 85 81 20 75 e9 48 89 f8 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 0031 c0 <80> 3f 00 48 89 fa 74 15 66 0f 1f 44 00 00 48 83 c2 01 80 3a 00 [6186227.149779] RIP [<ffffffff811e7752>] strlen+0x2/0x30 [6186227.149784] RSP <ffff880476051280> [6186227.149787] CR2: 0000000000000000這個(gè)打印和crash找的任務(wù)是一致的,都是sh進(jìn)程,pid為38021。
然后查看strlen的代碼:
/home/caq/usr/src/linux-2.6.32.59-0.7/lib/string.c: 379 0xffffffff811e1750 <strlen>: ljmpq *(%rcx) 0xffffffff811e1752 <strlen+2>: icebp確定是由于rcx為NULL導(dǎo)致的,業(yè)務(wù)代碼流程有問(wèn)題,直接引用空指針,導(dǎo)致crash。
總結(jié)一下:
1.crash分析的時(shí)候,crash的版本盡量新一些,特別當(dāng)某些crash工具解析有問(wèn)題的時(shí)候,要果斷換,出現(xiàn)的crash工具提醒的warning,要重視。
2.老司機(jī)也會(huì)翻車(chē),編譯vmlinx的gcc版本,最好和運(yùn)行的內(nèi)核的gcc版本一致。
?
轉(zhuǎn)載于:https://www.cnblogs.com/10087622blog/p/10609159.html
總結(jié)
以上是生活随笔為你收集整理的一个suse11 sp1的crash工具版本问题的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 如何使用Data Lake Analyt
- 下一篇: hello2 source analys