kprobe分析内核kworker占用CPU 100%问题总结
kprobe分析內(nèi)核kworker占用CPU 100%問題總結(jié)
Create by Billow.Jen,2020.3.8
前言
利用linux kernel 動(dòng)態(tài)追蹤技術(shù),排查問題本身就可能會(huì)變成一個(gè)非常有趣的過程,讓我們遇到線上的詭異問題就感到興奮,就仿佛好不容易又逮著機(jī)會(huì),可以去解一道迷人的謎題。
一、系統(tǒng)問題說明
背景說明
負(fù)載均衡模塊獨(dú)立主機(jī)部署,運(yùn)行BIRD、quagga、nftbl、contiv、ovs等進(jìn)程,實(shí)現(xiàn)網(wǎng)絡(luò)NAT功能。
內(nèi)核版本:centos7-4.19.8
問題描述
在調(diào)用nftlb寫入nft規(guī)則的過程中,發(fā)現(xiàn)部分主機(jī)性能低(因操作系統(tǒng)為centos且內(nèi)核進(jìn)行了升級,紅帽服務(wù)不支持。)
過程如下:
1、執(zhí)行命令:
time ip netns exec vpc067871207-lbalance curl -H "Key: spidernet" -X POST http://127.0.0.1:06787/farms --data '{"farms":[{"name":"b1","family":"ipv4","virtual-addr":"XX.XX.XX.XX","virtual-ports":"1001","mode":"snat","protocol":"tcp","scheduler":"weight","state":"up","backends":[{"name":"bck01","ip-addr":"192.168.0.1","port":"80","weight":"5","mark":"0x00000001","priority":"1","state":"up"}]}]}'
該命令正常應(yīng)該在0.2ms內(nèi)完成,但在部分主機(jī)上耗時(shí)到2、3s。
2、利用strace命令跟蹤:
strace -tt -T -v -f -s 1024 -p 2727846
發(fā)現(xiàn)耗時(shí)在sendmsg(AF_NETLINK…)函數(shù)上,有長有短,短的在幾十us,但有四個(gè)長在0.5s左右的sendmsg,如:
[外鏈圖片轉(zhuǎn)存失敗,源站可能有防盜鏈機(jī)制,建議將圖片保存下來直接上傳(img-eAfNbx6U-1584208688223)(en-resource://database/1982:1)]
3、繼續(xù)分析主機(jī)進(jìn)程,發(fā)現(xiàn)一個(gè)進(jìn)程kworker一直100%
[外鏈圖片轉(zhuǎn)存失敗,源站可能有防盜鏈機(jī)制,建議將圖片保存下來直接上傳(img-QxeEl9NA-1584208688224)(en-resource://database/1984:1)]
4、觀察主機(jī)性能,nmon如下:[外鏈圖片轉(zhuǎn)存失敗,源站可能有防盜鏈機(jī)制,建議將圖片保存下來直接上傳(img-ewVOFHNl-1584208688225)(en-resource://database/1990:1)]
此時(shí)主機(jī)內(nèi)存、cpu、網(wǎng)絡(luò)、存儲(chǔ)都很空閑,但中斷比正常主機(jī)高1~2倍。
5、用perf工具查看kworker耗時(shí)
[外鏈圖片轉(zhuǎn)存失敗,源站可能有防盜鏈機(jī)制,建議將圖片保存下來直接上傳(img-CEyDfmjJ-1584208688226)(en-resource://database/1988:1)]
6、網(wǎng)卡中斷,經(jīng)確認(rèn)(/proc/interrupts)網(wǎng)卡中斷分布在每個(gè)cpu上,網(wǎng)絡(luò)中斷均衡
7、繼續(xù)ftrace,分析事件:
cd /sys/kernel/debug/tracing
echo wakeup_rt > current_tracer
echo 0 > options/function-trace
echo 1 > events/enable
echo 1 > tracing_on
echo 0 > tracing_max_latency
chrt -f 5 sleep 1
echo 0 > tracing_on
cat trace
結(jié)果沒有針對性,不具備進(jìn)一步分析條件。初步問題分析完了,下步進(jìn)行內(nèi)核態(tài)深度分析。
小結(jié)
內(nèi)核該kworker進(jìn)程的性能影響了work處理效率,導(dǎo)致內(nèi)核較慢,響應(yīng)sendmsg的work延遲,導(dǎo)致curl耗時(shí)長。
二、內(nèi)核性能分析、跟蹤工具簡述
ftrace、kprobe、perf、operf/oprofile、systemtap都是跟蹤內(nèi)核的好工具,但有所區(qū)別。
Linux內(nèi)核提供的基礎(chǔ)設(shè)施:
- tarcepoints => 靜態(tài)探測點(diǎn)
- kprobe => 內(nèi)核態(tài)動(dòng)態(tài)探測點(diǎn)(kernel/kprobe.c, example:sample/kprobe)
- uprobe => 用戶態(tài)動(dòng)態(tài)探測點(diǎn)(kernel/events/uprobe.c)
- ftrace是后面工具的基礎(chǔ),但使用比較繁瑣
- perf可圖形化展示函數(shù)耗時(shí),但不能自定義跟蹤調(diào)試。
- operf/oprofile依賴vmlinux,并且得帶有debug信息。
- systemtap依賴太多,kernel-debuginfo、內(nèi)核編譯參數(shù)等等,生產(chǎn)環(huán)境基本不可用。
kprobe作為輕量級內(nèi)核調(diào)試工具,在診斷內(nèi)核bug時(shí)有著先天獨(dú)厚的優(yōu)勢,相關(guān)其他工具,kprobe有如下優(yōu)點(diǎn):
1、不用更新內(nèi)核
2、可以以模塊的形式加載進(jìn)內(nèi)核,用完后直接卸載即可,不會(huì)對內(nèi)核造成污染
3、動(dòng)態(tài)跟蹤,自由構(gòu)造,靈巧輕便。
經(jīng)過大量實(shí)驗(yàn)探索,依賴少并可深入自定義的,kprobe勝出。
三、利用kprobe分析kworker行為
調(diào)試源碼見附錄,kprobe使用及原理自行學(xué)習(xí),本文不做基礎(chǔ)介紹,重點(diǎn)在思路、步驟,主要展示方法論的形成過程。
準(zhǔn)備工作,說明下kprobe的參數(shù)示例:
struct file *file = (struct file *)regs->di;
//因?yàn)閤86的參數(shù)傳遞規(guī)則是di,si,dx,cx,r8,r9,所以di就是vfs_write的第一個(gè)參數(shù)。arm默認(rèn)是r0,r1,r2,r3,相應(yīng)的取r0
看看kworker在做什么:
[root@04b280305 kprobe]# cat /proc/352560/stack
[<0>] insert_work+0x6e/0xa0
[<0>] __queue_work+0x131/0x3b0
[<0>] queue_work_on+0x28/0x40
[<0>] rht_deferred_worker+0x8b/0x3e0
[<0>] process_one_work+0x179/0x390
[<0>] worker_thread+0x4f/0x3e0
[<0>] kthread+0x105/0x140
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
可知,rht_deferred_worker是特定任務(wù),需要重點(diǎn)分析。
寫kprobe代碼對rht_deferred_worker函數(shù)加探針,分析rht_deferred_worker事件(/var/log/messages,當(dāng)時(shí)dmesg沒記錄被覆蓋了,所以從messages中直接看):
[root@04b280305 kprobe]# grep rht_deferred_worker /var/log/messages |tail
Mar 9 21:35:14 04b280305 kernel: <rht_deferred_worker> pre_handler: 813f99c0
Mar 9 21:35:14 04b280305 kernel: <rht_deferred_worker> post_handler: 813f99c0
Mar 9 21:35:14 04b280305 kernel: <rht_deferred_worker> pre_handler: 813f99c0
Mar 9 21:35:14 04b280305 kernel: <rht_deferred_worker> post_handler: 813f99c0
Mar 9 21:35:14 04b280305 kernel: <rht_deferred_worker> pre_handler: 813f99c0
看看813f99c0是什么work:
[root@04b280305 kprobe]# cat /proc/kallsyms |grep 813f99c0
ffffffff813f99c0 t rht_deferred_worker
分析rht_deferred_worker源碼,得知在重新分配或者收縮失敗時(shí),會(huì)觸發(fā)插入work:rht_deferred_worker,此處是關(guān)鍵。
static void rht_deferred_worker(struct work_struct *work)
{
struct rhashtable *ht;
struct bucket_table *tbl;
int err = 0;
ht = container_of(work, struct rhashtable, run_work);
mutex_lock(&ht->mutex);
tbl = rht_dereference(ht->tbl, ht);
tbl = rhashtable_last_table(ht, tbl);
if (rht_grow_above_75(ht, tbl))
err = rhashtable_rehash_alloc(ht, tbl, tbl->size * 2);
else if (ht->p.automatic_shrinking && rht_shrink_below_30(ht, tbl))
err = rhashtable_shrink(ht);
else if (tbl->nest)
err = rhashtable_rehash_alloc(ht, tbl, tbl->size);
if (!err)//出錯(cuò)時(shí),返回非0,執(zhí)行不到
err = rhashtable_rehash_table(ht);
mutex_unlock(&ht->mutex);
if (err)
schedule_work(&ht->run_work);
/*上行是關(guān)鍵,造成了死循環(huán)*/
}
確認(rèn)下是alloc失敗還是shrink失?。?br /> 寫kprobe代碼對 rhashtable_rehash_alloc函數(shù)加探針,查看結(jié)果:
Mar 9 22:50:03 04b280305 kernel: Planted kprobe at rhashtable_rehash_alloc+0x0/0x50
Mar 9 22:50:07 04b280305 kernel: kprobe at rhashtable_rehash_alloc+0x0/0x50 unregistered
只執(zhí)行了一次,說明不是alloc的問題,繼續(xù)分析shrink,rhashtable_shrink不是系統(tǒng)導(dǎo)出符號,不能kprobe,看下源碼:
static int rhashtable_shrink(struct rhashtable *ht)
{
struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
unsigned int nelems = atomic_read(&ht->nelems);
unsigned int size = 0;
if (nelems)
size = roundup_pow_of_two(nelems * 3 / 2);
if (size < ht->p.min_size)
size = ht->p.min_size;
if (old_tbl->size <= size)
return 0;
if (rht_dereference(old_tbl->future_tbl, ht))
return -EEXIST;
return rhashtable_rehash_alloc(ht, old_tbl, size);
}
寫了個(gè)test_shirnk的探測函數(shù),一安裝服務(wù)器崩了…
去看看高版本內(nèi)核,從5.1版本解決了此問題:
static void rht_deferred_worker(struct work_struct *work)
{
struct rhashtable *ht;
struct bucket_table *tbl;
int err = 0;
ht = container_of(work, struct rhashtable, run_work);
mutex_lock(&ht->mutex);
tbl = rht_dereference(ht->tbl, ht);
tbl = rhashtable_last_table(ht, tbl);
if (rht_grow_above_75(ht, tbl))
err = rhashtable_rehash_alloc(ht, tbl, tbl->size * 2);
else if (ht->p.automatic_shrinking && rht_shrink_below_30(ht, tbl))
err = rhashtable_shrink(ht);
else if (tbl->nest)
err = rhashtable_rehash_alloc(ht, tbl, tbl->size);
/*此塊代碼,修復(fù)了BUG*/
if (!err || err == -EEXIST) {
int nerr;
nerr = rhashtable_rehash_table(ht);
err = err ?: nerr;
}
mutex_unlock(&ht->mutex);
if (err)
schedule_work(&ht->run_work);
}
BUG說明:
rhashtable: Still do rehash when we get EEXIST
As it stands if a shrink is delayed because of an outstanding
rehash, we will go into a rescheduling loop without ever doing
the rehash.
This patch fixes this by still carrying out the rehash and then
rescheduling so that we can shrink after the completion of the
rehash should it still be necessary.
The return value of EEXIST captures this case and other cases
(e.g., another thread expanded/rehashed the table at the same
time) where we should still proceed with the rehash.
Fixes: da20420 (“rhashtable: Add nested tables”)
英文不知所云,我自己的理解是:
哈希表收縮時(shí)(rhashtable_shrink),會(huì)去判斷該表是否被間接引用,如果存在則不縮容,返回-EEXIST。
v5.1之前的版本,該返回值會(huì)觸發(fā)重新生成一個(gè)rht_deferred_worker的work,這樣就形成了遞歸,如果間接引用一直存在則形成死循環(huán),導(dǎo)致CPU 100%。
v5.1之后的版本,該返回值會(huì)觸發(fā)重新哈希,若此期間間接引用還存在則返回0,不再會(huì)觸發(fā)產(chǎn)生新rht_deferred_worker的work,以此來解決BUG。
探測完注意盡快卸載驅(qū)動(dòng),影響性能,會(huì)打印大量日志,并且可能會(huì)把服務(wù)器搞崩
四、打patch,修復(fù)BUG
從https://www.kernel.org下載patch文件,升級解決。
$ diff -up linux-4.19.8-1/lib/rhashtable.c linux-4.19.8-2/lib/rhashtable.c > rht_patch
$ patch -p1 < rht_patch
五、總結(jié)回顧
遇到操作系統(tǒng)性能或功能問題,可采用如下步驟定位:
- 通過top工具,確定問題進(jìn)程
- 使用strace分析系統(tǒng)調(diào)用耗時(shí)或阻塞情況
- 使用nmon工具,查看系統(tǒng)中斷、負(fù)載、cpu、內(nèi)存、IO、net等情況,分析是否正常,初步判斷可能存在的問題
- 使用perf工具,分析函數(shù)棧性能
- 查詢資料、閱讀相關(guān)源碼,初步判斷位置
- 若問題可在測試環(huán)境復(fù)現(xiàn),使用systemtap工具進(jìn)行跟蹤分析
- 若問題在生產(chǎn)環(huán)境上可遇不可求,使用kprobe工具進(jìn)行動(dòng)態(tài)分析
- 定位問題后,查看高版本是否修復(fù)
- 若修復(fù)則通過kernel官網(wǎng)patch升級;未修復(fù)時(shí),若自己可修,則自改自測,否則git上提交bug
內(nèi)核調(diào)試博大精深,本文非常有局限性,具體問題還得具體分析
推薦幾個(gè)鏈接:
https://zhuanlan.zhihu.com/p/71437161
https://www.csdndoc.com/article/8015807
https://blog.csdn.net/yuntongsf/article/details/78707576
附kprobe探測代碼
Talk is cheap,show you the code!
kprobe_kworker.c
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>
#define MAX_SYMBOL_LEN 64
static char symbol[MAX_SYMBOL_LEN] = "rht_deferred_worker";
//rht_deferred_worker可根據(jù)需要改成要探測的函數(shù)
module_param_string(symbol, symbol, sizeof(symbol), 0644);
/* For each probe you need to allocate a kprobe structure */
static struct kprobe kp = {
.symbol_name = symbol,
};
/* kprobe pre_handler: called just before the probed instruction is executed */
static int handler_pre(struct kprobe *p, struct pt_regs *regs)
{
#ifdef CONFIG_X86
pr_info("<%s>,pid = %ld\n",current->comm, current->pid);
struct work_struct *work = (struct work_struct *)regs->di;
pr_info("<%s> pre_handler: %x\n",p->symbol_name,work->func);
#endif
#ifdef CONFIG_ARM64
pr_info("<%s> pre_handler: p->addr = %pF, pc = 0x%lx,"
" pstate = 0x%lx\n",
p->symbol_name, p->addr, (long)regs->pc, (long)regs->pstate);
#endif
/* A dump_stack() here will give a stack backtrace */
return 0;
}
/* kprobe post_handler: called after the probed instruction is executed */
static void handler_post(struct kprobe *p, struct pt_regs *regs,
unsigned long flags)
{
#ifdef CONFIG_X86
struct work_struct *work = (struct work_struct *)regs->di;
pr_info("<%s> post_handler: %x\n",p->symbol_name,work->func);
// pr_info("<%s> post_handler: p->addr = %pF, flags = 0x%lx\n",
// p->symbol_name, p->addr, regs->flags);
#endif
#ifdef CONFIG_ARM64
pr_info("<%s> post_handler: p->addr = %pF, pstate = 0x%lx\n",
p->symbol_name, p->addr, (long)regs->pstate);
#endif
}
/*
* fault_handler: this is called if an exception is generated for any
* instruction within the pre- or post-handler, or when Kprobes
* single-steps the probed instruction.
*/
static int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
{
pr_info("fault_handler: p->addr = %pF, trap #%dn", p->addr, trapnr);
/* Return 0 because we don't handle the fault. */
return 0;
}
static int __init kprobe_init(void)
{
int ret;
kp.pre_handler = handler_pre;
kp.post_handler = handler_post;
kp.fault_handler = handler_fault;
ret = register_kprobe(&kp);
if (ret < 0) {
pr_err("register_kprobe failed, returned %d\n", ret);
return ret;
}
pr_info("Planted kprobe at %pF\n", kp.addr);
return 0;
}
static void __exit kprobe_exit(void)
{
unregister_kprobe(&kp);
pr_info("kprobe at %pF unregistered\n", kp.addr);
}
module_init(kprobe_init)
module_exit(kprobe_exit)
MODULE_LICENSE("GPL");
Makefile
obj-m := kprobe_kworker.o
KBUILD_EXTRA_SYMBOLS:=/mod_a/Module.symvers
CROSS_COMPILE=''
KDIR := /lib/modules/4.19.8-1.el7.elrepo.x86_64/build
all:
make -C $(KDIR) M=$(PWD) modules
clean:
rm -f *.ko *.o *.mod.o *.mod.c .*.cmd *.symvers modul*
執(zhí)行命令
make
insmod kprobe_kworker.ko
rmmod kprobe_kworker.ko
dmesg查看結(jié)果
dmesg
總結(jié)
以上是生活随笔為你收集整理的kprobe分析内核kworker占用CPU 100%问题总结的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 英雄联盟s11总决赛RGE战队参赛的都有
- 下一篇: JavaScript 里 window,