linux性能优化--cpu篇
linux性能優(yōu)化--cpu篇
- 前言
- 負(fù)載
- CPU使用率
- proc
- perf
- 一些鏈接
- `perf list`
- 比較有用的event
- `perf stat`
- `perf record`
- Profiling
- Static Tracing
- `perf report`
- perf probe
- MISC
- perf調(diào)用查看docker內(nèi)部的程序
- `perf top`
- perf 優(yōu)化
- perf生成火焰圖
- Brendan Gregg博客上提及的生成火焰圖的方法
- 使用systemtap生成火焰圖
- perf script
- perf使用筆記
- OFF-CPU分析
- 測(cè)量off-cpu time
- 注意
- Request-Synchronous Context
- Scheduler Latency
- 非自愿上下文切換 Involuntary Context Switching
- IO
- Off-CPU
- wake-up
- Chain Graphs
- bcc系列工具概覽
- Single Purpose Tools:
- Multi Tools: Kernel Dynamic Tracing
- Multi Tools: User Level Dynamic Tracing
- Multi Tools: Kernel Static Tracing
- Multi Tools: User Statically Defined Tracing (USDT)
- perf-tools
- 參考鏈接
前言
本篇文章是我日常cpu性能優(yōu)化的實(shí)踐筆記,結(jié)構(gòu)有些混亂,未來(lái)會(huì)重構(gòu)。
我認(rèn)為,學(xué)習(xí)性能優(yōu)化最有效的途徑,一是看祖師爺Brendan Gregg的文章,二就是實(shí)踐了。
負(fù)載
- 平均負(fù)載多少才合適?
- 當(dāng)平均負(fù)載高于 CPU 數(shù)量 70% 的時(shí)候,你就應(yīng)該分析排查負(fù)載高的問(wèn)題了
- 三種情況會(huì)導(dǎo)致CPU負(fù)載升高:
- CPU 密集型進(jìn)程,使用大量 CPU 會(huì)導(dǎo)致平均負(fù)載升高
- I/O 密集型進(jìn)程,等待 I/O 也會(huì)導(dǎo)致平均負(fù)載升高,但不一定會(huì)很高
- 大量等待 CPU 的進(jìn)程調(diào)度也會(huì)導(dǎo)致平均負(fù)載升高,此時(shí)的 CPU 使用率也會(huì)比較高。
- cswch: 每秒自愿上下文切換(voluntary context switches)的次數(shù),另一個(gè)則是 nvcswch
- 所謂自愿上下文切換,是指進(jìn)程無(wú)法獲取所需資源,導(dǎo)致的上下文切換。比如說(shuō), I/O、內(nèi)存等系統(tǒng)資源不足時(shí),就會(huì)發(fā)生自愿上下文切換。
- nvcswch: 表示每秒非自愿上下文切換(non voluntary context switches)的次數(shù)。
- 而非自愿上下文切換,則是指進(jìn)程由于時(shí)間片已到等原因,被系統(tǒng)強(qiáng)制調(diào)度,進(jìn)而發(fā)生的上下文切換。比如說(shuō),大量進(jìn)程都在爭(zhēng)搶 CPU 時(shí),就容易發(fā)生非自愿上下文切換。
- 觸發(fā)上下文切換。在非常流暢的系統(tǒng)中,這個(gè)開(kāi)銷大約是3-5微秒,這可比搶鎖和同步cache還慢。
- Steps in Context Switching 解釋上下文切換的時(shí)候做了點(diǎn)啥
- How many Context Switches is “normal” (as a function of CPU cores (or other))? System calls Multi-Tasking
CPU使用率
- Linux 通過(guò) /proc 虛擬文件系統(tǒng),向用戶空間提供了系統(tǒng)內(nèi)部狀態(tài)的信息,,而 /proc/stat 提供的就是系統(tǒng)的 CPU 和任務(wù)統(tǒng)計(jì)信息。
- user(通常縮寫(xiě)為 us),代表用戶態(tài) CPU 時(shí)間。注意,它不包括下面的 nice 時(shí)間,但包括了 guest 時(shí)間。
- nice(通常縮寫(xiě)為 ni),代表低優(yōu)先級(jí)用戶態(tài) CPU 時(shí)間,也就是進(jìn)程的 nice 值被調(diào)整為 1-19 之間時(shí)的。這里注意,nice 可取值范圍是 -20 到 19,數(shù)值越大,優(yōu)先級(jí)反而越低。
- system(通常縮寫(xiě)為 sys),代表內(nèi)核態(tài) CPU 時(shí)間
- idle(通常縮寫(xiě)為 id),代表空閑時(shí)間。注意,它不包括等待 I/O 的時(shí)間(iowait)。
- irq(通常縮寫(xiě)為 hi),代表處理硬中斷的 CPU 時(shí)間。
- softirq(通常縮寫(xiě)為 si),代表處理軟中斷的 CPU時(shí)間。
- steal(通常縮寫(xiě)為 st),代表當(dāng)系統(tǒng)運(yùn)行在虛擬機(jī)中的時(shí)候,被其他虛擬機(jī)占用的 CPU 時(shí)間。
- guest(通常縮寫(xiě)為 guest),代表通過(guò)虛擬化運(yùn)行其他操作系統(tǒng)的時(shí)間,也就是運(yùn)行虛擬機(jī)的 CPU 時(shí)間。
- guest_nice(通常縮寫(xiě)為 gnice),代表以低優(yōu)先級(jí)運(yùn)行虛擬機(jī)的時(shí)間。
- 性能分析工具給出的都是間隔一段時(shí)間的平均 CPU 使用率,所以要注意間隔時(shí)間的設(shè)置
proc
perf
一些鏈接
面對(duì)一個(gè)問(wèn)題程序,最好采用自頂向下的策略。先整體看看該程序運(yùn)行時(shí)各種統(tǒng)計(jì)事件的大概,再針對(duì)某些方向深入細(xì)節(jié)。而不要一下子扎進(jìn)瑣碎細(xì)節(jié),會(huì)一葉障目的。
有些程序慢是因?yàn)橛?jì)算量太大,其多數(shù)時(shí)間都應(yīng)該在使用 CPU 進(jìn)行計(jì)算,這叫做 CPU bound 型;有些程序慢是因?yàn)檫^(guò)多的 IO,這種時(shí)候其 CPU 利用率應(yīng)該不高,這叫做 IO bound 型;對(duì)于 CPU bound 程序的調(diào)優(yōu)和 IO bound 的調(diào)優(yōu)是不同的。
perf list
使用 perf list 命令可以列出所有能夠觸發(fā) perf 采樣點(diǎn)的事件 perf list 'sched:*'表示列出sched相關(guān)的tracepoints
不同的系統(tǒng)會(huì)列出不同的結(jié)果,在 2.6.35 版本的內(nèi)核中,該列表已經(jīng)相當(dāng)?shù)拈L(zhǎng),但無(wú)論有多少,我們可以將它們劃分為三類:
- block: block device I/O
- ext4: file system operations
- kmem: kernel memory allocation events
- random: kernel random number generator events
- sched: CPU scheduler events
- syscalls: system call enter and exits
- task: task events
比較有用的event
perf stat
# CPU counter statistics for the specified command: 對(duì)某個(gè)command的執(zhí)行過(guò)程執(zhí)行perf,顯示一些cpu相關(guān)的計(jì)數(shù) perf stat command# Detailed CPU counter statistics (includes extras) for the specified command: 基本同上,但更為詳細(xì) perf stat -d command# CPU counter statistics for the specified PID, until Ctrl-C: 對(duì)某個(gè)pid perf stat -p PID# CPU counter statistics for the entire system, for 5 seconds: 對(duì)整個(gè)系統(tǒng),執(zhí)行5秒 perf stat -a sleep 5# Various basic CPU statistics, system wide, for 10 seconds: 顯示基本cpu相關(guān)信息 perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10# Various CPU level 1 data cache statistics for the specified command: cache相關(guān) perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores command# Various CPU data TLB statistics for the specified command: perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses command# Various CPU last level cache statistics for the specified command: perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command# Using raw PMC counters, eg, counting unhalted core cycles: perf stat -e r003c -a sleep 5 # PMCs: counting cycles and frontend stalls via raw specification: perf stat -e cycles -e cpu/event=0x0e,umask=0x01,inv,cmask=0x01/ -a sleep 5# Count syscalls per-second system-wide: perf stat -e raw_syscalls:sys_enter -I 1000 -a# Count system calls by type for the specified PID, until Ctrl-C: perf stat -e 'syscalls:sys_enter_*' -p PID# Count system calls by type for the entire system, for 5 seconds: perf stat -e 'syscalls:sys_enter_*' -a sleep 5# Count scheduler events for the specified PID, until Ctrl-C: perf stat -e 'sched:*' -p PID# Count scheduler events for the specified PID, for 10 seconds: perf stat -e 'sched:*' -p PID sleep 10# Count ext4 events for the entire system, for 10 seconds: perf stat -e 'ext4:*' -a sleep 10# Count block device I/O events for the entire system, for 10 seconds: perf stat -e 'block:*' -a sleep 10# Count all vmscan events, printing a report every second: perf stat -e 'vmscan:*' -a -I 1000perf record
- perf record則提供了保存數(shù)據(jù)的功能,保存后的數(shù)據(jù),需要你用 perf report 解析展示。
- perf 后面經(jīng)常加 -g選項(xiàng),開(kāi)啟調(diào)試關(guān)系的采樣,查看調(diào)用鏈
- -p <pid>指定進(jìn)程號(hào)
Profiling
# Sample on-CPU functions for the specified command, at 99 Hertz: perf record -F 99 command# Sample on-CPU functions for the specified PID, at 99 Hertz, until Ctrl-C: perf record -F 99 -p PID# Sample on-CPU functions for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID sleep 10# Sample CPU stack traces (via frame pointers) for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID -g -- sleep 10# Sample CPU stack traces for the PID, using dwarf (dbg info) to unwind stacks, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID --call-graph dwarf sleep 10# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (< Linux 4.11): perf record -F 99 -ag -- sleep 10# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (>= Linux 4.11): perf record -F 99 -g -- sleep 10# If the previous command didn't work, try forcing perf to use the cpu-clock event: perf record -F 99 -e cpu-clock -ag -- sleep 10# Sample CPU stack traces for a container identified by its /sys/fs/cgroup/perf_event cgroup: perf record -F 99 -e cpu-clock --cgroup=docker/1d567f4393190204...etc... -a -- sleep 10# Sample CPU stack traces for the entire system, with dwarf stacks, at 99 Hertz, for 10 seconds: perf record -F 99 -a --call-graph dwarf sleep 10# Sample CPU stack traces for the entire system, using last branch record for stacks, ... (>= Linux 4.?): perf record -F 99 -a --call-graph lbr sleep 10# Sample CPU stack traces, once every 10,000 Level 1 data cache misses, for 5 seconds: perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5# Sample CPU stack traces, once every 100 last level cache misses, for 5 seconds: perf record -e LLC-load-misses -c 100 -ag -- sleep 5 # Sample on-CPU kernel instructions, for 5 seconds: perf record -e cycles:k -a -- sleep 5 # Sample on-CPU user instructions, for 5 seconds: perf record -e cycles:u -a -- sleep 5 # Sample on-CPU user instructions precisely (using PEBS), for 5 seconds: perf record -e cycles:up -a -- sleep 5 # Perform branch tracing (needs HW support), for 1 second: perf record -b -a sleep 1Static Tracing
# Trace new processes, until Ctrl-C: perf record -e sched:sched_process_exec -a# Sample (take a subset of) context-switches, until Ctrl-C: perf record -e context-switches -a# Trace all context-switches, until Ctrl-C: # -c 1 意味著-vv選項(xiàng)中的sample_period是1,也就是抓所有的時(shí)間 perf record -e context-switches -c 1 -a# Include raw settings used (see: man perf_event_open): # -vv顯示詳細(xì)信息 perf record -vv -e context-switches -a# Trace all context-switches via sched tracepoint, until Ctrl-C: perf record -e sched:sched_switch -a# Sample context-switches with stack traces, until Ctrl-C: perf record -e context-switches -ag# Sample context-switches with stack traces, for 10 seconds: perf record -e context-switches -ag -- sleep 10# Sample CS, stack traces, and with timestamps (< Linux 3.17, -T now default): perf record -e context-switches -ag -T# Sample CPU migrations, for 10 seconds: perf record -e migrations -a -- sleep 10# Trace all connect()s with stack traces (outbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_connect -ag# Trace all accepts()s with stack traces (inbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_accept* -ag# Trace all block device (disk I/O) requests with stack traces, until Ctrl-C: perf record -e block:block_rq_insert -ag# Sample at most 100 block device requests per second, until Ctrl-C: perf record -F 100 -e block:block_rq_insert -a# Trace all block device issues and completions (has timestamps), until Ctrl-C: perf record -e block:block_rq_issue -e block:block_rq_complete -a# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'nr_sector > 200'# Trace all block completions, synchronous writes only, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs == "WS"'# Trace all block completions, all types of writes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'# Sample minor faults (RSS growth) with stack traces, until Ctrl-C: perf record -e minor-faults -ag# Trace all minor faults with stack traces, until Ctrl-C: perf record -e minor-faults -c 1 -ag# Sample page faults with stack traces, until Ctrl-C: perf record -e page-faults -ag# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C: perf record -e 'ext4:*' -o /tmp/perf.data -a # Trace kswapd wakeup events, until Ctrl-C: perf record -e vmscan:mm_vmscan_wakeup_kswapd -ag# Add Node.js USDT probes (Linux 4.10+): perf buildid-cache --add `which node`# Trace the node http__server__request USDT event (Linux 4.10+): perf record -e sdt_node:http__server__request -aperf report
# Show perf.data in an ncurses browser (TUI) if possible: perf report# Show perf.data with a column for sample count: perf report -n# Show perf.data as a text report, with data coalesced and percentages: perf report --stdio# Report, with stacks in folded format: one line per stack (needs 4.4): perf report --stdio -n -g folded# List all events from perf.data: perf script# List all perf.data events, with data header (newer kernels; was previously default): perf script --header# List all perf.data events, with customized fields (< Linux 4.1): perf script -f time,event,trace# List all perf.data events, with customized fields (>= Linux 4.1): perf script -F time,event,trace# List all perf.data events, with my recommended fields (needs record -a; newer kernels): perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso # List all perf.data events, with my recommended fields (needs record -a; older kernels): perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso# Dump raw contents from perf.data as hex (for debugging): perf script -D# Disassemble and annotate instructions with percentages (needs some debuginfo): perf annotate --stdioperf probe
Dynamic Tracing
# Add a tracepoint for the kernel tcp_sendmsg() function entry ("--add" is optional): perf probe --add tcp_sendmsg# Remove the tcp_sendmsg() tracepoint (or use "--del"): perf probe -d tcp_sendmsg# Add a tracepoint for the kernel tcp_sendmsg() function return: perf probe 'tcp_sendmsg%return'# Show available variables for the kernel tcp_sendmsg() function (needs debuginfo): perf probe -V tcp_sendmsg# Show available variables for the kernel tcp_sendmsg() function, plus external vars (needs debuginfo): perf probe -V tcp_sendmsg --externs# Show available line probes for tcp_sendmsg() (needs debuginfo): perf probe -L tcp_sendmsg# Show available variables for tcp_sendmsg() at line number 81 (needs debuginfo): perf probe -V tcp_sendmsg:81# Add a tracepoint for tcp_sendmsg(), with three entry argument registers (platform specific): perf probe 'tcp_sendmsg %ax %dx %cx'# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for the %cx register (platform specific): perf probe 'tcp_sendmsg bytes=%cx'# Trace previously created probe when the bytes (alias) variable is greater than 100: perf record -e probe:tcp_sendmsg --filter 'bytes > 100'# Add a tracepoint for tcp_sendmsg() return, and capture the return value: perf probe 'tcp_sendmsg%return $retval'# Add a tracepoint for tcp_sendmsg(), and "size" entry argument (reliable, but needs debuginfo): perf probe 'tcp_sendmsg size'# Add a tracepoint for tcp_sendmsg(), with size and socket state (needs debuginfo): perf probe 'tcp_sendmsg size sk->__sk_common.skc_state'# Tell me how on Earth you would do this, but don't actually do it (needs debuginfo): perf probe -nv 'tcp_sendmsg size sk->__sk_common.skc_state'# Trace previous probe when size is non-zero, and state is not TCP_ESTABLISHED(1) (needs debuginfo): perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a# Add a tracepoint for tcp_sendmsg() line 81 with local variable seglen (needs debuginfo): perf probe 'tcp_sendmsg:81 seglen'# Add a tracepoint for do_sys_open() with the filename as a string (needs debuginfo): perf probe 'do_sys_open filename:string'# Add a tracepoint for myfunc() return, and include the retval as a string: perf probe 'myfunc%return +0($retval):string'# Add a tracepoint for the user-level malloc() function from libc: perf probe -x /lib64/libc.so.6 malloc# Add a tracepoint for this user-level static probe (USDT, aka SDT event): perf probe -x /usr/lib64/libpthread-2.24.so %sdt_libpthread:mutex_entry# List currently available dynamic probes: perf probe -lMISC
Mixed
# Trace system calls by process, showing a summary refreshing every 2 seconds: perf top -e raw_syscalls:sys_enter -ns comm# Trace sent network packets by on-CPU process, rolling output (no clear): stdbuf -oL perf top -e net:net_dev_xmit -ns comm | strings# Sample stacks at 99 Hertz, and, context switches: perf record -F99 -e cpu-clock -e cs -a -g # Sample stacks to 2 levels deep, and, context switch stacks to 5 levels (needs 4.8): perf record -F99 -e cpu-clock/max-stack=2/ -e cs/max-stack=5/ -a -gSpecial
# Record cacheline events (Linux 4.10+): perf c2c record -a -- sleep 10# Report cacheline events from previous recording (Linux 4.10+): perf c2c report其他用法
- perf c2c (Linux 4.10+): cache-2-cache and cacheline false sharing analysis.
- perf kmem: kernel memory allocation analysis.
- perf kvm: KVM virtual guest analysis.
- perf lock: lock analysis.
- perf mem: memory access analysis.
- perf sched: kernel scheduler statistics. Examples.
perf調(diào)用查看docker內(nèi)部的程序
- 最簡(jiǎn)單的方法是在docker外執(zhí)行perf record。 然后把文件拷貝到容器內(nèi)來(lái)分析。
- 先運(yùn)行 perf record -g -p < pid>,執(zhí)行一會(huì)兒(比如 15 秒)后,按 Ctrl+C 停止。
- 然后,把生成的 perf.data 文件,拷貝到容器里面來(lái)分…
perf top
- 它能夠?qū)崟r(shí)顯示占用 CPU 時(shí)鐘最多的函數(shù)或者指令,因此可以用來(lái)查找熱點(diǎn)函數(shù)
- perf 后面經(jīng)常加 -g選項(xiàng),開(kāi)啟調(diào)試關(guān)系的采樣,查看調(diào)用鏈
- -p <pid>指定進(jìn)程號(hào)
- 每一行包含四列:
- 第一列 Overhead ,是該符號(hào)的性能事件在所有采樣中的比例,用百分比來(lái)表示。
- 第二列 Shared ,是該函數(shù)或指令所在的動(dòng)態(tài)共享對(duì)象(Dynamic Shared Object),如內(nèi)核、進(jìn)程名、動(dòng)態(tài)鏈接庫(kù)名、內(nèi)核模塊名等。
- 第三列 Object ,是動(dòng)態(tài)共享對(duì)象的類型。比如 [.] 表示用戶空間的可執(zhí)行程序、或者動(dòng)態(tài)鏈接庫(kù),而 [k] 則表示內(nèi)核空間。
- 最后一列 Symbol 是符號(hào)名,也就是函數(shù)名。當(dāng)函數(shù)名未知時(shí),用十六進(jìn)制的地址來(lái)表示。
perf 優(yōu)化
perf生成火焰圖
# 1. 采樣 # -g 選項(xiàng)是告訴perf record額外記錄函數(shù)的調(diào)用關(guān)系 # -e cpu-clock 指perf record監(jiān)控的指標(biāo)為cpu周期 # -p 指定需要record的進(jìn)程pid # -a across all cpus sudo perf record -a -e cpu-clock -p <pid> -F 1000 -g sleep 60sudo perf record -a -e cpu-clock -F 1000 -g -p <pid> sudo perf record --call-graph dwarf -p pid# 2. 用perf script工具對(duì)perf.data進(jìn)行解析 perf script -i perf.data &> perf.unfold# 后面用到火焰圖repo https://github.com/brendangregg/FlameGraph.git # 3. 將perf.unfold中的符號(hào)進(jìn)行折疊 xxx/stackcollapse-perf.pl perf.unfold &> perf.folded# 4. 最后生成svg圖 xxx/flamegraph.pl perf.folded > perf.svgBrendan Gregg博客上提及的生成火焰圖的方法
當(dāng)我們使用 stackcollapse-perf.pl生成了對(duì)應(yīng)的folded文件后,可以使用grep生成自己希望的火焰圖
perf script | ./stackcollapse-perf.pl > out.perf-folded# 可以過(guò)濾掉 cpu_idle線程的影響,將關(guān)注重點(diǎn)放到真正消耗cpu的地方 grep -v cpu_idle out.perf-folded | ./flamegraph.pl > nonidle.svggrep ext4 out.perf-folded | ./flamegraph.pl > ext4internals.svgegrep 'system_call.*sys_(read|write)' out.perf-folded | ./flamegraph.pl > rw.svgI frequently elide the cpu_idle threads in this way, to focus on the real threads that are consuming CPU resources. If I miss this step, the cpu_idle threads can often dominate the flame graph, squeezing the interesting code paths.
Note that it would be a little more efficient to process the output of perf report instead of perf script; better still, perf report could have a report style (eg, “-g folded”) that output folded stacks directly, obviating the need for stackcollapse-perf.pl. There could even be a perf mode that output the SVG directly (which wouldn’t be the first one; see perf-timechart), although, that would miss the value of being able to grep the folded stacks (which I use frequently).
There are more examples of perf_events CPU flame graphs on the CPU flame graph page, including a summary of these instructions. I have also shared an example of using perf for a Block Device I/O Flame Graph.
使用systemtap生成火焰圖
SystemTap can do the aggregation in-kernel and pass a (much smaller) report to user-land. The data collected and output generated can be customized much further via its scripting language.
The timer.profile probe was used, which samples all CPUs at 100 Hertz
# -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 保證stack trace不會(huì)truncate stap -s 32 -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 -D MAXMAPENTRIES=10240 \-D MAXACTION=10000 -D STP_OVERLOAD_THRESHOLD=5000000000 --all-modules \-ve 'global s; probe timer.profile { s[backtrace()] <<< 1; } probe end { foreach (i in s+) { print_stack(i);printf("\t%d\n", @count(s[i])); } } probe timer.s(60) { exit(); }' \> out.stap-stacks ./stackcollapse-stap.pl out.stap-stacks > out.stap-folded cat out.stap-folded | ./flamegraph.pl > stap-kernel.svgperf script
可以把record記錄下的文件轉(zhuǎn)成readable的一條條記錄
perf使用筆記
OFF-CPU分析
off-cpu說(shuō)明進(jìn)程使用的系統(tǒng)調(diào)用,進(jìn)入了內(nèi)核態(tài),分析offcpu的主要是為了測(cè)量cpu的阻塞在什么樣的系統(tǒng)調(diào)用已經(jīng)系統(tǒng)調(diào)用花費(fèi)的時(shí)間。
如下可見(jiàn),off-cpu主要分析的是進(jìn)入某種阻塞的系統(tǒng)調(diào)用(sys cpu并不高),進(jìn)程消耗在等待某種資源(如IO)
Off-CPU Tracing -------------------------------------------->| | B B A A A(---------. .----------) | | B(--------. .--) | | user-land - - - - - - - - - - syscall - - - - - - - - - - - - - - - - - | | kernel X Off-CPU | block . . . . . interrupt測(cè)量off-cpu time
最簡(jiǎn)單使用time命令,可以觀察到進(jìn)程花費(fèi)在sys上時(shí)間。如下,我們總共耗費(fèi)了50s,但是user和sys加起來(lái)只有11.6s。有38.2s消失了,這就是耗費(fèi)在off-cpu上的時(shí)間(讀寫(xiě)IO)
$ time tar cf archive.tar linux-4.15-rc2real 0m50.798s user 0m1.048s sys 0m11.627sbcc命令
# 統(tǒng)計(jì)offcpu時(shí)間,輸出histo /usr/share/bcc/tools/cpudist -O -p <pid> # 統(tǒng)計(jì)offcpu的時(shí)間,可以分別統(tǒng)計(jì)用戶態(tài)和內(nèi)核態(tài)的,可以看到用戶態(tài)和內(nèi)核態(tài)的棧幀 /usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg# 對(duì)某個(gè)pid的調(diào)用malloc計(jì)數(shù),因?yàn)閎pf是內(nèi)核態(tài)工具,stackcount只有在ctrl+c之后才會(huì)將統(tǒng)計(jì)結(jié)果輸出,這樣就避免了每次進(jìn)程調(diào)用malloc之后的內(nèi)核態(tài)用戶態(tài)轉(zhuǎn)換。 # 函數(shù)可以用*通配符,例如c:*map可以匹配到18個(gè)函數(shù) /usr/share/bcc/tools/stackcount -p 19183 -U c:malloc > out.stacks # 將上述輸出轉(zhuǎn)化為火焰圖 ./stackcollapse.pl < out.stacks | ./flamegraph.pl --color=mem \--title="malloc() Flame Graph" --countname="calls" > out.svg注意
Request-Synchronous Context
很多我們部署的服務(wù),其進(jìn)程大多處于等待狀態(tài)(網(wǎng)絡(luò)線程等待網(wǎng)絡(luò)包,工作線程等待任務(wù))。這些干擾會(huì)影響我們直接分析offcpu。我們其實(shí)想找的是如某次request的過(guò)程中,該線程出現(xiàn)的offcpu時(shí)間。所以我們可以在上面的offcputime的輸出中只選出有用棧幀進(jìn)行分析。
Scheduler Latency
Something that’s missing from these stacks is if the off-CPU time includes time spent waiting on a CPU run queue. This time is known as scheduler latency, run queue latency, or dispatcher queue latency. If the CPUs are running at saturation, then any time a thread blocks, it may endure additional time waiting its turn on a CPU after being woken up. That time will be included in the off-CPU time.
You can use extra trace events to tease apart off-CPU time into time blocked vs scheduler latency, but in practice, CPU saturation is pretty easy to spot, so you are unlikely to be spending much time studying off-CPU time when you have a known CPU saturation issue to deal with.
感覺(jué)意思是說(shuō),如果cpu負(fù)載太高,那么offcpu有一部分的時(shí)間是花費(fèi)在調(diào)度上了
非自愿上下文切換 Involuntary Context Switching
Involuntary Context Switching也會(huì)造成上下文切換,但這一般情況下不是我們想要定位的問(wèn)題,可以用如下的參數(shù)只關(guān)注D狀態(tài)下的offcpu TASK_UNINTERRUPTIBLE
/usr/share/bcc/tools/offcputime -p `pgrep -nx mysqld` --state 2On Linux, involuntary context switches occur for state TASK_RUNNING (0), whereas the blocking events we’re usually interested in are in TASK_INTERRUPTIBLE (1) or TASK_UNINTERRUPTIBLE (2), which offcputime can match on using --state. I used this feature in my Linux Load Averages: Solving the Mystery post.
IO
sycn IO的時(shí)延增大直接會(huì)表現(xiàn)為業(yè)務(wù)的時(shí)延增大。我們需要找到一種方式,可以記錄每一次調(diào)用sync io系統(tǒng)調(diào)用的時(shí)間,從而畫(huà)出火焰圖。 Brendan寫(xiě)了一個(gè)工具 fileiostacks.py。當(dāng)我們確定系統(tǒng)是因?yàn)閟ync IO導(dǎo)致了性能問(wèn)題,可以使用這種方式進(jìn)行分析
如下這種方式,只會(huì)打印出調(diào)用sycn IO訪問(wèn)文件系統(tǒng)的棧幀的火焰圖
/usr/share/bcc/tools/fileiostacks.py -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="File I/O Time Flame Graph" --countname=us < out.stacks > out.svg但是IO不止有一種,還有block device,還有網(wǎng)絡(luò),我們下面看一個(gè)block_device的
使用perf
perf record -e block:block_rq_insert -a -g -- sleep 30 perf script --header > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./stackcollapse-perf.html < out.stacks | ./flamegraph.pl --color=io \--title="Block I/O Flame Graph" --countname="I/O" > out.svg使用bcc
/usr/share/bcc/tools/biostacks.py -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Block I/O Time Flame Graph" --countname=us < out.stacks > out.svgOff-CPU
和上面不同,這里我們想分析所有的off-cpu事件。
/usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svgwake-up
這個(gè)火焰圖主要將關(guān)注點(diǎn)放在wakeup的線程上,我們就可以看到到底是什么線程喚醒了某個(gè)進(jìn)程,感覺(jué)還是挺有偶用的,比如說(shuō)鎖開(kāi)銷很高的時(shí)候,我們就能分析到底是什么進(jìn)程在喚醒鎖。
比如之前內(nèi)核出現(xiàn)的wmb驚群的問(wèn)題,應(yīng)該可以用這種方法分析出
/usr/share/bcc/tools/wakeuptime -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=wakeup --title="Wakeup Time Flame Graph" --countname=us < out.stacks > out.svgChain Graphs
后之覽者,亦將有感于斯文。
bcc系列工具概覽
cd /usr/share/bcc/tools/
Single Purpose Tools:
# Trace new processes: execsnoop# Trace file opens with process and filename: opensnoop# Summarize block I/O (disk) latency as a power-of-2 distribution by disk: biolatency -D# Summarize block I/O size as a power-of-2 distribution by program name: bitesize# Trace common ext4 file system operations slower than 1 millisecond: ext4slower 1# Trace TCP active connections (connect()) with IP address and ports: tcpconnect# Trace TCP passive connections (accept()) with IP address and ports: tcpaccept# Trace TCP connections to local port 80, with session duration: tcplife -L 80# Trace TCP retransmissions with IP addresses and TCP state: # 只打印重傳包 tcpretrans# Sample stack traces at 49 Hertz for 10 seconds, emit folded format (for flame graphs): profile -fd -F 49 10# Trace details and latency of resolver DNS lookups: gethostlatency# Trace commands issued in all running bash shells: bashreadlineMulti Tools: Kernel Dynamic Tracing
# Count "tcp_send*" kernel function, print output every second: funccount -i 1 'tcp_send*'# Count "vfs_*" calls for PID 185: funccount -p 185 'vfs_*'# Trace file names opened, using dynamic tracing of the kernel do_sys_open() function: trace 'p::do_sys_open "%s", arg2'# Same as before ("p:: is assumed if not specified): trace 'do_sys_open "%s", arg2'# Trace the return of the kernel do_sys_open() funciton, and print the retval: trace 'r::do_sys_open "ret: %d", retval'# Trace do_nanosleep() kernel function and the second argument (mode), with kernel stack traces: trace -K 'do_nanosleep "mode: %d", arg2'# Trace do_nanosleep() mode by providing the prototype (no debuginfo required): trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "mode: %d", mode'# Trace do_nanosleep() with the task address (may be NULL), noting the dereference: trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "task: %x", t->task'# Frequency count tcp_sendmsg() size: argdist -C 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size'# Summarize tcp_sendmsg() size as a power-of-2 histogram: argdist -H 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size'# Frequency count stack traces that lead to the submit_bio() function (disk I/O issue): stackcount submit_bio# Summarize the latency (time taken) by the vfs_read() function for PID 181: # 統(tǒng)計(jì)函數(shù)延時(shí),我們也可以用來(lái)做自己進(jìn)程的性能優(yōu)化 funclatency -p 181 -u vfs_readMulti Tools: User Level Dynamic Tracing
# Trace the libc library function nanosleep() and print the requested sleep details: trace 'p:c:nanosleep(struct timespec *req) "%d sec %d nsec", req->tv_sec, req->tv_nsec'# Count the libc write() call for PID 181 by file descriptor: argdist -p 181 -C 'p:c:write(int fd):int:fd' # Summarize the latency (time taken) by libc getaddrinfo(), as a power-of-2 histogram in microseconds: funclatency.py -u 'c:getaddrinfo'Multi Tools: Kernel Static Tracing
# Count stack traces that led to issuing block I/O, tracing its kernel tracepoint: stackcount t:block:block_rq_insertMulti Tools: User Statically Defined Tracing (USDT)
# Trace the pthread_create USDT probe, and print arg1 as hex: trace 'u:pthread:pthread_create "%x", arg1'perf-tools
perf-tools
里面有些工具bcc已經(jīng)收納了,但是有的比如像iosnoop,還是perf-tools才有
參考鏈接
總結(jié)
以上是生活随笔為你收集整理的linux性能优化--cpu篇的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: linux网络编程--阻塞与非阻塞
- 下一篇: DPDK精准测量时间