當(dāng)前位置：首頁(yè) > 运维知识 > linux >内容正文

linux

linux性能优化--cpu篇

發(fā)布時(shí)間：2024/2/28 linux 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 linux性能优化--cpu篇小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

linux性能優(yōu)化--cpu篇

- 前言
- 負(fù)載
- CPU使用率
- proc
- perf
- - 一些鏈接
  - `perf list`
  - - 比較有用的event
  - `perf stat`
  - `perf record`
  - - Profiling
    - Static Tracing
  - `perf report`
  - perf probe
  - MISC
  - perf調(diào)用查看docker內(nèi)部的程序
  - `perf top`
  - perf 優(yōu)化
  - perf生成火焰圖
  - - Brendan Gregg博客上提及的生成火焰圖的方法
    - - 使用systemtap生成火焰圖
  - perf script
  - perf使用筆記
- OFF-CPU分析
- - 測(cè)量off-cpu time
  - 注意
  - - Request-Synchronous Context
    - Scheduler Latency
    - 非自愿上下文切換 Involuntary Context Switching
  - IO
  - Off-CPU
  - wake-up
  - Chain Graphs
- bcc系列工具概覽
- - Single Purpose Tools:
  - Multi Tools: Kernel Dynamic Tracing
  - Multi Tools: User Level Dynamic Tracing
  - Multi Tools: Kernel Static Tracing
  - Multi Tools: User Statically Defined Tracing (USDT)
- perf-tools
- 參考鏈接

前言

本篇文章是我日常cpu性能優(yōu)化的實(shí)踐筆記，結(jié)構(gòu)有些混亂，未來(lái)會(huì)重構(gòu)。

我認(rèn)為，學(xué)習(xí)性能優(yōu)化最有效的途徑，一是看祖師爺Brendan Gregg的文章，二就是實(shí)踐了。

負(fù)載

平均負(fù)載多少才合適？
- 當(dāng)平均負(fù)載高于 CPU 數(shù)量 70% 的時(shí)候，你就應(yīng)該分析排查負(fù)載高的問(wèn)題了
- 三種情況會(huì)導(dǎo)致CPU負(fù)載升高：
- CPU 密集型進(jìn)程，使用大量 CPU 會(huì)導(dǎo)致平均負(fù)載升高
- I/O 密集型進(jìn)程，等待 I/O 也會(huì)導(dǎo)致平均負(fù)載升高，但不一定會(huì)很高
- 大量等待 CPU 的進(jìn)程調(diào)度也會(huì)導(dǎo)致平均負(fù)載升高，此時(shí)的 CPU 使用率也會(huì)比較高。
上下文切換：
- cswch: 每秒自愿上下文切換（voluntary context switches）的次數(shù)，另一個(gè)則是 nvcswch
  - 所謂自愿上下文切換，是指進(jìn)程無(wú)法獲取所需資源，導(dǎo)致的上下文切換。比如說(shuō)， I/O、內(nèi)存等系統(tǒng)資源不足時(shí)，就會(huì)發(fā)生自愿上下文切換。
- nvcswch: 表示每秒非自愿上下文切換（non voluntary context switches）的次數(shù)。
  - 而非自愿上下文切換，則是指進(jìn)程由于時(shí)間片已到等原因，被系統(tǒng)強(qiáng)制調(diào)度，進(jìn)而發(fā)生的上下文切換。比如說(shuō)，大量進(jìn)程都在爭(zhēng)搶 CPU 時(shí)，就容易發(fā)生非自愿上下文切換。
- 觸發(fā)上下文切換。在非常流暢的系統(tǒng)中，這個(gè)開(kāi)銷大約是3-5微秒，這可比搶鎖和同步cache還慢。
- Steps in Context Switching 解釋上下文切換的時(shí)候做了點(diǎn)啥
- How many Context Switches is “normal” (as a function of CPU cores (or other))? System calls Multi-Tasking

CPU使用率

Linux 通過(guò) /proc 虛擬文件系統(tǒng)，向用戶空間提供了系統(tǒng)內(nèi)部狀態(tài)的信息，，而 /proc/stat 提供的就是系統(tǒng)的 CPU 和任務(wù)統(tǒng)計(jì)信息。
user（通常縮寫(xiě)為 us），代表用戶態(tài) CPU 時(shí)間。注意，它不包括下面的 nice 時(shí)間，但包括了 guest 時(shí)間。
nice（通常縮寫(xiě)為 ni），代表低優(yōu)先級(jí)用戶態(tài) CPU 時(shí)間，也就是進(jìn)程的 nice 值被調(diào)整為 1-19 之間時(shí)的。這里注意，nice 可取值范圍是 -20 到 19，數(shù)值越大，優(yōu)先級(jí)反而越低。
system（通常縮寫(xiě)為 sys），代表內(nèi)核態(tài) CPU 時(shí)間
idle（通常縮寫(xiě)為 id），代表空閑時(shí)間。注意，它不包括等待 I/O 的時(shí)間（iowait）。
irq（通常縮寫(xiě)為 hi），代表處理硬中斷的 CPU 時(shí)間。
softirq（通常縮寫(xiě)為 si），代表處理軟中斷的 CPU時(shí)間。
steal（通常縮寫(xiě)為 st），代表當(dāng)系統(tǒng)運(yùn)行在虛擬機(jī)中的時(shí)候，被其他虛擬機(jī)占用的 CPU 時(shí)間。
guest（通常縮寫(xiě)為 guest），代表通過(guò)虛擬化運(yùn)行其他操作系統(tǒng)的時(shí)間，也就是運(yùn)行虛擬機(jī)的 CPU 時(shí)間。
guest_nice（通常縮寫(xiě)為 gnice），代表以低優(yōu)先級(jí)運(yùn)行虛擬機(jī)的時(shí)間。
性能分析工具給出的都是間隔一段時(shí)間的平均 CPU 使用率，所以要注意間隔時(shí)間的設(shè)置

proc

stack選項(xiàng)表示當(dāng)前的進(jìn)程(主線程)內(nèi)核模式的堆棧/proc/[pid]/stack

/proc/[PID]/task/下表示當(dāng)前進(jìn)程所有的線程

/proc/[PID]/task/[TID]/stack表示當(dāng)前進(jìn)程的內(nèi)核模式堆棧

# -d 參數(shù)表示高亮顯示變化的區(qū)域 $ watch -d cat /proc/interrupts

perf

一些鏈接

linux 性能分析工具——perf(包括處理器特性介紹)

這里面的參考鏈接很好

https://www.ibm.com/developerworks/cn/linux/l-cn-perf1/

如何用Perf解開(kāi)服務(wù)器消耗的困境(Perf：跟蹤問(wèn)題三板斧)

man page

面對(duì)一個(gè)問(wèn)題程序，最好采用自頂向下的策略。先整體看看該程序運(yùn)行時(shí)各種統(tǒng)計(jì)事件的大概，再針對(duì)某些方向深入細(xì)節(jié)。而不要一下子扎進(jìn)瑣碎細(xì)節(jié)，會(huì)一葉障目的。
有些程序慢是因?yàn)橛?jì)算量太大，其多數(shù)時(shí)間都應(yīng)該在使用 CPU 進(jìn)行計(jì)算，這叫做 CPU bound 型；有些程序慢是因?yàn)檫^(guò)多的 IO，這種時(shí)候其 CPU 利用率應(yīng)該不高，這叫做 IO bound 型；對(duì)于 CPU bound 程序的調(diào)優(yōu)和 IO bound 的調(diào)優(yōu)是不同的。

perf list

使用 perf list 命令可以列出所有能夠觸發(fā) perf 采樣點(diǎn)的事件 perf list 'sched:*'表示列出sched相關(guān)的tracepoints

不同的系統(tǒng)會(huì)列出不同的結(jié)果，在 2.6.35 版本的內(nèi)核中，該列表已經(jīng)相當(dāng)?shù)拈L(zhǎng)，但無(wú)論有多少，我們可以將它們劃分為三類：

Hardware Event 是由 PMU 硬件產(chǎn)生的事件，比如 cache 命中，當(dāng)您需要了解程序?qū)τ布匦缘氖褂们闆r時(shí)，便需要對(duì)這些事件進(jìn)行采樣；

Software Event 是內(nèi)核軟件產(chǎn)生的事件，比如進(jìn)程切換，tick 數(shù)等 ;man perf_event_open可以查到每一個(gè)event的解釋

Tracepoint event 是內(nèi)核中的靜態(tài) tracepoint 所觸發(fā)的事件，這些 tracepoint 用來(lái)判斷程序運(yùn)行期間內(nèi)核的行為細(xì)節(jié)，比如 slab 分配器的分配次數(shù)等。

kernel tracepoint分為以下幾類

block: block device I/O
ext4: file system operations
kmem: kernel memory allocation events
random: kernel random number generator events
sched: CPU scheduler events
syscalls: system call enter and exits
task: task events

Dynamic Tracing

比較有用的event

cpu-clock一般就用這個(gè)，表示監(jiān)控的指標(biāo)時(shí)cpu的周期

block:block_rq_issue系統(tǒng)下發(fā)IO

perf stat

# CPU counter statistics for the specified command: 對(duì)某個(gè)command的執(zhí)行過(guò)程執(zhí)行perf，顯示一些cpu相關(guān)的計(jì)數(shù) perf stat command# Detailed CPU counter statistics (includes extras) for the specified command: 基本同上，但更為詳細(xì) perf stat -d command# CPU counter statistics for the specified PID, until Ctrl-C: 對(duì)某個(gè)pid perf stat -p PID# CPU counter statistics for the entire system, for 5 seconds: 對(duì)整個(gè)系統(tǒng)，執(zhí)行5秒 perf stat -a sleep 5# Various basic CPU statistics, system wide, for 10 seconds: 顯示基本cpu相關(guān)信息 perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10# Various CPU level 1 data cache statistics for the specified command: cache相關(guān) perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores command# Various CPU data TLB statistics for the specified command: perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses command# Various CPU last level cache statistics for the specified command: perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command# Using raw PMC counters, eg, counting unhalted core cycles: perf stat -e r003c -a sleep 5 # PMCs: counting cycles and frontend stalls via raw specification: perf stat -e cycles -e cpu/event=0x0e,umask=0x01,inv,cmask=0x01/ -a sleep 5# Count syscalls per-second system-wide: perf stat -e raw_syscalls:sys_enter -I 1000 -a# Count system calls by type for the specified PID, until Ctrl-C: perf stat -e 'syscalls:sys_enter_*' -p PID# Count system calls by type for the entire system, for 5 seconds: perf stat -e 'syscalls:sys_enter_*' -a sleep 5# Count scheduler events for the specified PID, until Ctrl-C: perf stat -e 'sched:*' -p PID# Count scheduler events for the specified PID, for 10 seconds: perf stat -e 'sched:*' -p PID sleep 10# Count ext4 events for the entire system, for 10 seconds: perf stat -e 'ext4:*' -a sleep 10# Count block device I/O events for the entire system, for 10 seconds: perf stat -e 'block:*' -a sleep 10# Count all vmscan events, printing a report every second: perf stat -e 'vmscan:*' -a -I 1000

perf record

perf record則提供了保存數(shù)據(jù)的功能，保存后的數(shù)據(jù)，需要你用 perf report 解析展示。
perf 后面經(jīng)常加 -g選項(xiàng)，開(kāi)啟調(diào)試關(guān)系的采樣，查看調(diào)用鏈
-p <pid>指定進(jìn)程號(hào)

Profiling

# Sample on-CPU functions for the specified command, at 99 Hertz: perf record -F 99 command# Sample on-CPU functions for the specified PID, at 99 Hertz, until Ctrl-C: perf record -F 99 -p PID# Sample on-CPU functions for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID sleep 10# Sample CPU stack traces (via frame pointers) for the specified PID, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID -g -- sleep 10# Sample CPU stack traces for the PID, using dwarf (dbg info) to unwind stacks, at 99 Hertz, for 10 seconds: perf record -F 99 -p PID --call-graph dwarf sleep 10# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (< Linux 4.11): perf record -F 99 -ag -- sleep 10# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (>= Linux 4.11): perf record -F 99 -g -- sleep 10# If the previous command didn't work, try forcing perf to use the cpu-clock event: perf record -F 99 -e cpu-clock -ag -- sleep 10# Sample CPU stack traces for a container identified by its /sys/fs/cgroup/perf_event cgroup: perf record -F 99 -e cpu-clock --cgroup=docker/1d567f4393190204...etc... -a -- sleep 10# Sample CPU stack traces for the entire system, with dwarf stacks, at 99 Hertz, for 10 seconds: perf record -F 99 -a --call-graph dwarf sleep 10# Sample CPU stack traces for the entire system, using last branch record for stacks, ... (>= Linux 4.?): perf record -F 99 -a --call-graph lbr sleep 10# Sample CPU stack traces, once every 10,000 Level 1 data cache misses, for 5 seconds: perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5# Sample CPU stack traces, once every 100 last level cache misses, for 5 seconds: perf record -e LLC-load-misses -c 100 -ag -- sleep 5 # Sample on-CPU kernel instructions, for 5 seconds: perf record -e cycles:k -a -- sleep 5 # Sample on-CPU user instructions, for 5 seconds: perf record -e cycles:u -a -- sleep 5 # Sample on-CPU user instructions precisely (using PEBS), for 5 seconds: perf record -e cycles:up -a -- sleep 5 # Perform branch tracing (needs HW support), for 1 second: perf record -b -a sleep 1

Static Tracing

# Trace new processes, until Ctrl-C: perf record -e sched:sched_process_exec -a# Sample (take a subset of) context-switches, until Ctrl-C: perf record -e context-switches -a# Trace all context-switches, until Ctrl-C: # -c 1 意味著-vv選項(xiàng)中的sample_period是1，也就是抓所有的時(shí)間 perf record -e context-switches -c 1 -a# Include raw settings used (see: man perf_event_open): # -vv顯示詳細(xì)信息 perf record -vv -e context-switches -a# Trace all context-switches via sched tracepoint, until Ctrl-C: perf record -e sched:sched_switch -a# Sample context-switches with stack traces, until Ctrl-C: perf record -e context-switches -ag# Sample context-switches with stack traces, for 10 seconds: perf record -e context-switches -ag -- sleep 10# Sample CS, stack traces, and with timestamps (< Linux 3.17, -T now default): perf record -e context-switches -ag -T# Sample CPU migrations, for 10 seconds: perf record -e migrations -a -- sleep 10# Trace all connect()s with stack traces (outbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_connect -ag# Trace all accepts()s with stack traces (inbound connections), until Ctrl-C: perf record -e syscalls:sys_enter_accept* -ag# Trace all block device (disk I/O) requests with stack traces, until Ctrl-C: perf record -e block:block_rq_insert -ag# Sample at most 100 block device requests per second, until Ctrl-C: perf record -F 100 -e block:block_rq_insert -a# Trace all block device issues and completions (has timestamps), until Ctrl-C: perf record -e block:block_rq_issue -e block:block_rq_complete -a# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'nr_sector > 200'# Trace all block completions, synchronous writes only, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs == "WS"'# Trace all block completions, all types of writes, until Ctrl-C: perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'# Sample minor faults (RSS growth) with stack traces, until Ctrl-C: perf record -e minor-faults -ag# Trace all minor faults with stack traces, until Ctrl-C: perf record -e minor-faults -c 1 -ag# Sample page faults with stack traces, until Ctrl-C: perf record -e page-faults -ag# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C: perf record -e 'ext4:*' -o /tmp/perf.data -a # Trace kswapd wakeup events, until Ctrl-C: perf record -e vmscan:mm_vmscan_wakeup_kswapd -ag# Add Node.js USDT probes (Linux 4.10+): perf buildid-cache --add `which node`# Trace the node http__server__request USDT event (Linux 4.10+): perf record -e sdt_node:http__server__request -a

perf report

# Show perf.data in an ncurses browser (TUI) if possible: perf report# Show perf.data with a column for sample count: perf report -n# Show perf.data as a text report, with data coalesced and percentages: perf report --stdio# Report, with stacks in folded format: one line per stack (needs 4.4): perf report --stdio -n -g folded# List all events from perf.data: perf script# List all perf.data events, with data header (newer kernels; was previously default): perf script --header# List all perf.data events, with customized fields (< Linux 4.1): perf script -f time,event,trace# List all perf.data events, with customized fields (>= Linux 4.1): perf script -F time,event,trace# List all perf.data events, with my recommended fields (needs record -a; newer kernels): perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso # List all perf.data events, with my recommended fields (needs record -a; older kernels): perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso# Dump raw contents from perf.data as hex (for debugging): perf script -D# Disassemble and annotate instructions with percentages (needs some debuginfo): perf annotate --stdio

perf probe

Dynamic Tracing

# Add a tracepoint for the kernel tcp_sendmsg() function entry ("--add" is optional): perf probe --add tcp_sendmsg# Remove the tcp_sendmsg() tracepoint (or use "--del"): perf probe -d tcp_sendmsg# Add a tracepoint for the kernel tcp_sendmsg() function return: perf probe 'tcp_sendmsg%return'# Show available variables for the kernel tcp_sendmsg() function (needs debuginfo): perf probe -V tcp_sendmsg# Show available variables for the kernel tcp_sendmsg() function, plus external vars (needs debuginfo): perf probe -V tcp_sendmsg --externs# Show available line probes for tcp_sendmsg() (needs debuginfo): perf probe -L tcp_sendmsg# Show available variables for tcp_sendmsg() at line number 81 (needs debuginfo): perf probe -V tcp_sendmsg:81# Add a tracepoint for tcp_sendmsg(), with three entry argument registers (platform specific): perf probe 'tcp_sendmsg %ax %dx %cx'# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for the %cx register (platform specific): perf probe 'tcp_sendmsg bytes=%cx'# Trace previously created probe when the bytes (alias) variable is greater than 100: perf record -e probe:tcp_sendmsg --filter 'bytes > 100'# Add a tracepoint for tcp_sendmsg() return, and capture the return value: perf probe 'tcp_sendmsg%return $retval'# Add a tracepoint for tcp_sendmsg(), and "size" entry argument (reliable, but needs debuginfo): perf probe 'tcp_sendmsg size'# Add a tracepoint for tcp_sendmsg(), with size and socket state (needs debuginfo): perf probe 'tcp_sendmsg size sk->__sk_common.skc_state'# Tell me how on Earth you would do this, but don't actually do it (needs debuginfo): perf probe -nv 'tcp_sendmsg size sk->__sk_common.skc_state'# Trace previous probe when size is non-zero, and state is not TCP_ESTABLISHED(1) (needs debuginfo): perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a# Add a tracepoint for tcp_sendmsg() line 81 with local variable seglen (needs debuginfo): perf probe 'tcp_sendmsg:81 seglen'# Add a tracepoint for do_sys_open() with the filename as a string (needs debuginfo): perf probe 'do_sys_open filename:string'# Add a tracepoint for myfunc() return, and include the retval as a string: perf probe 'myfunc%return +0($retval):string'# Add a tracepoint for the user-level malloc() function from libc: perf probe -x /lib64/libc.so.6 malloc# Add a tracepoint for this user-level static probe (USDT, aka SDT event): perf probe -x /usr/lib64/libpthread-2.24.so %sdt_libpthread:mutex_entry# List currently available dynamic probes: perf probe -l

MISC

Mixed

# Trace system calls by process, showing a summary refreshing every 2 seconds: perf top -e raw_syscalls:sys_enter -ns comm# Trace sent network packets by on-CPU process, rolling output (no clear): stdbuf -oL perf top -e net:net_dev_xmit -ns comm | strings# Sample stacks at 99 Hertz, and, context switches: perf record -F99 -e cpu-clock -e cs -a -g # Sample stacks to 2 levels deep, and, context switch stacks to 5 levels (needs 4.8): perf record -F99 -e cpu-clock/max-stack=2/ -e cs/max-stack=5/ -a -g

Special

# Record cacheline events (Linux 4.10+): perf c2c record -a -- sleep 10# Report cacheline events from previous recording (Linux 4.10+): perf c2c report

其他用法

perf c2c (Linux 4.10+): cache-2-cache and cacheline false sharing analysis.
perf kmem: kernel memory allocation analysis.
perf kvm: KVM virtual guest analysis.
perf lock: lock analysis.
perf mem: memory access analysis.
perf sched: kernel scheduler statistics. Examples.

perf調(diào)用查看docker內(nèi)部的程序

最簡(jiǎn)單的方法是在docker外執(zhí)行perf record。然后把文件拷貝到容器內(nèi)來(lái)分析。
先運(yùn)行 perf record -g -p < pid>，執(zhí)行一會(huì)兒（比如 15 秒）后，按 Ctrl+C 停止。
然后，把生成的 perf.data 文件，拷貝到容器里面來(lái)分…

docker cp perf.data phpfpm:/tmp docker exec -i -t phpfpm bash

perf top

它能夠?qū)崟r(shí)顯示占用 CPU 時(shí)鐘最多的函數(shù)或者指令，因此可以用來(lái)查找熱點(diǎn)函數(shù)
perf 后面經(jīng)常加 -g選項(xiàng)，開(kāi)啟調(diào)試關(guān)系的采樣，查看調(diào)用鏈
-p <pid>指定進(jìn)程號(hào)
每一行包含四列：
第一列 Overhead ，是該符號(hào)的性能事件在所有采樣中的比例，用百分比來(lái)表示。
第二列 Shared ，是該函數(shù)或指令所在的動(dòng)態(tài)共享對(duì)象（Dynamic Shared Object），如內(nèi)核、進(jìn)程名、動(dòng)態(tài)鏈接庫(kù)名、內(nèi)核模塊名等。
第三列 Object ，是動(dòng)態(tài)共享對(duì)象的類型。比如 [.] 表示用戶空間的可執(zhí)行程序、或者動(dòng)態(tài)鏈接庫(kù)，而 [k] 則表示內(nèi)核空間。
最后一列 Symbol 是符號(hào)名，也就是函數(shù)名。當(dāng)函數(shù)名未知時(shí)，用十六進(jìn)制的地址來(lái)表示。

# Sample CPUs at 49 Hertz, and show top addresses and symbols, live (no perf.data file): perf top -F 49# Sample CPUs at 49 Hertz, and show top process names and segments, live: perf top -F 49 -ns comm,dso

perf 優(yōu)化

優(yōu)化系統(tǒng)調(diào)用

優(yōu)化cache利用率

perf生成火焰圖

# 1. 采樣 # -g 選項(xiàng)是告訴perf record額外記錄函數(shù)的調(diào)用關(guān)系 # -e cpu-clock 指perf record監(jiān)控的指標(biāo)為cpu周期 # -p 指定需要record的進(jìn)程pid # -a across all cpus sudo perf record -a -e cpu-clock -p <pid> -F 1000 -g sleep 60sudo perf record -a -e cpu-clock -F 1000 -g -p <pid> sudo perf record --call-graph dwarf -p pid# 2. 用perf script工具對(duì)perf.data進(jìn)行解析 perf script -i perf.data &> perf.unfold# 后面用到火焰圖repo https://github.com/brendangregg/FlameGraph.git # 3. 將perf.unfold中的符號(hào)進(jìn)行折疊 xxx/stackcollapse-perf.pl perf.unfold &> perf.folded# 4. 最后生成svg圖 xxx/flamegraph.pl perf.folded > perf.svg

Brendan Gregg博客上提及的生成火焰圖的方法

當(dāng)我們使用 stackcollapse-perf.pl生成了對(duì)應(yīng)的folded文件后，可以使用grep生成自己希望的火焰圖

perf script | ./stackcollapse-perf.pl > out.perf-folded# 可以過(guò)濾掉 cpu_idle線程的影響，將關(guān)注重點(diǎn)放到真正消耗cpu的地方 grep -v cpu_idle out.perf-folded | ./flamegraph.pl > nonidle.svggrep ext4 out.perf-folded | ./flamegraph.pl > ext4internals.svgegrep 'system_call.*sys_(read|write)' out.perf-folded | ./flamegraph.pl > rw.svg

I frequently elide the cpu_idle threads in this way, to focus on the real threads that are consuming CPU resources. If I miss this step, the cpu_idle threads can often dominate the flame graph, squeezing the interesting code paths.
Note that it would be a little more efficient to process the output of perf report instead of perf script; better still, perf report could have a report style (eg, “-g folded”) that output folded stacks directly, obviating the need for stackcollapse-perf.pl. There could even be a perf mode that output the SVG directly (which wouldn’t be the first one; see perf-timechart), although, that would miss the value of being able to grep the folded stacks (which I use frequently).
There are more examples of perf_events CPU flame graphs on the CPU flame graph page, including a summary of these instructions. I have also shared an example of using perf for a Block Device I/O Flame Graph.

使用systemtap生成火焰圖

SystemTap can do the aggregation in-kernel and pass a (much smaller) report to user-land. The data collected and output generated can be customized much further via its scripting language.

The timer.profile probe was used, which samples all CPUs at 100 Hertz

# -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 保證stack trace不會(huì)truncate stap -s 32 -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 -D MAXMAPENTRIES=10240 \-D MAXACTION=10000 -D STP_OVERLOAD_THRESHOLD=5000000000 --all-modules \-ve 'global s; probe timer.profile { s[backtrace()] <<< 1; } probe end { foreach (i in s+) { print_stack(i);printf("\t%d\n", @count(s[i])); } } probe timer.s(60) { exit(); }' \> out.stap-stacks ./stackcollapse-stap.pl out.stap-stacks > out.stap-folded cat out.stap-folded | ./flamegraph.pl > stap-kernel.svg

perf script

可以把record記錄下的文件轉(zhuǎn)成readable的一條條記錄

perf使用筆記

Omitting frame pointers會(huì)導(dǎo)致perf看到的棧幀不全，有幾種辦法規(guī)避

使用編譯選項(xiàng)-fno-omit-frame-pointer。如果優(yōu)化過(guò)度，使用了-fomit-frame-pointer 編譯選項(xiàng)，會(huì)導(dǎo)致棧幀不全。我猜這種優(yōu)化是引入-O2只有引入的，所以最好加-O0 -fno-omit-frame-pointer

如果加了上面的選項(xiàng)還是棧幀不全，我猜測(cè)應(yīng)該是libc庫(kù)編譯的時(shí)候優(yōu)化掉frame-pointer了。

perf record的時(shí)候增加選項(xiàng)--call-graph lbr，這個(gè)必須要cpu支持last branch record特性

使用dwarf（也就是gdb info） perf record增加選項(xiàng)--call-graph dwarf或-g dwarf

OFF-CPU分析

off-cpu說(shuō)明進(jìn)程使用的系統(tǒng)調(diào)用，進(jìn)入了內(nèi)核態(tài)，分析offcpu的主要是為了測(cè)量cpu的阻塞在什么樣的系統(tǒng)調(diào)用已經(jīng)系統(tǒng)調(diào)用花費(fèi)的時(shí)間。

如下可見(jiàn)，off-cpu主要分析的是進(jìn)入某種阻塞的系統(tǒng)調(diào)用（sys cpu并不高），進(jìn)程消耗在等待某種資源（如IO）

測(cè)量off-cpu time

最簡(jiǎn)單使用time命令，可以觀察到進(jìn)程花費(fèi)在sys上時(shí)間。如下，我們總共耗費(fèi)了50s，但是user和sys加起來(lái)只有11.6s。有38.2s消失了，這就是耗費(fèi)在off-cpu上的時(shí)間（讀寫(xiě)IO）

$ time tar cf archive.tar linux-4.15-rc2real 0m50.798s user 0m1.048s sys 0m11.627s

bcc命令

# 統(tǒng)計(jì)offcpu時(shí)間，輸出histo /usr/share/bcc/tools/cpudist -O -p <pid> # 統(tǒng)計(jì)offcpu的時(shí)間，可以分別統(tǒng)計(jì)用戶態(tài)和內(nèi)核態(tài)的，可以看到用戶態(tài)和內(nèi)核態(tài)的棧幀 /usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg# 對(duì)某個(gè)pid的調(diào)用malloc計(jì)數(shù)，因?yàn)閎pf是內(nèi)核態(tài)工具，stackcount只有在ctrl+c之后才會(huì)將統(tǒng)計(jì)結(jié)果輸出，這樣就避免了每次進(jìn)程調(diào)用malloc之后的內(nèi)核態(tài)用戶態(tài)轉(zhuǎn)換。 # 函數(shù)可以用*通配符，例如c:*map可以匹配到18個(gè)函數(shù) /usr/share/bcc/tools/stackcount -p 19183 -U c:malloc > out.stacks # 將上述輸出轉(zhuǎn)化為火焰圖 ./stackcollapse.pl < out.stacks | ./flamegraph.pl --color=mem \--title="malloc() Flame Graph" --countname="calls" > out.svg

注意

Request-Synchronous Context

很多我們部署的服務(wù)，其進(jìn)程大多處于等待狀態(tài)（網(wǎng)絡(luò)線程等待網(wǎng)絡(luò)包，工作線程等待任務(wù)）。這些干擾會(huì)影響我們直接分析offcpu。我們其實(shí)想找的是如某次request的過(guò)程中，該線程出現(xiàn)的offcpu時(shí)間。所以我們可以在上面的offcputime的輸出中只選出有用棧幀進(jìn)行分析。

Scheduler Latency

Something that’s missing from these stacks is if the off-CPU time includes time spent waiting on a CPU run queue. This time is known as scheduler latency, run queue latency, or dispatcher queue latency. If the CPUs are running at saturation, then any time a thread blocks, it may endure additional time waiting its turn on a CPU after being woken up. That time will be included in the off-CPU time.

You can use extra trace events to tease apart off-CPU time into time blocked vs scheduler latency, but in practice, CPU saturation is pretty easy to spot, so you are unlikely to be spending much time studying off-CPU time when you have a known CPU saturation issue to deal with.

感覺(jué)意思是說(shuō)，如果cpu負(fù)載太高，那么offcpu有一部分的時(shí)間是花費(fèi)在調(diào)度上了

非自愿上下文切換 Involuntary Context Switching

Involuntary Context Switching也會(huì)造成上下文切換，但這一般情況下不是我們想要定位的問(wèn)題，可以用如下的參數(shù)只關(guān)注D狀態(tài)下的offcpu TASK_UNINTERRUPTIBLE

/usr/share/bcc/tools/offcputime -p `pgrep -nx mysqld` --state 2

On Linux, involuntary context switches occur for state TASK_RUNNING (0), whereas the blocking events we’re usually interested in are in TASK_INTERRUPTIBLE (1) or TASK_UNINTERRUPTIBLE (2), which offcputime can match on using --state. I used this feature in my Linux Load Averages: Solving the Mystery post.

IO

sycn IO的時(shí)延增大直接會(huì)表現(xiàn)為業(yè)務(wù)的時(shí)延增大。我們需要找到一種方式，可以記錄每一次調(diào)用sync io系統(tǒng)調(diào)用的時(shí)間，從而畫(huà)出火焰圖。 Brendan寫(xiě)了一個(gè)工具 fileiostacks.py。當(dāng)我們確定系統(tǒng)是因?yàn)閟ync IO導(dǎo)致了性能問(wèn)題，可以使用這種方式進(jìn)行分析

如下這種方式，只會(huì)打印出調(diào)用sycn IO訪問(wèn)文件系統(tǒng)的棧幀的火焰圖

/usr/share/bcc/tools/fileiostacks.py -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="File I/O Time Flame Graph" --countname=us < out.stacks > out.svg

但是IO不止有一種，還有block device，還有網(wǎng)絡(luò)，我們下面看一個(gè)block_device的

使用perf

perf record -e block:block_rq_insert -a -g -- sleep 30 perf script --header > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./stackcollapse-perf.html < out.stacks | ./flamegraph.pl --color=io \--title="Block I/O Flame Graph" --countname="I/O" > out.svg

使用bcc

/usr/share/bcc/tools/biostacks.py -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Block I/O Time Flame Graph" --countname=us < out.stacks > out.svg

Off-CPU

和上面不同，這里我們想分析所有的off-cpu事件。

/usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg

wake-up

這個(gè)火焰圖主要將關(guān)注點(diǎn)放在wakeup的線程上，我們就可以看到到底是什么線程喚醒了某個(gè)進(jìn)程，感覺(jué)還是挺有偶用的，比如說(shuō)鎖開(kāi)銷很高的時(shí)候，我們就能分析到底是什么進(jìn)程在喚醒鎖。

比如之前內(nèi)核出現(xiàn)的wmb驚群的問(wèn)題，應(yīng)該可以用這種方法分析出

/usr/share/bcc/tools/wakeuptime -f 30 > out.stacks git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl --color=wakeup --title="Wakeup Time Flame Graph" --countname=us < out.stacks > out.svg

Chain Graphs

后之覽者，亦將有感于斯文。

bcc系列工具概覽

cd /usr/share/bcc/tools/

Single Purpose Tools:

# Trace new processes: execsnoop# Trace file opens with process and filename: opensnoop# Summarize block I/O (disk) latency as a power-of-2 distribution by disk: biolatency -D# Summarize block I/O size as a power-of-2 distribution by program name: bitesize# Trace common ext4 file system operations slower than 1 millisecond: ext4slower 1# Trace TCP active connections (connect()) with IP address and ports: tcpconnect# Trace TCP passive connections (accept()) with IP address and ports: tcpaccept# Trace TCP connections to local port 80, with session duration: tcplife -L 80# Trace TCP retransmissions with IP addresses and TCP state: # 只打印重傳包 tcpretrans# Sample stack traces at 49 Hertz for 10 seconds, emit folded format (for flame graphs): profile -fd -F 49 10# Trace details and latency of resolver DNS lookups: gethostlatency# Trace commands issued in all running bash shells: bashreadline

Multi Tools: Kernel Dynamic Tracing

# Count "tcp_send*" kernel function, print output every second: funccount -i 1 'tcp_send*'# Count "vfs_*" calls for PID 185: funccount -p 185 'vfs_*'# Trace file names opened, using dynamic tracing of the kernel do_sys_open() function: trace 'p::do_sys_open "%s", arg2'# Same as before ("p:: is assumed if not specified): trace 'do_sys_open "%s", arg2'# Trace the return of the kernel do_sys_open() funciton, and print the retval: trace 'r::do_sys_open "ret: %d", retval'# Trace do_nanosleep() kernel function and the second argument (mode), with kernel stack traces: trace -K 'do_nanosleep "mode: %d", arg2'# Trace do_nanosleep() mode by providing the prototype (no debuginfo required): trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "mode: %d", mode'# Trace do_nanosleep() with the task address (may be NULL), noting the dereference: trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "task: %x", t->task'# Frequency count tcp_sendmsg() size: argdist -C 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size'# Summarize tcp_sendmsg() size as a power-of-2 histogram: argdist -H 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size'# Frequency count stack traces that lead to the submit_bio() function (disk I/O issue): stackcount submit_bio# Summarize the latency (time taken) by the vfs_read() function for PID 181: # 統(tǒng)計(jì)函數(shù)延時(shí)，我們也可以用來(lái)做自己進(jìn)程的性能優(yōu)化 funclatency -p 181 -u vfs_read

Multi Tools: User Level Dynamic Tracing

# Trace the libc library function nanosleep() and print the requested sleep details: trace 'p:c:nanosleep(struct timespec *req) "%d sec %d nsec", req->tv_sec, req->tv_nsec'# Count the libc write() call for PID 181 by file descriptor: argdist -p 181 -C 'p:c:write(int fd):int:fd' # Summarize the latency (time taken) by libc getaddrinfo(), as a power-of-2 histogram in microseconds: funclatency.py -u 'c:getaddrinfo'

Multi Tools: Kernel Static Tracing

# Count stack traces that led to issuing block I/O, tracing its kernel tracepoint: stackcount t:block:block_rq_insert

Multi Tools: User Statically Defined Tracing (USDT)

# Trace the pthread_create USDT probe, and print arg1 as hex: trace 'u:pthread:pthread_create "%x", arg1'

perf-tools

里面有些工具bcc已經(jīng)收納了，但是有的比如像iosnoop，還是perf-tools才有

參考鏈接

linux 性能調(diào)優(yōu)工具perf + 火焰圖常用命令

系統(tǒng)級(jí)性能分析工具perf的介紹與使用寫(xiě)的很全的perf介紹

Flame Graphs看跪了，還有系統(tǒng)優(yōu)化的最新消息

CPU Flame Graphsbrendan gregg NB

Off-CPU Flame Graphsbcc一系列的perf工具，包括offcpu的分析，IO的分析等等

關(guān)于-fno-omit-frame-pointer與-fomit-frame-pointer

perf Examples NB

SystemTap新手指南中文翻譯

perf-tools 很多很有用的tool，例如tcpretrans可以只抓重傳的tcp包

Linux Perf Tools Tips很牛，很多問(wèn)題，例如perf找不到symbol怎么辦還有systemtap相關(guān)

eBPF

Linux Extended BPF (eBPF) Tracing Tools

Userspace stack is not unwinded in most samples with offcputime.py

Speed up SystemTap script monitoring of system calls

What do ‘real’, ‘user’ and ‘sys’ mean in the output of time(1)?

Linux Load Averages: Solving the Mystery

漫話性能：CPU上下文切換深度好文

總結(jié)

以上是生活随笔為你收集整理的linux性能优化--cpu篇的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： linux网络编程--阻塞与非阻塞
下一篇： DPDK精准测量时间