irqbalance
http://www.bubuko.com/infodetail-1129360.html
irqbalance 理論上:
啟用 irqbalance 服務,既可以提升性能,又可以降低能耗。
irqbalance 用于優(yōu)化中斷分配,它會自動收集系統(tǒng)數據以分析使用模式,并依據系統(tǒng)負載狀況將工作狀態(tài)置于 Performance mode 或 Power-save mode。
處于 Performance mode 時,irqbalance 會將中斷盡可能均勻地分發(fā)給各個 CPU core,以充分利用 CPU 多核,提升性能。
處于 Power-save mode 時,irqbalance 會將中斷集中分配給第一個 CPU,以保證其它空閑 CPU 的睡眠時間,降低能耗。
但實際中往往影響cpu的使用均衡,建議服務器環(huán)境中關閉。
http://www.it165.net/os/html/201301/4427.html
irqbalance項目的主頁在這里
irqbalance用于優(yōu)化中斷分配,它會自動收集系統(tǒng)數據以分析使用模式,并依據系統(tǒng)負載狀況將工作狀態(tài)置于 Performance mode 或 Power-save mode。處于Performance mode 時,irqbalance 會將中斷盡可能均勻地分發(fā)給各個 CPU core,以充分利用 CPU 多核,提升性能。
處于Power-save mode 時,irqbalance 會將中斷集中分配給第一個 CPU,以保證其它空閑 CPU 的睡眠時間,降低能耗。
在RHEL發(fā)行版里這個守護程序默認是開機啟用的,那如何確認它的狀態(tài)呢?
view sourceprint? 1.# service irqbalance status 2.irqbalance (pid PID) is running…
?
然后在實踐中,我們的專用的應用程序通常是綁定在特定的CPU上的,所以其實不可不需要它。如果已經被打開了,我們可以用下面的命令關閉它:
view sourceprint? 1.# service irqbalance stop 2.Stopping irqbalance: [ OK ]
或者干脆取消開機啟動:
view sourceprint? 1.# chkconfig irqbalance off
下面我們來分析下這個irqbalance的工作原理,好準確的知道什么時候該用它,什么時候不用它。
既然irqbalance用于優(yōu)化中斷分配,首先我們從中斷講起,文章很長,深吸一口氣,來吧!
摘抄重點:
SMP affinity is controlled by manipulating files in the /proc/irq/ directory.
In /proc/irq/ are directories that correspond to the IRQs present on your
system (not all IRQs may be available). In each of these directories is
the “smp_affinity” file, and this is where we will work our magic.
說白了就是往/proc/irq/N/smp_affinity文件寫入你希望的親緣的CPU的mask碼! 關于如何手工設置中斷親緣性
接著普及下概念,我們再來看下CPU的拓撲結構,首先看下Intel CPU的各個部件之間的關系:
?
一個NUMA node包括一個或者多個Socket,以及與之相連的local memory。一個多核的Socket有多個Core。如果CPU支持HT,OS還會把這個Core看成 2個Logical Processor。
可以看拓撲的工具很多l(xiāng)scpu或者intel的cpu_topology64工具都可以
這次用之前我們新介紹的Likwid工具箱里面的likwid-topology我們可以看到:
./likwid-topology
CPU的拓撲結構是各種高性能服務器CPU親緣性綁定必須理解的東西,有感覺了嗎?
有了前面的各種基礎知識和名詞的鋪墊,我們就可以來調查irqbalance的工作原理:
view sourceprint? 01.//irqbalance.c 02.int?main(int?argc,?char** argv) 03.{ 04./* ... */ 05.while?(keep_going) { 06.sleep_approx(SLEEP_INTERVAL);?//#define SLEEP_INTERVAL 10 07./* ... */ 08.clear_work_stats(); 09.parse_proc_interrupts(); 10.parse_proc_stat(); 11./* ... */ 12.calculate_placement(); 13.activate_mappings(); 14./* ... */ 15.} 16./* ... */ 17.}
從程序的主循環(huán)可以很清楚的看到它的邏輯,在退出之前每隔10秒它做了以下的幾個事情:
1. 清除統(tǒng)計
2. 分析中斷的情況
3. 分析中斷的負載情況
4. 根據負載情況計算如何平衡中斷
5. 實施中斷親緣性變跟
好吧,稍微看下irqbalance如何使用的:
man irqbalance
–oneshot
Causes irqbalance to be run once, after which the daemon exits
–debug
Causes irqbalance to run in the foreground and extra debug information to be printed
在診斷模型下運行irqbalance可以給我們很多詳細的信息:
#./irqbalance –oneshot –debug
喝口水,我們接著來分析下各個步驟的詳細情況:
先了解下中斷在CPU上的分布情況:
view sourceprint? 01.$cat /proc/interrupts|tr -s?' '?'\t'|cut -f 1-3 02.CPU0??? CPU1 03.0:????? 2622846291 04.1:????? 7 05.4:????? 234 06.8:????? 1 07.9:????? 0 08.12:???? 4 09.50:???? 6753 10.66:???? 228 11.90:???? 497 12.98:???? 31 13.209:??? 2?????? 0 14.217:??? 0?????? 0 15.225:??? 29????? 556 16.233:??? 0?????? 0 17.NMI:??? 7395302 4915439 18.LOC:??? 2622846035????? 2622833187 19.ERR:??? 0 20.MIS:??? 0
輸出的第一列是中斷號,后面的2列是在CPU0,CPU1的中斷次數。
但是我們如何知道比如中斷是98那個類型的設備呢?不廢話,上代碼!
view sourceprint? 01.//classify.c 02.char?*classes[] = { 03."other", 04."legacy", 05."storage", 06."timer", 07."ethernet", 08."gbit-ethernet", 09."10gbit-ethernet", 10.0 11.}; 12.? 13.#define MAX_CLASS 0x12 14./*??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? 15.* Class codes lifted from pci spec, appendix D.????????????????????????????????????????????????????????????????????????? 16.* and mapped to irqbalance types here??????????????????????????????????????????????????????????????????????????????????? 17.*/ 18.static?short?class_codes[MAX_CLASS] = { 19.IRQ_OTHER, 20.IRQ_SCSI, 21.IRQ_ETH, 22.IRQ_OTHER, 23.IRQ_OTHER, 24.IRQ_OTHER, 25.IRQ_LEGACY, 26.IRQ_OTHER, 27.IRQ_OTHER, 28.IRQ_LEGACY, 29.IRQ_OTHER, 30.IRQ_OTHER, 31.IRQ_LEGACY, 32.IRQ_ETH, 33.IRQ_SCSI, 34.IRQ_OTHER, 35.IRQ_OTHER, 36.IRQ_OTHER, 37.}; 38.int?map_class_to_level[7] = 39.{ BALANCE_PACKAGE, BALANCE_CACHE, BALANCE_CACHE, BALANCE_NONE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE };
irqbalance把中斷分成7個類型,不同類型的中斷平衡的時候作用域不同,有的在PACKAGE,有的在CACHE,有的在CORE。
那么類型信息在那里獲取呢?不廢話,上代碼!
view sourceprint? 01.//#define SYSDEV_DIR "/sys/bus/pci/devices" 02.static?struct?irq_info *add_one_irq_to_db(const?char?*devpath,?int?irq,?structuser_irq_policy *pol) 03.{ 04.... 05.sprintf(path,?"%s/class", devpath); 06.? 07.fd =?fopen(path,?"r"); 08.? 09.if?(!fd) { 10.perror("Can't open class file: "); 11.goto?get_numa_node; 12.} 13.? 14.rc =?fscanf(fd,?"%x", &class); 15.fclose(fd); 16.? 17.if?(!rc) 18.goto?get_numa_node; 19.? 20./*??????????????????????????????????????????????????????????????????????????????????????????????????????????????? 21.* Restrict search to major class code??????????????????????????????????????????????????????????????????????????? 22.*/ 23.class?>>= 16; 24.? 25.if?(class?>= MAX_CLASS) 26.goto?get_numa_node; 27.? 28.new->class?= class_codes[class]; 29.if?(pol->level >= 0) 30.new->level = pol->level; 31.else 32.new->level = map_class_to_level[class_codes[class]]; 33.get_numa_node: 34.numa_node = -1; 35.sprintf(path,?"%s/numa_node", devpath); 36.fd =?fopen(path,?"r"); 37.if?(!fd) 38.goto?assign_node; 39.? 40.rc =?fscanf(fd,?"%d", &numa_node); 41.fclose(fd); 42.? 43.assign_node: 44.new->numa_node = get_numa_node(numa_node); 45.? 46.sprintf(path,?"%s/local_cpus", devpath); 47.fd =?fopen(path,?"r"); 48.if?(!fd) { 49.cpus_setall(new->cpumask); 50.goto?assign_affinity_hint; 51.} 52.lcpu_mask = NULL; 53.ret = getline(&lcpu_mask, &blen, fd); 54.fclose(fd); 55.if?(ret <= 0) { 56.cpus_setall(new->cpumask); 57.}?else?{ 58.cpumask_parse_user(lcpu_mask, ret,?new->cpumask); 59.} 60.free(lcpu_mask); 61.? 62.assign_affinity_hint: 63.cpus_clear(new->affinity_hint); 64.sprintf(path,?"/proc/irq/%d/affinity_hint", irq); 65.fd =?fopen(path,?"r"); 66.if?(!fd) 67.goto?out; 68.lcpu_mask = NULL; 69.ret = getline(&lcpu_mask, &blen, fd); 70.fclose(fd); 71.if?(ret <= 0) 72.goto?out; 73.cpumask_parse_user(lcpu_mask, ret,?new->affinity_hint); 74.free(lcpu_mask); 75.out: 76.... 77.}
#上面的c代碼翻譯成下面的腳本就是:
view sourceprint? 01.$cat>x.sh 02.SYSDEV_DIR="/sys/bus/pci/devices/" 03.for?dev in `ls $SYSDEV_DIR` 04.do 05.IRQ=`cat $SYSDEV_DIR$dev/irq` 06.CLASS=$(((`cat $SYSDEV_DIR$dev/class`)>>16)) 07.printf?"irq %s: class[%s] "?$IRQ $CLASS 08.if?[ -f?"/proc/irq/$IRQ/affinity_hint"?]; then 09.printf?"affinity_hint[%s] "?`cat /proc/irq/$IRQ/affinity_hint` 10.fi 11.if?[ -f?"$SYSDEV_DIR$dev/local_cpus"?]; then 12.printf?"local_cpus[%s] "?`cat $SYSDEV_DIR$dev/local_cpus` 13.fi 14.if?[ -f?"$SYSDEV_DIR$dev/numa_node"?]; then 15.printf?"numa_node[%s]"?`cat $SYSDEV_DIR$dev/numa_node` 16.fi 17.echo 18.done 19.CTRL+D 20.$ tree /sys/bus/pci/devices 21./sys/bus/pci/devices 22.|-- 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0 23.|-- 0000:00:01.0 -> ../../../devices/pci0000:00/0000:00:01.0 24.|-- 0000:00:03.0 -> ../../../devices/pci0000:00/0000:00:03.0 25.|-- 0000:00:07.0 -> ../../../devices/pci0000:00/0000:00:07.0 26.|-- 0000:00:09.0 -> ../../../devices/pci0000:00/0000:00:09.0 27.|-- 0000:00:13.0 -> ../../../devices/pci0000:00/0000:00:13.0 28.|-- 0000:00:14.0 -> ../../../devices/pci0000:00/0000:00:14.0 29.|-- 0000:00:14.1 -> ../../../devices/pci0000:00/0000:00:14.1 30.|-- 0000:00:14.2 -> ../../../devices/pci0000:00/0000:00:14.2 31.|-- 0000:00:14.3 -> ../../../devices/pci0000:00/0000:00:14.3 32.|-- 0000:00:1a.0 -> ../../../devices/pci0000:00/0000:00:1a.0 33.|-- 0000:00:1a.7 -> ../../../devices/pci0000:00/0000:00:1a.7 34.|-- 0000:00:1d.0 -> ../../../devices/pci0000:00/0000:00:1d.0 35.|-- 0000:00:1d.1 -> ../../../devices/pci0000:00/0000:00:1d.1 36.|-- 0000:00:1d.2 -> ../../../devices/pci0000:00/0000:00:1d.2 37.|-- 0000:00:1d.7 -> ../../../devices/pci0000:00/0000:00:1d.7 38.|-- 0000:00:1e.0 -> ../../../devices/pci0000:00/0000:00:1e.0 39.|-- 0000:00:1f.0 -> ../../../devices/pci0000:00/0000:00:1f.0 40.|-- 0000:00:1f.2 -> ../../../devices/pci0000:00/0000:00:1f.2 41.|-- 0000:00:1f.3 -> ../../../devices/pci0000:00/0000:00:1f.3 42.|-- 0000:00:1f.5 -> ../../../devices/pci0000:00/0000:00:1f.5 43.|-- 0000:01:00.0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0 44.|-- 0000:01:00.1 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.1 45.|-- 0000:04:00.0 -> ../../../devices/pci0000:00/0000:00:09.0/0000:04:00.0 46.`-- 0000:05:00.0 -> ../../../devices/pci0000:00/0000:00:1e.0/0000:05:00.0 47.? 48.$chmod +x x.sh 49.$./x.sh|grep 98 50.irq 98:?class[2] local_cpus[00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000]
簡單的分析下數字:class_codes[2]=IRQ_ETH 也就是說這個中斷是塊網卡。
那中斷的負載是怎么算出來的呢?繼續(xù)看代碼!
view sourceprint? 01.//procinterrupts.c 02.void?parse_proc_stat(void) 03.{ 04.... 05.file =?fopen("/proc/stat",?"r"); 06.if?(!file) { 07.log(TO_ALL, LOG_WARNING,?"WARNING cant open /proc/stat.? balacing is broken\n"); 08.return; 09.} 10.? 11./* first line is the header we don't need; nuke it */ 12.if?(getline(&line, &size, file)==0) { 13.free(line); 14.log(TO_ALL, LOG_WARNING,?"WARNING read /proc/stat. balancing is broken\n"); 15.fclose(file); 16.return; 17.} 18.cpucount = 0; 19.while?(!feof(file)) { 20.if?(getline(&line, &size, file)==0) 21.break; 22.? 23.if?(!strstr(line,?"cpu")) 24.break; 25.? 26.cpunr =?strtoul(&line[3], NULL, 10); 27.? 28.if?(cpu_isset(cpunr, banned_cpus)) 29.continue; 30.? 31.rc =?sscanf(line,?"%*s %*d %*d %*d %*d %*d %d %d", &irq_load, &softirq_load); 32.if?(rc < 2) 33.break; 34.? 35.cpu = find_cpu_core(cpunr); 36.? 37.if?(!cpu) 38.break; 39.? 40.cpucount++; 41./*??????????????????????????????????????????????????????????????????????????????????????????????????????? 42.* For each cpu add the irq and softirq load and propagate that?????????????????????????????????????????? 43.* all the way up the device tree???????????????????????????????????????????????????????????????????????? 44.*/ 45.if?(cycle_count) { 46.cpu->load = (irq_load + softirq_load) - (cpu->last_load); 47./*??????????????????????????????????????????????????????????????????????????????????????????????? 48.* the [soft]irq_load values are in jiffies, which are??????????????????????????????????????????? 49.* units of 10ms, multiply by 1000 to convert that to???????????????????????????????????????????? 50.* 1/10 milliseconds.? This give us a better integer????????????????????????????????????????????? 51.* distribution of load between irqs????????????????????????????????????????????????????????????? 52.*/ 53.cpu->load *= 1000; 54.} 55.cpu->last_load = (irq_load + softirq_load); 56.} 57.... 58.}
相當于以下的命令:
$grep cpu015/proc/stat
cpu15 30068830 85841 22995655 3212064899 536154 91145 2789328 0
我們學習下 /proc/stat 的文件格式!
關于CPU這行摘抄如下:
cpu — Measures the number of jiffies (1/100 of a second for x86 systems) that the system has been in user mode, user mode with low priority (nice), system mode, idle task, I/O wait, IRQ (hardirq), and softirq respectively. The IRQ (hardirq) is the direct response to a hardware event. The IRQ takes minimal work for queuing the “heavy” work up for the softirq to execute. The softirq runs at a lower priority than the IRQ and therefore may be interrupted more frequently. The total for all CPUs is given at the top, while each individual CPU is listed below with its own statistics. The following example is a 4-way Intel Pentium Xeon configuration with multi-threading enabled, therefore showing four physical processors and four virtual processors totaling eight processors.
可以知道這行的第7,8項分別對應著中斷和軟中斷的次數,二者加起來就是我們所謂的CPU負載。
這個和結果和irqbalance報告的中斷的情況是吻合的,見圖:
是不是有點暈了,喝口水!
我們繼續(xù)來看下整個Package層面irqbalance是如何計算負載的,從下面的圖結合前面的那個CPU拓撲很清楚的看到:
每個CORE的負載是附在上面的中斷的負載的總和,每個DOMAIN是包含的CORE的總和,每個PACKAGE包含的DOMAIN的總和,就像樹層次一樣的計算。
知道了每個CORE, DOMAIN,PACKAGE的負載的情況,那么剩下的就是找個這個中斷類型所在作用域范圍內最輕的對象把中斷遷移過去。
遷移的依據正是之前看過的這個東西:
int map_class_to_level[7] =
{ BALANCE_PACKAGE, BALANCE_CACHE, BALANCE_CACHE, BALANCE_NONE, BALANCE_CORE, BALANCE_CORE, BALANCE_CORE };
水喝多了,等等放下水先,回來繼續(xù)!
最后那irqbalance系統(tǒng)是如何實施中斷親緣性變跟的呢,繼續(xù)上代碼:
view sourceprint? 01.// activate.c 02.static?void?activate_mapping(struct?irq_info *info,?void?*data __attribute__((unused))) 03.{ 04.... 05.if?((hint_policy == HINT_POLICY_EXACT) && 06.(!cpus_empty(info->affinity_hint))) { 07.applied_mask = info->affinity_hint; 08.valid_mask = 1; 09.}?else?if?(info->assigned_obj) { 10.applied_mask = info->assigned_obj->mask; 11.valid_mask = 1; 12.if?((hint_policy == HINT_POLICY_SUBSET) && 13.(!cpus_empty(info->affinity_hint))) 14.cpus_and(applied_mask, applied_maskapplied_mask, info->affinity_hint); 15.} 16.? 17./*??????????????????????????????????????????????????????????????????????????????????????????????????????????????? 18.* only activate mappings for irqs that have moved??????????????????????????????????????????????????????????????? 19.*/ 20.if?(!info->moved && (!valid_mask || check_affinity(info, applied_mask))) 21.return; 22.? 23.if?(!info->assigned_obj) 24.return; 25.? 26.sprintf(buf,?"/proc/irq/%i/smp_affinity", info->irq); 27.file =?fopen(buf,?"w"); 28.if?(!file) 29.return; 30.? 31.cpumask_scnprintf(buf, PATH_MAX, applied_mask); 32.fprintf(file,?"%s", buf); 33.fclose(file); 34.info->moved = 0;?/*migration is done*/ 35.} 36.? 37.void?activate_mappings(void) 38.{ 39.for_each_irq(NULL, activate_mapping, NULL); 40.}
上面的代碼簡單的翻譯成shell就是:
#echo MASK > /proc/irq/N/smp_affinity
當然如果用戶設置的策略如果是HINT_POLICY_EXACT,那么我們會參照/proc/irq/N/affinity_hint設置
策略如果是HINT_POLICY_SUBSET, 那么我們會參照/proc/irq/N/affinity_hint | applied_mask 設置。
好吧,總算分析完成了! www.it165.net
總結:
irqbalance根據系統(tǒng)中斷負載的情況,自動遷移中斷保持中斷的平衡,同時會考慮到省電因素等等。 但是在實時系統(tǒng)中會導致中斷自動漂移,對性能造成不穩(wěn)定因素,在高性能的場合建議關閉。
總結
以上是生活随笔為你收集整理的irqbalance的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: numabalance
- 下一篇: 负载均衡(Load Balance)的简