使用 ebpf 深入分析容器网络 dup 包问题
| 導語?我們日常工作中都會遇到內核網絡相關問題,包括網絡不通、丟包、重傳、性能問題等等,排查和解決非常困難,而容器更加重了這種復雜度,涉及2、3層轉發,iptable,ipvs,namespace等,即使是網絡專家,面對如此復雜的環境也非常的頭大的。我們基于eBPF開發了skbtracer工具,不用修改內核代碼,不用編寫內核模塊而能夠使Linux主機網絡變成一個徹底的白盒,能夠看到一個數據包從產生發出去的全部過程,包括不限于數據流、namespace、路由信息跟蹤,丟包原因,netfilter處理過程、關鍵點時延等功能,使一般用戶也能夠排查復雜的網絡問題并理解網絡原理。本文以困擾k8s社區已久的『容器網絡dup包問題』為例,介紹了使用該工具排查問題的方式,并對容器網絡dup包的原理做出了詳細解釋。
來自:騰訊云原生
? ?背 景
云計算浪潮中,網絡成為了跨越云端必不可少的一座橋梁,它給予人們便利,同時也帶來了各種奇怪的困擾。這些困擾的奇怪之處,不僅僅在于你面對它時的束手無策,還在于當你直接或者間接解決了這些困擾時卻又不知道為什么就解決了。究其本質的話,無外乎是我們不能夠真正地去理清楚其中的門道兒。
但是,想要非常精通網絡真的不是一件容易的事情。我們知道,內核技術門檻非常高,尤其是內核中最復雜的子系統——內核網絡子系統,有了容器網絡的加持使得該系統涉及到了 2 層轉發,3 層轉發,iptable,ipvs,NameSpace 交叉變化,面對如此復雜的環境,即使是專業的網絡專家,解決問題的時候也是非常頭大的。再加上由于業務的發展迅速使得網絡問題變得更加頻繁棘手,以及技術壁壘居高不下,開發運維人員在此花費重大人力也不一定能根治問題本身,更多時候只能繞道去嘗試通過各種內核參數等等去解決問題。
既然把黑盒般的內核研究透徹是一件難于上青天的事情,那么我們是否可以嘗試開發出一種工具旨在讓 Linux 主機網絡對一般開發運維人員來說成為一個徹底的白盒?答案是肯定的,本文正是利用這樣的一款工具分析一個完全黑盒的容器網絡問題。
1 問題描述
用戶在使用 TKE 的過程中,發現同一個節點上的 Pod1 通過 Service( ClusterIP )訪問 Pod2, Pod1 通過 UDP push 的每一條消息會在 Pod2 上出現兩次。當嘗試在 Pod1 ?eth0 <—> veth1 <—> cbr0 <—> veth2 <—> Pod2 eth0 路徑上的每個網絡接口上分別抓包后,發現在 Pod1 eth0,veth1 上數據包都僅有一個,但在cbr0,veth2,Pod2 eth0 都能抓到兩個完全相同的包。
進一步擴展場景發現,當滿足如下條件時,就會出現 dup 包:
????? 1. Pod1 與 Pod2在同一個 Node 。
????? 2. Pod1 通過 Service 訪問 Pod2 。
????? 3. 容器網絡為橋接模式且需要橋打開混雜模式。如 TKE 網絡的 cbr0 需要打開混雜模式。
并且,這些條件都很容易滿足,從『 一個K8S的issue 』(1)可以看出,早在 1.2.4 版本,此問題就存在。
2?思路
實踐發現,此問題百分之百復現,且不區分內核版本。從網頁搜索結果看出,已經有人發現了這個問題并尋求幫助,但是遺憾的是還沒有人能夠分析出根本原因并給出解決方案:
Duplicate UDP packets between pods https://github.com/kubernetes/kubernetes/issues/26255Duplication of packets seen in bridge-promiscuous mode when accessing a pod on the same node through service ip https://github.com/kubernetes/kubernetes/issues/25793Resolve cbr0 hairpin/promiscuous mode duplicate packet issue https://github.com/kubernetes/kubernetes/issues/28650唯一找到一個似乎可以規避該問題的『方案』(2),也同樣沒有分析出原因,而是強行在 2 層通過 ebtable 丟包處理進行解決。
既然從當前掌握的情況中無法得知內核黑盒到底發生了什么,也就只能深入內核進行進一步分析了。當我們去分析內核數據包處理流程細節時,可以參考如下幾種思路進行:
1.修改內核代碼,在數據包經過的路徑上跟蹤處理細節
優點:實現比較直接。
缺點:不靈活,需要反復修改,重啟內核;需要對協議棧非常熟悉,知道數據包處理的關鍵路徑。
2.插入內核模塊,hook 關鍵函數
優點:不用修改、重啟內核。
缺點:不靈活。由于很多內部函數不能直接調用,通過該方式進行分析會導致代碼量大,容易犯錯。
3. 通過 Linux 內核提供的 ebpf 去 hook 關鍵路徑函數
優點:輕量,安全,易于編寫及調試。
缺點:ebpf 不支持循環,有些功能需要通過比較 trick 的方式實現;ebpf 當前屬于比較新的技術,很多特性需要高版本內核的支持才更加完善,具體可參考『BPF Features by Linux Kernel Version』(3)。
考慮到以上分析思路,筆者已實現了基于『 bcc 』(4)的容器網絡分析工具 skbtracer,旨在一鍵排查網絡問題(主機網絡,容器網絡)。
3??工具說明
skbtracer 工具基于 bcc 框架,由 skbtracer.py 和 skbtracer.c 兩個文件組成,C 代碼是具體的 ebpf 代碼,負責抓取過濾內核事件,Python 代碼負責簡單參數解析以及抓取事件的結果輸出。該工具當前已支持跟蹤路由信息、丟包信息以及 netfilter 處理過程,其中數據流跟蹤功能正在完善,后續也將會在排查問題過程中不斷完善強大,爭取能夠讓一般用戶也能把內核協議棧當作白盒觀測。
本文提供的 skbtracer 需要 4.15 及以上內核才能正常運行,推薦使用 ubuntu18.04 或其他更高版本。你可以參照以下步驟進行 skbtracer 工具體驗:
1. 運行鏡像 docker run --privileged=true -it ccr.ccs.tencentyun.com/monkeyyang/bpftool:v0.0.12. 運行工具 root@14bd5f0406fd:~# ./skbtracer.py -h usage: skbtracer.py [-h] [-H IPADDR] [--proto PROTO] [--icmpid ICMPID][-c CATCH_COUNT] [-P PORT] [-p PID] [-N NETNS][--dropstack] [--callstack] [--iptable] [--route] [--keep][-T] [-t]Trace any packet through TCP/IP stackoptional arguments:-h, --help show this help message and exit-H IPADDR, --ipaddr IPADDRip address--proto PROTO tcp|udp|icmp|any--icmpid ICMPID trace imcp id-c CATCH_COUNT, --catch-count CATCH_COUNTcatch and print count-P PORT, --port PORT udp or tcp port-p PID, --pid PID trace this PID only-N NETNS, --netns NETNStrace this Network Namespace only--dropstack output kernel stack trace when drop packet--callstack output kernel stack trace--iptable output iptable path--route output route path--keep keep trace packet all lifetime-T, --time show HH:MM:SS timestamp-t, --timestamp show timestamp in seconds at us resolutionexamples:skbtracer.py # trace all packetsskbtracer.py --proto=icmp -H 1.2.3.4 --icmpid 22 # trace icmp packet with addr=1.2.3.4 and icmpid=22skbtracer.py --proto=tcp -H 1.2.3.4 -P 22 # trace tcp packet with addr=1.2.3.4:22skbtracer.py --proto=udp -H 1.2.3.4 -P 22 # trace udp packet wich addr=1.2.3.4:22skbtracer.py -t -T -p 1 --debug -P 80 -H 127.0.0.1 --proto=tcp --kernel-stack --icmpid=100 -N 10000 3. 抓取10個UDP報文 root@14bd5f0406fd:~# ./skbtracer.py --proto=udp -c10 time NETWORK_NS INTERFACE DEST_MAC PKT_INFO TRACE_INFO [12:51:01 ][4026531993] nil 00a8177f422e U:10.0.2.96:60479->183.60.83.19:53 ffff949a8df47700.0:ip_output [12:51:01 ][4026531993] eth0 00a8177f422e U:10.0.2.96:60479->183.60.83.19:53 ffff949a8df47700.0:ip_finish_output [12:51:01 ][4026531993] eth0 00a8177f422e U:10.0.2.96:60479->183.60.83.19:53 ffff949a8df47700.0:__dev_queue_xmit [12:51:01 ][4026531993] nil 09c67eff9b77 U:10.0.2.96:56790->183.60.83.19:53 ffff949a8e655000.0:ip_output [12:51:01 ][4026531993] eth0 09c67eff9b77 U:10.0.2.96:56790->183.60.83.19:53 ffff949a8e655000.0:ip_finish_output [12:51:01 ][4026531993] eth0 09c67eff9b77 U:10.0.2.96:56790->183.60.83.19:53 ffff949a8e655000.0:__dev_queue_xmit [12:51:01 ][4026531993] eth0 5254006c498f U:183.60.83.19:53->10.0.2.96:56790 ffff949a2151bd00.0:napi_gro_receive [12:51:01 ][4026531993] eth0 5254006c498f U:183.60.83.19:53->10.0.2.96:56790 ffff949a2151bd00.0:__netif_receive_skb [12:51:01 ][4026531993] eth0 5254006c498f U:183.60.83.19:53->10.0.2.96:56790 ffff949a2151bd00.0:ip_rcv [12:51:01 ][4026531993] eth0 5254006c498f U:183.60.83.19:53->10.0.2.96:56790 ffff949a2151bd00.0:ip_rcv_finish輸出結果說明 第一列:ebpf抓取內核事件的時間 第二列:skb當前所在namespace的inode號 第三列:skb->dev 所指設備 第四列:抓取事件發生時,數據包目的mac地址 第五列:數據包信息,由4層協議+3層地址信息+4層端口信息組成(T代表TCP,U代表UDP,I代表ICMP,其他協議直接打印協議號) 第六列:數據包的跟蹤信息,由skb內存地址+skb->pkt_type+抓取函數名(如果在netfilter抓取,則由pf號+表+鏈+執行結果構成)第六列,skb->pkt_type含義如下(\include\uapi\linux\if_packet.h): /* Packet types */ #define PACKET_HOST 0 /* To us */ #define PACKET_BROADCAST 1 /* To all */ #define PACKET_MULTICAST 2 /* To group */ #define PACKET_OTHERHOST 3 /* To someone else */ #define PACKET_OUTGOING 4 /* Outgoing of any type */ #define PACKET_LOOPBACK 5 /* MC/BRD frame looped back */ #define PACKET_USER 6 /* To user space */ #define PACKET_KERNEL 7 /* To kernel space */ /* Unused, PACKET_FASTROUTE and PACKET_LOOPBACK are invisible to user space */ #define PACKET_FASTROUTE 6 /* Fastrouted frame */第六列,pf號含義如下(\include\uapi\linux\netfilter.h): enum {NFPROTO_UNSPEC = 0,NFPROTO_INET = 1,NFPROTO_IPV4 = 2,NFPROTO_ARP = 3,NFPROTO_NETDEV = 5,NFPROTO_BRIDGE = 7,NFPROTO_IPV6 = 10,NFPROTO_DECNET = 12,NFPROTO_NUMPROTO, };4??分析過程
4.1 環境說明
操作系統:ubuntu 18.01 4.15.0-54-generic pod信息: 角色 nodeveth eth0 mac地址 pod ip地址 node eth0地址 client vethdfe06191 42:0f:47:45:67:ff 172.18.0.15 10.0.2.96 server veth19678f6d 7a:47:5e:bb:26:b0 172.18.0.12 10.0.2.96 server2 vethe9f5bdcc 52:f5:ce:8f:62:f6 172.18.0.66 10.0.2.138node信息: node eth0 ip地址 eth0 mac地址 cbr0 ip地址 crb0 mac地址 node1 10.0.2.96 52:54:00:6c:49:8f 172.18.0.1 2e:74:df:86:96:4b node2 10.0.2.138 52:54:00:b8:05:ae 172.18.0.65 6e:41:27:22:69:0eservice信息: ubuntu@VM-2-138-ubuntu:~$ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 172.18.252.1 <none> 443/TCP 10d ubuntu-server ClusterIP 172.18.253.6 <none> 6666/UDP 10d ubuntu-server2 ClusterIP 172.18.253.158 <none> 6667/UDP 5h41mubuntu-server 后端只有一個 pod,對應 pod 名為 server ubuntu-server2 后端只有一個 pod,對應 pod 名為 server2接下來用 skbtracer 工具分別輸出如下場景數據包的處理細節:
??? 1. cbr0 開啟混雜模式:client 通過 Service 訪問相同 Node 的 Pod
??? 2. cbr0 開啟混雜模式:client 通過 Service 訪問不同 Node 的 Pod
??? 3. cbr0?開啟混雜模式:client 直接訪問相同 Node 的 Pod
??? 4. cbr0?關閉混雜模式:client 通過 Service 訪問相同 Node的 Pod
??? 5. cbr0?關閉混雜模式:client 通過 Service 訪問不同 Node的 Pod
??? 6. cbr0 關閉混雜模式:client 直接訪問相同 Node 的 Pod
4.2?分場景輸出處理細節
4.2.1. cbr0 開啟混雜模式: client 通過 Service 訪問相同 Node 的 Pod
root@14bd5f0406fd:~# ./skbtracer.py --route --iptable -H 172.18.0.15 --proto udp time NETWORK_NS INTERFACE DEST_MAC PKT_INFO TRACE_INFO [02:05:54 ][4026532282] nil 000000000000 U:172.18.0.15:48655->172.18.253.6:6666 ffff949a8d639a00.0:ip_output [02:05:54 ][4026532282] eth0 000000000000 U:172.18.0.15:48655->172.18.253.6:6666 ffff949a8d639a00.0:ip_finish_output [02:05:54 ][4026532282] eth0 000000000000 U:172.18.0.15:48655->172.18.253.6:6666 ffff949a8d639a00.0:__dev_queue_xmit [02:05:54 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:48655->172.18.253.6:6666 ffff949a8d639a00.3:__netif_receive_skb [02:05:54 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:48655->172.18.253.6:6666 ffff949a8d639a00.3:br_handle_frame [02:05:54 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:48655->172.18.253.6:6666 ffff949a8d639a00.0:br_nf_pre_routing [02:05:54 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:48655->172.18.253.6:6666 ffff949a8d639a00.0:2.nat.PREROUTING.ACCEPT [02:05:54 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:br_nf_pre_routing_finish [02:05:54 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:br_handle_frame_finish [02:05:54 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:deliver_clone [02:05:54 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639400.0:__br_forward [02:05:54 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639400.0:br_nf_forward_ip [02:05:54 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639400.0:2.filter.FORWARD.ACCEPT [02:05:54 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639400.0:br_nf_forward_finish [02:05:54 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639400.0:br_forward_finish [02:05:54 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639400.0:br_nf_post_routing [02:05:54 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639400.0:2.nat.POSTROUTING.ACCEPT [02:05:54 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639400.0:br_nf_dev_queue_xmit [02:05:54 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639400.0:__dev_queue_xmit [02:05:54 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:br_pass_frame_up [02:05:54 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:br_netif_receive_skb [02:05:54 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:__netif_receive_skb [02:05:54 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:ip_rcv [02:05:54 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:ip_rcv_finish [02:05:54 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:2.filter.FORWARD.ACCEPT [02:05:54 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:ip_output [02:05:54 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:ip_finish_output [02:05:54 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:__dev_queue_xmit [02:05:54 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:deliver_clone [02:05:54 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639000.0:__br_forward [02:05:54 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639000.0:br_forward_finish [02:05:54 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639000.0:br_nf_post_routing [02:05:54 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639000.0:__dev_queue_xmit [02:05:54 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:__br_forward [02:05:54 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:br_forward_finish [02:05:54 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:br_nf_post_routing [02:05:54 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:__dev_queue_xmit [02:05:54 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639400.0:__netif_receive_skb [02:05:54 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639400.0:ip_rcv [02:05:54 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639400.0:ip_rcv_finish [02:05:54 ][4026532282] eth0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639000.3:__netif_receive_skb [02:05:54 ][4026532282] eth0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639000.3:ip_rcv [02:05:54 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:__netif_receive_skb [02:05:54 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:ip_rcv [02:05:54 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:48655->172.18.0.12:6666 ffff949a8d639a00.0:ip_rcv_finish4.2.2. cbr0 開啟混雜模式: client 通過 Service 訪問不同 Node 的 Pod
root@14bd5f0406fd:~# ./skbtracer.py --route --iptable -H 172.18.0.15 --proto udp time NETWORK_NS INTERFACE DEST_MAC PKT_INFO TRACE_INFO [02:06:41 ][4026532282] nil 005ea63e937c U:172.18.0.15:53947->172.18.253.158:6667 ffff949a8d639400.0:ip_output [02:06:41 ][4026532282] eth0 005ea63e937c U:172.18.0.15:53947->172.18.253.158:6667 ffff949a8d639400.0:ip_finish_output [02:06:41 ][4026532282] eth0 005ea63e937c U:172.18.0.15:53947->172.18.253.158:6667 ffff949a8d639400.0:__dev_queue_xmit [02:06:41 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:53947->172.18.253.158:6667 ffff949a8d639400.3:__netif_receive_skb [02:06:41 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:53947->172.18.253.158:6667 ffff949a8d639400.3:br_handle_frame [02:06:41 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:53947->172.18.253.158:6667 ffff949a8d639400.0:br_nf_pre_routing [02:06:41 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:53947->172.18.253.158:6667 ffff949a8d639400.0:2.nat.PREROUTING.ACCEPT [02:06:41 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:53947->172.18.0.66:6667 ffff949a8d639400.0:br_nf_pre_routing_finish [02:06:41 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:53947->172.18.0.66:6667 ffff949a8d639400.0:br_handle_frame_finish [02:06:41 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:53947->172.18.0.66:6667 ffff949a8d639400.0:br_pass_frame_up [02:06:41 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:53947->172.18.0.66:6667 ffff949a8d639400.0:br_netif_receive_skb [02:06:41 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:53947->172.18.0.66:6667 ffff949a8d639400.0:__netif_receive_skb [02:06:41 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:53947->172.18.0.66:6667 ffff949a8d639400.0:ip_rcv [02:06:41 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:53947->172.18.0.66:6667 ffff949a8d639400.0:ip_rcv_finish [02:06:41 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:53947->172.18.0.66:6667 ffff949a8d639400.0:2.filter.FORWARD.ACCEPT [02:06:41 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:53947->172.18.0.66:6667 ffff949a8d639400.0:ip_output [02:06:41 ][4026531993] eth0 2e74df86964b U:172.18.0.15:53947->172.18.0.66:6667 ffff949a8d639400.0:2.nat.POSTROUTING.ACCEPT [02:06:41 ][4026531993] eth0 2e74df86964b U:172.18.0.15:53947->172.18.0.66:6667 ffff949a8d639400.0:ip_finish_output [02:06:41 ][4026531993] eth0 feee1e1bcbe0 U:172.18.0.15:53947->172.18.0.66:6667 ffff949a8d639400.0:__dev_queue_xmit4.2.3. cbr0 開啟混雜模式: client 直接訪問相同 Node 的 Pod
root@14bd5f0406fd:~# ./skbtracer.py --route --iptable -H 172.18.0.15 --proto udp time NETWORK_NS INTERFACE DEST_MAC PKT_INFO TRACE_INFO [02:07:21 ][4026532282] nil 000000000000 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.0:ip_output [02:07:21 ][4026532282] eth0 000000000000 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.0:ip_finish_output [02:07:21 ][4026532282] eth0 000000000000 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.0:__dev_queue_xmit [02:07:21 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.3:__netif_receive_skb [02:07:21 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.3:br_handle_frame [02:07:21 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.3:br_nf_pre_routing [02:07:21 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.0:2.nat.PREROUTING.ACCEPT [02:07:21 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.0:br_nf_pre_routing_finish [02:07:21 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.3:br_handle_frame_finish [02:07:21 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.3:br_forward [02:07:21 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.3:deliver_clone [02:07:21 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64900.3:__br_forward [02:07:21 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64900.3:br_nf_forward_ip [02:07:21 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64900.0:2.filter.FORWARD.ACCEPT [02:07:21 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64900.0:br_nf_forward_finish [02:07:21 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64900.3:br_forward_finish [02:07:21 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64900.3:br_nf_post_routing [02:07:21 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64900.0:2.nat.POSTROUTING.ACCEPT [02:07:21 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64900.0:br_nf_dev_queue_xmit [02:07:21 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64900.0:__dev_queue_xmit [02:07:21 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.3:br_pass_frame_up [02:07:21 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.3:br_netif_receive_skb [02:07:21 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.3:__netif_receive_skb [02:07:21 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64b00.3:ip_rcv [02:07:21 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64900.0:__netif_receive_skb [02:07:21 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64900.0:ip_rcv [02:07:21 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:37729->172.18.0.12:6666 ffff949a8eb64900.0:ip_rcv_finish4.2.4. cbr0 關閉混雜模式: client 通過 Service 訪問相同 Node 的 Pod
root@14bd5f0406fd:~# ./skbtracer.py --route --iptable -H 172.18.0.15 --proto udp time NETWORK_NS INTERFACE DEST_MAC PKT_INFO TRACE_INFO [02:08:32 ][4026532282] nil 3dc486eebf45 U:172.18.0.15:53281->172.18.253.6:6666 ffff949a2151b100.0:ip_output [02:08:32 ][4026532282] eth0 3dc486eebf45 U:172.18.0.15:53281->172.18.253.6:6666 ffff949a2151b100.0:ip_finish_output [02:08:32 ][4026532282] eth0 3dc486eebf45 U:172.18.0.15:53281->172.18.253.6:6666 ffff949a2151b100.0:__dev_queue_xmit [02:08:32 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:53281->172.18.253.6:6666 ffff949a2151b100.3:__netif_receive_skb [02:08:32 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:53281->172.18.253.6:6666 ffff949a2151b100.3:br_handle_frame [02:08:32 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:53281->172.18.253.6:6666 ffff949a2151b100.0:br_nf_pre_routing [02:08:32 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:53281->172.18.253.6:6666 ffff949a2151b100.0:2.nat.PREROUTING.ACCEPT [02:08:32 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:br_nf_pre_routing_finish [02:08:32 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:br_handle_frame_finish [02:08:32 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:br_forward [02:08:32 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:__br_forward [02:08:32 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:br_nf_forward_ip [02:08:32 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:2.filter.FORWARD.ACCEPT [02:08:32 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:br_nf_forward_finish [02:08:32 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:br_forward_finish [02:08:32 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:br_nf_post_routing [02:08:32 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:2.nat.POSTROUTING.ACCEPT [02:08:32 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:br_nf_dev_queue_xmit [02:08:32 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:__dev_queue_xmit [02:08:32 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:__netif_receive_skb [02:08:32 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:ip_rcv [02:08:32 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:53281->172.18.0.12:6666 ffff949a2151b100.0:ip_rcv_finish4.2.5. cbr0 關閉混雜模式: client 通過 Service 訪問不同 Node 的 Pod
root@14bd5f0406fd:~# ./skbtracer.py --route --iptable -H 172.18.0.15 --proto udp time NETWORK_NS INTERFACE DEST_MAC PKT_INFO TRACE_INFO [02:09:08 ][4026532282] nil 000000000000 U:172.18.0.15:44436->172.18.253.158:6667 ffff949a8e655b00.0:ip_output [02:09:08 ][4026532282] eth0 000000000000 U:172.18.0.15:44436->172.18.253.158:6667 ffff949a8e655b00.0:ip_finish_output [02:09:08 ][4026532282] eth0 000000000000 U:172.18.0.15:44436->172.18.253.158:6667 ffff949a8e655b00.0:__dev_queue_xmit [02:09:08 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:44436->172.18.253.158:6667 ffff949a8e655b00.3:__netif_receive_skb [02:09:08 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:44436->172.18.253.158:6667 ffff949a8e655b00.3:br_handle_frame [02:09:08 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:44436->172.18.253.158:6667 ffff949a8e655b00.0:br_nf_pre_routing [02:09:08 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:44436->172.18.253.158:6667 ffff949a8e655b00.0:2.nat.PREROUTING.ACCEPT [02:09:08 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:44436->172.18.0.66:6667 ffff949a8e655b00.0:br_nf_pre_routing_finish [02:09:08 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:44436->172.18.0.66:6667 ffff949a8e655b00.0:br_handle_frame_finish [02:09:08 ][4026531993] vethdfe06191 2e74df86964b U:172.18.0.15:44436->172.18.0.66:6667 ffff949a8e655b00.0:br_pass_frame_up [02:09:08 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:44436->172.18.0.66:6667 ffff949a8e655b00.0:br_netif_receive_skb [02:09:08 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:44436->172.18.0.66:6667 ffff949a8e655b00.0:__netif_receive_skb [02:09:08 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:44436->172.18.0.66:6667 ffff949a8e655b00.0:ip_rcv [02:09:08 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:44436->172.18.0.66:6667 ffff949a8e655b00.0:ip_rcv_finish [02:09:08 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:44436->172.18.0.66:6667 ffff949a8e655b00.0:2.filter.FORWARD.ACCEPT [02:09:08 ][4026531993] cbr0 2e74df86964b U:172.18.0.15:44436->172.18.0.66:6667 ffff949a8e655b00.0:ip_output [02:09:08 ][4026531993] eth0 2e74df86964b U:172.18.0.15:44436->172.18.0.66:6667 ffff949a8e655b00.0:2.nat.POSTROUTING.ACCEPT [02:09:08 ][4026531993] eth0 2e74df86964b U:172.18.0.15:44436->172.18.0.66:6667 ffff949a8e655b00.0:ip_finish_output [02:09:08 ][4026531993] eth0 feee1e1bcbe0 U:172.18.0.15:44436->172.18.0.66:6667 ffff949a8e655b00.0:__dev_queue_xmit4.2.6. cbr0 關閉混雜模式: client 直接訪問相同 Node 的 Pod
root@14bd5f0406fd:~# ./skbtracer.py --route --iptable -H 172.18.0.15 --proto udp time NETWORK_NS INTERFACE DEST_MAC PKT_INFO TRACE_INFO [02:09:36 ][4026532282] nil 000000000000 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.0:ip_output [02:09:36 ][4026532282] eth0 000000000000 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.0:ip_finish_output [02:09:36 ][4026532282] eth0 000000000000 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.0:__dev_queue_xmit [02:09:36 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.3:__netif_receive_skb [02:09:36 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.3:br_handle_frame [02:09:36 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.3:br_nf_pre_routing [02:09:36 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.0:2.nat.PREROUTING.ACCEPT [02:09:36 ][4026531993] cbr0 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.0:br_nf_pre_routing_finish [02:09:36 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.3:br_handle_frame_finish [02:09:36 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.3:br_forward [02:09:36 ][4026531993] vethdfe06191 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.3:__br_forward [02:09:36 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.3:br_nf_forward_ip [02:09:36 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.0:2.filter.FORWARD.ACCEPT [02:09:36 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.0:br_nf_forward_finish [02:09:36 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.3:br_forward_finish [02:09:36 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.3:br_nf_post_routing [02:09:36 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.0:2.nat.POSTROUTING.ACCEPT [02:09:36 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.0:br_nf_dev_queue_xmit [02:09:36 ][4026531993] veth19678f6d 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.0:__dev_queue_xmit [02:09:36 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.0:__netif_receive_skb [02:09:36 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.0:ip_rcv [02:09:36 ][4026532360] eth0 7a475ebb26b0 U:172.18.0.15:41067->172.18.0.12:6666 ffff949a8d570300.0:ip_rcv_finish5 數據處理流程結合內核代碼現象分析解釋
對比各種場景的處理流程,發現 br_handle_frame_finish 函數關鍵處理流程在不同場景有不同的處理結果,首先分析下這個函數:
/* note: already called with rcu_read_lock */ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb) {struct net_bridge_port *p = br_port_get_rcu(skb->dev);enum br_pkt_type pkt_type = BR_PKT_UNICAST;struct net_bridge_fdb_entry *dst = NULL;struct net_bridge_mdb_entry *mdst;bool local_rcv, mcast_hit = false;const unsigned char *dest;struct net_bridge *br;u16 vid = 0;if (!p || p->state == BR_STATE_DISABLED)goto drop;if (!br_allowed_ingress(p->br, nbp_vlan_group_rcu(p), skb, &vid))goto out;nbp_switchdev_frame_mark(p, skb);/* insert into forwarding database after filtering to avoid spoofing */br = p->br;if (p->flags & BR_LEARNING)br_fdb_update(br, p, eth_hdr(skb)->h_source, vid, false);//關鍵點,如果橋設備開啟混雜模式,則local_rcv設置為truelocal_rcv = !!(br->dev->flags & IFF_PROMISC);dest = eth_hdr(skb)->h_dest;if (is_multicast_ether_addr(dest)) {/* by definition the broadcast is also a multicast address */if (is_broadcast_ether_addr(dest)) {pkt_type = BR_PKT_BROADCAST;local_rcv = true;} else {pkt_type = BR_PKT_MULTICAST;if (br_multicast_rcv(br, p, skb, vid))goto drop;}}if (p->state == BR_STATE_LEARNING)goto drop;BR_INPUT_SKB_CB(skb)->brdev = br->dev;if (IS_ENABLED(CONFIG_INET) &&(skb->protocol == htons(ETH_P_ARP) ||skb->protocol == htons(ETH_P_RARP))) {br_do_proxy_suppress_arp(skb, br, vid, p);} else if (IS_ENABLED(CONFIG_IPV6) &&skb->protocol == htons(ETH_P_IPV6) &&br->neigh_suppress_enabled &&pskb_may_pull(skb, sizeof(struct ipv6hdr) +sizeof(struct nd_msg)) &&ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {struct nd_msg *msg, _msg;msg = br_is_nd_neigh_msg(skb, &_msg);if (msg)br_do_suppress_nd(skb, br, vid, p, msg);}switch (pkt_type) {//當前的處理場景pkt_type只可能為BR_PKT_UNICASTcase BR_PKT_MULTICAST:mdst = br_mdb_get(br, skb, vid);if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&br_multicast_querier_exists(br, eth_hdr(skb))) {if ((mdst && mdst->host_joined) ||br_multicast_is_router(br)) {local_rcv = true;br->dev->stats.multicast++;}mcast_hit = true;} else {local_rcv = true;br->dev->stats.multicast++;}break;case BR_PKT_UNICAST: //根據數據包的目的mac地址查詢該發往哪個設備dst = br_fdb_find_rcu(br, dest, vid);default:break;}if (dst) {unsigned long now = jiffies;if (dst->is_local) //目的mac地址是橋設備(cbr0),則把數據包往3層協議棧傳遞return br_pass_frame_up(skb);if (now != dst->used)dst->used = now;br_forward(dst->dst, skb, local_rcv, false); //目的mac地址是橋接的其他port,則在橋上直接轉發出去} else {if (!mcast_hit)br_flood(br, skb, pkt_type, local_rcv, false);elsebr_multicast_flood(mdst, skb, local_rcv, false);}if (local_rcv) //關鍵點:開啟混雜模式的情況下,此包再次被送入3層協議棧return br_pass_frame_up(skb);out:return 0; drop:kfree_skb(skb);goto out; }5.1 為何 client 通過 Service 訪問相同節點 Pod 有 dup 包?
對比內核函數 br_handle_frame_finish 和 br_forward 以及 4.2.1 中工具的輸出結果,可知流程如下:
在 br_handle_frame_finish 函數內已經做完 ?DNAT 處理,目的 mac 地址是同 Pod 的 mac 地址,數據包有兩條并行路線。
路線1:br_forward() —> deliver_clone() —> br_forward() —> 直接發給了 Service 后端的 Pod。
路線2:br_pass_frame_up()—>__netif_receive_skb(cbr0)—>ip_rcv()—>ip_output()—>發給 Service 后端的 Pod 。
5.2 為何 client 直接訪問相同節點 Pod 沒有 dup 包?
仔細對比可以發現,4.2.3 這條數據的處理與 4.2.1( 訪問 Service )類似,數據包也有兩條路徑,第一條路徑完全一致,第二條路徑基本一致,只不過是在 ip_rcv() 函數處的 skb->pkt_type 有區別:
4.2.1 的 skb->pkt_type == PACKET_HOST
4.2.3 的 skb->pkt_type == PACKET_OTHERHOST
而 ip_rcv() 函數入口處會直接丟棄 skb->pkt_type == PACKET_OTHERHOST 的包,因此導致第二條路徑就斷了。
從以下代碼片段可以看出,內核函數 ip_rcv() 丟棄了 skb->pkt_type == PACKET_OTHERHOST 的數據包:
/** Main IP Receive routine.*/ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) {const struct iphdr *iph;struct net *net;u32 len;/* When the interface is in promisc. mode, drop all the crap* that it receives, do not try to analyse it.*/if (skb->pkt_type == PACKET_OTHERHOST)goto drop;那么,為什么 4.2.1 和 4.2.3 兩種情況中 skb->pkt_type 會有差異呢?原因在于,函數 br_handle_frame 會根據數據包的目的 mac 地址進行如下判斷并重新給 skb->pkt_type 賦值:
如果目的 mac 地址是 cbr0( Pod 的默認網關),則 skb->pkt_type = PACKET_HOST;
否則skb->pkt_type = PACKET_OTHERHOST。
5.3 開啟混雜模式與否的哪些處理差異會導致出現 ?dup 包?
內核中有這樣一處注釋,如下:
static int br_pass_frame_up(struct sk_buff *skb) { ..../* Bridge is just like any other port. Make sure the* packet is allowed except in promisc modue when someone* may be running packet capture.*/ .....cbr0開啟混雜模式后,內核處理邏輯認為當前正在進行 cbr0 抓包,需要讓數據包經過 cbr0,因此調用 br_pass_frame_up 把包 clone 一份傳給 cbr0 設備(原始包是直接轉發給對應 port ),cbr0 收到包后,從 2 層協議棧轉 3 層協議棧走到 ip_rcv 函數(如 5.2 所述,ip_rcv 在 skb->pkt_type == PACKET_HOST 時會繼續往后處理),而后 3 層協議棧繼續走 IP 轉發,又再次把這個 clone 的包發給了目的 Pod 。
5.4 為何在 netfilter 處理處,skb dev 會發生突變?
內核有一處注釋如下:
/* Direct IPv6 traffic to br_nf_pre_routing_ipv6.* Replicate the checks that IPv4 does on packet reception.* Set skb->dev to the bridge device (i.e. parent of the* receiving device) to make netfilter happy, the REDIRECT* target in particular. Save the original destination IP* address to be able to detect DNAT afterwards. */ static unsigned int br_nf_pre_routing(unsigned int hook, struct sk_buff *skb,const struct net_device *in,const struct net_device *out,int (*okfn)(struct sk_buff *))該注釋表明,在 netfilter 處理處 skb dev 發生突變,是由于橋模式下調用 iptable 規則,會讓規則認為當前的處理接口就是 cbr0 。
5.5 為何有時候能抓到兩份包,有時候只能抓到一份?
關于抓包的問題,本次不做具體場景分析。抓包原理是應用層開啟抓包后動態往協議棧注冊 tpacket_rcv 回調函數,回調函數內部調用 bpf filter 判斷滿足抓包條件后會 clone 一份 skb 放到 ring buffer,此回調函數會在如下兩點被調用:
dev_hard_start_xmit (抓接口發出包的點)
__netif_receive_skb_core(抓接口收包的點)
只要數據包經過這兩個函數,就會被 tcpdump 抓取到。
以上即是所有可疑點的分析解釋,如果讀完本文還有其他質疑點,可以提出,一起分析討論。
6 解決方案
經過以上分析發現,當前問題是開啟混雜模式后,目的端收到了兩份 UDP 報文(其實 TCP 也是兩份,只不過是 TCP 協議棧在接收端做了處理,并認為是重傳報文)。出現該問題的關鍵點在于,為了使 cbr0 能夠抓到數據包,br_handle_frame_finish 多 clone 出了一份 skb,往上走到了 Node 節點的 ip_rcv 。
因此,要解決這個問題,只需要把這個包提前丟掉,不讓其進入目的 Pod 即可。當前 KBS 社區有人提出使用『 ebtable規則丟包』(5),通過在 veth 設置 ebtable 規則,把源地址是 Pod 網段,源 mac ?是cbr0 的包丟掉:
ebtables -t filter -N CNI-DEDUP ebtables -t filter -A OUTPUT -j CNI-DEDUP ebtables -t filter -A CNI-DEDUP -p IPv4 -s 2e:74:df:86:96:4b -o veth+ --ip-src 172.18.0.1 -j ACCEPT ebtables -t filter -A CNI-DEDUP -p IPv4 -s 2e:74:df:86:96:4b -o veth+ --ip-src 172.18.0.1/24 -j DROP經測試,該方式能夠有效防止目標 Pod 收到兩份報文。也許有人會疑問:為何這個 patch 很早就被合并進了 Kubernetes 的 master 主分支,現在這個問題又冒出來了呢?原因是 Kubernetes 后來引入了網絡 CNI 插件機制,把原本放到 Kubenet 中實現的網絡功能移植到 CNI 插件實現,而對應的插件并沒有把這個特性移植過去。
? 總? 結
本文借助 ebpf 工具 skbtracer 分析了容器網橋模式下出現 dup 包問題的根本原因并討論了解決方案, 整個分析過程工具的輸出內容信息量較大,但是能夠簡化原本復雜的分析過程。本文僅涉及到了與問題相關的部分結論,并沒有覆蓋所有網絡處理細節,但是工具的輸出結果可以多多交叉對比,能從中收獲很多驚喜,你會發現:哦~原來如此。
? ?引? 用
此處所展示的鏈接,為正文中帶有『 』符號的內容所對應的擴展入口,你可以通過這些入口進一步深入閱讀。
- END -
看完一鍵三連在看,轉發,點贊
是對文章最大的贊賞,極客重生感謝你
推薦閱讀
圖解Linux 內核TCP/IP 協議棧實現|Linux網絡硬核系列
網絡排障全景指南手冊v1.0精簡版pdf
一個奇葩的網絡問題
總結
以上是生活随笔為你收集整理的使用 ebpf 深入分析容器网络 dup 包问题的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 编程究竟难在哪?
- 下一篇: 他35岁,年薪100万,牛逼的人生无需解