linux内核 checksum,Checksum in Linux Kernel
calculate IP/TCP/UDP checsum
簡單來說,就是對要計算的數據,以16bit為單元進行累加,然后取反。
TCP收包時,檢查校驗和:
static __sum16 tcp_v4_checksum_init(struct sk_buff *skb)
{
const struct iphdr *iph = ip_hdr(skb);
if (skb->ip_summed == CHECKSUM_COMPLETE) {
if (!tcp_v4_check(skb->len, iph->saddr, ///check TCP/UDP pseudo-header checksum
iph->daddr, skb->csum)) {
skb->ip_summed = CHECKSUM_UNNECESSARY;
return 0;
}
}
skb->csum = csum_tcpudp_nofold(iph->saddr, iph->daddr,
skb->len, IPPROTO_TCP, 0); ///calc pseudo header checksum
if (skb->len <= 76) {
return __skb_checksum_complete(skb); /// 基于偽頭累加和,計算整個數據包的checksum
}
return 0;
}
csum_tcpudp_nofold用于計算偽頭的checksum,__skb_checksum_complete基于偽頭累加和(skb->csum)計算整個skb的校驗和。
net_device->features
net_device->features字段表示設備的各種特性。其中一些位用于表示硬件校驗和的計算能力:
#define NETIF_F_IP_CSUM__NETIF_F(HW_CSUM)
#define NETIF_F_IP_CSUM__NETIF_F(IP_CSUM) ///ipv4 + TCP/UDP
#define NETIF_F_IPV6_CSUM__NETIF_F(IPV6_CSUM)
NETIF_F_IP_CSUM表示硬件可以計算L4 checksum,但是只針對IPV4的TCP和UDP。但是一些設備擴展支持VXLAN和NVGRE。
NETIF_F_IP_CSUM是一種協議感知的計算checksum的方法。具體來說,上層提供兩個CSUM的參數(csum_start和csum_offset)。
NETIF_F_HW_CSUM is a protocol agnostic method to offload the transmit checksum. In this method the host
provides checksum related parameters in a transmit descriptor for a packet. These parameters include the
starting offset of data to checksum and the offset in the packet where the computed checksum is to be written. The
length of data to checksum is implicitly the length of the packet minus the starting offset.
值得一提的是,igb/ixgbe使用的NETIF_F_IP_CSUM.
sk_buff
取決于skb是接收封包,還是發送封包,skb->csum和skb->ip_summed的意義會不同。
/*
*@csum: Checksum (must include start/offset pair)
*@csum_start: Offset from skb->head where checksumming should start
*@csum_offset: Offset from csum_start where checksum should be stored
*@ip_summed: Driver fed us an IP checksum
*/
struct sk_buff {
union {
__wsumcsum;
struct {
__u16csum_start;
__u16csum_offset;
};
};
__u8local_df:1,
cloned:1,
ip_summed:2,
nohdr:1,
nfctinfo:3;
skb->ip_summed一般的取值:
/* Don't change this without changing skb_csum_unnecessary! */
#define CHECKSUM_NONE 0
#define CHECKSUM_UNNECESSARY 1 ///hardware verified the checksums
#define CHECKSUM_COMPLETE 2
#define CHECKSUM_PARTIAL 3 ///only compute IP header, not include data
接收時的CSUM
對于接收包,skb->csum可能包含L4校驗和。skb->ip_summed表述L4校驗和的狀態:
(1) CHECKSUM_UNNECESSARY
CHECKSUM_UNNECESSARY表示底層硬件已經計算了CSUM,以igb驅動為例:
igb_poll -> igb_clean_rx_irq -> igb_process_skb_fields -> igb_rx_checksum:
static inline void igb_rx_checksum(struct igb_ring *ring,
union e1000_adv_rx_desc *rx_desc,
struct sk_buff *skb)
{
///...
/* Rx checksum disabled via ethtool */
if (!(ring->netdev->features & NETIF_F_RXCSUM)) ///關閉RXCSUM
return;
/* TCP/UDP checksum error bit is set */
if (igb_test_staterr(rx_desc,
E1000_RXDEXT_STATERR_TCPE |
E1000_RXDEXT_STATERR_IPE)) {
/* work around errata with sctp packets where the TCPE aka
* L4E bit is set incorrectly on 64 byte (60 byte w/o crc)
* packets, (aka let the stack check the crc32c)
*/
if (!((skb->len == 60) &&
test_bit(IGB_RING_FLAG_RX_SCTP_CSUM, &ring->flags))) {
u64_stats_update_begin(&ring->rx_syncp);
ring->rx_stats.csum_err++;
u64_stats_update_end(&ring->rx_syncp);
}
/* let the stack verify checksum errors,交給協議棧進一步驗證csum */
return;
}
/* It must be a TCP or UDP packet with a valid checksum */
if (igb_test_staterr(rx_desc, E1000_RXD_STAT_TCPCS |
E1000_RXD_STAT_UDPCS))
skb->ip_summed = CHECKSUM_UNNECESSARY; ///stack don't needed verify
}
TCP層在收到包后,發現skb->ip_summed為CHECKSUM_UNNECESSARY就不會再檢查checksum了:
int tcp_v4_rcv(struct sk_buff *skb)
{
///...
/* An explanation is required here, I think.
* Packet length and doff are validated by header prediction,
* provided case of th->doff==0 is eliminated.
* So, we defer the checks. */
if (!skb_csum_unnecessary(skb) && tcp_v4_checksum_init(skb))
goto csum_error;
///...
}
static inline int skb_csum_unnecessary(const struct sk_buff *skb)
{
return skb->ip_summed & CHECKSUM_UNNECESSARY;
}
(2) CHECKSUM_NONE
csum中的校驗和無效,可能有以下幾種原因:
設備不支持硬件校驗和計算;
設備計算了硬件校驗和,但發現該數據幀已經損壞。此時,設備驅動程序可以直接丟棄該數據幀。但有些設備驅動程序(比如e10000/igb/ixbge)卻沒有丟棄數據幀,而是將ip_summed設置為CHECKSUM_NONE,然后交給上層協議棧重新計算并處理這種錯誤。
(3) CHECKSUM_COMPLETE
表明網卡已經計算了L4層報頭和payload的校驗和,并且skb->csum已經被賦值,此時L4層的接收者只需要加偽頭并驗證校驗結果。以TCP為例:
static __sum16 tcp_v4_checksum_init(struct sk_buff *skb)
{
const struct iphdr *iph = ip_hdr(skb);
if (skb->ip_summed == CHECKSUM_COMPLETE) {
if (!tcp_v4_check(skb->len, iph->saddr, ///check TCP/UDP pseudo-header checksum
iph->daddr, skb->csum)) {
skb->ip_summed = CHECKSUM_UNNECESSARY;
return 0;
}
}
///...
}
值得一提的,igb/ixgbe沒有使用CHECKSUM_COMPLETE,而是使用的CHECKSUM_UNNECESSARY.
注意CHECKSUM_COMPLETE和CHECKSUM_UNNECESSARY的區別,對于前者,上層還需要計算偽頭校驗和,再進行驗證,見tcp_v4_check。實際上,早前的內核版本為CHECKSUM_HW。
Veth的BUG
Veth設備會將CHECKSUM_NONE改為CHECKSUM_UNNECESSARY。這樣,就會導致硬件收到損壞的數據幀后,轉給veth后,卻變成了CHECKSUM_UNNECESSARY,上層協議棧(TCP)就不會再計算檢查數據包的校驗和了。
static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
{
///...
/* don't change ip_summed == CHECKSUM_PARTIAL, as that
* will cause bad checksum on forwarded packets
*/
if (skb->ip_summed == CHECKSUM_NONE &&
rcv->features & NETIF_F_RXCSUM)
skb->ip_summed = CHECKSUM_UNNECESSARY;
}
veth最初是用于本地通信的設備,一般來說,本地的數據幀不太可能發生損壞。在發送數據時,如果協議棧已經計算校驗和,會將skb->ip_summed設置為CHECKSUM_NONE。所以,對于veth本機通信,接收端沒有必要再計算校驗和。但是,對于容器虛擬化場景,veth的數據包可能來自網絡,如果還這樣設置,就會導致損壞的數據幀傳給應用層。
發送時CSUM
同樣,對于發送包,skb->ip_summed用于L4校驗和的狀態,以通知底層網卡是否還需要處理校驗和:
(1) CHECKSUM_NONE
此時,CHECKSUM_NONE表示協議棧已經計算了校驗和,設備不需要做任何事情。
(2) CHECKSUM_PARTIAL
CHECKSUM_PARTIAL表示使用硬件checksum ,協議棧已經計算L4層的偽頭的校驗和,并且已經加入uh->check字段中,此時只需要設備計算整個頭4層頭的校驗值。
int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
size_t size)
{
///...
/*
* Check whether we can use HW checksum.
*/
if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
skb->ip_summed = CHECKSUM_PARTIAL;
}
static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
gfp_t gfp_mask)
{
///...
icsk->icsk_af_ops->send_check(sk, skb); ///tcp_v4_send_check
}
static void __tcp_v4_send_check(struct sk_buff *skb,
__be32 saddr, __be32 daddr)
{
struct tcphdr *th = tcp_hdr(skb);
if (skb->ip_summed == CHECKSUM_PARTIAL) { ///HW CSUM
th->check = ~tcp_v4_check(skb->len, saddr, daddr, 0); ///add IPv4 pseudo header checksum
skb->csum_start = skb_transport_header(skb) - skb->head;
skb->csum_offset = offsetof(struct tcphdr, check);
} else {
th->check = tcp_v4_check(skb->len, saddr, daddr,
csum_partial(th,
th->doff << 2,
skb->csum)); ///ip_summed == CHECKSUM_NONE
}
}
/* This routine computes an IPv4 TCP checksum. */
void tcp_v4_send_check(struct sock *sk, struct sk_buff *skb)
{
const struct inet_sock *inet = inet_sk(sk);
__tcp_v4_send_check(skb, inet->inet_saddr, inet->inet_daddr);
}
dev_queue_xmit
最后在dev_queue_xmit發送的時候發現設備不支持硬件checksum還會進行軟件計算(是否會走這里?):
int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
struct netdev_queue *txq)
{
///...
if (netif_needs_gso(skb, features)) {
if (unlikely(dev_gso_segment(skb, features))) ///GSO(software offload)
goto out_kfree_skb;
if (skb->next)
goto gso;
} else { ///hardware offload
if (skb_needs_linearize(skb, features) &&
__skb_linearize(skb))
goto out_kfree_skb;
/* If packet is not checksummed and device does not
* support checksumming for this protocol, complete
* checksumming here.
*/
if (skb->ip_summed == CHECKSUM_PARTIAL) { ///only header csum is computed
if (skb->encapsulation)
skb_set_inner_transport_header(skb,
skb_checksum_start_offset(skb));
else
skb_set_transport_header(skb,
skb_checksum_start_offset(skb));
if (!(features & NETIF_F_ALL_CSUM) && ///check hardware if support offload
skb_checksum_help(skb)) ///HW not support CSUM
goto out_kfree_skb;
}
}
}
ip_summed==CHECKSUM_PARTIAL表示協議棧并沒有計算完校驗和,只計算了偽頭,將傳輸層的數據部分留給了硬件進行計算。如果底層硬件不支持CSUM,則skb_checksum_help完成計算校驗和。
Remote checksum
TODO:
相關資料
總結
以上是生活随笔為你收集整理的linux内核 checksum,Checksum in Linux Kernel的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 1901年-2020年全球气象数据 CR
- 下一篇: JobControl的使用及获取计数器