當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

TCP/IP协议栈之LwIP（六）---网络传输管理之TCP协议

發(fā)布時間：2024/3/7 编程问答 47 豆豆

生活随笔收集整理的這篇文章主要介紹了 TCP/IP协议栈之LwIP（六）---网络传输管理之TCP协议小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

文章目錄

一、TCP協(xié)議簡介
- 1.1 正面確認與超時重傳
- 1.2 連接管理與保活機制
- 1.3 滑動窗口與緩沖機制
- 1.4 流量控制與擁塞控制
- 1.5 提高網(wǎng)絡(luò)利用率的其他機制
二、TCP協(xié)議實現(xiàn)
- 2.1 TCP報文格式
- 2.2 TCP數(shù)據(jù)報描述
- 2.3 TCP狀態(tài)機
- 2.4 TCP數(shù)據(jù)報操作
- - 2.4.1 TCP報文段輸出處理
  - 2.4.2 TCP報文段輸入處理
  - 2.4.3 TCP定時器
- 2.5 SYN攻擊
更多文章

一、TCP協(xié)議簡介

在傳輸層協(xié)議中，UDP是一種沒有復雜控制，提供面向無連接通信服務的一種協(xié)議，它將部分控制轉(zhuǎn)移給應用程序去處理，自己卻只提供作為傳輸層協(xié)議的最基本功能。與UDP不同，TCP則是對傳輸、發(fā)送、通信等進行控制的協(xié)議。

TCP(Transmission Control Protocol)與UDP(User Datagram Protocol)的區(qū)別相當大，它充分實現(xiàn)了數(shù)據(jù)傳輸時各種控制功能，可以進行丟包時的重發(fā)控制，還可以對次序亂掉的分包進行順序控制，而這些在UDP中都沒有。此外，TCP作為一種面向有連接的協(xié)議，只有在確認通信對端存在時才會發(fā)送數(shù)據(jù)，從而可以控制通信流量的浪費。根據(jù)TCP的這些機制，在IP這種無連接的網(wǎng)絡(luò)上也能夠?qū)崿F(xiàn)高可靠的通信。

為了通過IP數(shù)據(jù)報實現(xiàn)可靠性傳輸，需要考慮很多問題，例如數(shù)據(jù)的破壞、丟包、重復以及分片順序混亂等問題。TCP通過校驗和、序列號、確認應答、重發(fā)控制、連接管理、窗口控制等機制實現(xiàn)可靠性傳輸。

1.1 正面確認與超時重傳

在TCP中，當發(fā)送端的數(shù)據(jù)到達接收主機時，接收端主機會返回一個已收到消息的通知，這個消息叫做確認應答（ACK）。TCP通過肯定的確認應答實現(xiàn)可靠的數(shù)據(jù)傳輸，當發(fā)送端將數(shù)據(jù)發(fā)出之后會等待對端的確認應答，如果有確認應答說明數(shù)據(jù)已經(jīng)成功到達對端，反之則說明數(shù)據(jù)丟失的可能性很大。在一定時間內(nèi)沒有等到確認應答，發(fā)送端就可以認為數(shù)據(jù)已經(jīng)丟失并進行重發(fā)，由此即使產(chǎn)生了丟包仍能保證數(shù)據(jù)能夠到達對端，實現(xiàn)可靠傳輸。

未收到確認應答并不意味著數(shù)據(jù)一定丟失，也有可能是數(shù)據(jù)對方已經(jīng)收到，只是返回的確認應答在途中丟失，這種情況也會導致發(fā)送到因沒有收到確認應答而認為數(shù)據(jù)沒有到達目的地，從而進行重新發(fā)送。也有可能因為一些其他原因?qū)е麓_認應答延遲到達，在源主機重發(fā)數(shù)據(jù)以后才到達的情況也屢見不鮮。此時，源發(fā)送主機只要按照機制重發(fā)數(shù)據(jù)即可，但目標主機會反復收到相同的數(shù)據(jù)，為了對上層應用提供可靠的傳輸必須放棄重復的數(shù)據(jù)。為此，就必須引入一種機制，它能夠識別是否已經(jīng)接收數(shù)據(jù)，又能夠判斷是否需要接收。

上述這些確認應答處理、重傳控制以及重復控制等功能都可以通過序列號實現(xiàn)。序列號是按順序給發(fā)送數(shù)據(jù)的每一個字節(jié)都標上號碼的編號。接收端查詢接收數(shù)據(jù)TCP首部中的序列號和數(shù)據(jù)的長度，將自己下一步應該接收的序號作為確認應答返送回去，這樣通過序列號和確認應答號，TCP可以實現(xiàn)可靠傳輸，整個過程如下圖所示：

前面說到發(fā)送端在一定時間內(nèi)沒有等到確認應答就會進行數(shù)據(jù)重發(fā)，在重發(fā)數(shù)據(jù)之前等待確認應答到來的特定時間間隔就叫重發(fā)超時。那么這個重發(fā)超時的具體時間長度又是如何確定的呢？

最理想的是，找到一個最小時間，它能保證確認應答一定能在這個時間內(nèi)返回，然而這個時間長短隨著數(shù)據(jù)包途經(jīng)的網(wǎng)絡(luò)環(huán)境的不同而有所變化，例如跟網(wǎng)絡(luò)的距離、帶寬、擁堵程度等都有關(guān)系。TCP要求不論處在何種網(wǎng)絡(luò)環(huán)境下都要提供高性能通信，并且不論網(wǎng)絡(luò)擁堵情況發(fā)生何種變化，都必須保持這一特性。為此，它在每次發(fā)包時都會計算往返時間及其偏差（往返時間RTT估計）。將這個往返時間和偏差相加，重發(fā)超時時間就是比這個總和要稍大一點的值。往返時間的計算與重發(fā)超時的時間推移過程如下圖所示：

在BSD的Unix以及Windows系統(tǒng)中，超時都以0.5秒為單位進行控制，因此重發(fā)超時都是0.5秒的整數(shù)倍。不過由于最初的數(shù)據(jù)包還不知道往返時間，所以其重發(fā)超時一般設(shè)置為6秒左右。

數(shù)據(jù)被重發(fā)之后若還收不到確認應答，則進行再次發(fā)送，此時等待確認應答的時間將會以2倍、4倍的指數(shù)函數(shù)延長。但數(shù)據(jù)也不會無限、反復的重發(fā)，達到一定重發(fā)次數(shù)后，如果仍沒有任何確認應答返回，就會判斷為網(wǎng)絡(luò)或?qū)Χ酥鳈C發(fā)生了異常，強制關(guān)閉連接，并通知應用通信異常強行終止。

1.2 連接管理與保活機制

TCP提供面向有連接的通信傳輸，通信雙方在有效數(shù)據(jù)交互之前，必須建立穩(wěn)定的連接，同時初始化與連接相關(guān)的數(shù)據(jù)交互、控制信息。UDP是一種面向無連接的通信協(xié)議，因此不檢查對端是否可以通信，直接將UDP數(shù)據(jù)包發(fā)送出去。TCP與此相反，它會在數(shù)據(jù)通信之前通過TCP首部發(fā)送一個SYN包作為建立連接的請求等待確認應答。如果對端發(fā)來確認應答，則認為可以進行數(shù)據(jù)通信，如果對端的確認應答未能到達，就不會進行數(shù)據(jù)通信。在TCP中，通信雙方按照客戶端–服務器模型建立連接的過程稱為“三次握手”過程，圖示如下：

TCP提供全雙工的連接服務，連接的任何一方都可以關(guān)閉某個方向上的數(shù)據(jù)傳輸，當一個方向上的連接被終止時，另一個方向還可以繼續(xù)發(fā)送數(shù)據(jù)。當發(fā)送數(shù)據(jù)的一方完成數(shù)據(jù)發(fā)送任務后，它就可以發(fā)送一個FIN標志置1的握手包來終止這個方向上的連接，當另一端收到這個FIN包時，它必須通知應用層另一端已經(jīng)終止了該方向的數(shù)據(jù)傳輸。發(fā)送FIN通常是應用層進行關(guān)閉的結(jié)果，收到一個FIN意味著在這個方向上已經(jīng)沒有數(shù)據(jù)流動，但在另一個方向上仍能發(fā)送數(shù)據(jù)，此時的連接處于半關(guān)閉狀態(tài)。要完全關(guān)閉一條連接，需要四次報文交互的過程，稱連接斷開過程為“四次握手”過程，圖示如下：

在建立TCP連接的同時，也可以確定發(fā)送數(shù)據(jù)包的單位，也即最大報文段長度（MSS:Maximum Segment Size），最理想的情況是，MSS正好是IP中不會被分片處理的最大數(shù)據(jù)長度。

TCP在傳送大量數(shù)據(jù)時，是以MSS的大小將數(shù)據(jù)進行分割傳送的，進行重發(fā)時也是以MSS為單位的。MSS是在三次握手的時候，在兩端主機之間被計算得出的，兩端的主機在發(fā)送建立連接的請求時，會在TCP首部中寫入MSS選項，告訴對方自己的接口能夠適應的MSS的大小，然后會在兩者之間選擇一個較小的值投入使用，整個過程圖示如下：

如果一個TCP連接已處于穩(wěn)定狀態(tài)，而同時雙方都沒有數(shù)據(jù)需要發(fā)送，則在這個連接之間不會再有任何信息交互。然而在很多情況下，連接雙方都希望知道對方是否仍處于活動狀態(tài)，TCP提供了保活定時器來實現(xiàn)這種檢測功能。

TCP必須為服務器應用程序提供保活功能，服務器通常希望知道客戶主機的運行狀況，從而可以合理分配客戶占用的資源。如果某條連接在兩個小時內(nèi)沒有任何動作，則服務器就向客戶端發(fā)送一個保活探查報文，若客戶主機依然正常運行且從服務器仍可達，則服務器應用程序并不能感覺到保活探查的發(fā)生，TCP負責的保活探查工作對應用程序不可見；若客戶主機崩潰或從服務器不可達等情況，服務器應用程序?qū)⑹盏絹碜訲CP層的差錯報文（比如連接超時、連接被對方復位、路由超時等），服務器將終止該連接并釋放資源。

1.3 滑動窗口與緩沖機制

TCP以1個段為單位，每發(fā)一個段進行一次確認應答處理，這種傳輸方式有個缺點，包的往返時間越長通信性能就越低。為解決這個問題，TCP引入了窗口的概念，即使在往返時間較長的情況下，它也能控制網(wǎng)絡(luò)性能的下降。引入了發(fā)送接收窗口后，確認應答不再以每個分段而是以更大的單位進行確認，轉(zhuǎn)發(fā)時間將會被大幅度將會被大幅度的縮短。

窗口大小就是指無需等待確認應答而可以繼續(xù)發(fā)送數(shù)據(jù)的最大值，這個機制實現(xiàn)了使用大量的緩沖區(qū)，通過對多個段同時進行確認應答的功能。在整個窗口的確定應答沒有到達之前，如果其中部分數(shù)據(jù)出現(xiàn)丟包，那么發(fā)送端仍然要負責重傳，為此發(fā)送端主機得設(shè)置緩存保留這些待被重傳的數(shù)據(jù)，直到收到它們的確認應答。滑動窗口的結(jié)構(gòu)如下圖示：

滑動窗口可以看成定義在數(shù)據(jù)緩沖上的一個窗口，緩沖中存放了從應用程序傳遞過來的待發(fā)送數(shù)據(jù)。在滑動窗口以外的部分包括尚未發(fā)送的數(shù)據(jù)以及已經(jīng)確認對端已收到的數(shù)據(jù)。當數(shù)據(jù)發(fā)出后若如期收到確認應答就可以不用再進行重發(fā)，此時數(shù)據(jù)就可以從緩存區(qū)清除。收到確認應答的情況下，將窗口滑動到確認應答中的序列號位置，這樣可以順序的將多個段同時發(fā)送提高通信性能，這種機制被稱為滑動窗口控制。

滑動窗口控制可以到達很好的流量控制效果和擁塞控制效果，實際上流量控制與擁塞控制的本質(zhì)在于對發(fā)送窗口的合理調(diào)節(jié)。由于每個分段都會有確認應答，而滑動窗口的已確認序列號表示該序列號之前的所有數(shù)據(jù)都已收到確認應答，即便某些確認應答丟失也無需重發(fā)。如果某個報文段確實丟失了，同一個序列號的確認應答將會被重復不斷的返回（接收端在沒有收到自己所期望序列號的數(shù)據(jù)時，會對之前收到的數(shù)據(jù)進行確認應答），發(fā)送端主機如果連續(xù)3次收到同一個確認應答，就會將其所對應的數(shù)據(jù)進行重發(fā)。這種機制比前面介紹的超時重傳更高效，因此也被稱為快速重傳控制。快速重傳過程如下圖示：

接收方為了接收數(shù)據(jù)，也必須在接收緩存上維護一個接收窗口，接收方需要將數(shù)據(jù)填入緩沖區(qū)、對數(shù)據(jù)進行順序組織（因底層的報文可能是無序到達的，需要把無序報文組織為有序數(shù)據(jù)流并刪除重復報文）等操作，并向發(fā)送方通告自己的接收窗口大小，它告訴發(fā)送方：我還能接收多少字節(jié)的數(shù)據(jù)。發(fā)送方應根據(jù)這個窗口通告值適當?shù)卣{(diào)整發(fā)送窗口的大小，以調(diào)整數(shù)據(jù)的發(fā)送速度。

需要指出的是，TCP是全雙工通信，兩個方向上的數(shù)據(jù)傳送是獨立的，任何一方既可以作為發(fā)送端也可以作為接收端，因此任何一方都將為每個TCP連接維護兩個窗口，一個用于數(shù)據(jù)接收，另一個用于數(shù)據(jù)發(fā)送，在一條完整的TCP連接上應該同時存在四個窗口。

1.4 流量控制與擁塞控制

發(fā)送端根據(jù)自己的實際情況發(fā)送數(shù)據(jù)，接收端可能因緩存耗盡或忙于處理其他任務而來不及處理到來的數(shù)據(jù)包，如果接收端將本應該接收的數(shù)據(jù)丟棄的話，就又會觸發(fā)重傳機制，從而導致網(wǎng)絡(luò)流量的無端浪費。為了防止這種現(xiàn)象的發(fā)生，TCP提供了一種機制可以讓發(fā)送端根據(jù)接收端的實際接收能力控制發(fā)送的數(shù)據(jù)量，這就是所謂的流量控制機制。

在TCP首部中，專門有一個字段用來通知接收窗口的大小，接收端主機將自己可以接收的緩存區(qū)大小放入這個字段中通知給發(fā)送端，發(fā)送端會發(fā)送不超過這個窗口限度的數(shù)據(jù)，這個字段的值越大說明網(wǎng)絡(luò)的吞吐量越高。接收端這個緩沖區(qū)一旦面臨數(shù)據(jù)溢出時，窗口大小的值也會隨之被設(shè)置為一個更小的值通知給發(fā)送端，從而控制數(shù)據(jù)發(fā)送量。發(fā)送端主機根據(jù)接收端主機的指示，對發(fā)送數(shù)據(jù)的量進行控制的過程如下圖示：

當接收端緩沖區(qū)用完后，不得不停止接收數(shù)據(jù)（此時接收窗口大小為0），在收到發(fā)送窗口更新通知后通信才能繼續(xù)進行。如果這個窗口的更新通知在傳送途中丟失，可能會導致無法繼續(xù)通信，為避免此類問題的發(fā)生，發(fā)送端主機會定時（由堅持定時器persist timer管理該定時周期）的發(fā)送一個叫做窗口探測的數(shù)據(jù)段，次數(shù)據(jù)段僅含一個字節(jié)以獲取最新的窗口大小信息。

有了TCP的窗口控制，收發(fā)主機之間即使不再以一個數(shù)據(jù)段為單位發(fā)送確認應答，也能夠連續(xù)發(fā)送大量數(shù)據(jù)包。計算機網(wǎng)絡(luò)都處于一個共享環(huán)境中，可能會因為其他主機之間的通信使得網(wǎng)絡(luò)擁堵，如果在通信剛開始時就突然發(fā)送大量數(shù)據(jù)，可能會導致整個網(wǎng)絡(luò)的癱瘓。TCP為了防止該問題的出現(xiàn)，在通信一開始時就會通過一個叫慢啟動的算法得出的數(shù)值對發(fā)送數(shù)據(jù)量進行控制。

首先，為了在發(fā)送端調(diào)節(jié)所要發(fā)送數(shù)據(jù)的量，定義了一個叫做擁塞窗口的概念，在慢啟動的時候?qū)⑦@個擁塞窗口大小設(shè)置為1個數(shù)據(jù)段（1 MSS）發(fā)送數(shù)據(jù)，之后每收到一次確認應答擁塞窗口的值就加1。在發(fā)送數(shù)據(jù)包時，將擁塞窗口的大小與接收端主機通知的窗口大小做比較，取其中較小的值作為實際發(fā)送窗口的大小。有了上述這些機制，就可以有效減少通信開始時連續(xù)發(fā)包導致的網(wǎng)絡(luò)擁塞情況的發(fā)生。

不過，隨著包的每次往返，擁塞窗口也會以1、2、4、8等指數(shù)函數(shù)增長（每收到一次確認應答擁塞窗口值加1，收到一個窗口大小數(shù)量的確認應答則擁塞窗口大小翻倍），擁堵情況激增甚至導致網(wǎng)絡(luò)擁塞情況的發(fā)生。為了防止這些，TCP又引入了慢啟動閾值的概念，只要擁塞窗口的值超過這個閾值，在每收到一次確認應答時，只允許以擁塞窗口大小的倒數(shù)為單位增加，即收到一個窗口大小數(shù)量的確認應答后擁塞窗口大小增加一個數(shù)據(jù)段，這是擁塞窗口大小是線性增長的，該變化過程如下圖所示：

TCP的通信開始時，并沒有設(shè)置相應的慢啟動閾值，而是在超時重傳時，才會設(shè)置為當時擁塞窗口一半的大小。

由重復確認應答而觸發(fā)的快速重傳與普通的超時重傳機制的處理多少有些不同，因為前者要求至少3次的確認應答數(shù)據(jù)段到達對方主機后才會觸發(fā)，相比后者網(wǎng)絡(luò)的擁堵要輕一些。所以由重復確認應答進行快速重傳控制時，慢啟動閾值的大小被設(shè)置為當時窗口大小的一半，然后將發(fā)送窗口的大小設(shè)置為該慢啟動閾值 + 3個數(shù)據(jù)段的大小，相當于直接跨國慢啟動階段進入擁塞避免階段，這種機制也稱為快速恢復機制。

1.5 提高網(wǎng)絡(luò)利用率的其他機制

Nagle算法

TCP中為了提高網(wǎng)絡(luò)利用率，經(jīng)常使用一個叫做Nagle的算法，該算法是指發(fā)送端即使還有應該發(fā)送的數(shù)據(jù)，但如果這部分數(shù)據(jù)很少的話，則進行延遲發(fā)送的一種處理機制。具體來說就是僅在已發(fā)送的數(shù)據(jù)都已收到確認應答或可以發(fā)送最大段長度的數(shù)據(jù)時才能發(fā)送數(shù)據(jù)，如果兩個條件都不滿足則暫時等待一段時間后再進行數(shù)據(jù)發(fā)送。

根據(jù)這個算法雖然網(wǎng)絡(luò)利用率可以提高，但可能會發(fā)生某種程度的延遲。在某些對響應實時性要求比較高的應用場景中使用TCP時，往往會關(guān)閉對該算法的啟用。

延遲確認應答

接收數(shù)據(jù)的主機如果每次都立刻回復確認應答的話，可能會返回一個較小的窗口，發(fā)送端主機收到這個小窗口通知后會以它為上限發(fā)送數(shù)據(jù)，從而又降低了網(wǎng)絡(luò)利用率。為此引入了一個方法，在收到數(shù)據(jù)后不立即返回確認應答，而是延遲一段時間（直到收到2 MSS數(shù)據(jù)時為止，最大延遲0.5秒）發(fā)送確認應答。

TCP采用滑動窗口機制，通常確認應答少一些也不無妨，TCP文件傳輸時，絕大多數(shù)都是每兩個數(shù)據(jù)段返回一次確認應答。

捎帶應答

根據(jù)應用層協(xié)議，發(fā)送出去的數(shù)據(jù)到達對端，對端處理后會返回一個回執(zhí)，在雙方通信過程中，為提高網(wǎng)絡(luò)利用率，TCP的確認應答和回執(zhí)數(shù)據(jù)可以通過一個包發(fā)送，這種方式叫做捎帶應答。

接收數(shù)據(jù)傳給應用處理生成回執(zhí)數(shù)據(jù)需要一段時間，如果要實現(xiàn)捎帶應答，需要確認應答等待回執(zhí)數(shù)據(jù)的生成，如果沒有啟用延遲確認應答就無法實現(xiàn)捎帶應答。延遲確認應答是能夠提高網(wǎng)絡(luò)利用率從而降低計算機處理負荷的一種較優(yōu)的處理機制。

二、TCP協(xié)議實現(xiàn)

2.1 TCP報文格式

TCP協(xié)議有著自己的數(shù)據(jù)報組織格式，這里把TCP的數(shù)據(jù)包稱為報文段（Segment）,TCP報文段封裝在IP數(shù)據(jù)報中發(fā)送。TCP報文段由TCP首部和TCP數(shù)據(jù)區(qū)組成，首部區(qū)域包含了連接建立與斷開、數(shù)據(jù)確認、窗口大小通告、數(shù)據(jù)發(fā)送相關(guān)的所有標志與控制信息，TCP報文結(jié)構(gòu)如下圖所示：

TCP首部相比UDP首部要復雜得多，TCP中沒有表示包長度和數(shù)據(jù)長度的字段，可由IP層獲知TCP的包長再由TCP的包長可知數(shù)據(jù)的長度。TCP首部的大小為20~60字節(jié)，在沒有任何選項的情況下，首部大小為20字節(jié)，與不含選項字段的IP報首部大小相同，TCP數(shù)據(jù)部分可以為空（比如建立或斷開連接時）。

與UDP報文相同，源端口號和目的端口號兩個字段用來標識發(fā)送端和接收端應用進程分別綁定的端口號。32位序號字段標識了從TCP發(fā)送端到TCP接收端的數(shù)據(jù)字節(jié)編號，它的值為當前報文段中第一個數(shù)據(jù)的字節(jié)序號。32位確認序號只有ACK標志置1時才有效，它包含了本機所期望收到的下一個數(shù)據(jù)序號（即上次已成功收到數(shù)據(jù)字節(jié)序號加1），確認常常和反向數(shù)據(jù)一起捎帶發(fā)送。序列號與確認應答號共同為TCP的正面確認、超時重傳、有序重組等可靠通信提供支持。

4位首部長度指出了TCP首部的長度，以4字節(jié)為單位，若沒有任何選項字段則首部長度為5（5*4 = 20字節(jié)）。接下來的6bit保留字段暫未使用，為將來保留。再接下來是6個標志比特，它們告訴了接收端應該如何解釋報文的內(nèi)容，比如一些報文段攜帶了確認信息、一些報文段攜帶了緊急數(shù)據(jù)、一些報文段包含建立或關(guān)閉連接的請求等，6個標志位的意義如下表示：

在TCP發(fā)送一個報文時，可在窗口字段中填寫相應值以通知對方自己的可用緩沖區(qū)大小（以字節(jié)為單位），報文接收方需要根據(jù)這個值來調(diào)整發(fā)送窗口的大小。窗口字段是實現(xiàn)流量控制的關(guān)鍵字段，當接收方向發(fā)送方通知一個大小為0的窗口時，將完全阻止發(fā)送方的數(shù)據(jù)發(fā)送。

16位校驗和字段的計算和上一章中UDP校驗和計算過程與原理都相同，在UDP首部中校驗和的計算是可選的，但在TCP中校驗和的計算是必須的、強制的。TCP中校驗和包含了偽首部、TCP首部和TCP數(shù)據(jù)區(qū)三部分，偽首部的概念與UDP中完全一樣，只是偽首部中的協(xié)議字段值為6，與TCP相對應。

16位的緊急指針只有當緊急標志位URG置位時才有效，此時報文中包含緊急數(shù)據(jù)，緊急數(shù)據(jù)始終放到報文段數(shù)據(jù)開始的地方，而緊急指針定義出了緊急數(shù)據(jù)在數(shù)據(jù)區(qū)中的結(jié)束處，用這個值加上序號字段值就得到了最后一個緊急數(shù)據(jù)的序號。URG位置1的報文段將告訴接收方：這里面的數(shù)據(jù)是緊急的，你可以優(yōu)先直接讀取，不必把它們放在接收緩沖里面（即該報文段不使用普通的數(shù)據(jù)流形式被處理）。

TCP首部可包含0個或多個選項信息，選項總長度可達40字節(jié)，用來把附加信息傳遞給對方。每條TCP選項由三部分組成：1字節(jié)的選項類型 + 1字節(jié)的選項總長度 + 選項數(shù)據(jù)，具有代表性的選項如下表所示：

其中類型代碼為2的選項是最大報文段長度（MSS），每個連接通常都在通信的第一個報文段（包含SYN標志的連接握手報文）中指明這個選項，用來向?qū)Ψ街该髯约核芙邮艿淖畲髨笪亩?#xff0c;如果沒有指明則使用默認MSS為536，前面提到的客戶端與服務器協(xié)商確定MSS的功能就是通過該選項實現(xiàn)的。

類型代碼為3的選項是窗口擴大因子選項，可以讓通信雙方聲明更大的窗口，首部中的窗口字段長度16bit，即接收窗口最大值為65535字節(jié)，在許多高速場合下，這樣的窗口還是太小，會影響發(fā)送端的發(fā)送速度。使用該選項可以向?qū)Ψ酵ǜ娓蟮拇翱?#xff0c;此時通告窗口大小值（假設(shè)為N）為首部中窗口大小字段值（假設(shè)為W）乘以2的窗口擴大因子值（假設(shè)為A）次冪（即N = W * 2^A）。

2.2 TCP數(shù)據(jù)報描述

TCP數(shù)據(jù)報首部比UDP復雜些，描述TCP的數(shù)據(jù)結(jié)構(gòu)自然更復雜，在LwIP中用于描述TCP首部的數(shù)據(jù)結(jié)構(gòu)如下：

// rt-thread\components\net\lwip-1.4.1\src\include\lwip\tcp_impl.h/* Fields are (of course) in network byte order.* Some fields are converted to host byte order in tcp_input().*/ PACK_STRUCT_BEGIN struct tcp_hdr {PACK_STRUCT_FIELD(u16_t src);PACK_STRUCT_FIELD(u16_t dest);PACK_STRUCT_FIELD(u32_t seqno);PACK_STRUCT_FIELD(u32_t ackno);PACK_STRUCT_FIELD(u16_t _hdrlen_rsvd_flags);PACK_STRUCT_FIELD(u16_t wnd);PACK_STRUCT_FIELD(u16_t chksum);PACK_STRUCT_FIELD(u16_t urgp); } PACK_STRUCT_STRUCT; PACK_STRUCT_END#define TCP_FIN 0x01U #define TCP_SYN 0x02U #define TCP_RST 0x04U #define TCP_PSH 0x08U #define TCP_ACK 0x10U #define TCP_URG 0x20U #define TCP_ECE 0x40U #define TCP_CWR 0x80U#define TCPH_HDRLEN(phdr) (ntohs((phdr)->_hdrlen_rsvd_flags) >> 12) #define TCPH_FLAGS(phdr) (ntohs((phdr)->_hdrlen_rsvd_flags) & TCP_FLAGS)#define TCPH_HDRLEN_SET(phdr, len) (phdr)->_hdrlen_rsvd_flags = htons(((len) << 12) | TCPH_FLAGS(phdr)) #define TCPH_FLAGS_SET(phdr, flags) (phdr)->_hdrlen_rsvd_flags = (((phdr)->_hdrlen_rsvd_flags & PP_HTONS((u16_t)(~(u16_t)(TCP_FLAGS)))) | htons(flags)) #define TCPH_HDRLEN_FLAGS_SET(phdr, len, flags) (phdr)->_hdrlen_rsvd_flags = htons(((len) << 12) | (flags))#define TCPH_SET_FLAG(phdr, flags ) (phdr)->_hdrlen_rsvd_flags = ((phdr)->_hdrlen_rsvd_flags | htons(flags)) #define TCPH_UNSET_FLAG(phdr, flags) (phdr)->_hdrlen_rsvd_flags = htons(ntohs((phdr)->_hdrlen_rsvd_flags) | (TCPH_FLAGS(phdr) & ~(flags)) )#define TCP_TCPLEN(seg) ((seg)->len + ((TCPH_FLAGS((seg)->tcphdr) & (TCP_FIN | TCP_SYN)) != 0))

TCP首部中的各個標志位以宏定義的形式表示，同時定義了操作TCP首部各字段的宏定義。

與UDP的內(nèi)容相同，在TCP實現(xiàn)中也專門使用一個數(shù)據(jù)結(jié)構(gòu)來描述一個連接，把這個數(shù)據(jù)結(jié)構(gòu)稱為TCP控制塊或傳輸控制塊。TCP控制塊中包含了雙方實現(xiàn)基本通信所需要的信息，如發(fā)送窗口、接收窗口、數(shù)據(jù)緩沖區(qū)等，也包含了所有與該連接性能保障相關(guān)的字段，如定時器、擁塞控制、滑動窗口控制等。TCP協(xié)議實現(xiàn)的本質(zhì)就是對TCP控制塊中各個字段的操作：在接收到TCP報文段時，在所有控制塊中查找，以得到和報文目的地相匹配的控制塊，并調(diào)用控制塊上注冊的各個函數(shù)對報文進行處理；TCP內(nèi)核維護了一些周期性的定時事件，在定時處理函數(shù)中會對所有控制塊進行處理，例如把某些控制塊中的超時報文段進行重傳，把某些控制塊中的失序報文段刪除。TCP控制塊是整個TCP協(xié)議的核心，也是整個內(nèi)核中最大的數(shù)據(jù)結(jié)構(gòu)，在LwIP中用于描述TCP控制塊的數(shù)據(jù)結(jié)構(gòu)如下：

// rt-thread\components\net\lwip-1.4.1\src\include\lwip\tcp.h/* the TCP protocol control block */ struct tcp_pcb { /** common PCB members */IP_PCB; /** protocol specific PCB members */TCP_PCB_COMMON(struct tcp_pcb);/* ports are in host byte order */u16_t remote_port;u8_t flags; #define TF_ACK_DELAY ((u8_t)0x01U) /* Delayed ACK. */ #define TF_ACK_NOW ((u8_t)0x02U) /* Immediate ACK. */ #define TF_INFR ((u8_t)0x04U) /* In fast recovery. */ #define TF_TIMESTAMP ((u8_t)0x08U) /* Timestamp option enabled */ #define TF_RXCLOSED ((u8_t)0x10U) /* rx closed by tcp_shutdown */ #define TF_FIN ((u8_t)0x20U) /* Connection was closed locally (FIN segment enqueued). */ #define TF_NODELAY ((u8_t)0x40U) /* Disable Nagle algorithm */ #define TF_NAGLEMEMERR ((u8_t)0x80U) /* nagle enabled, memerr, try to output to prevent delayed ACK to happen *//* the rest of the fields are in host byte orderas we have to do some math with them *//* Timers */u8_t polltmr, pollinterval;u8_t last_timer;u32_t tmr;/* receiver variables */u32_t rcv_nxt; /* next seqno expected */u16_t rcv_wnd; /* receiver window available */u16_t rcv_ann_wnd; /* receiver window to announce */u32_t rcv_ann_right_edge; /* announced right edge of window *//* Retransmission timer. */s16_t rtime;u16_t mss; /* maximum segment size *//* RTT (round trip time) estimation variables */u32_t rttest; /* RTT estimate in 500ms ticks */u32_t rtseq; /* sequence number being timed */s16_t sa, sv; /* @todo document this */s16_t rto; /* retransmission time-out */u8_t nrtx; /* number of retransmissions *//* fast retransmit/recovery */u8_t dupacks;u32_t lastack; /* Highest acknowledged seqno. *//* congestion avoidance/control variables */u16_t cwnd;u16_t ssthresh;/* sender variables */u32_t snd_nxt; /* next new seqno to be sent */u32_t snd_wl1, snd_wl2; /* Sequence and acknowledgement numbers of lastwindow update. */u32_t snd_lbb; /* Sequence number of next byte to be buffered. */u16_t snd_wnd; /* sender window */u16_t snd_wnd_max; /* the maximum sender window announced by the remote host */u16_t acked;u16_t snd_buf; /* Available buffer space for sending (in bytes). */ #define TCP_SNDQUEUELEN_OVERFLOW (0xffffU-3)u16_t snd_queuelen; /* Available buffer space for sending (in tcp_segs). *//* These are ordered by sequence number: */struct tcp_seg *unsent; /* Unsent (queued) segments. */struct tcp_seg *unacked; /* Sent but unacknowledged segments. */struct tcp_seg *ooseq; /* Received out of sequence segments. */struct pbuf *refused_data; /* Data previously received but not yet taken by upper layer *//* Function to be called when more send buffer space is available. */tcp_sent_fn sent;/* Function to be called when (in-sequence) data has arrived. */tcp_recv_fn recv;/* Function to be called when a connection has been set up. */tcp_connected_fn connected;/* Function which is called periodically. */tcp_poll_fn poll;/* Function to be called whenever a fatal error occurs. */tcp_err_fn errf;/* idle time before KEEPALIVE is sent */u32_t keep_idle;/* Persist timer counter */u8_t persist_cnt;/* Persist timer back-off */u8_t persist_backoff;/* KEEPALIVE counter */u8_t keep_cnt_sent; };struct tcp_pcb_listen { /* Common members of all PCB types */IP_PCB; /* Protocol specific PCB members */TCP_PCB_COMMON(struct tcp_pcb_listen); };/*** members common to struct tcp_pcb and struct tcp_listen_pcb*/ #define TCP_PCB_COMMON(type) \type *next; /* for the linked list */ \void *callback_arg; \/* the accept callback for listen- and normal pcbs, if LWIP_CALLBACK_API */ \DEF_ACCEPT_CALLBACK \enum tcp_state state; /* TCP state */ \u8_t prio; \/* ports are in host byte order */ \u16_t local_port#define DEF_ACCEPT_CALLBACK tcp_accept_fn accept;enum tcp_state {CLOSED = 0,LISTEN = 1,SYN_SENT = 2,SYN_RCVD = 3,ESTABLISHED = 4,FIN_WAIT_1 = 5,FIN_WAIT_2 = 6,CLOSE_WAIT = 7,CLOSING = 8,LAST_ACK = 9,TIME_WAIT = 10 };/* This structure represents a TCP segment on the unsent, unacked and ooseq queues */ struct tcp_seg {struct tcp_seg *next; /* used when putting segements on a queue */struct pbuf *p; /* buffer containing data + TCP header */u16_t len; /* the TCP length of this segment */u8_t flags; #define TF_SEG_OPTS_MSS (u8_t)0x01U /* Include MSS option. */ #define TF_SEG_OPTS_TS (u8_t)0x02U /* Include timestamp option. */ #define TF_SEG_DATA_CHECKSUMMED (u8_t)0x04U /* ALL data (not the header) ischecksummed into 'chksum' */struct tcp_hdr *tcphdr; /* the TCP header */ };/** Function prototype for tcp accept callback functions. Called when a new* connection can be accepted on a listening pcb.* @param arg Additional argument to pass to the callback function (@see tcp_arg())* @param newpcb The new connection pcb* @param err An error code if there has been an error accepting.* Only return ERR_ABRT if you have called tcp_abort from within the* callback function!*/ typedef err_t (*tcp_accept_fn)(void *arg, struct tcp_pcb *newpcb, err_t err);/** Function prototype for tcp receive callback functions. Called when data has* been received.* @param arg Additional argument to pass to the callback function (@see tcp_arg())* @param tpcb The connection pcb which received data* @param p The received data (or NULL when the connection has been closed!)* @param err An error code if there has been an error receiving* Only return ERR_ABRT if you have called tcp_abort from within the* callback function!*/ typedef err_t (*tcp_recv_fn)(void *arg, struct tcp_pcb *tpcb,struct pbuf *p, err_t err);/** Function prototype for tcp sent callback functions. Called when sent data has* been acknowledged by the remote side. Use it to free corresponding resources.* This also means that the pcb has now space available to send new data.* @param arg Additional argument to pass to the callback function (@see tcp_arg())* @param tpcb The connection pcb for which data has been acknowledged* @param len The amount of bytes acknowledged* @return ERR_OK: try to send some data by calling tcp_output* Only return ERR_ABRT if you have called tcp_abort from within the* callback function!*/ typedef err_t (*tcp_sent_fn)(void *arg, struct tcp_pcb *tpcb,u16_t len);/** Function prototype for tcp poll callback functions. Called periodically as* specified by @see tcp_poll.* @param arg Additional argument to pass to the callback function (@see tcp_arg())* @param tpcb tcp pcb* @return ERR_OK: try to send some data by calling tcp_output* Only return ERR_ABRT if you have called tcp_abort from within the* callback function!*/ typedef err_t (*tcp_poll_fn)(void *arg, struct tcp_pcb *tpcb);/** Function prototype for tcp error callback functions. Called when the pcb* receives a RST or is unexpectedly closed for any other reason.* @note The corresponding pcb is already freed when this callback is called!* @param arg Additional argument to pass to the callback function (@see tcp_arg())* @param err Error code to indicate why the pcb has been closed* ERR_ABRT: aborted through tcp_abort or by a TCP timer* ERR_RST: the connection was reset by the remote host*/ typedef void (*tcp_err_fn)(void *arg, err_t err);/** Function prototype for tcp connected callback functions. Called when a pcb* is connected to the remote side after initiating a connection attempt by* calling tcp_connect().* @param arg Additional argument to pass to the callback function (@see tcp_arg())* @param tpcb The connection pcb which is connected* @param err An unused error code, always ERR_OK currently ;-) TODO!* Only return ERR_ABRT if you have called tcp_abort from within the* callback function!* @note When a connection attempt fails, the error callback is currently called!*/ typedef err_t (*tcp_connected_fn)(void *arg, struct tcp_pcb *tpcb, err_t err);/* The TCP PCB lists. */ /** List of all TCP PCBs bound but not yet (connected || listening) */ struct tcp_pcb *tcp_bound_pcbs; /** List of all TCP PCBs in LISTEN state */ union tcp_listen_pcbs_t tcp_listen_pcbs; /** List of all TCP PCBs that are in a state in which* they accept or send data. */ struct tcp_pcb *tcp_active_pcbs; /** List of all TCP PCBs in TIME-WAIT state */ struct tcp_pcb *tcp_tw_pcbs;

上面的TCP控制塊tcp_pcb看起來很大，可以把成員變量分組，每種TCP相關(guān)機制的實現(xiàn)只涉及到其中的某幾個字段，這幾個字段可以按一組去理解和操作。除了定義tcp_pcb，還定義了tcp_pcb_listen，后者主要是用來描述處于LISTEN狀態(tài)的連接，處于LISTEN狀態(tài)的連接只記錄本地端口信息，不記錄任何遠程端口信息，一般只用于在服務器端打開某個端口為客戶端服務。處于LISTEN狀態(tài)的控制塊不會對應于任何一條有效連接，它會進行數(shù)據(jù)發(fā)送、連接握手之類的工作，因此描述LISTEN狀態(tài)的控制塊結(jié)構(gòu)體比tcp_pcb相比更小，使用它可以節(jié)省內(nèi)存空間。

對于描述一個連接的通用字段（比如遠程端口、本地端口、遠程IP地址、本地IP地址、控制塊優(yōu)先級等）就不再贅述了。重點說下flags字段，它描述了當前控制塊的特性，例如是否允許立即發(fā)送ACK、是否使能Nagle算法等，這些標志位是提高TCP傳輸性能的關(guān)鍵。

TCP控制塊中維護了三個緩沖隊列，unsent、unacked、ooseq三個字段分別為隊列的首指針，unsent用于連接還未被發(fā)送出去的報文段，unacked用于連接已經(jīng)發(fā)送出去但還未被確認的報文段，ooseq用于連接接收到的無序報文段，這三個緩沖隊列簡單的實現(xiàn)了對連接的所有報文段的管理。每個報文段用結(jié)構(gòu)體tcp_seg來描述，并以鏈表形式組織成隊列，tcp_seg報文段不僅包含指向裝載報文段的指針pbuf，還包含指向報文段中的TCP首部的指針tcp_hdr，報文段緩沖隊列的組織關(guān)系如下圖所示：

為了組織和描述系統(tǒng)內(nèi)的所有TCP控制塊，內(nèi)核定義了四條鏈表來連接處于不同狀態(tài)下的控制塊，TCP操作過程通常都包括對鏈表上控制塊的查找。定義四條鏈表的代碼在上面已給出：tcp_bound_pcbs鏈表用來連接新創(chuàng)建的且綁定了本地端口的控制塊，可以認為此時的控制塊處于CLOSED狀態(tài)；tcp_listen_pcbs鏈表用來連接處于LISTEN狀態(tài)的控制塊，該狀態(tài)下用結(jié)構(gòu)體tcp_pcb_listen來描述一個本地連接；tcp_tw_pcbs鏈表用來連接處于TIME_WAIT狀態(tài)的控制塊；tcp_active_pcbs用于連接處于TCP轉(zhuǎn)換圖中其它所有狀態(tài)的控制塊，上圖展示的就是該鏈表上的控制塊。

2.3 TCP狀態(tài)機

TCP狀態(tài)字段state表示一個連接在整個通信過程中的狀態(tài)變遷。那么TCP連接的狀態(tài)是如何變遷的呢？

前面介紹TCP連接管理時談到TCP建立連接需要“三次握手”過程：首先客戶端發(fā)送SYN置1的連接請求報文后，從CLOSED狀態(tài)遷移到SYN_SENT狀態(tài)；服務器收到客戶端的連接請求報文后返回SYN與ACK都置1的應答報文，并從LISTEN狀態(tài)遷移到SYN_RCVD狀態(tài)；客戶端收到服務器的SYN應答報文后會再次返回ACK置1的應答報文，當服務器收到該應答報文后雙方的連接就建立起來了，此時雙方都遷移到ESTABLISHED狀態(tài)。

TCP斷開連接需要“四次握手”過程：首先客戶端向服務器發(fā)送FIN置1的報文后，從ESTABLISHED狀態(tài)遷移到FIN_WAIT_1狀態(tài)；服務器收到FIN報文后返回ACK置1的應答報文，并從ESTABLISHED狀態(tài)遷移到CLOSE_WAIT狀態(tài)，客戶端收到來自服務器的ACK報文后從FIN_WAIT_1狀態(tài)遷移到FIN_WAIT_2狀態(tài)；服務器向上層通告該斷開操作并向客戶端發(fā)送一個FIN置1的報文段，從CLOSE_WAIT狀態(tài)遷移到LAST_ACK狀態(tài)；客戶端收到來自服務器的FIN報文后返回ACK置1的應答報文，并從FIN_WAIT_2狀態(tài)遷移到TIME_WAIT狀態(tài)，服務器收到來自客戶端的ACK報文后從LAST_ACK狀態(tài)遷移到CLOSED狀態(tài)。

在理解了TCP連接建立與斷開流程后，再來看TCP狀態(tài)遷移圖就相對容易了，TCP為每個連接定義了11種狀態(tài)（上面已給出實現(xiàn)代碼），下面給出狀態(tài)轉(zhuǎn)換圖如下：

雖然上面的狀態(tài)轉(zhuǎn)換圖看起來很復雜，但并不是每個連接都會出現(xiàn)圖中的所有轉(zhuǎn)換路徑，圖中有兩條最經(jīng)典的狀態(tài)轉(zhuǎn)換路徑，而TCP絕大部分的狀態(tài)轉(zhuǎn)換都發(fā)生在這兩條路徑上：第一條路徑描述了客戶端申請建立連接與斷開連接的整個過程，如圖中虛線所示；第二條路徑描述了服務器接受來自客戶端的建立連接請求與斷開連接請求的整個過程，如圖中粗實線所示。配合前面介紹的建立連接的“三次握手”過程與斷開連接的“四次握手”過程，應該更容易理解TCP連接的狀態(tài)遷移過程。

實現(xiàn)TCP狀態(tài)遷移的狀態(tài)機函數(shù)實現(xiàn)代碼如下：

// rt-thread\components\net\lwip-1.4.1\src\core\tcp_in.c /*** Implements the TCP state machine. Called by tcp_input. In some* states tcp_receive() is called to receive data. The tcp_seg* argument will be freed by the caller (tcp_input()) unless the* recv_data pointer in the pcb is set.* @param pcb the tcp_pcb for which a segment arrived* @note the segment which arrived is saved in global variables, therefore only the pcb* involved is passed as a parameter to this function*/ static err_t tcp_process(struct tcp_pcb *pcb) {struct tcp_seg *rseg;u8_t acceptable = 0;err_t err;err = ERR_OK;/* Process incoming RST segments. */if (flags & TCP_RST) {/* First, determine if the reset is acceptable. */if (pcb->state == SYN_SENT) {if (ackno == pcb->snd_nxt) {acceptable = 1;}} else {if (TCP_SEQ_BETWEEN(seqno, pcb->rcv_nxt, pcb->rcv_nxt+pcb->rcv_wnd)) {acceptable = 1;}}if (acceptable) {recv_flags |= TF_RESET;pcb->flags &= ~TF_ACK_DELAY;return ERR_RST;} else {return ERR_OK;}}if ((flags & TCP_SYN) && (pcb->state != SYN_SENT && pcb->state != SYN_RCVD)) { /* Cope with new connection attempt after remote end crashed */tcp_ack_now(pcb);return ERR_OK;}if ((pcb->flags & TF_RXCLOSED) == 0) {/* Update the PCB (in)activity timer unless rx is closed (see tcp_shutdown) */pcb->tmr = tcp_ticks;}pcb->keep_cnt_sent = 0;tcp_parseopt(pcb);/* Do different things depending on the TCP state. */switch (pcb->state) {case SYN_SENT:/* received SYN ACK with expected sequence number? */if ((flags & TCP_ACK) && (flags & TCP_SYN)&& ackno == ntohl(pcb->unacked->tcphdr->seqno) + 1) {pcb->snd_buf++;pcb->rcv_nxt = seqno + 1;pcb->rcv_ann_right_edge = pcb->rcv_nxt;pcb->lastack = ackno;pcb->snd_wnd = tcphdr->wnd;pcb->snd_wnd_max = tcphdr->wnd;pcb->snd_wl1 = seqno - 1; /* initialise to seqno - 1 to force window update */pcb->state = ESTABLISHED;#if TCP_CALCULATE_EFF_SEND_MSSpcb->mss = tcp_eff_send_mss(pcb->mss, &(pcb->remote_ip)); #endif /* TCP_CALCULATE_EFF_SEND_MSS *//* Set ssthresh again after changing pcb->mss (already set in tcp_connect* but for the default value of pcb->mss) */pcb->ssthresh = pcb->mss * 10;pcb->cwnd = ((pcb->cwnd == 1) ? (pcb->mss * 2) : pcb->mss);--pcb->snd_queuelen;rseg = pcb->unacked;pcb->unacked = rseg->next;tcp_seg_free(rseg);/* If there's nothing left to acknowledge, stop the retransmittimer, otherwise reset it to start again */if(pcb->unacked == NULL)pcb->rtime = -1;else {pcb->rtime = 0;pcb->nrtx = 0;}/* Call the user specified function to call when sucessfully* connected. */TCP_EVENT_CONNECTED(pcb, ERR_OK, err);if (err == ERR_ABRT) {return ERR_ABRT;}tcp_ack_now(pcb);}/* received ACK? possibly a half-open connection */else if (flags & TCP_ACK) {/* send a RST to bring the other side in a non-synchronized state. */tcp_rst(ackno, seqno + tcplen, ip_current_dest_addr(), ip_current_src_addr(),tcphdr->dest, tcphdr->src);}break;case SYN_RCVD:if (flags & TCP_ACK) {/* expected ACK number? */if (TCP_SEQ_BETWEEN(ackno, pcb->lastack+1, pcb->snd_nxt)) {u16_t old_cwnd;pcb->state = ESTABLISHED;/* Call the accept function. */TCP_EVENT_ACCEPT(pcb, ERR_OK, err);if (err != ERR_OK) {/* If the accept function returns with an error, we abort* the connection. *//* Already aborted? */if (err != ERR_ABRT) {tcp_abort(pcb);}return ERR_ABRT;}old_cwnd = pcb->cwnd;/* If there was any data contained within this ACK,* we'd better pass it on to the application as well. */tcp_receive(pcb);/* Prevent ACK for SYN to generate a sent event */if (pcb->acked != 0) {pcb->acked--;}pcb->cwnd = ((old_cwnd == 1) ? (pcb->mss * 2) : pcb->mss);if (recv_flags & TF_GOT_FIN) {tcp_ack_now(pcb);pcb->state = CLOSE_WAIT;}} else {/* incorrect ACK number, send RST */tcp_rst(ackno, seqno + tcplen, ip_current_dest_addr(), ip_current_src_addr(),tcphdr->dest, tcphdr->src);}} else if ((flags & TCP_SYN) && (seqno == pcb->rcv_nxt - 1)) {/* Looks like another copy of the SYN - retransmit our SYN-ACK */tcp_rexmit(pcb);}break;case CLOSE_WAIT:/* FALLTHROUGH */case ESTABLISHED:tcp_receive(pcb);if (recv_flags & TF_GOT_FIN) { /* passive close */tcp_ack_now(pcb);pcb->state = CLOSE_WAIT;}break;case FIN_WAIT_1:tcp_receive(pcb);if (recv_flags & TF_GOT_FIN) {if ((flags & TCP_ACK) && (ackno == pcb->snd_nxt)) {tcp_ack_now(pcb);tcp_pcb_purge(pcb);TCP_RMV_ACTIVE(pcb);pcb->state = TIME_WAIT;TCP_REG(&tcp_tw_pcbs, pcb);} else {tcp_ack_now(pcb);pcb->state = CLOSING;}} else if ((flags & TCP_ACK) && (ackno == pcb->snd_nxt)) {pcb->state = FIN_WAIT_2;}break;case FIN_WAIT_2:tcp_receive(pcb);if (recv_flags & TF_GOT_FIN) {tcp_ack_now(pcb);tcp_pcb_purge(pcb);TCP_RMV_ACTIVE(pcb);pcb->state = TIME_WAIT;TCP_REG(&tcp_tw_pcbs, pcb);}break;case CLOSING:tcp_receive(pcb);if (flags & TCP_ACK && ackno == pcb->snd_nxt) {tcp_pcb_purge(pcb);TCP_RMV_ACTIVE(pcb);pcb->state = TIME_WAIT;TCP_REG(&tcp_tw_pcbs, pcb);}break;case LAST_ACK:tcp_receive(pcb);if (flags & TCP_ACK && ackno == pcb->snd_nxt) {/* bugfix #21699: don't set pcb->state to CLOSED here or we risk leaking segments */recv_flags |= TF_CLOSED;}break;default:break;}return ERR_OK; }

上面就是TCP狀態(tài)機的轉(zhuǎn)換代碼，對照狀態(tài)轉(zhuǎn)換圖更容易理解代碼邏輯。

2.4 TCP數(shù)據(jù)報操作

TCP的輸入/輸出處理函數(shù)較多，它們之間的調(diào)用關(guān)系也比較復雜，下面用一個總函數(shù)調(diào)用流程來展示所有這些函數(shù)之間的調(diào)用關(guān)系：

2.4.1 TCP報文段輸出處理

前面介紹了TCP Raw API編程，用戶應用程序可以通過TCP編程函數(shù)tcp_connect、tcp_write等構(gòu)造一個報文段，這個報文可以用于連接建立和斷開的握手報文，也可以是雙方的數(shù)據(jù)交互報文，握手報文段的構(gòu)造由函數(shù)tcp_enqueue_flags構(gòu)造完成并放入到控制塊的發(fā)送隊列中；而數(shù)據(jù)報文段的構(gòu)造是函數(shù)tcp_write直接完成的，它將TCP數(shù)據(jù)和首部部分字段填入報文中，并使用tcp_seg結(jié)構(gòu)體將報文段組織在發(fā)送緩沖隊列上（一個tcp_seg描述一個可獨立發(fā)送的報文段）；當函數(shù)tcp_output被調(diào)用時，它會在控制塊的發(fā)送緩沖隊列上依次取下報文段發(fā)送，這個函數(shù)的唯一工作就是判斷報文段是否在允許的發(fā)送窗口內(nèi)，然后調(diào)用函數(shù)tcp_output_segment發(fā)送報文段，當發(fā)送完成后，tcp_output會把相應報文段放在控制塊的未確認隊列unacked上；在tcp_output_segment發(fā)送報文段時，它會填寫首部中的剩余字段，包括確認序號、通告窗口、選項等，最重要的是，它需要與IP層的ip_route函數(shù)交互，獲得偽首部中的源IP地址字段，計算并填寫TCP首部中的校驗和。最后，IP層的發(fā)送函數(shù)ip_output會被調(diào)用，用來組裝并發(fā)送IP數(shù)據(jù)報。

下面給出構(gòu)造數(shù)據(jù)報文段的tcp_write函數(shù)的流程圖，實現(xiàn)代碼較復雜，讀者可以根據(jù)流程圖對照源碼理解其邏輯，構(gòu)造握手報文段的tcp_enqueue_flags函數(shù)比tcp_write簡單許多，讀者可以參考下面的流程圖直接閱讀源碼：

發(fā)送報文段的函數(shù)是tcp_output，其唯一參數(shù)是某個連接的TCP控制塊指針pcb，函數(shù)把這個控制塊unsent隊列上的報文段發(fā)送出去或只發(fā)送一個ACK報文段（unsent隊列無數(shù)據(jù)發(fā)送或發(fā)送窗口此時不允許發(fā)送數(shù)據(jù)）。報文段實際由tcp_output_segment發(fā)送出去后，tcp_output需將發(fā)送出去的報文段放入控制塊unacked緩沖隊列中（需保證隊列中的所有報文段序號有序排列），以便后續(xù)的重發(fā)操作。當unsent隊列上的第一個報文段處理完畢，tcp_output會按照上述方法依次處理unsent隊列上的剩余報文段，直到數(shù)據(jù)被全部發(fā)送出去或發(fā)送窗口被填滿。tcp_write函數(shù)的重要部分和tcp_output的實現(xiàn)代碼如下：

// rt-thread\components\net\lwip-1.4.1\src\core\tcp_out.c /*** Write data for sending (but does not send it immediately).** It waits in the expectation of more data being sent soon (as* it can send them more efficiently by combining them together).* To prompt the system to send data now, call tcp_output() after* calling tcp_write().** @param pcb Protocol control block for the TCP connection to enqueue data for.* @param arg Pointer to the data to be enqueued for sending.* @param len Data length in bytes* @param apiflags combination of following flags :* - TCP_WRITE_FLAG_COPY (0x01) data will be copied into memory belonging to the stack* - TCP_WRITE_FLAG_MORE (0x02) for TCP connection, PSH flag will be set on last segment sent,* @return ERR_OK if enqueued, another err_t on error*/ err_t tcp_write(struct tcp_pcb *pcb, const void *arg, u16_t len, u8_t apiflags) {....../** Finally update the pcb state.*/pcb->snd_lbb += len;pcb->snd_buf -= len;pcb->snd_queuelen = queuelen;/* Set the PSH flag in the last segment that we enqueued. */if (seg != NULL && seg->tcphdr != NULL && ((apiflags & TCP_WRITE_FLAG_MORE)==0)) {TCPH_SET_FLAG(seg->tcphdr, TCP_PSH);}...... }/*** Find out what we can send and send it** @param pcb Protocol control block for the TCP connection to send data* @return ERR_OK if data has been sent or nothing to send* another err_t on error*/ err_t tcp_output(struct tcp_pcb *pcb) {struct tcp_seg *seg, *useg;u32_t wnd, snd_nxt;/* First, check if we are invoked by the TCP input processingcode. If so, we do not output anything. Instead, we rely on theinput processing code to call us when input processing is donewith. */if (tcp_input_pcb == pcb) {return ERR_OK;}wnd = LWIP_MIN(pcb->snd_wnd, pcb->cwnd);seg = pcb->unsent;/* If the TF_ACK_NOW flag is set and no data will be sent (either* because the ->unsent queue is empty or because the window does* not allow it), construct an empty ACK segment and send it.* If data is to be sent, we will just piggyback the ACK (see below).*/if (pcb->flags & TF_ACK_NOW &&(seg == NULL ||ntohl(seg->tcphdr->seqno) - pcb->lastack + seg->len > wnd)) {return tcp_send_empty_ack(pcb);}/* useg should point to last segment on unacked queue */useg = pcb->unacked;if (useg != NULL) {for (; useg->next != NULL; useg = useg->next);}/* data available and window allows it to be sent? */while (seg != NULL &&ntohl(seg->tcphdr->seqno) - pcb->lastack + seg->len <= wnd) {/* Stop sending if the nagle algorithm would prevent it* Don't stop:* - if tcp_write had a memory error before (prevent delayed ACK timeout) or* - if FIN was already enqueued for this PCB (SYN is always alone in a segment -* either seg->next != NULL or pcb->unacked == NULL;* RST is no sent using tcp_write/tcp_output.*/if((tcp_do_output_nagle(pcb) == 0) &&((pcb->flags & (TF_NAGLEMEMERR | TF_FIN)) == 0)){break;}pcb->unsent = seg->next;if (pcb->state != SYN_SENT) {TCPH_SET_FLAG(seg->tcphdr, TCP_ACK);pcb->flags &= ~(TF_ACK_DELAY | TF_ACK_NOW);}tcp_output_segment(seg, pcb);snd_nxt = ntohl(seg->tcphdr->seqno) + TCP_TCPLEN(seg);if (TCP_SEQ_LT(pcb->snd_nxt, snd_nxt)) {pcb->snd_nxt = snd_nxt;}/* put segment on unacknowledged list if length > 0 */if (TCP_TCPLEN(seg) > 0) {seg->next = NULL;/* unacked list is empty? */if (pcb->unacked == NULL) {pcb->unacked = seg;useg = seg;/* unacked list is not empty? */} else {/* In the case of fast retransmit, the packet should not go to the tail* of the unacked queue, but rather somewhere before it. We need to check for* this case. -STJ Jul 27, 2004 */if (TCP_SEQ_LT(ntohl(seg->tcphdr->seqno), ntohl(useg->tcphdr->seqno))) {/* add segment to before tail of unacked list, keeping the list sorted */struct tcp_seg **cur_seg = &(pcb->unacked);while (*cur_seg &&TCP_SEQ_LT(ntohl((*cur_seg)->tcphdr->seqno), ntohl(seg->tcphdr->seqno))) {cur_seg = &((*cur_seg)->next );}seg->next = (*cur_seg);(*cur_seg) = seg;} else {/* add segment to tail of unacked list */useg->next = seg;useg = useg->next;}}/* do not queue empty segments on the unacked list */} else {tcp_seg_free(seg);}seg = pcb->unsent;}pcb->flags &= ~TF_NAGLEMEMERR;return ERR_OK; }

從整個發(fā)送過程來看，tcp_output只是檢查某個報文是否滿足被發(fā)送的條件，然后調(diào)用函數(shù)tcp_output_segment將報文段發(fā)送出去，后者需要填寫TCP報文首部中剩下的幾個必要字段，然后調(diào)用IP層輸出函數(shù)ip_output發(fā)送報文，tcp_output_segment函數(shù)的功能有點類似于UDP協(xié)議中的udp_sendto函數(shù)，讀者可以對照源碼理解。

2.4.2 TCP報文段輸入處理

從上面的TCP函數(shù)調(diào)用流程圖可以看出，與TCP輸入相關(guān)的函數(shù)有5個，TCP報文被IP層遞交給tcp_input函數(shù)，這個函數(shù)可以說是TCP層的總輸入函數(shù)，它會為報文段尋找一個匹配的TCP控制塊，根據(jù)控制塊狀態(tài)的不同，調(diào)用tcp_timewait_input、tcp_listen_input或tcp_process處理報文段；這里的重點是函數(shù)tcp_process，它實現(xiàn)了前面介紹過的TCP狀態(tài)機（實現(xiàn)源碼也在前面給出），函數(shù)根據(jù)報文信息完成連接狀態(tài)的變遷，同時若報文中有數(shù)據(jù)，則函數(shù)tcp_receive會被調(diào)用；整個過程中的難點在于函數(shù)tcp_receive，它完成了TCP中的數(shù)據(jù)接收、數(shù)據(jù)重組等工作，同時TCP中各種性能算法的實現(xiàn)也是在該函數(shù)中完成。

在IP層收到數(shù)據(jù)報后，ip_input函數(shù)會判斷IP首部中的協(xié)議字段，把屬于TCP的報文通過tcp_input函數(shù)傳遞到TCP層。tcp_input完成報文向各個控制塊的分發(fā)，并等待控制塊對相應報文的處理結(jié)果，它會根據(jù)處理結(jié)果向用戶遞交數(shù)據(jù)或向連接另一端輸出響應報文。對于每一個待處理報文，tcp_input都將它們的信息記錄在一些全局變量中，其它各函數(shù)可以直接操作這些全局變量來得到想要的信息，這些全局變量的定義如下：

// rt-thread\components\net\lwip-1.4.1\src\core\tcp_in.c/* These variables are global to all functions involved in the inputprocessing of TCP segments. They are set by the tcp_input()function. */ static struct tcp_seg inseg; static struct tcp_hdr *tcphdr; static struct ip_hdr *iphdr; static u32_t seqno, ackno; static u8_t flags; static u16_t tcplen;static u8_t recv_flags; static struct pbuf *recv_data;struct tcp_pcb *tcp_input_pcb;

tcp_input函數(shù)開始會對IP層遞交進來的報文段進行一些基本操作，如丟棄廣播或多播數(shù)據(jù)報、數(shù)據(jù)校驗和驗證，同時提取TCP報文首部各個字段填寫到上述全局變量中。接下來根據(jù)TCP報文段中表示連接的四個字段的值來查找四條鏈表，在哪條鏈表上找到對應的控制塊則交由相應的函數(shù)繼續(xù)處理。下面給出tcp_input函數(shù)的流程圖如下：

tcp_process函數(shù)實現(xiàn)代碼前面已給出，下面給出tcp_input部分比較重要的代碼（函數(shù)太長，不再全部展示，讀者可以結(jié)合流程圖理解源碼）、tcp_timewait_input與tcp_listen_input實現(xiàn)代碼如下：

// rt-thread\components\net\lwip-1.4.1\src\core\tcp_in.c /*** The initial input processing of TCP. It verifies the TCP header, demultiplexes* the segment between the PCBs and passes it on to tcp_process(), which implements* the TCP finite state machine. This function is called by the IP layer (in* ip_input()).* @param p received TCP segment to process (p->payload pointing to the IP header)* @param inp network interface on which this segment was received*/ void tcp_input(struct pbuf *p, struct netif *inp) {......tcp_input_pcb = pcb;err = tcp_process(pcb);/* A return value of ERR_ABRT means that tcp_abort() was calledand that the pcb has been freed. If so, we don't do anything. */if (err != ERR_ABRT) {if (recv_flags & TF_RESET) {/* TF_RESET means that the connection was reset by the otherend. We then call the error callback to inform theapplication that the connection is dead before wedeallocate the PCB. */TCP_EVENT_ERR(pcb->errf, pcb->callback_arg, ERR_RST);tcp_pcb_remove(&tcp_active_pcbs, pcb);memp_free(MEMP_TCP_PCB, pcb);} else if (recv_flags & TF_CLOSED) {/* The connection has been closed and we will deallocate thePCB. */if (!(pcb->flags & TF_RXCLOSED)) {/* Connection closed although the application has only shut down thetx side: call the PCB's err callback and indicate the closure toensure the application doesn't continue using the PCB. */TCP_EVENT_ERR(pcb->errf, pcb->callback_arg, ERR_CLSD);}tcp_pcb_remove(&tcp_active_pcbs, pcb);memp_free(MEMP_TCP_PCB, pcb);} else {err = ERR_OK;/* If the application has registered a "sent" function to becalled when new send buffer space is available, we call itnow. */if (pcb->acked > 0) {TCP_EVENT_SENT(pcb, pcb->acked, err);if (err == ERR_ABRT) {goto aborted;}}if (recv_data != NULL) {if (pcb->flags & TF_RXCLOSED) {/* received data although already closed -> abort (send RST) tonotify the remote host that not all data has been processed */pbuf_free(recv_data);tcp_abort(pcb);goto aborted;}/* Notify application that data has been received. */TCP_EVENT_RECV(pcb, recv_data, ERR_OK, err);if (err == ERR_ABRT) {goto aborted;}/* If the upper layer can't receive this data, store it */if (err != ERR_OK) {pcb->refused_data = recv_data;}}/* If a FIN segment was received, we call the callbackfunction with a NULL buffer to indicate EOF. */if (recv_flags & TF_GOT_FIN) {if (pcb->refused_data != NULL) {/* Delay this if we have refused data. */pcb->refused_data->flags |= PBUF_FLAG_TCP_FIN;} else {/* correct rcv_wnd as the application won't call tcp_recved()for the FIN's seqno */if (pcb->rcv_wnd != TCP_WND) {pcb->rcv_wnd++;}TCP_EVENT_CLOSED(pcb, err);if (err == ERR_ABRT) {goto aborted;}}}tcp_input_pcb = NULL;/* Try to send something out. */tcp_output(pcb);}}...... }/*** Called by tcp_input() when a segment arrives for a listening* connection (from tcp_input()).* @param pcb the tcp_pcb_listen for which a segment arrived* @return ERR_OK if the segment was processed* another err_t on error* @note the return value is not (yet?) used in tcp_input()* @note the segment which arrived is saved in global variables, therefore only the pcb* involved is passed as a parameter to this function*/ static err_t tcp_listen_input(struct tcp_pcb_listen *pcb) {struct tcp_pcb *npcb;err_t rc;if (flags & TCP_RST) {/* An incoming RST should be ignored. Return. */return ERR_OK;}/* In the LISTEN state, we check for incoming SYN segments,creates a new PCB, and responds with a SYN|ACK. */if (flags & TCP_ACK) {/* For incoming segments with the ACK flag set, respond with a RST. */tcp_rst(ackno, seqno + tcplen, ip_current_dest_addr(),ip_current_src_addr(), tcphdr->dest, tcphdr->src);} else if (flags & TCP_SYN) {npcb = tcp_alloc(pcb->prio);/* If a new PCB could not be created (probably due to lack of memory),we don't do anything, but rely on the sender will retransmit theSYN at a time when we have more memory available. */if (npcb == NULL) {return ERR_MEM;}/* Set up the new PCB. */ip_addr_copy(npcb->local_ip, current_iphdr_dest);npcb->local_port = pcb->local_port;ip_addr_copy(npcb->remote_ip, current_iphdr_src);npcb->remote_port = tcphdr->src;npcb->state = SYN_RCVD;npcb->rcv_nxt = seqno + 1;npcb->rcv_ann_right_edge = npcb->rcv_nxt;npcb->snd_wnd = tcphdr->wnd;npcb->snd_wnd_max = tcphdr->wnd;npcb->ssthresh = npcb->snd_wnd;npcb->snd_wl1 = seqno - 1;/* initialise to seqno-1 to force window update */npcb->callback_arg = pcb->callback_arg;npcb->accept = pcb->accept;/* inherit socket options */npcb->so_options = pcb->so_options & SOF_INHERITED;/* Register the new PCB so that we can begin receiving segmentsfor it. */TCP_REG_ACTIVE(npcb);/* Parse any options in the SYN. */tcp_parseopt(npcb);npcb->mss = tcp_eff_send_mss(npcb->mss, &(npcb->remote_ip));/* Send a SYN|ACK together with the MSS option. */rc = tcp_enqueue_flags(npcb, TCP_SYN | TCP_ACK);if (rc != ERR_OK) {tcp_abandon(npcb, 0);return rc;}return tcp_output(npcb);}return ERR_OK; }/*** Called by tcp_input() when a segment arrives for a connection in* TIME_WAIT.* @param pcb the tcp_pcb for which a segment arrived* @note the segment which arrived is saved in global variables, therefore only the pcb* involved is passed as a parameter to this function*/ static err_t tcp_timewait_input(struct tcp_pcb *pcb) {/* RFC 1337: in TIME_WAIT, ignore RST and ACK FINs + any 'acceptable' segments *//* RFC 793 3.9 Event Processing - Segment Arrives:* - first check sequence number - we skip that one in TIME_WAIT (always* acceptable since we only send ACKs)* - second check the RST bit (... return) */if (flags & TCP_RST) {return ERR_OK;}/* - fourth, check the SYN bit, */if (flags & TCP_SYN) {/* If an incoming segment is not acceptable, an acknowledgmentshould be sent in reply */if (TCP_SEQ_BETWEEN(seqno, pcb->rcv_nxt, pcb->rcv_nxt+pcb->rcv_wnd)) {/* If the SYN is in the window it is an error, send a reset */tcp_rst(ackno, seqno + tcplen, ip_current_dest_addr(), ip_current_src_addr(),tcphdr->dest, tcphdr->src);return ERR_OK;}} else if (flags & TCP_FIN) {/* - eighth, check the FIN bit: Remain in the TIME-WAIT state.Restart the 2 MSL time-wait timeout.*/pcb->tmr = tcp_ticks;}if ((tcplen > 0)) {/* Acknowledge data, FIN or out-of-window SYN */pcb->flags |= TF_ACK_NOW;return tcp_output(pcb);}return ERR_OK; }

在TCP內(nèi)核中，輸入報文段中的數(shù)據(jù)接收和處理都是由函數(shù)tcp_receive來完成的，這個函數(shù)可以說是整個協(xié)議棧內(nèi)核中代碼最長、最難懂的部分了。在前面TCP狀態(tài)機實現(xiàn)函數(shù)tcp_process中可以看到，函數(shù)tcp_receive在多個地方被調(diào)用來處理報文段中的數(shù)據(jù)。總結(jié)下該函數(shù)需要完成的工作：首先檢查報文中攜帶的確認序號是否確認了未確認序列unacked中的數(shù)據(jù)，如果是則釋放掉被確認的數(shù)據(jù)空間，并設(shè)置acked字段值以便tcp_input回調(diào)用戶函數(shù)；同時，如果報文段中有數(shù)據(jù)且數(shù)據(jù)有序，這些數(shù)據(jù)會被記錄在recv_data中，以便用戶程序處理；如果控制塊的ooseq隊列上的報文段因為新報文段的到來而變得有序，則這些報文段的數(shù)據(jù)也會被一起連接在recv_data中，在函數(shù)退出后由tcp_input遞交給應用程序處理；如果新報文段不是有序的，則報文段將被插入到隊列ooseq上，該報文段的引用指針將被加1，防止在其他地方被刪除。最后，還有很多其他工作也需要在該函數(shù)中完成，例如當前確認序號包含了對正在進行RTT估計的報文段的確認，則RTT需要被計算；如果收到重復的ACK，這可能會在函數(shù)中啟動快速重傳算法等。下面展示了整個tcp_receive函數(shù)的處理流程，讀者可以參照這個流程圖去閱讀該函數(shù)的源代碼：

前面介紹了TCP協(xié)議如何提供可靠的傳輸服務，比如超時重傳與RTT估計、保活機制、快速重傳與快速恢復、慢啟動與擁塞避免、零窗口探查、Nagle算法與延遲捎帶確認應答等，這些功能的實現(xiàn)代碼也都分布在上面介紹的函數(shù)中，限于篇幅且某功能實現(xiàn)代碼并不局限于某一個函數(shù)內(nèi)，這里就不再一一列出了，讀者可以閱讀源碼理解相應功能的實現(xiàn)邏輯。下面以零窗口探查、快速重傳與快速恢復、慢啟動與擁塞避免、RTT（Round-Rrip Time）估算與RTO（Retransmission Timeout）更新等功能在tcp_receive函數(shù)中的部分實現(xiàn)為例，展示其實現(xiàn)代碼如下：

// rt-thread\components\net\lwip-1.4.1\src\core\tcp_in.c /*** Called by tcp_process. Checks if the given segment is an ACK for outstanding* data, and if so frees the memory of the buffered data. Next, is places the* segment on any of the receive queues (pcb->recved or pcb->ooseq). If the segment* is buffered, the pbuf is referenced by pbuf_ref so that it will not be freed until* it has been removed from the buffer.** If the incoming segment constitutes an ACK for a segment that was used for RTT* estimation, the RTT is estimated here as well.** Called from tcp_process().*/ static void tcp_receive(struct tcp_pcb *pcb) {struct tcp_seg *next;struct tcp_seg *prev, *cseg;struct pbuf *p;s32_t off;s16_t m;u32_t right_wnd_edge;u16_t new_tot_len;int found_dupack = 0;if (flags & TCP_ACK) {right_wnd_edge = pcb->snd_wnd + pcb->snd_wl2;/* Update window. */if (TCP_SEQ_LT(pcb->snd_wl1, seqno) ||(pcb->snd_wl1 == seqno && TCP_SEQ_LT(pcb->snd_wl2, ackno)) ||(pcb->snd_wl2 == ackno && tcphdr->wnd > pcb->snd_wnd)) {pcb->snd_wnd = tcphdr->wnd;/* keep track of the biggest window announced by the remote host to calculatethe maximum segment size */if (pcb->snd_wnd_max < tcphdr->wnd) {pcb->snd_wnd_max = tcphdr->wnd;}pcb->snd_wl1 = seqno;pcb->snd_wl2 = ackno;if (pcb->snd_wnd == 0) {if (pcb->persist_backoff == 0) {/* start persist timer */pcb->persist_cnt = 0;pcb->persist_backoff = 1;}} else if (pcb->persist_backoff > 0) {/* stop persist timer */pcb->persist_backoff = 0;}}/* (From Stevens TCP/IP Illustrated Vol II, p970.) Its only a* duplicate ack if:* 1) It doesn't ACK new data * 2) length of received packet is zero (i.e. no payload) * 3) the advertised window hasn't changed * 4) There is outstanding unacknowledged data (retransmission timer running)* 5) The ACK is == biggest ACK sequence number so far seen (snd_una)* * If it passes all five, should process as a dupack: * a) dupacks < 3: do nothing * b) dupacks == 3: fast retransmit * c) dupacks > 3: increase cwnd * * If it only passes 1-3, should reset dupack counter (and add to* stats, which we don't do in lwIP)* If it only passes 1, should reset dupack counter*//* Clause 1 */if (TCP_SEQ_LEQ(ackno, pcb->lastack)) {pcb->acked = 0;/* Clause 2 */if (tcplen == 0) {/* Clause 3 */if (pcb->snd_wl2 + pcb->snd_wnd == right_wnd_edge){/* Clause 4 */if (pcb->rtime >= 0) {/* Clause 5 */if (pcb->lastack == ackno) {found_dupack = 1;if ((u8_t)(pcb->dupacks + 1) > pcb->dupacks) {++pcb->dupacks;}if (pcb->dupacks > 3) {/* Inflate the congestion window, but not if it means thatthe value overflows. */if ((u16_t)(pcb->cwnd + pcb->mss) > pcb->cwnd) {pcb->cwnd += pcb->mss;}} else if (pcb->dupacks == 3) {/* Do fast retransmit */tcp_rexmit_fast(pcb);}}}}}/* If Clause (1) or more is true, but not a duplicate ack, reset* count of consecutive duplicate acks */if (!found_dupack) {pcb->dupacks = 0;}} else if (TCP_SEQ_BETWEEN(ackno, pcb->lastack+1, pcb->snd_nxt)){/* We come here when the ACK acknowledges new data. *//* Reset the "IN Fast Retransmit" flag, since we are no longerin fast retransmit. Also reset the congestion window to theslow start threshold. */if (pcb->flags & TF_INFR) {pcb->flags &= ~TF_INFR;pcb->cwnd = pcb->ssthresh;}/* Reset the number of retransmissions. */pcb->nrtx = 0;/* Reset the retransmission time-out. */pcb->rto = (pcb->sa >> 3) + pcb->sv;/* Update the send buffer space. Diff between the two can never exceed 64K? */pcb->acked = (u16_t)(ackno - pcb->lastack);pcb->snd_buf += pcb->acked;/* Reset the fast retransmit variables. */pcb->dupacks = 0;pcb->lastack = ackno;/* Update the congestion control variables (cwnd andssthresh). */if (pcb->state >= ESTABLISHED) {if (pcb->cwnd < pcb->ssthresh) {if ((u16_t)(pcb->cwnd + pcb->mss) > pcb->cwnd) {pcb->cwnd += pcb->mss;}} else {u16_t new_cwnd = (pcb->cwnd + pcb->mss * pcb->mss / pcb->cwnd);if (new_cwnd > pcb->cwnd) {pcb->cwnd = new_cwnd;}}}....../* RTT estimation calculations. This is done by checking if theincoming segment acknowledges the segment we use to take around-trip time measurement. */if (pcb->rttest && TCP_SEQ_LT(pcb->rtseq, ackno)) {/* diff between this shouldn't exceed 32K since this are tcp timer ticksand a round-trip shouldn't be that long... */m = (s16_t)(tcp_ticks - pcb->rttest);/* This is taken directly from VJs original code in his paper */m = m - (pcb->sa >> 3);pcb->sa += m;if (m < 0) {m = -m;}m = m - (pcb->sv >> 2);pcb->sv += m;pcb->rto = (pcb->sa >> 3) + pcb->sv;pcb->rttest = 0;}...... }

2.4.3 TCP定時器

在TCP函數(shù)調(diào)用總流程中，TCP報文段輸出函數(shù)tcp_output是被定時器tcp_tmr周期性調(diào)用的。此外，與TCP功能相關(guān)的定時器還有很多，比如回調(diào)函數(shù)poll需要定時器的支持，重傳、保活等也都離不開定時器支持。總結(jié)來說，TCP為每條連接總共建立了七個定時器，分別如下：

建立連接（connection establishment）定時器：在服務器響應一個SYN握手報文并試圖建立一條新連接時啟動，此時服務器已發(fā)出自己的SYN+ACK并處于SYN_RCVD等待對方ACK的返回，如果在75秒內(nèi)沒有收到響應，連接建立將中止，這也是服務器處理SYN攻擊的有效手段；
重傳（retransmission）定時器：在TCP發(fā)送某個報文時設(shè)定，如果該定時器超時而對端的確認還未到達，TCP將重傳該報文段。重傳間隔是根據(jù)RTT估計值動態(tài)計算的，且取決于報文段已被重傳的次數(shù)；
數(shù)據(jù)組裝（assemble）定時器：在接收緩沖隊列ooseq不為空時有效，如果連接上很長時間內(nèi)都沒有數(shù)據(jù)交互，但是失序報文段緩沖隊列ooseq上還有失序的報文，則相應的報文需要在隊列中刪除；
堅持（persist）定時器：在對方通告接收窗口為0，阻止TCP繼續(xù)發(fā)送數(shù)據(jù)時設(shè)定。定時器超時后，將向?qū)Ψ桨l(fā)送1字節(jié)的數(shù)據(jù)，判斷對方接收窗口是否已打開；
保活（keep alive）定時器：在TCP控制塊的so_options字段設(shè)置了SOF_KEEPALIVE選項時生效。如果連接的連續(xù)空閑時間超過2小時，則保活定時器超時，此時應向?qū)Ψ桨l(fā)送保活探查報文，強迫對方響應。如果收到期待的響應，TCP可確定對方主機工作正常，重置保活定時器；如果未收到期待的響應，則TCP關(guān)閉連接釋放資源并通知應用程序?qū)Ψ揭褦嚅_；
FIN_WAIT_2定時器：當某個連接從FIN_WAIT_1狀態(tài)變遷到FIN_WAIT_2狀態(tài)并且不能再接收任何新數(shù)據(jù)時，FIN_WAIT_2定時器啟動，定時器超時后連接被關(guān)閉。
TIME_WAIT定時器：一般也稱為2MSL（Maximum Segment Lifetime）定時器，當連接轉(zhuǎn)移到TIME_WAIT狀態(tài)即連接主動關(guān)閉時，該定時器啟動，超時后TCP控制塊被刪除，端口號可重新使用。同樣，服務器端在斷開連接過程中會處于LAST_ACK狀態(tài)等待對方ACK的返回，如果在該狀態(tài)下的2MSL時間內(nèi)未收到對方的響應，連接也會被立即關(guān)閉。

所有的7個定時器中，重傳定時器使用rtime字段計數(shù)，堅持定時器使用persist_cnt字段計數(shù)，其它所有5個定時器都使用tmr字段，通過與各自的一個全局變量做比較判斷是否超時，超時后執(zhí)行相應的處理。這幾個定時器是在連接處于幾種不同的狀態(tài)時使用的，因此它們可以完全獨立的使用tmr字段而不會相互影響，下面是它們的超時上限宏定義：

// rt-thread\components\net\lwip-1.4.1\src\include\lwip\tcp_impl.h#define TCP_TMR_INTERVAL 250 /* The TCP timer interval in milliseconds. */ #define TCP_FAST_INTERVAL TCP_TMR_INTERVAL /* the fine grained timeout in milliseconds */ #define TCP_SLOW_INTERVAL (2*TCP_TMR_INTERVAL) /* the coarse grained timeout in milliseconds */#define TCP_FIN_WAIT_TIMEOUT 20000 /* milliseconds */ #define TCP_SYN_RCVD_TIMEOUT 20000 /* milliseconds */#define TCP_OOSEQ_TIMEOUT 6U /* x RTO */ #define TCP_MSL 60000UL /* The maximum segment lifetime in milliseconds *//* Keepalive values, compliant with RFC 1122. Don't change this unless you know what you're doing */ #define TCP_KEEPIDLE_DEFAULT 7200000UL /* Default KEEPALIVE timer in milliseconds */ #define TCP_KEEPINTVL_DEFAULT 75000UL /* Default Time between KEEPALIVE probes in milliseconds */ #define TCP_KEEPCNT_DEFAULT 9U /* Default Counter for KEEPALIVE probes */ #define TCP_MAXIDLE TCP_KEEPCNT_DEFAULT * TCP_KEEPINTVL_DEFAULT /* Maximum KEEPALIVE probe time */

上面介紹的7種定時器包括TCP絕大部分可靠性的保障都是在tcp_slowtmr慢速定時器處理函數(shù)中完成的，該函數(shù)的實現(xiàn)代碼如下：

// rt-thread\components\net\lwip-1.4.1\src\core\tcp.c/* Incremented every coarse grained timer shot (typically every 500 ms). */ u32_t tcp_ticks; const u8_t tcp_backoff[13] = { 1, 2, 3, 4, 5, 6, 7, 7, 7, 7, 7, 7, 7};/* Times per slowtmr hits */ const u8_t tcp_persist_backoff[7] = { 3, 6, 12, 24, 48, 96, 120 };/* The TCP PCB lists. */ /** List of all TCP PCBs bound but not yet (connected || listening) */ struct tcp_pcb *tcp_bound_pcbs; /** List of all TCP PCBs in LISTEN state */ union tcp_listen_pcbs_t tcp_listen_pcbs; /** List of all TCP PCBs that are in a state in which* they accept or send data. */ struct tcp_pcb *tcp_active_pcbs; /** List of all TCP PCBs in TIME-WAIT state */ struct tcp_pcb *tcp_tw_pcbs;/*** Called every 500 ms and implements the retransmission timer and the timer that* removes PCBs that have been in TIME-WAIT for enough time. It also increments* various timers such as the inactivity timer in each PCB.** Automatically called from tcp_tmr().*/ void tcp_slowtmr(void) {struct tcp_pcb *pcb, *prev;u16_t eff_wnd;u8_t pcb_remove; /* flag if a PCB should be removed */u8_t pcb_reset; /* flag if a RST should be sent when removing */err_t err;err = ERR_OK;++tcp_ticks;++tcp_timer_ctr;tcp_slowtmr_start:/* Steps through all of the active PCBs. */prev = NULL;pcb = tcp_active_pcbs;while (pcb != NULL) {if (pcb->last_timer == tcp_timer_ctr) {/* skip this pcb, we have already processed it */pcb = pcb->next;continue;}pcb->last_timer = tcp_timer_ctr;pcb_remove = 0;pcb_reset = 0;if (pcb->state == SYN_SENT && pcb->nrtx == TCP_SYNMAXRTX) {++pcb_remove;}else if (pcb->nrtx == TCP_MAXRTX) {++pcb_remove;} else {if (pcb->persist_backoff > 0) {/* If snd_wnd is zero, use persist timer to send 1 byte probes* instead of using the standard retransmission mechanism. */pcb->persist_cnt++;if (pcb->persist_cnt >= tcp_persist_backoff[pcb->persist_backoff-1]) {pcb->persist_cnt = 0;if (pcb->persist_backoff < sizeof(tcp_persist_backoff)) {pcb->persist_backoff++;}tcp_zero_window_probe(pcb);}} else {/* Increase the retransmission timer if it is running */if(pcb->rtime >= 0) {++pcb->rtime;}if (pcb->unacked != NULL && pcb->rtime >= pcb->rto) {/* Double retransmission time-out unless we are trying to* connect to somebody (i.e., we are in SYN_SENT). */if (pcb->state != SYN_SENT) {pcb->rto = ((pcb->sa >> 3) + pcb->sv) << tcp_backoff[pcb->nrtx];}/* Reset the retransmission timer. */pcb->rtime = 0;/* Reduce congestion window and ssthresh. */eff_wnd = LWIP_MIN(pcb->cwnd, pcb->snd_wnd);pcb->ssthresh = eff_wnd >> 1;if (pcb->ssthresh < (pcb->mss << 1)) {pcb->ssthresh = (pcb->mss << 1);}pcb->cwnd = pcb->mss;/* The following needs to be called AFTER cwnd is set to onemss - STJ */tcp_rexmit_rto(pcb);}}}/* Check if this PCB has stayed too long in FIN-WAIT-2 */if (pcb->state == FIN_WAIT_2) {/* If this PCB is in FIN_WAIT_2 because of SHUT_WR don't let it time out. */if (pcb->flags & TF_RXCLOSED) {/* PCB was fully closed (either through close() or SHUT_RDWR):normal FIN-WAIT timeout handling. */if ((u32_t)(tcp_ticks - pcb->tmr) >TCP_FIN_WAIT_TIMEOUT / TCP_SLOW_INTERVAL) {++pcb_remove;}}}/* Check if KEEPALIVE should be sent */if(ip_get_option(pcb, SOF_KEEPALIVE) &&((pcb->state == ESTABLISHED) ||(pcb->state == CLOSE_WAIT))) {if((u32_t)(tcp_ticks - pcb->tmr) >(pcb->keep_idle + TCP_KEEP_DUR(pcb)) / TCP_SLOW_INTERVAL){ ++pcb_remove;++pcb_reset;}else if((u32_t)(tcp_ticks - pcb->tmr) > (pcb->keep_idle + pcb->keep_cnt_sent * TCP_KEEP_INTVL(pcb))/ TCP_SLOW_INTERVAL){tcp_keepalive(pcb);pcb->keep_cnt_sent++;}}/* If this PCB has queued out of sequence data, but has beeninactive for too long, will drop the data (it will eventuallybe retransmitted). */if (pcb->ooseq != NULL &&(u32_t)tcp_ticks - pcb->tmr >= pcb->rto * TCP_OOSEQ_TIMEOUT) {tcp_segs_free(pcb->ooseq);pcb->ooseq = NULL;}/* Check if this PCB has stayed too long in SYN-RCVD */if (pcb->state == SYN_RCVD) {if ((u32_t)(tcp_ticks - pcb->tmr) >TCP_SYN_RCVD_TIMEOUT / TCP_SLOW_INTERVAL) {++pcb_remove;}}/* Check if this PCB has stayed too long in LAST-ACK */if (pcb->state == LAST_ACK) {if ((u32_t)(tcp_ticks - pcb->tmr) > 2 * TCP_MSL / TCP_SLOW_INTERVAL) {++pcb_remove;}}/* If the PCB should be removed, do it. */if (pcb_remove) {struct tcp_pcb *pcb2;tcp_err_fn err_fn;void *err_arg;tcp_pcb_purge(pcb);/* Remove PCB from tcp_active_pcbs list. */if (prev != NULL) {prev->next = pcb->next;} else {/* This PCB was the first. */tcp_active_pcbs = pcb->next;}if (pcb_reset) {tcp_rst(pcb->snd_nxt, pcb->rcv_nxt, &pcb->local_ip, &pcb->remote_ip,pcb->local_port, pcb->remote_port);}err_fn = pcb->errf;err_arg = pcb->callback_arg;pcb2 = pcb;pcb = pcb->next;memp_free(MEMP_TCP_PCB, pcb2);tcp_active_pcbs_changed = 0;TCP_EVENT_ERR(err_fn, err_arg, ERR_ABRT);if (tcp_active_pcbs_changed) {goto tcp_slowtmr_start;}} else {/* get the 'next' element now and work with 'prev' below (in case of abort) */prev = pcb;pcb = pcb->next;/* We check if we should poll the connection. */++prev->polltmr;if (prev->polltmr >= prev->pollinterval) {prev->polltmr = 0;tcp_active_pcbs_changed = 0;TCP_EVENT_POLL(prev, err);if (tcp_active_pcbs_changed) {goto tcp_slowtmr_start;}/* if err == ERR_ABRT, 'prev' is already deallocated */if (err == ERR_OK) {tcp_output(prev);}}}}/* Steps through all of the TIME-WAIT PCBs. */prev = NULL;pcb = tcp_tw_pcbs;while (pcb != NULL) {pcb_remove = 0;/* Check if this PCB has stayed long enough in TIME-WAIT */if ((u32_t)(tcp_ticks - pcb->tmr) > 2 * TCP_MSL / TCP_SLOW_INTERVAL) {++pcb_remove;}/* If the PCB should be removed, do it. */if (pcb_remove) {struct tcp_pcb *pcb2;tcp_pcb_purge(pcb);/* Remove PCB from tcp_tw_pcbs list. */if (prev != NULL) {prev->next = pcb->next;} else {/* This PCB was the first. */tcp_tw_pcbs = pcb->next;}pcb2 = pcb;pcb = pcb->next;memp_free(MEMP_TCP_PCB, pcb2);} else {prev = pcb;pcb = pcb->next;}} }

很容易看出，各個定時器的實現(xiàn)都是通過使用全局變量tcp_ticks與tmr字段的差值來實現(xiàn)的，當TCP進入某個狀態(tài)時，就會將控制塊tmr字段設(shè)置為以前的全局時鐘tcp_ticks的值，所以上面的差值可以有效表示出TCP處于某個狀態(tài)的時間。各定時器超時后的處理也很類似，即將變量pcb_remove加1，pcb_remove變量是超時處理中最核心的變量，當針對某個控制塊做完超時判斷后，函數(shù)通過判斷pcb_remove的值來處理TCP控制塊，當pcb_remove值大于1時，則表示該控制塊上有超時事件發(fā)生，該控制塊或被刪除或被掛起。

LwIP中包含兩個定時器相關(guān)函數(shù)：一個是上述周期在500ms的慢速定時器函數(shù)tcp_slowtmr，它完成了基本所有TCP需要實現(xiàn)的定時功能；第二個是周期為250ms的快速定時器函數(shù)tcp_fasttmr，它完成的一個重要功能是讓連接上被延遲的ACK立即發(fā)送出去，同時未被成功遞交的數(shù)據(jù)也在這里被遞交，tcp_fasttmr的實現(xiàn)代碼如下：

// rt-thread\components\net\lwip-1.4.1\src\core\tcp.c /*** Is called every TCP_FAST_INTERVAL (250 ms) and process data previously* "refused" by upper layer (application) and sends delayed ACKs.** Automatically called from tcp_tmr().*/ void tcp_fasttmr(void) {struct tcp_pcb *pcb;++tcp_timer_ctr;tcp_fasttmr_start:pcb = tcp_active_pcbs;while(pcb != NULL) {if (pcb->last_timer != tcp_timer_ctr) {struct tcp_pcb *next;pcb->last_timer = tcp_timer_ctr;/* send delayed ACKs */if (pcb->flags & TF_ACK_DELAY) {tcp_ack_now(pcb);tcp_output(pcb);pcb->flags &= ~(TF_ACK_DELAY | TF_ACK_NOW);}next = pcb->next;/* If there is data which was previously "refused" by upper layer */if (pcb->refused_data != NULL) {tcp_active_pcbs_changed = 0;tcp_process_refused_data(pcb);if (tcp_active_pcbs_changed) {/* application callback has changed the pcb list: restart the loop */goto tcp_fasttmr_start;}}pcb = next;}} }

為了實現(xiàn)TCP的功能，TCP的上述兩個定時器函數(shù)需要被周期性的調(diào)用，在LwIP的實現(xiàn)中，內(nèi)核需要以250ms為周期調(diào)用tcp_tmr，這個函數(shù)會自動完成對tcp_slowtmr和tcp_fasttmr的調(diào)用。為了便于用戶程序的編寫，內(nèi)核已經(jīng)將tcp_timer以及其他所有定時調(diào)用函數(shù)封裝到了sys_check_timeouts中，因此在沒有操作系統(tǒng)模擬層的支持下，應用程序應至少每隔250ms調(diào)用sys_check_timeouts一次，以保證內(nèi)核機制的正常工作。下面給出tcp_timer的實現(xiàn)代碼：

// rt-thread\components\net\lwip-1.4.1\src\core\tcp.c/** Timer counter to handle calling slow-timer from tcp_tmr() */ static u8_t tcp_timer;/*** Called periodically to dispatch TCP timers.*/ void tcp_tmr(void) {/* Call tcp_fasttmr() every 250 ms */tcp_fasttmr();if (++tcp_timer & 1) {/* Call tcp_tmr() every 500 ms, i.e., every other timertcp_tmr() is called. */tcp_slowtmr();} }

2.5 SYN攻擊

SYN洪水攻擊是目前被廣泛使用的一種基于TCP的DDos攻擊技術(shù)，通常受攻擊的機器是網(wǎng)絡(luò)中服務固定功能的TCP服務器，由于它們的端口號和IP地址都很容易得到，所以它們很容易成為黑客攻擊的對象。這種攻擊過程可以用前面介紹的tcp_listen_input的原理來解釋：當服務器接收到一個連接請求后，它無法判斷客戶端的合法性；另一方面，服務器需要為新連接申請一個控制塊內(nèi)存空間，然后向?qū)Ψ椒祷谹CK+SYN報文，并等待對方的握手ACK返回；如果這個連接請求是惡意者發(fā)起的，那么服務器永遠等不到這個ACK返回（SYN握手報文中的源IP地址是偽造的），服務器必須將這個連接維持足夠長的時間后，服務器才能清除它認為無效的連接。

假如網(wǎng)絡(luò)黑客控制了大量的計算機，并同時向服務器發(fā)送SYN請求，則此時服務器將占用大量的內(nèi)存空間和時間在等待對方的ACK返回上，而顯然這種等待都是徒勞的。如果這樣的連接達到了很大的數(shù)目，系統(tǒng)沒有更多的資源來響應新連接，那么正常用戶的TCP連接也就無法建立，服務器將無法提供正常的訪問服務。TCP協(xié)議連接建立握手過程存在的缺陷，注定了網(wǎng)絡(luò)中的TCP服務器很容易受到SYN攻擊。