Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译
Spatial As Deep: Spatial CNN for Traffic Scene Understanding論文翻譯
Abstract摘要
Convolutional neural networks (CNNs) are usually built by stacking convolutional operations layer-by-layer. Although CNN has shown strong capability to extract semantics from raw pixels, its capacity to capture spatial relationships of pixels across rows and columns of an image is not fully explored. These relationships are important to learn semantic objects with strong shape priors but weak appearance coherences, such as traf?c lanes, which are often occluded or not even painted on the road surface as shown in Fig. 1 (a). In this paper, we propose Spatial CNN (SCNN), which generalizes traditional deep layer-by-layer convolutions to slice-byslice convolutions within feature maps, thus enabling message passings between pixels across rows and columns in a layer. Such SCNN is particular suitable for long continuous shape structure or large objects, with strong spatial relationship but less appearance clues, such as traf?c lanes, poles, and wall. We apply SCNN on a newly released very challenging traf?c lane detection dataset and Cityscapse dataset1. The results show that SCNN could learn the spatial relationship for structure output and signi?cantly improves the performance. We show that SCNN outperforms the recurrent neural network (RNN) based ReNet and MRF+CNN (MRFNet) in the lane detection dataset by 8.7% and 4.6% respectively. Moreover, our SCNN won the 1st place on the TuSimple Benchmark Lane Detection Challenge, with an accuracy of 96.53%.
卷積神經網絡(CNN)通常通過逐層堆疊卷積運算來構建。盡管CNN已經顯示出從原始像素中提取語義的強大能力,但是其在圖像的行和列上捕獲像素的空間關系的能力尚未得到充分研究。這些關系對于學習具有強形狀先驗但具有弱外觀相干性的語義對象(例如交通車道)非常重要,如圖1(a)所示,交通車道經常被遮擋或甚至不涂在路面上。在本文中,我們提出了空間CNN(SCNN),它將傳統的深層逐層卷積推廣到特征映射中的逐片卷積,從而實現層中行和列之間的像素之間的消息傳遞。這種SCNN特別適用于長連續形狀結構或大型物體,具有強烈的空間關系但外觀線索較少,例如交通車道,桿和墻。我們將SCNN應用于新發布的非常具有挑戰性的交通車道檢測數據集和Cityscapse數據集1。結果表明,SCNN可以學習結構輸出的空間關系,并顯著提高性能。我們表明,SCNN在車道檢測數據集中的表現優于基于遞歸神經網絡(RNN)的ReNet和MRF + CNN(MRFNet),分別為8.7%和4.6%。此外,我們的SCNN在TuSimple Benchmark Lane Detection Challenge中獲得第一名,準確率為96.53%。
Introduction介紹
In recent years, autonomous driving has received much attention in both academy and industry. One of the most challenging task of autonomous driving is traf?c scene understanding, which comprises computer vision tasks like lane detection and semantic segmentation. Lane detection helps to guide vehicles and could be used in driving assistance system (Urmson et al. 2008), while semantic segmentation provides more detailed positions about surrounding objects like vehicles or pedestrians. In real applications, however, these tasks could be very challenging considering the many harsh scenarios, including bad weather conditions, dim or dazzle light, etc. Another challenge of traf?c scene understanding is that in many cases, especially in lane detection, we need to tackle objects with strong structure prior but less Copyright c 2018, Association for the Advancement of Arti?cial Intelligence (www.aaai.org). All rights reserved.
近年來,自動駕駛在學術界和工業界都備受關注。自動駕駛最具挑戰性的任務之一是交通場景理解,其包括諸如車道檢測和語義分割的計算機視覺任務。車道檢測有助于引導車輛并可用于駕駛輔助系統(Urmson等人,2008),而語義分段提供關于周圍物體(如車輛或行人)的更詳細位置。然而,在實際應用中,考慮到許多惡劣的場景,包括惡劣的天氣條件,昏暗或眩目的光線等,這些任務可能非常具有挑戰性。交通現場理解的另一個挑戰是,在許多情況下,特別是在車道檢測中,我們需要先處理具有強結構的物體,但需要更少的人工智能推進協會(www.aaai.org)。版權所有。
1Code is available at https://github.com/XingangPan/SCNN appearance clues like lane markings and poles, which have long continuous shape and might be occluded. For instance, in the ?rst example in Fig. 1 (a), the car at the right side fully occludes the rightmost lane marking.
1Code可以在https://github.com/XingangPan/SCNN上看到,如車道標記和桿子的外觀線索,它們具有長的連續形狀并且可能被遮擋。例如,在圖1(a)的第一個例子中,右側的汽車完全封閉了最右邊的車道標記。
Figure 1: Comparison between CNN and SCNN in (a) lane detection and (b) semantic segmentation. For each example, from left to right are: input image, output of CNN, output of SCNN. It can be seen that SCNN could better capture the long continuous shape prior of lane markings and poles and ?x the disconnected parts in CNN.
圖1:CNN和SCNN在(a)車道檢測和(b)語義分段中的比較。對于每個示例,從左到右分別是:輸入圖像,CNN的輸出,SCNN的輸出。可以看出,SCNN可以更好地捕獲車道標記和極點之前的長連續形狀,并且可以修復CNN中的斷開部分。
Although CNN based methods (Krizhevsky, Sutskever, and Hinton 2012; Long, Shelhamer, and Darrell 2015) have pushed scene understanding to a new level thanks to the strong representation learning ability. It is still not performing well for objects having long structure region and could be occluded, such as the lane markings and poles shown in the red bounding boxes in Fig. 1. However, humans can easily infer their positions and ?ll in the occluded part from the context, i.e., the viewable part.
雖然基于CNN的方法(Krizhevsky,Sutskever和Hinton 2012; Long,Shelhamer和Darrell 2015)通過強大的表現學習能力將場景理解推向了一個新的水平。對于具有長結構區域并且可以被遮擋的物體,它仍然表現不佳,例如圖1中紅色邊界框中所示的車道標記和極點。然而,人類可以容易地從上下文(即可視部分)推斷它們的位置并填充在被遮擋的部分中。
To address this issue, we propose Spatial CNN (SCNN), a generalization of deep convolutional neural networks to a rich spatial level. In a layer-by-layer CNN, a convolution layer receives input from the former layer, applies convolution operation and nonlinear activation, and sends result to the next layer. This process is done sequentially. Similarly, SCNN views rows or columns of feature maps as layers and applies convolution, nonlinear activation, and sum operations sequentially, which forms a deep neural network. In this way information could be propagated between neurons in the same layer. It is particularly useful for structured object such as lanes, poles, or truck with occlusions, since the spatial information can be reinforced via inter layer propa gation. As shown in Fig. 1, in cases where CNN is discontinuous or is messy, SCNN could well preserve the smoothness and continuity of lane markings and poles. In our experiment, SCNN signi?cantly outperforms other RNN or MRF/CRF based methods, and also gives better results than the much deeper ResNet-101 (He et al. 2016).
為了解決這個問題,我們提出了空間CNN(SCNN),即深度卷積神經網絡向豐富空間層次的推廣。在逐層CNN中,卷積層接收來自前一層的輸入,應用卷積運算和非線性激活,并將結果發送到下一層。該過程按順序完成。類似地,SCNN將特征映射的行或列視為層,并順序地應用卷積,非線性激活和求和操作,這形成了深度神經網絡。以這種方式,信息可以在同一層中的神經元之間傳播。它對于具有遮擋的結構物體(例如車道,桿或卡車)特別有用,因為空間信息可以通過層間傳播來加強。如圖1所示,在CNN不連續或雜亂的情況下,SCNN可以很好地保持車道標記和桿的平滑性和連續性。在我們的實驗中,SCNN明顯優于其他基于RNN或MRF / CRF的方法,并且比更深層次的ResNet-101(He et al.2016)也提供了更好的結果。
Figure 2: (a) Dataset examples for different scenarios. (b) Proportion of each scenario.
圖2:(a)不同場景的數據集示例。 (b)每種方案的比例。
Related Work. For lane detection, most existing algorithms are based on hand-crafted low-level features (Aly 2008; Son et al. 2015; Jung, Youn, and Sull 2016), limiting there capability to deal with harsh conditions. Only Huval et al. (2015) gave a primacy attempt adopting deep learning in lane detection but without a large and general dataset. While for semantic segmentation, CNN based methods have become mainstream and achieved great success (Long, Shelhamer, and Darrell 2015; Chen et al. 2017).
相關工作。對于車道檢測,大多數現有算法都基于手工制作的低級特征(Aly 2008; Son等人2015; Jung,Youn和Sull 2016),限制了處理惡劣條件的能力。只有Huval等人。 (2015)首次嘗試在車道檢測中采用深度學習,但沒有大型通用數據集。而對于語義分割,基于CNN的方法已成為主流并取得了巨大成功(Long,Shelhamer和Darrell 2015; Chen等2017)。
There have been some other attempts to utilize spatial information in neural networks. Visin et al. (2015) and Bell et al. (2016) used recurrent neural networks to pass information along each row or column, thus in one RNN layer each pixel position could only receive information from the same row or column. Liang et al. (2016a; 2016b) proposed variants of LSTM to exploit contextual information in semantic object parsing, but such models are computationally expensive. Researchers also attempted to combine CNN with graphical models like MRF or CRF, in which message pass is realized by convolution with large kernels (Liu et al. 2015; Tompson et al. 2014; Chu et al. 2016). There are three advantages of SCNN over these aforementioned methods: in SCNN, (1) the sequential message pass scheme is much more computational ef?ciency than traditional dense MRF/CRF, (2) the messages are propagated as residual, making SCNN easy to train, and (3) SCNN is ?exible and could be applied to any level of a deep neural network.
已經有一些其他嘗試在神經網絡中利用空間信息。Visin等人。 (2015年)和貝爾等人。 (2016)使用遞歸神經網絡沿每行或每列傳遞信息,因此在一個RNN層中,每個像素位置只能從同一行或列接收信息。梁等人。 (2016a; 2016b)提出了LSTM的變體以在語義對象解析中利用上下文信息,但是這樣的模型在計算上是昂貴的。研究人員還嘗試將CNN與MRF或CRF等圖形模型結合起來,其中消息傳遞通過與大內核的卷積來實現(Liu等人2015; Tompson等人2014; Chu等人2016)。SCNN相對于上述方法有三個優點:在SCNN中,(1)順序消息傳遞方案比傳統的密集MRF / CRF具有更高的計算效率,(2)消息作為殘差傳播,使得SCNN易于訓練, (3)SCNN是靈活的,可以應用于任何級別的深度神經網絡。
Spatial Convolutional Neural Network空間卷積神經網絡
Lane Detection Dataset車道檢測數據集
In this paper, we present a large scale challenging dataset for traf?c lane detection. Despite the importance and dif?culty of traf?c lane detection, existing datasets are either too small or too simple, and a large public annotated benchmark is needed to compare different methods (Bar Hillel et al. 2014). KITTI (Fritsch, Kuhnl, and Geiger 2013) and CamVid (Brostow et al. 2008) contains pixel level anno tations for lane/lane markings, but have merely hundreds of images, too small for deep learning methods. Caltech Lanes Dataset (Aly 2008) and the recently released TuSimple Benchmark Dataset (TuSimple 2017) consists of 1224 and 6408 images with annotated lane markings respectively, while the traf?c is in a constrained scenario, which has light traf?c and clear lane markings. Besides, none of these datasets annotates the lane markings that are occluded or are unseen because of abrasion, while such lane markings can be inferred by human and is of high value in real applications. To collect data, we mounted cameras on six different vehicles driven by different drivers and recorded videos during driving in Beijing on different days. More than 55 hours of videos were collected and 133,235 frames were extracted, which is more than 20 times of TuSimple Dataset. We have divided the dataset into 88880 for training set, 9675 for validation set, and 34680 for test set. These images were undistorted using tools in (Scaramuzza, Martinelli, and Siegwart 2006) and have a resolution of . Fig. 2 (a) shows some examples, which comprises urban, rural, and highway scenes. As one of the largest and most crowded cities in the world, Beijing provides many challenging traf?c scenarios for lane detection. We divided the test set into normal and 8 challenging categories, which correspond to the 9 examples in Fig. 2 (a). Fig. 2 (b) shows the proportion of each scenario. It can be seen that the 8 challenging scenarios account for most (72.3%) of the dataset.
在本文中,我們提出了一個用于交通車道檢測的大規模挑戰性數據集。盡管交通車道檢測具有重要性和困難性,但現有數據集要么太小,要么太簡單,需要大量的公共注釋基準來比較不同的方法(Bar Hillel et al.2014)。KITTI(Fritsch,Kuhnl和Geiger,2013)和CamVid(Brostow等人,2008)包含用于車道/車道標記的像素級別通知,但僅有數百個圖像,對于深度學習方法而言太小。Caltech Lanes Dataset(Aly 2008)和最近發布的TuSimple Benchmark Dataset(TuSimple 2017)由1224和6408個圖像組成,分別帶有帶注釋的車道標記,而交通則處于受限情景中,具有輕便的交通和清晰的車道標記。此外,這些數據集中沒有一個注釋由于磨損而被遮擋或看不見的車道標記,而這種車道標記可以由人推斷并且在實際應用中具有高價值。為了收集數據,我們在不同的日子在北京開車時,在不同的駕駛員駕駛的六輛不同車輛上安裝了攝像頭,錄制了視頻。收集了超過55小時的視頻,提取了133,235幀,這是TuSimple數據集的20多倍。我們將數據集劃分為88880用于訓練集,9675用于驗證集,34680用于測試集。這些圖像使用(Scaramuzza,Martinelli和Siegwart 2006)中的工具不失真,分辨率為。圖2(a)示出了一些示例,其包括城市,鄉村和高速公路場景。作為世界上最大,最擁擠的城市之一,北京為車道檢測提供了許多具有挑戰性的交通方案。我們將測試集分為正常類和8個具有挑戰性的類別,這些類別對應于圖2(a)中的9個示例。圖2(b)顯示了每種情況的比例。可以看出,8個具有挑戰性的場景占據了數據集的大部分(72.3%)。
For each frame, we manually annotate the traf?c lanes with cubic splines. As mentioned earlier, in many cases lane markings are occluded by vehicles or are unseen. In real applications it is important that lane detection algorithms could estimate lane positions from the context even in these challenging scenarios that occur frequently. Therefore, for these cases we still annotate the lanes according to the context, as shown in Fig. 2 (a) (2)(4). We also hope that our algorithm could distinguish barriers on the road, like the one in Fig. 2 (a) (1). Thus the lanes on the other side of the barrier are not annotated. In this paper we focus our attention on the detection of four lane markings, which are paid most attention to in real applications. Other lane markings are not annotated.
對于每個幀,我們使用三次樣條手動注釋交通車道。如前所述,在許多情況下,車道標記被車輛遮擋或看不見。在實際應用中,重要的是車道檢測算法即使在頻繁發生的這些挑戰性場景中也可以從上下文估計車道位置。因此,對于這些情況,我們仍然根據上下文注釋通道,如圖2(a)(2)(4)所示。我們也希望我們的算法可以區分道路障礙,如圖2(a)(1)所示。因此,屏障另一側的車道沒有注釋。在本文中,我們將注意力集中在四個車道標記的檢測上,這些標記在實際應用中最受關注。其他車道標記未注釋。
Spatial CNN空間CNN
Traditional methods to model spatial relationship are based on Markov Random Fields (MRF) or Conditional Ran dom Fields (CRF) (Kr¨ahenb¨uhl and Koltun 2011). Recent works (Zheng et al. 2015; Liu et al. 2015; Chen et al. 2017) to combine them with CNN all follow the pipeline of Fig. 3 (a), where the mean ?eld algorithm can be implemented with neural networks. Speci?cally, the procedure is (1) Normalize: the output of CNN is viewed as unary potentials and is normalized by the Softmax operation, (2) Message Passing, which could be realized by channel wise convolution with large kernels (for dense CRF, the kernel size would cover the whole image and the kernel weights are dependent on the input image), (3) Compatibility Transform, which could be implemented with a convolution layer, and (4) Adding unary potentials. This process is iterated for N times to give the ?nal output.
傳統的空間關系建模方法基于馬爾可夫隨機場(MRF)或條件隨機場(CRF)(Kr¨ahenb¨uhl和Koltun 2011)。最近的工作(Zheng等人2015; Liu等人2015; Chen等人2017)將它們與CNN結合起來都遵循圖3(a)的流程,其中平均現場算法可以用神經網絡實現。具體來說,程序是(1)歸一化:CNN的輸出被視為一元電位,并通過Softmax操作歸一化,(2)消息傳遞,這可以通過大內核的通道卷積來實現(對于密集的CRF,內核大小將覆蓋整個圖像,內核權重取決于輸入圖像),(3)兼容性轉換,可以使用卷積層實現,以及(4)添加一元電位。該過程迭代N次以給出最終輸出。
Figure 3: (a) MRF/CRF based method. (b) Our implementation of Spatial CNN. MRF/CRF are theoretically applied to unary potentials whose channel number equals to the number of classes to be classi?ed, while SCNN could be applied to the top hidden layers with richer information.
圖3:(a)基于MRF / CRF的方法。 (b)我們實施空間CNN。MRF / CRF理論上應用于一致電位,其通道數等于待分類的類數,而SCNN可應用于具有更豐富信息的頂層隱藏層。
It can be seen that in the message passing process of traditional methods, each pixel receives information from all other pixels, which is very computational expensive and hard to be used in real time tasks as in autonomous driving. For MRF, the large convolution kernel is hard to learn and usually requires careful initialization (Tompson et al. 2014; Liu et al. 2015). Moreover, these methods are applied to the output of CNN, while the top hidden layer, which comprises richer information, might be a better place to model spatial relationship.
可以看出,在傳統方法的消息傳遞過程中,每個像素接收來自所有其他像素的信息,這非常耗費計算并且難以在自動駕駛中用于實時任務。對于MRF,大卷積內核很難學習,并且通常需要仔細初始化(Tompson等人2014; Liu等人2015)。此外,這些方法應用于CNN的輸出,而包含更豐富信息的頂部隱藏層可能是建模空間關系的更好地方。
To address these issues, and to more ef?ciently learn the spatial relationship and the smooth, continuous prior of lane markings, or other structured object in the driving scenario, we propose Spatial CNN. Note that the ’spatial’ here is not the same with that in ’spatial convolution’, but denotes propagating spatial information via specially designed CNN structure.
為了解決這些問題,并且為了更有效地學習車道標記的空間關系和平滑連續的先驗,或者在駕駛場景中的其他結構化對象,我們提出了空間CNN。請注意,此處的“空間”與“空間卷積”中的“空間”不同,但表示通過特殊設計的CNN結構傳播空間信息。
As shown in the ’SCNN D’ module of Fig. 3 (b), considering a SCNN applied on a 3-D tensor of size , where C, H, and W denote the number of channel, rows, and columns respectively. The tensor would be splited into H slices, and the ?rst slice is then sent into a convolution layer with C kernels of size , where w is the kernel width. In a traditional CNN the output of a convolution layer is then fed into the next layer, while here the output is added to the next slice to provide a new slice. The new slice is then sent to the next convolution layer and this process would continue until the last slice is updated.
如圖3(b)的“SCNN D”模塊所示,考慮應用于尺寸為的3-D張量的SCNN,其中C,H和W分別表示信道,行和列的數量。張量將被分割成H個切片,然后將第一個切片發送到具有大小為的C內核的卷積層,其中w是內核寬度。在傳統的CNN中,卷積層的輸出然后被饋送到下一層,而此處輸出被添加到下一個片以提供新的片。然后將新切片發送到下一個卷積層,此過程將繼續,直到更新最后一個切片。
Speci?cally, assume we have a 3-D kernel tensor K with element denoting the weight between an element in channel i of the last slice and an element in channel j of the current slice, with an offset of k columes between two elements. Also denote the element of input 3-D tensor X as , where i, j, and k indicate indexes of channel, row, and column respectively. Then the forward computation of SCNN is:
具體地說,假設我們有一個3-D核張量K,元素表示最后一個切片的通道i中的元素與當前切片的通道j中的元素之間的權重,兩個元素之間的k個偏移量。還將輸入3-D張量X的元素表示為,其中i,j和k分別表示通道,行和列的索引。那么SCNN的正向計算是:
where f is a nonlinear activation function as ReLU. The X with superscript denotes the element that has been updated. Note that the convolution kernel weights are shared across all slices, thus SCNN is a kind of recurrent neural network. Also note that SCNN has directions. In Fig. 3 (b), the four ’SCNN’ module with suf?x ’D’, ’U’, ’R’, ’L’ denotes SCNN that is downward, upward, rightward, and leftward respectively.
其中f是ReLU的非線性激活函數。帶有上標的X表示已更新的元素。注意,卷積核權重在所有切片之間共享,因此SCNN是一種遞歸神經網絡。另請注意,SCNN有方向。在圖3(b)中,具有suf fi x’D’,‘U’,‘R’,'L’的四個’SCNN’模塊分別表示向下,向上,向右和向左的SCNN。
Analysis分析
There are three main advantages of Spatial CNN over traditional methods, which are concluded as follows.
與傳統方法相比,Spatial CNN有三個主要優點,其結論如下。
(1) Computational ef?ciency. As show in Fig. 4, in dense MRF/CRF each pixel receives messages from all other pixels directly, which could have much redundancy, while in SCNN message passing is realized in a sequential propagation scheme. Speci?cally, assume a tensor with H rows and W columns, then in dense MRF/CRF, there is message pass between every two of the W H pixels. For iterations, the number of message passing is niterW 2H 2. In SCNN, each pixel only receive information from w pixels, thus the number of message passing is , where and w denotes the number of propagation directions in SCNN and the kernel width of SCNN respectively. could range from 10 to 100, while in this paper is set to 4, corresponding to 4 directions, and w is usually no larger than 10 (in the example in Fig. 4 (b) ). It can be seen that for images with hundreds of rows and columns, SCNN could save much computations, while each pixel still could receive messages from all other pixels with message propagation along 4 directions.
(1)計算效率。如圖4所示,在密集的MRF / CRF中,每個像素直接從所有其他像素接收消息,這可能具有很多冗余,而在SCNN中,消息傳遞是在順序傳播方案中實現的。具體地,假設具有H行和W列的張量,然后在密集的MRF / CRF中,在每兩個W H像素之間存在消息傳遞。對于迭代,消息傳遞的數量是2H 2。在SCNN中,每個像素僅接收來自w個像素的信息,因此消息傳遞的數量是,其中和w分別表示SCNN中的傳播方向的數量和SCNN的內核寬度。的范圍可以從10到100,而在本文中,設置為4,對應于4個方向,w通常不大于10(在圖4(b)的示例中)。可以看出,對于具有數百行和列的圖像,SCNN可以節省大量計算,而每個像素仍然可以從所有其他像素接收消息,其中消息沿4個方向傳播。
Figure 4: Message passing directions in (a) dense MRF/CRF and (b) Spatial CNN (rightward). For (a), only message passing to the inner 4 pixels are shown for clearance.
圖4:(a)密集MRF / CRF和(b)空間CNN(向右)中的消息傳遞方向。對于(a),僅顯示傳遞到內部4個像素的消息以進行清除。
(2) Message as residual. In MRF/CRF, message passing is achieved via weighted sum of all pixels, which, according to the former paragraph, is computational expensive. And recurrent neural network based methods might suffer from gradient descent (Pascanu, Mikolov, and Bengio 2013), considering so many rows or columns. However, deep residual learning (He et al. 2016) has shown its capability to easy the training of very deep neural networks. Similarly, in our deep SCNN messages are propagated as residual, which is the output of ReLU in Eq.(1). Such residual could also be viewed as a kind of modi?cation to the original neuron. As our experiments will show, such message pass scheme achieves better results than LSTM based methods.
(2)消息為殘差。在MRF / CRF中,通過所有像素的加權和來實現消息傳遞,根據前一段,其是計算上昂貴的。考慮到如此多的行或列,基于遞歸神經網絡的方法可能會受到梯度下降(Pascanu,Mikolov和Bengio 2013)的影響。然而,深度殘差學習(He et al.2016)已經證明了它能夠輕松訓練非常深的神經網絡。類似地,在我們的深度SCNN中,消息作為殘差傳播,這是方程(1)中的ReLU的輸出。這種殘留也可以被視為對原始神經元的一種修改。正如我們的實驗所示,這種消息傳遞方案比基于LSTM的方法獲得更好的結果。
(3) Flexibility Thanks to the computational ef?ciency of SCNN, it could be easily incorporated into any part of a CNN, rather than just output. Usually, the top hidden layer contains information that is both rich and of high semantics, thus is an ideal place to apply SCNN. Typically, Fig. 3 shows our implementation of SCNN on the LargeFOV (Chen et al. 2017) model. SCNNs on four spatial directions are added sequentially right after the top hidden layer (’fc7’ layer) to introduce spatial message propagation.
(3)靈活性由于SCNN的計算效率,它可以很容易地整合到CNN的任何部分,而不僅僅是輸出。通常,頂層隱藏層包含豐富且高語義的信息,因此是應用SCNN的理想位置。通常,圖3顯示了我們在LargeFOV(Chen et al.2017)模型上實現SCNN。在頂部隱藏層('fc7’層)之后依次添加四個空間方向上的SCNN以引入空間消息傳播。
Experiment實驗
We evaluate SCNN on our lane detection dataset and Cityscapes (Cordts et al. 2016). In both tasks, we train the models using standard SGD with batch size 12, base learning rate 0.01, momentum 0.9, and weight decay 0.0001. The learning rate policy is ”poly” with power and iteration number set to 0.9 and 60K respectively. Our models are modi?ed based on the LargeFOV model in (Chen et al. 2017). The ini tial weights of the ?rst 13 convolution layers are copied from VGG16 (Simonyan and Zisserman 2015) trained on ImageNet (Deng et al. 2009). All experiments are implemented on the Torch7 (Collobert, Kavukcuoglu, and Farabet 2011) framework.
我們在車道檢測數據集和Cityscapes上評估SCNN(Cordts等人,2016)。在這兩個任務中,我們使用標準SGD訓練模型,批量大小為12,基礎學習率為0.01,動量為0.9,重量衰減為0.0001。學習率策略是“poly”,功率和迭代次數分別設置為0.9和60K。我們的模型基于LargeFOV模型(Chen et al.2017)進行了修改。從ImageNet訓練的VGG16(Simonyan和Zisserman 2015)復制了前13個卷積層的初始權重(Deng et al.2009)。所有實驗都在Torch7(Collobert,Kavukcuoglu和Farabet 2011)框架上實施。
Figure 5: (a) Training model, (b) Lane prediction process. ’Conv’,’HConv’, and ’FC’ denotes convolution layer, atrous convolution layer (Chen et al. 2017), and fully connected layer respectively. ’c’, ’w’, and ’h’ denotes number of output channels, kernel width, and ’rate’ for atrous convolution.
圖5:(a)訓練模型,(b)車道預測過程。 ‘Conv’,'HConv’和’FC’分別表示卷積層,萎縮卷積層(Chen等2017)和完全連接層。 ‘c’,‘w’和’h’表示有害卷積的輸出通道數,內核寬度和’速率’。
Lane Detection車道檢測
Lane detection model Unlike common object detection task that only requires bounding boxes, lane detection requires precise prediction of curves. A natural idea is that the model should output probability maps (probmaps) of these curves, thus we generate pixel level targets to train the networks, like in semantic segmentation tasks. Instead of viewing different lane markings as one class and do clustering afterwards, we want the neural network to distinguish different lane markings on itself, which could be more robust. Thus these four lanes are viewed as different classes. Moreover, the probmaps are then sent to a small network to give prediction on the existence of lane markings.
車道檢測模型與僅需要邊界框的常見物體檢測任務不同,車道檢測需要精確預測曲線。一個自然的想法是模型應該輸出這些曲線的概率圖(probmaps),因此我們生成像素級目標來訓練網絡,就像在語義分割任務中一樣。我們希望神經網絡能夠區分不同的車道標記,而不是將不同的車道標記視為一個類并在之后進行聚類,這樣可以更加穩健。因此,這四個車道被視為不同的類別。此外,然后將問題發送到小型網絡以預測車道標記的存在。
During testing, we still need to go from probmaps to curves. As shown in Fig.5 (b), for each lane marking whose existence value is larger than 0.5, we search the corresponding probmap every 20 rows for the position with the highest response. These positions are then connected by cubic splines, which are the ?nal predictions.
在測試期間,我們仍然需要從probmaps到曲線。如圖5(b)所示,對于存在值大于0.5的每個車道標記,我們每隔20行搜索相應的probmap,以獲得響應最高的位置。然后通過三次樣條連接這些位置,這是最終的預測。
As shown in Fig.5 (a), the detailed differences between our baseline model and LargeFOV are: (1) the output channel number of the ’fc7’ layer is set to 128, (2) the ’rate’ for the atrous convolution layer of ’fc6’ is set to 4, (3) batch normalization (Ioffe and Szegedy 2015) is added before each ReLU layer, (4) a small network is added to predict the existence of lane markings. During training, the line width of the targets is set to 16 pixels, and the input and target images are rescaled to . Considering the imbalanced label between background and lane markings, the loss of background is multiplied by 0.4.
如圖5(a)所示,我們的基線模型和LargeFOV之間的詳細差異是:(1)‘fc7’層的輸出通道數設置為128,(2)atrous卷積的’速率’ 'fc6’層設置為4,(3)在每個ReLU層之前添加批量標準化(Ioffe和Szegedy 2015),(4)添加小網絡以預測車道標記的存在。在訓練期間,目標的線寬設置為16像素,輸入和目標圖像重新調整為。考慮到背景和車道標記之間的不平衡標簽,背景損失乘以0.4。
Figure 6: Evaluation based on IoU. Green lines denote ground truth, while blue and red lines denote TP and FP respectively.
圖6:基于IoU的評估。綠線表示基本事實,而藍線和紅線分別表示TP和FP。
Evaluation In order to judge whether a lane marking is successfully detected, we view lane markings as lines with widths equal to 30 pixel and calculate the intersectionover-union (IoU) between the ground truth and the prediction. Predictions whose IoUs are larger than certain threshold are viewed as true positives (TP), as shown in Fig. 6. Here we consider 0.3 and 0.5 thresholds corresponding to loose and strict evaluations. Then we employ F-measure Precision+Recall as the ?nal evaluation index, where Precision Recall Precision and Recall . Here β is set to 1, corresponding to harmonic mean (F1-measure).
評估為了判斷是否成功檢測到車道標記,我們將車道標記視為寬度等于30像素的線,并計算地面實況和預測之間的交叉聯合(IoU)。其IoU大于特定閾值的預測被視為真陽性(TP),如圖6所示。在這里,我們考慮0.3和0.5閾值對應松散和嚴格的評估。然后我們采用F-measure Precision + Recall作為最終評估指標,其中Precision Recall Precision 和Recall 。這里β設定為1,對應于調和平均值(F1-measure)。
Ablation Study In section 2.2 we propose Spatial CNN to enable spatial message propagation. To verify our method, we will make detailed ablation studies in this subsection. Our implementation of SCNN follows that shown in Fig. 3. (1) Effectiveness of multidirectional SCNN. Firstly, we investigate the effects of directions in SCNN. We try SCNN that has different direction implementations, the results are shown in Table. 1. Here the kernel width w of SCNN is set to
消融研究在2.2節中,我們提出了空間CNN來實現空間消息傳播。為了驗證我們的方法,我們將在本小節中進行詳細的消融研究。我們的SCNN實現如圖3所示。(1)多向SCNN的有效性。首先,我們研究了SCNN中方向的影響。我們嘗試具有不同方向實現的SCNN,結果如表所示。這里SCNN的內核寬度w設置為
- It can be seen that the performance increases as more directions are added. To prove that the improvement does not result from more parameters but from the message passing scheme brought about by SCNN, we add an extra convolution layer with kernel width after the top hidden layer of the baseline model and compare with our method. From the results we can see that extra convolution layer could merely bring about little improvement, which veri?es the effectiveness of SCNN.
5.可以看出,隨著更多方向的增加,性能也會提高。為了證明改進不是來自更多參數,而是來自SCNN帶來的消息傳遞方案,我們在基線模型的頂部隱藏層之后添加了一個額外的卷積層和內核寬度,并與我們的方法進行比較。從結果我們可以看出,額外的卷積層只能帶來很小的改進,這證明了SCNN的有效性。
(2) Effects of kernel width w. We further try SCNN with different kernel width based on the ”SCNN DURL” model, as shown in Table. 2. Here the kernel width denotes the number of pixels that a pixel could receive messages from, and the case is similar to the methods in (Visin et al. 2015; Bell et al. 2016). The results show that larger w is bene?cial,
(2)核寬w的影響。我們進一步嘗試基于“SCNN DURL”模型的不同內核寬度的SCNN,如表所示。 2。這里內核寬度表示像素可以從中接收消息的像素數,情況類似于(Visin等人2015; Bell等人2016)中的方法。結果表明較大的w是有益的,
and gives a satisfactory result, which surpasses the baseline by a signi?cant margin 8.4% and 3.2% corresponding to different IoU threshold.
和給出了令人滿意的結果,超過基線的顯著邊際8.4%和3.2%對應不同的IoU閾值。
(3) Spatial CNN on different positions. As mentioned earlier, SCNN could be added to any place of a neural network. Here we consider the SCNN DURL model applied on (1) output and (2) the top hidden layer, which correspond to Fig. 3. The results in Table. 3 indicate that the top hidden layer, which comprises richer information than the output, turns out to be a better position to apply SCNN.
(3)不同位置的空間CNN。如前所述,SCNN可以添加到神經網絡的任何位置。這里我們考慮應用于(1)輸出的SCNN DURL模型和(2)頂部隱藏層,其對應于圖3。結果見表。圖3表明包含比輸出更豐富的信息的頂部隱藏層被證明是應用SCNN的更好位置。
(4) Effectiveness of sequential propagation. In our SCNN, information is propagated in a sequential way, i.e., a slice does not pass information to the next slice until it has received information from former slices. To verify the effectiveness of this scheme, we compare it with parallel propagation, i.e., each slice passes information to the next slice simultaneously before being updated. For this parallel case, the in the right part of Eq.(1) is removed. As Table. 4 shows, the sequential message passing scheme outperforms the parallel scheme signi?cantly. This result indicates that in SCNN, a pixel does not merely affected by nearby pixels, but do receive information from further positions.
(4)順序傳播的有效性。在我們的SCNN中,信息以順序方式傳播,即,切片不將信息傳遞到下一個切片,直到它從前切片接收到信息。為了驗證該方案的有效性,我們將其與并行傳播進行比較,即,每個片在更新之前同時將信息傳遞給下一個片。對于這種并行情況,方程(1)右側的被刪除。如表。如圖4所示,順序消息傳遞方案明顯優于并行方案。該結果表明,在SCNN中,像素不僅僅受到附近像素的影響,而且確實從其他位置接收信息。
(5) Comparison with state-of-the-art methods. To further verify the effectiveness of SCNN in lane detection, we compare it with several methods: the rnn based ReNet (Visin et al. 2015), the MRF based MRFNet, the DenseCRF (Kr¨ahenb¨uhl and Koltun 2011), and the very deep residual network (He et al. 2016). For ReNet based on LSTM, we replace the ”SCNN” layers in Fig. 3 with two ReNet layers: one layer to pass horizontal information and the other to pass vertical information. For DenseCRF, we use dense CRF as post-processing and employ 10 mean ?eld iterations as in (Chen et al. 2017). For MRFNet, we use the implementation in Fig. 3 (a), with iteration times and message passing kernel size set to 10 and 20 respectively. The main difference of the MRF here with CRF is that the weights of message passing kernels are learned during training rather than depending on the image. For ResNet, our implementation is the same with (Chen et al. 2017) except that we do not use the ASPP module. For SCNN, we add SCNN DULR module to the baseline, and the kernel width w is 9. The test results on different scenarios are shown in Table 5, and visualizations are given in Fig. 7.
(5)與最先進的方法進行比較。為了進一步驗證SCNN在泳道檢測中的有效性,我們將其與幾種方法進行比較:基于rnn的ReNet(Visin等人2015),基于MRF的MRFNet,DenseCRF(Kr¨ahenb¨uhl和Koltun 2011),以及非常深的殘留網絡(He et al.2016)。對于基于LSTM的ReNet,我們用兩個ReNet層替換圖3中的“SCNN”層:一層傳遞水平信息,另一層傳遞垂直信息。對于DenseCRF,我們使用密集CRF作為后處理,并采用10個均值場迭代(Chen et al.2017)。對于MRFNet,我們使用圖3(a)中的實現,迭代時間和消息傳遞內核大小分別設置為10和20。這里的MRF與CRF的主要區別在于消息傳遞內核的權重是在訓練期間學習而不是依賴于圖像。對于ResNet,我們的實現與(Chen et al.2017)相同,只是我們不使用ASPP模塊。對于SCNN,我們將SCNN DULR模塊添加到基線,內核寬度w為9。不同情況下的測試結果如表5所示,可視化如圖7所示。
Figure 7: Comparison between probmaps of baseline, ReNet, MRFNet, ResNet-101, and SCNN.
圖7:基線,ReNet,MRFNet,ResNet-101和SCNN的問題圖之間的比較。
From the results, we can see that the performance of ReNet is not even comparable with SCNN DULR with 1, indicating the effectiveness of our residual message passing scheme. Interestingly, DenseCRF leads to worse result here, because lane markings usually have less appearance clues so that dense CRF cannot distinguish lane markings and background. In contrast, with kernel weights learned from data, MRFNet could to some extent smooth the results and improve performance, as Fig. 7 shows, but are still not very satisfactory. Furthermore, our method even outperform the much deeper ResNet-50 and ResNet-101. Despite the over a hundred layers and the very large receptive ?eld of ResNet-101, it still gives messy or discontinuous outputs in challenging cases, while our method, with only 16 convolution layers plus 4 SCNN layers, could preserve the smoothness and continuity of lane lines better. This demonstrates the much stronger capability of SCNN to capture structure prior of objects over traditional CNN.
從結果中我們可以看出,ReNet的性能甚至與SCNN DULR與 1無法比較,表明我們的剩余消息傳遞方案的有效性。有趣的是,DenseCRF在這里導致更糟糕的結果,因為車道標記通常具有較少的外觀線索,因此密集的CRF無法區分車道標記和背景。相比之下,從數據中學習核心權重,MRFNet可以在一定程度上平滑結果并改善性能,如圖7所示,但仍然不是很令人滿意。此外,我們的方法甚至優于更深層次的ResNet-50和ResNet-101。盡管ResNet-101有超過一百層和非常大的接收場,但在具有挑戰性的情況下它仍會產生混亂或不連續的輸出,而我們的方法只有16個卷積層和4個SCNN層,可以保持通道的平滑性和連續性。線條更好。這表明SCNN在傳統CNN上捕獲對象之前的結構的能力強得多。
(6) Computational ef?ciency over other methods. In the Analysis section we give theoretical analysis on the computational ef?ciency of SCNN over dense CRF. To verify
(6)計算效率優于其他方法。在分析部分,我們對SCNN在密集CRF上的計算效率進行了理論分析。核實
this, we compare their runtime experimentally. The results are shown in Table. 6, where the runtime of the LSTM in ReNet is also given. Here the runtime does not include runtime of the backbone network. For SCNN, we test both the practical case and the case with the same setting as dense CRF. In the practical case, SCNN is applied on top hidden layer, thus the input has more channels but less hight and width. In the fair comparison case, the input size is modi?ed to be the same with that in dense CRF, and both methods are tested on CPU. The results show that even in fair comparison case, SCNN is over 4 times faster than dense CRF, despite the ef?cient implementation of dense CRF in (Kr¨ahenb¨uhl and Koltun 2011). This is because SCNN signi?cantly reduces redundancy in message passing, as in Fig. 4. Also, SCNN is more ef?cient than LSTM, whose gate mechanism requires more computation.
這,我們通過實驗比較它們的運行時間。結果如表所示。 6,其中還給出了ReNet中LSTM的運行時間。這里運行時不包括骨干網絡的運行時。對于SCNN,我們使用與密集CRF相同的設置測試實際案例和案例。在實際情況中,SCNN應用于頂層隱藏層,因此輸入具有更多通道但較少高度和寬度。在公平比較的情況下,輸入大小被修改為與密集CRF中的大小相同,并且兩種方法都在CPU上進行測試。結果表明,即使在公平比較的情況下,SCNN也比密集CRF快4倍,盡管在(Kr¨ahenb¨uhl和Koltun 2011)中實現了密集CRF的有效實施。這是因為SCNN顯著減少了消息傳遞中的冗余,如圖4所示。此外,SCNN比LSTM更有效,LSTM的門機制需要更多的計算。
2Intel Core i7-4790K CPU 3GeForce GTX TITAN Black
2英特爾酷睿i7-4790K CPU 3GeForce GTX TITAN黑色
Figure 8: Visual improvements on Cityscapes validation set. For each example, from left to right are: input image, ground truth, result of LargeFOV, result of LargeFOV+SCNN.
圖8:Cityscapes驗證集的視覺改進。對于每個示例,從左到右分別是:輸入圖像,基礎事實,LargeFOV的結果,LargeFOV + SCNN的結果。
Semantic Segmentation on Cityscapes城市景觀的語義分割
To demonstrate the generality of our method, we also evaluate Spatial CNN on Cityscapes (Cordts et al. 2016). Cityscapes is a standard benchmark dataset for semantic segmentation on urban traf?c scenes. It contains 5000 ?ne annotated images, including 2975 for training, 500 for validation and 1525 for testing. 19 categories are de?ned including both stuff and objects. We use two classic models, the LargeFOV and ResNet-101 in DeepLab (Chen et al. 2017) as the baselines. Batch normalization layers (Ioffe and Szegedy 2015) are added to LargeFOV to enable faster convergence. For both models, the channel numbers of the top hidden layers are modi?ed to 128 to make them compacter.
為了證明我們方法的一般性,我們還評估了Cityscapes上的空間CNN(Cordts等,2016)。Cityscapes是城市交通場景語義分割的標準基準數據集。它包含5000個帶注釋的圖像,包括2975個用于訓練,500個用于驗證,1525個用于測試。定義了19個類別,包括東西和物體。我們使用兩個經典模型,DeepLab中的LargeFOV和ResNet-101(Chen等人,2017)作為基線。批量標準化層(Ioffe和Szegedy 2015)被添加到LargeFOV以實現更快的收斂。對于這兩種型號,頂部隱藏層的通道編號被修改為128以使其更加緊湊。
We add SCNN to the baseline models in the same way as in lane detection. The comparisons between baselines and those combined with the SCNN DURL models with kernel width are shown in Table 7. It can be seen that SCNN could also improve semantic segmentation results. With SCNNs added, the IoUs for all classes are at least comparable to the baselines, while the ”wall”, ”pole”, ”truck”, ”bus”, ”train”, and ”motor” categories achieve signi?cant improve. This is because for long shaped objects like train and pole, SCNN could capture its continuous structure and connect the disconnected part, as shown in Fig. 8. And for wall, truck, and bus which could occupy large image area, the diffusion effect of SCNN could correct the part that are misclassi?ed according to the context. This shows that SCNN is useful not only for long thin structure, but also for large objects which require global information to be classi?ed correctly. There is another interesting phenomenon that the head of the vehicle at the bottom of the images, whose label is ignored during training, is in a mess in LargeFOV while with SCNN added it is classi?ed as road. This is also due to the diffusion effects of SCNN, which passes the information of road to the vehicle head area.
我們以與車道檢測相同的方式將SCNN添加到基線模型中。表7中顯示了基線與內核寬度為的SCNN DURL模型相結合的比較。可以看出,SCNN還可以改善語義分割結果。隨著SCNN的增加,所有類別的IoU至少與基線相當,而“墻”,“極”,“卡車”,“公共汽車”,“火車”和“電機”類別實現了顯著改善。這是因為對于像火車和桿這樣的長形物體,SCNN可以捕獲其連續結構并連接斷開的部分,如圖8所示。對于可能占據大圖像區域的墻壁,卡車和公共汽車,SCNN的擴散效果可以根據上下文糾正錯誤分類的部分。這表明SCNN不僅適用于長薄結構,而且適用于需要對全局信息進行正確分類的大型對象。另一個有趣的現象是,在訓練期間忽略標簽的圖像底部的車輛頭部在LargeFOV中處于混亂狀態,而在添加SCNN時,它被分類為道路。這也是由于SCNN的擴散效應,其將道路信息傳遞到車輛頭部區域。
To compare our method with other MRF/CRF based methods, we evaluate LargeFOV+SCNN on Cityscapes test set, and compare with methods that also use VGG16 (Simonyan and Zisserman 2015) as the backbone network. The results are shown in Table 8. Here LargeFOV, DPN, and our method use dense CRF, dense MRF, and SCNN respectively, and share nearly the same base CNN part. The results show that our method achieves signi?cant better performance.
為了將我們的方法與其他基于MRF / CRF的方法進行比較,我們在Cityscapes測試集上評估LargeFOV + SCNN,并與也使用VGG16(Simonyan和Zisserman 2015)作為骨干網絡的方法進行比較。結果如表8所示。這里LargeFOV,DPN和我們的方法分別使用密集CRF,密集MRF和SCNN,并且共享幾乎相同的基本CNN部分。結果表明,我們的方法取得了顯著的更好的性能。
Conclusion結論
In this paper, we propose Spatial CNN, a CNN-like scheme to achieve effective information propagation in the spatial level. SCNN could be easily incorporated into deep neural networks and trained end-to-end. It is evaluated at two tasks in traf?c scene understanding: lane detection and semantic segmentation. The results show that SCNN could effectively preserve the continuity of long thin structure, while in semantic segmentation its diffusion effects is also proved to be bene?cial for large objects. Speci?cally, by introducing SCNN into the LargeFOV model, our 20-layer network outperforms ReNet, MRF, and the very deep ResNet-101 in lane detection. Last but not least, we believe that the large challenging lane detection dataset we presented would push forward researches on autonomous driving.
在本文中,我們提出了空間CNN,一種類似CNN的方案,以在空間層面實現有效的信息傳播。SCNN可以很容易地融入深度神經網絡并進行端到端訓練。它在交通場景理解中的兩個任務進行評估:車道檢測和語義分割。結果表明,SCNN可以有效地保持長薄結構的連續性,而在語義分割中,其擴散效應也被證明對大型物體有利。具體而言,通過將SCNN引入LargeFOV模型,我們的20層網絡在路徑檢測方面優于ReNet,MRF和非常深的ResNet-101。最后但同樣重要的是,我們認為我們提出的大型挑戰車道檢測數據集將推動自動駕駛的研究。
該翻譯選自
http://tongtianta.site/
總結
以上是生活随笔為你收集整理的Spatial As Deep: Spatial CNN for Traffic Scene Understanding论文翻译的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 卷积神经网络原理图文详解
- 下一篇: Photoshop抠图、污点处理等常用功