深度学习之对象检测_深度学习时代您应该阅读的12篇文章,以了解对象检测
深度學習之對象檢測
前言 (Foreword)
As the second article in the “Papers You Should Read” series, we are going to walk through both the history and some recent developments in a more difficult area of computer vision research: object detection. Before the deep learning era, hand-crafted features like HOG and feature pyramids are used pervasively to capture localization signals in an image. However, those methods usually can’t extend to generic object detection well, so most of the applications are limited to face or pedestrian detections. With the power of deep learning, we can train a network to learn which features to capture, as well as what coordinates to predict for an object. And this eventually led to a boom of applications based on visual perception, such as the commercial face recognition system and autonomous vehicle. In this article, I picked 12 must-read papers for newcomers who want to study object detection. Although the most challenging part of building an object detection system hides in the implementation details, reading these papers can still give you a good high-level understanding of where the ideas come from, and how would object detection evolve in the future.
作為“您應該閱讀的論文”系列的第二篇文章,我們將探討計算機視覺研究這一更加困難的領域的歷史和最近的一些發展:對象檢測。 在深度學習時代之前,像HOG和金字塔特征之類的手工特征已廣泛用于捕獲圖像中的定位信號。 但是,這些方法通常不能很好地擴展到通用對象檢測,因此大多數應用程序僅限于人臉或行人檢測。 借助深度學習的力量,我們可以訓練網絡來學習要捕獲的特征以及預測對象的坐標。 最終,這導致了基于視覺感知的應用程序的繁榮,例如商用人臉識別系統和自動駕駛汽車。 在本文中,我為想要研究對象檢測的新手挑選了12篇必讀的論文。 盡管構建對象檢測系統最具挑戰性的部分隱藏在實現細節中,但閱讀這些文章仍可以使您對這些思想的來龍去脈以及對象檢測在未來的發展中有一個很好的高級了解。
As a prerequisite for reading this article, you need to know the basic idea of the convolution neural network and the common optimization method such as gradient descent with back-propagation. It’s also highly recommended to read my previous article “10 Papers You Should Read to Understand Image Classification in the Deep Learning Era” first because many cool ideas of object detection originate from a more fundamental image classification research.
作為閱讀本文的先決條件,您需要了解卷積神經網絡的基本概念以及常見的優化方法,例如帶有反向傳播的梯度下降。 強烈建議先閱讀我以前的文章“ 在深度學習時代應該讀懂的10篇論文,以了解圖像分類 ”,因為許多很酷的對象檢測思想都來自于更基礎的圖像分類研究。
2013年 (2013: OverFeat)
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks
OverFeat:使用卷積網絡的集成識別,定位和檢測
OverFeat: Integrated Recognition, Localization and Detection using Convolutional NetworksOverFeat:使用卷積網絡的集成識別,定位和檢測””Inspired by the early success of AlexNet in the 2012 ImageNet competition, where CNN-based feature extraction defeated all hand-crafted feature extractors, OverFeat quickly introduced CNN back into the object detection area as well. The idea is very straight forward: if we can classify one image using CNN, what about greedily scrolling through the whole image with different sizes of windows, and try to regress and classify them one-by-one using a CNN? This leverages the power of CNN for feature extraction and classification, and also bypassed the hard region proposal problem by pre-defined sliding windows. Also, since a nearby convolution kernel can share part of the computation result, it is not necessary to compute convolutions for the overlapping area, hence reducing cost a lot. OverFeat is a pioneer in the one-stage object detector. It tried to combine feature extraction, location regression, and region classification in the same CNN. Unfortunately, such a one-stage approach also suffers from relatively poorer accuracy due to less prior knowledge used. Thus, OverFeat failed to lead a hype for one-stage detector research, until a much more elegant solution coming out 2 years later.
受AlexNet在2012年ImageNet競賽中早期成功的啟發,基于CNN的特征提取擊敗了所有手工制作的特征提取器,OverFeat很快也將CNN引入了對象檢測領域。 這個想法很簡單:如果我們可以使用CNN對一個圖像進行分類,那么如何在不同大小的窗口中貪婪地滾動瀏覽整個圖像,然后嘗試使用CNN逐一進行回歸和分類呢? 這利用了CNN進行特征提取和分類的功能,還通過預定義的滑動窗口繞過了硬區域建議問題。 另外,由于附近的卷積核可以共享一部分計算結果,因此不必為重疊區域計算卷積,因此大大降低了成本。 OverFeat是一級目標檢測器的先驅。 它試圖在同一CNN中結合特征提取,位置回歸和區域分類。 不幸的是,由于使用的先驗知識較少,這種單階段方法還遭受相對較差的準確性。 因此,OverFeat未能引起對一級探測器研究的大肆宣傳,直到兩年后出現了更為優雅的解決方案。
2013年:R-CNN (2013: R-CNN)
Region-based Convolutional Networks for Accurate Object Detection and Segmentation
基于區域的卷積網絡,用于精確的目標檢測和分割
Also proposed in 2013, R-CNN is a bit late compared with OverFeat. However, this region-based approach eventually led to a big wave of object detection research with its two-stage framework, i.e, region proposal stage, and region classification and refinement stage.
同樣在2013年提出的R-CNN與OverFeat相比有點晚。 然而,這種基于區域的方法最終以其兩個階段的框架,即區域提議階段,區域分類和細化階段,引起了對象檢測研究的熱潮。
Region-based Convolutional Networks for Accurate Object Detection and Segmentation用于精確目標檢測和分割的基于區域的卷積網絡In the above diagram, R-CNN first extracts potential regions of interest from an input image by using a technique called selective search. Selective search doesn’t really try to understand the foreground object, instead, it groups similar pixels by relying on a heuristic: similar pixels usually belong to the same object. Therefore, the results of selective search have a very high probability to contain something meaningful. Next, R-CNN warps these region proposals into fixed-size images with some paddings, and feed these images into the second stage of the network for more fine-grained recognition. Unlike those old methods using selective search, R-CNN replaced HOG with a CNN to extract features from all region proposals in its second stage. One caveat of this approach is that many region proposals are not really a full object, so R-CNN needs to not only learn to classify the right classes, but also learn to reject the negative ones. To solve this problem, R-CNN treated all region proposals with a ≥ 0.5 IoU overlap with a ground-truth box as positive, and the rest as negatives.
在上圖中,R-CNN首先使用稱為選擇性搜索的技術從輸入圖像中提取潛在的感興趣區域。 選擇性搜索并沒有真正嘗試理解前景對象,而是依靠啟發式方法將相似像素進行分組:相似像素通常屬于同一對象。 因此,選擇性搜索的結果很有可能包含有意義的內容。 接下來,R-CNN將這些區域建議變形為帶有一些填充的固定大小的圖像,并將這些圖像饋入網絡的第二階段以進行更細粒度的識別。 與那些使用選擇性搜索的舊方法不同,R-CNN在第二階段將HOG替換為CNN,以從所有區域提案中提取特征。 這種方法的一個警告是,許多區域提議并不是真正的完整對象,因此R-CNN不僅需要學習對正確的類別進行分類,而且還需要拒絕否定的類別。 為解決此問題,R-CNN將所有≥0.5 IoU重疊且與地面真實框重疊的區域提案視為正,其余部分視為負面。
Region proposal from selective search highly depends on the similarity assumption, so it can only provide a rough estimate of location. To further improve localization accuracy, R-CNN borrowed an idea from “Deep Neural Networks for Object Detection” (aka DetectorNet), and introduced an additional bounding box regression to predict the center coordinates, width and height of a box. This regressor is widely used in the future object detectors.
選擇性搜索的區域提議在很大程度上取決于相似性假設,因此它只能提供位置的粗略估計。 為了進一步提高定位精度,R-CNN借鑒了“用于物體檢測的深度神經網絡”(又名DetectorNet)的思想,并引入了附加的包圍盒回歸來預測盒子的中心坐標,寬度和高度。 該回歸器在未來的物體檢測器中被廣泛使用。
However, a two-stage detector like R-CNN suffers from two big issues: 1) It’s not fully convolutional because selective search is not E2E trainable. 2) region proposal stage is usually very slow compared with other one-stage detectors like OverFeat, and running on each region proposal separately makes it even slower. Later, we will see how R-CNN evolve over time to address these two issues.
但是,像R-CNN這樣的兩級檢測器存在兩個大問題:1)它不是完全卷積的,因為選擇性搜索不可E2E訓練。 2)與其他單階段檢測器(如OverFeat)相比,區域提議階段通常非常慢,并且在每個區域提議上單獨運行會使其變得更慢。 稍后,我們將了解R-CNN如何隨著時間的發展而發展,以解決這兩個問題。
2015年:快速R-CNN (2015: Fast R-CNN)
Fast R-CNN
快速R-CNN
Fast R-CNN”Fast R-CNN ”A quick follow-up for R-CNN is to reduce the duplicate convolution over multiple region proposals. Since these region proposals all come from one image, it’s naturally to improve R-CNN by running CNN over the entire image once and share the computation among many region proposals. However, different region proposals have different sizes, which also result in different output feature map sizes if we are using the same CNN feature extractor. These feature maps with various sizes will prevent us from using fully connected layers for further classification and regression because the FC layer only works with a fixed size input.
R-CNN的快速跟進是減少多個區域提案之間的重復卷積。 由于這些區域提案全部來自一張圖像,因此自然可以通過在整個圖像上運行一次CNN來改善R-CNN,并在許多區域提案中共享計算。 但是,不同的區域建議具有不同的大小,如果我們使用相同的CNN特征提取器,這也會導致不同的輸出特征圖大小。 這些具有各種尺寸的要素圖將阻止我們使用完全連接的圖層進行進一步的分類和回歸,因為FC圖層僅適用于固定大小的輸入。
Fortunately, a paper called “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition” has already solved the dynamic scale issue for FC layers. In SPPNet, a feature pyramid pooling is introduced between convolution layers and FC layers to create a bag-of-words style of the feature vector. This vector has a fixed size and encodes features from different scales, so our convolution layers can now take any size of images as input without worrying about the incompatibility of the FC layer. Inspired by this, Fast R-CNN proposed a similar layer call the ROI Pooling layer. This pooling layer downsamples feature maps with different sizes into a fixed-size vector. By doing so, we can now use the same FC layers for classification and box regression, no matter how large or small the ROI is.
幸運的是,名為“用于視覺識別的深度卷積網絡中的空間金字塔池化”的論文已經解決了FC層的動態縮放問題。 在SPPNet中,在卷積層和FC層之間引入了特征金字塔池,以創建特征向量的詞袋樣式。 此向量具有固定的大小,并編碼不同比例的特征,因此我們的卷積層現在可以將任何大小的圖像用作輸入,而不必擔心FC層的不兼容性。 受此啟發,Fast R-CNN提出了一個類似的層,稱為ROI Pooling層。 此池化層下采樣將具有不同大小的特征圖轉換為固定大小的向量。 這樣,無論ROI大小如何,我們現在都可以使用相同的FC層進行分類和框回歸。
With a shared feature extractor and the scale-invariant ROI pooling layer, Fast R-CNN can reach a similar localization accuracy but having 10~20x faster training and 100~200x faster inference. The near real-time inference and an easier E2E training protocol for the detection part make Fast R-CNN a popular choice in the industry as well.
借助共享的特征提取器和尺度不變的ROI合并層,Fast R-CNN可以達到類似的定位精度,但訓練速度提高10到20倍,推理速度提高100到200倍。 接近實時的推理和用于檢測部分的更輕松的端到端培訓協議使Fast R-CNN成為行業中的流行選擇。
You Only Look Once: Unified, Real-Time Object Detection您只看一次:統一的實時對象檢測””This dense prediction over the entire image can cause trouble in computation cost, so YOLO took the bottleneck structure from GooLeNet to avoid this issue. Another problem of YOLO is that two objects might fall into the same coarse grid cell, so it doesn’t work well with small objects such as a flock of birds. Despite lower accuracy, YOLO’s straightforward design and real-time inference ability makes one-stage object detection popular again in the research, and also a go-to solution for the industry.
這種對整個圖像的密集預測會導致計算成本出現問題,因此YOLO采取了GooLeNet的瓶頸結構來避免此問題。 YOLO的另一個問題是兩個物體可能會落入同一個粗網格單元中,因此它不適用于鳥群等較小的物體。 盡管精度較低,但YOLO的直接設計和實時推理能力使一階段目標檢測在研究中再次受到歡迎,并且也是行業首選的解決方案。
2015年:更快的R-CNN (2015: Faster R-CNN)
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
更快的R-CNN:通過區域提議網絡實現實時目標檢測
As we introduced above, in early 2015, Ross Girshick proposed an improved version of R-CNN called Fast R-CNN by using a shared feature extractor for proposed regions. Just a few months later, Ross and his team came back with another improvement again. This new network Faster R-CNN is not only faster than previous versions but also marks a milestone for object detection with a deep learning method.
正如我們上文所述,2015年初,羅斯·吉爾希克(Ross Girshick)通過使用提議區域的共享特征提取器,提出了一種改進的R-CNN版本,稱為快速R-CNN。 僅僅幾個月后,羅斯和他的團隊又回來了,又有了另一個改進。 這個新的網絡Faster R-CNN不僅比以前的版本快,而且標志著使用深度學習方法進行對象檢測的里程碑。
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks更快的R-CNN:通過區域提議網絡實現實時目標檢測””With Fast R-CNN, the only non-convolutional piece of the network is the selective search region proposal. As of 2015, researchers started to realize that the deep neural network is so magical, that it can learn anything given enough data. So, is it possible to also train a neural network to proposal regions, instead of relying on heuristic and hand-crafted approach like selective search? Faster R-CNN followed this direction and thinking, and successfully created the Region Proposal Network (RPN). To simply put, RPN is a CNN that takes an image as input and outputs a set of rectangular object proposals, each with an objectiveness score. The paper used VGG originally but other backbone networks such as ResNet become more widespread later. To generate region proposals, a 3x3 sliding window is applied over the CNN feature map output to generate 2 scores (foreground and background) and 4 coordinates each location. In practice, this sliding window is implemented with a 3x3 convolution kernel with a 1x1 convolution kernel.
使用Fast R-CNN,網絡中唯一的非卷積部分是選擇性搜索區域建議。 從2015年開始,研究人員開始意識到深度神經網絡是如此神奇,以至于只要有足夠的數據就可以學習任何東西。 因此,是否有可能將神經網絡訓練到投標區域,而不是依靠像選擇搜索那樣的啟發式和手工方法? 更快的R-CNN遵循了這個方向和思想,并成功創建了區域提案網絡(RPN)。 簡而言之,RPN是一個CNN,它以圖像作為輸入并輸出一組矩形的對象建議,每個對象建議都有一個客觀評分。 本文最初使用VGG,但后來其他類似ResNet的骨干網絡變得更加普及。 為了生成區域建議,將3x3滑動窗口應用于CNN特征圖輸出,以生成2個得分(前景和背景)以及每個位置4個坐標。 實際上,此滑動窗口是通過3x3卷積內核和1x1卷積內核實現的。
Although the sliding window has a fixed size, our objects may appear on different scales. Therefore, Faster R-CNN introduced a technique called anchor box. Anchor boxes are pre-defined prior boxes with different aspect ratios and sizes but share the same central location. In Faster R-CNN there are k=9 anchors for each sliding window location, which covers 3 aspect ratios for 3 scales each. These repeated anchor boxes over different scales bring nice translation-invariance and scale-invariance features to the network while sharing outputs of the same feature map. Note that the bounding box regression will be computed from these anchor box instead of the whole image.
盡管滑動窗口的大小是固定的,但我們的對象可能會以不同的比例出現。 因此,Faster R-CNN引入了一種稱為錨框的技術。 錨框是預先定義的先驗框,具有不同的縱橫比和大小,但共享相同的中心位置。 在Faster R-CNN中,每個滑動窗口位置都有k = 9個錨點,每個錨點覆蓋3個比例的3個寬高比。 這些在不同比例尺上重復出現的錨定框為網絡帶來了很好的平移不變性和比例不變性特征,同時共享了相同特征圖的輸出。 注意,將根據這些錨定框而不是整個圖像計算邊界框回歸。
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks更快的R-CNN:通過區域提議網絡實現實時目標檢測””So far, we discussed the new Region Proposal Network to replace the old selective search region proposal. To make the final detection, Faster R-CNN uses the same detection head from Fast R-CNN to do classification and fine-grained localization. Do you remember that Fast R-CNN also uses a shared CNN feature extractor? Now that RPN itself is also a feature extraction CNN, we can just share it with detection head like the diagram above. This sharing design doesn’t bring some trouble though. If we train RPN and Fast R-CNN detector together, we will treat RPN proposals as a constant input of ROI pooling, and inevitably ignore the gradients of RPN’s bounding box proposals. One walk around is called alternative training where you train RPN and Fast R-CNN in turns. And later in a paper “Instance-aware semantic segmentation via multi-task network cascades”, we can see that the ROI pooling layer can also be made differentiable w.r.t. the box coordinates proposals.
到目前為止,我們討論了新的區域提議網絡以取代舊的選擇性搜索區域提議。 為了進行最終檢測,Faster R-CNN使用與Fast R-CNN相同的檢測頭進行分類和細粒度定位。 您還記得Fast R-CNN也使用共享的CNN特征提取器嗎? 現在,RPN本身也是一個特征提取CNN,我們可以像上圖一樣與檢測頭共享它。 這種共享設計不會帶來任何麻煩。 如果我們一起訓練RPN和Fast R-CNN檢測器,我們會將RPN建議視為ROI池的不變輸入,并且不可避免地會忽略RPN邊界框建議的梯度。 繞一圈稱為替代訓練,您可以依次訓練RPN和Fast R-CNN。 稍后在“通過多任務網絡級聯進行實例感知的語義分割”一文中,我們可以看到,在框協調提議的情況下,也可以使ROI池化層具有差異性。
2015年:YOLO v1 (2015: YOLO v1)
You Only Look Once: Uni?ed, Real-Time Object Detection
您只需看一次即可:統一的實時對象檢測
While the R-CNN series started a big hype over two-stage object detection in the research community, its complicated implementation brought many headaches for engineers who maintain it. Does object detection need to be so cumbersome? If we are willing to sacrifice a bit of accuracy, can we trade for much faster speed? With these questions, Joseph Redmon submitted a network called YOLO to arxiv.org only four days after Faster R-CNN’s submission and finally brought popularity back to one-stage object detection two years after OverFeat’s debut.
盡管R-CNN系列在研究界開始大肆宣傳兩階段目標檢測,但其復雜的實現卻給維護它的工程師帶來了許多麻煩。 對象檢測是否需要這么麻煩? 如果我們愿意犧牲一點準確性,我們可以以更快的速度進行交易嗎? 有了這些問題,約瑟夫·雷德蒙(Joseph Redmon)在Faster R-CNN提交后僅四天就向arxiv.org提交了一個名為YOLO的網絡,并在OverFeat首次亮相兩年后,終于將流行度恢復到了一個階段的對象檢測。
You Only Look Once: Unified, Real-Time Object Detection您只看一次:統一的實時對象檢測””Unlike R-CNN, YOLO decided to tackle region proposal and region classification together in the same CNN. In other words, it treats object detection as a regression problem, instead of a classification problem relying on region proposals. The general idea is to split the input into an SxS grid and having each cell directly regress the bounding box location and the confidence score if the object center falls into that cell. Because objects may have different sizes, there will be more than one bounding box regressor per cell. During training, the regressor with the highest IOU will be assigned to compare with the ground-truth label, so regressors at the same location will learn to handle different scales over time. In the meantime, each cell will also predict C class probabilities, conditioned on the grid cell containing an object (high confidence score). This approach is later described as dense predictions because YOLO tried to predict classes and bounding boxes for all possible locations in an image. In contrast, R-CNN relies on region proposals to filter out background regions, hence the final predictions are much more sparse.
與R-CNN不同,YOLO決定在同一CNN中一起處理區域提議和區域分類。 換句話說,它將對象檢測視為回歸問題,而不是依賴于區域提議的分類問題。 一般的想法是將輸入分成一個SxS網格,并在對象中心落入該單元格時讓每個單元格直接回歸邊界框位置和置信度分數。 由于對象的大小可能不同,因此每個單元將有一個以上的包圍盒回歸器。 在訓練期間,將分配具有最高IOU的回歸變量與地面真實性標簽進行比較,因此同一位置的回歸變量將隨著時間的推移學會處理不同的音階。 同時,每個單元還將根據包含對象的網格單元(高置信度得分)來預測C類概率。 后來,這種方法被稱為密集預測,因為YOLO試圖預測圖像中所有可能位置的類和邊界框。 相比之下,R-CNN依靠區域提議來濾除背景區域,因此最終的預測要稀疏得多。
2015年:SSD (2015: SSD)
SSD: Single Shot MultiBox Detector
SSD:單發MultiBox檢測器
YOLO v1 demonstrated the potentials of one-stage detection, but the performance gap from two-stage detection is still noticeable. In YOLO v1, multiple objects could be assigned to the same grid cell. This was a big challenge when detecting small objects, and became a critical problem to solve in order to improve a one-stage detector’s performance to be on par with two-stage detectors. SSD is such a challenger and attacks this problem from three angles.
YOLO v1展示了一級檢測的潛力,但是與二級檢測相比的性能差距仍然很明顯。 在YOLO v1中,可以將多個對象分配給同一網格單元。 當檢測小物體時,這是一個很大的挑戰,并且成為要提高一級檢測器的性能以使其與兩級檢測器相提并論的關鍵問題。 SSD就是這樣的挑戰者,并從三個角度解決這個問題。
SSD: Single Shot MultiBox DetectorSSD:單發多盒檢測器””First, the anchor box technique from Faster R-CNN can alleviate this problem. Objects in the same area usually come with different aspect ratios to be visible. Introducing anchor box not only increased the amount of object to detect for each cell, but also helped the network to better differentiate overlapping small objects with this aspect ratio assumption.
首先,Faster R-CNN的錨框技術可以緩解此問題。 同一區域中的對象通常具有不同的長寬比以使其可見。 引入錨框不僅增加了每個單元要檢測的對象數量,而且還通過這種長寬比假設幫助網絡更好地區分了重疊的小對象。
SSD: Single Shot MultiBox DetectorSSD:單發多盒檢測器””SSD went down on this road further by aggregating multi-scale features before detection. This is a very common approach to pick up fine-grained local features while preserving coarse global features in CNN. For example, FCN, the pioneer of CNN semantic segmentation, also merged features from multiple levels to refine the segmentation boundary. Besides, multi-scale feature aggregation can be easily performed on all common classification networks, so it’s very convenient to swap out the backbone with another network.
在檢測之前,SSD通過整合多尺度功能進一步走上了這條道路。 這是在保留CNN中粗粒度全局特征的同時獲取細粒度局部特征的一種非常常用的方法。 例如,CNN語義分割的先驅FCN也合并了多個級別的特征以完善分割邊界。 此外,可以在所有常見分類網絡上輕松執行多尺度特征聚合,因此將主干與另一個網絡交換出去非常方便。
Finally, SSD leveraged a large amount of data augmentation, especially targeted to small objects. For example, images are randomly expanded to a much larger size before random cropping, which brings a zoom-out effect to the training data to simulate small objects. Also, large bounding boxes are usually easy to learn. To avoid these easy examples dominating the loss, SSD adopted a hard negative mining technique to pick examples with the highest loss for each anchor box.
最后,SSD利用了大量的數據擴充功能,尤其是針對小型對象。 例如,在隨機裁剪之前將圖像隨機擴展到更大的尺寸,這會給訓練數據帶來縮小效果以模擬小物體。 而且,大型邊界框通常易于學習。 為了避免這些簡單的示例占主導地位的損失,SSD采用硬性負挖礦技術為每個錨定框選擇損失最高的示例。
2016年:FPN (2016: FPN)
Feature Pyramid Networks for Object Detection
用于目標檢測的特征金字塔網絡
With the launch of Faster-RCNN, YOLO, and SSD in 2015, it seems like the general structure an object detector is determined. Researchers start to look at improving each individual parts of these networks. Feature Pyramid Networks is an attempt to improve the detection head by using features from different layers to form a feature pyramid. This feature pyramid idea isn’t very novel in computer vision research. Back then when features are still manually designed, feature pyramid is already a very effective way to recognize patterns at different scales. Using the Feature Pyramid in deep learning is also not a new idea: SSPNet, FCN, and SSD all demonstrated the benefit of aggregating multiple-layer features before classification. However, how to share the feature pyramid between RPN and the region-based detector is still yet to be determined.
隨著2015年Faster-RCNN,YOLO和SSD的推出,似乎確定了對象檢測器的一般結構。 研究人員開始考慮改進這些網絡的各個部分。 特征金字塔網絡是通過使用來自不同層的特征以形成特征金字塔來改進檢測頭的嘗試。 這個特征金字塔的想法在計算機視覺研究中不是很新穎。 那時,當仍然手動設計要素時,要素金字塔已經是一種識別不同比例尺圖案的非常有效的方法。 在深度學習中使用功能金字塔也不是一個新主意:SSPNet,FCN和SSD都展示了在分類之前聚合多層功能的好處。 然而,如何在RPN和基于區域的檢測器之間共享特征金字塔仍有待確定。
Feature Pyramid Networks for Object Detection用于目標檢測的特征金字塔網絡””First, to rebuild RPN with an FPN structure like the diagram above, we need to have a region proposal running on multiple different scales of feature output. Also, we only need 3 anchors with different aspect ratios per location now because objects with different sizes will be handle by different levels of the feature pyramid. Next, to use an FPN structure in the Fast R-CNN detector, we also need to adapt it to detect on multiple scales of feature maps as well. Since region proposals might have different scales too, we should use them in the corresponding level of FPN as well. In short, if Faster R-CNN is a pair of RPN and region-based detector running on one scale, FPN converts it into multiple parallel branches running on different scales and collects the final results from all branches in the end.
首先,要使用上圖所示的FPN結構重建RPN,我們需要有一個在多個不同比例的要素輸出上運行的區域提議。 另外,現在每個位置只需要3個具有不同長寬比的錨點,因為具有不同大小的對象將由要素金字塔的不同級別處理。 接下來,要在Fast R-CNN檢測器中使用FPN結構,我們還需要對其進行調整以在多個比例的特征圖上進行檢測。 由于區域提案的規模也可能不同,因此我們也應在相應的FPN級別中使用它們。 簡而言之,如果Faster R-CNN是一對以單個比例運行的RPN和基于區域的檢測器,則FPN會將其轉換為以不同比例運行的多個并行分支,并最終收集所有分支的最終結果。
2016年:YOLO v2 (2016: YOLO v2)
YOLO9000: Better, Faster, Stronger
YOLO9000:更好,更快,更強大
While Kaiming He, Ross Girshick, and their team keep improving their two-stage R-CNN detectors, Joseph Redmon, on the other hand, was also busy improving his one-stage YOLO detector. The initial version of YOLO suffers from many shortcomings: predictions based on a coarse grid brought lower localization accuracy, two scale-agnostic regressors per grid cell also made it difficult to recognize small packed objects. Fortunately, we saw too many great innovations in 2015 in many computer vision areas. YOLO v2 just needs to find a way to integrate them all to become better, faster, and stronger. Here are some highlights of the modifications:
在何凱明,羅斯·吉爾???Ross Girshick)及其團隊不斷改進其兩階段R-CNN探測器的同時,約瑟夫·雷德蒙(Joseph Redmon)也忙于改進其一階段YOLO探測器。 YOLO的初始版本存在許多缺點:基于粗網格的預測帶來較低的定位精度,每個網格單元有兩個與規模無關的回歸變量,這也使得難以識別小的包裝物體。 幸運的是,2015年我們在許多計算機視覺領域看到了太多偉大的創新。 YOLO v2只需要找到一種方法來整合它們,使其變得更好,更快,更強大。 以下是修改的一些重點:
YOLO v2 added Batch Normalization layers from a paper called “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”.
YOLO v2從名為“ 批處理規范化:通過減少內部協變量偏移來加速深度網絡訓練 ”的論文中添加了批處理規范化層。
Just like SSD, YOLO v2 also introduced Faster R-CNN’s idea of anchor boxes for bounding box regression. But YOLO v2 did some customization for its anchor boxes. Instead of predicting offsets to anchor boxes, YOLOv2 constraints the object center regression tx and ty within the responsible grid cell to stabilize early training. Also, anchors sizes are determined by a K-means clustering of the target dataset to better align with object shapes.
就像SSD一樣,YOLO v2也引入了Faster R-CNN的錨框用于邊界框回歸的想法。 但是YOLO v2對其錨框進行了一些自定義。 YOLOv2并沒有預測錨框的偏移量,而是限制了責任網格單元內的對象中心回歸tx和ty來穩定早期訓練。 同樣,錨點的大小由目標數據集的K均值聚類確定,以更好地與對象形狀對齊。
A new backbone network called Darknet is used for feature extraction. This is inspired by “Network in Network” and GooLeNet’s bottleneck structure.
一種稱為Darknet的新骨干網絡用于特征提取。 這受到“ 網絡中的網絡 ”和GooLeNet瓶頸結構的啟發。
- To improve the detection of small objects, YOLO v2 added a passthrough layer to merge features from an early layer. This part can be seen as a simplified version of SSD. 為了改善對小物體的檢測,YOLO v2添加了直通層以合并早期層中的要素。 這部分可以看作是SSD的簡化版本。
- Last but not least, Joseph realized that input resolution is a silver bullet for small object detection. It not only doubled the input for the backbone to 448x448 from 224x224 but also invented a multi-scale training schema, which involves different input resolutions at different periods of training. 最后但并非最不重要的一點,約瑟夫意識到輸入分辨率是檢測小物體的靈丹妙藥。 它不僅將骨干網的輸入從224x224增加了一倍,達到448x448,而且發明了一種多尺度訓練方案,該方案在不同的訓練階段涉及不同的輸入分辨率。
Note that YOLO v2 also experimented with a version that’s trained on 9000 classes hierarchical datasets, which also represents an early trial of multi-label classification in an object detector.
請注意,YOLO v2還試驗了針對9000類分層數據集進行訓練的版本,這也代表了在對象檢測器中進行多標簽分類的早期嘗試。
2017年:RetinaNet (2017: RetinaNet)
Focal Loss for Dense Object Detection
密集物體檢測的焦點損失
To understand why one-stage detectors are usually not as good as two-stage detectors, RetinaNet investigated the foreground-background class imbalance issue from a one-stage detector’s dense predictions. Take YOLO as an example, it tried to predict classes and bounding boxes for all possible locations in the meantime, so most of the outputs are matched to negative class during training. SSD addressed this issue by online hard example mining. YOLO used an objectiveness score to implicitly train a foreground classifier in the early stage of training. RetinaNet thinks they both didn’t get the key to the problem, so it invented a new loss function called Focal Loss to help the network learn what’s important.
為了了解為什么一級檢測器通常不如二級檢測器好,RetinaNet從一級檢測器的密集預測中研究了前景-背景類不平衡問題。 以YOLO為例,它試圖同時預測所有可能位置的類別和邊界框,因此在訓練過程中,大多數輸出??都與否定類別匹配。 SSD通過在線硬示例挖掘解決了此問題。 YOLO在訓練的早期階段使用客觀分數來隱式訓練前景分類器。 RetinaNet認為他們倆都沒有找到解決問題的關鍵,因此,它發明了一個名為Focal Loss的新損失函數,以幫助網絡了解重要信息。
From “摘自“ Focal Loss for Dense Object Detection密集物體檢測的焦點損失””Focal Loss added a power γ (they call it focusing parameter) to Cross-Entropy loss. Naturally, as the confidence score becomes higher, the loss value will become much lower than a normal Cross-Entropy. The α parameter is used to balance such a focusing effect.
焦點損耗為交叉熵損耗增加了冪γ(它們稱為聚焦參數)。 自然,隨著置信度得分的提高,損失值將變得比正常的交叉熵要低得多。 α參數用于平衡這種聚焦效果。
Focal Loss for Dense Object Detection密集物體檢測的焦點損失””This idea is so simple that even a primary school student can understand. So to further justify their work, they adapted the FPN model they previously proposed and created a new one-stage detector called RetinaNet. It is composed of a ResNet backbone, an FPN detection neck to channel features at different scales, and two subnets for classification and box regression as detection head. Similar to SSD and YOLO v3, RetinaNet uses anchor boxes to cover targets of various scales and aspect ratios.
這個想法非常簡單,甚至小學生也能理解。 因此,為進一步證明他們的工作合理性,他們改用了他們先前提出的FPN模型,并創建了一個新的稱為RetinaNet的單級探測器。 它由ResNet主干網,FPN檢測頸以不同尺度的通道特征組成,以及兩個用于分類和框回歸的子網作為檢測頭。 與SSD和YOLO v3相似,RetinaNet使用錨框覆蓋各種比例和縱橫比的目標。
A bit of a digression, RetinaNet used the COCO accuracy from a ResNeXT-101 and 800 input resolution variant to contrast YOLO v2, which only has a light-weighted Darknet-19 backbone and 448 input resolution. This insincerity shows the team’s emphasis on getting better benchmark results, rather than solving a practical issue like a speed-accuracy trade-off. And it might be part of the reason that RetinaNet didn’t take off after its release.
有點離題,RetinaNet使用ResNeXT-101和800輸入分辨率變體的COCO精度來對比YOLO v2,后者僅具有輕量級Darknet-19主干和448輸入分辨率。 這種不誠實表明團隊強調獲得更好的基準測試結果,而不是解決諸如速度精度折衷之類的實際問題。 這可能是RetinaNet發布后沒有起飛的部分原因。
2018年:YOLO v3 (2018: YOLO v3)
YOLOv3: An Incremental Improvement
YOLOv3:增量改進
YOLO v3 is the last version of the official YOLO series. Following YOLO v2’s tradition, YOLO v3 borrowed more ideas from previous research and got an incredible powerful one-stage detector like a monster. YOLO v3 balanced the speed, accuracy, and implementation complexity pretty well. And it got really popular in the industry because of its fast speed and simple components. If you are interested, I wrote a very detailed explanation of how YOLO v3 works in my previous article “Dive Really Deep into YOLO v3: A Beginner’s Guide”.
YOLO v3是YOLO官方系列的最新版本。 遵循YOLO v2的傳統,YOLO v3借鑒了以前的研究中的更多想法,并獲得了令人難以置信的強大的一級檢測器,就像怪物一樣。 YOLO v3很好地平衡了速度,準確性和實現復雜性。 由于它的快速和簡單的組件,它在行業中非常流行。 如果您有興趣,我在我之前的文章“ 深入研究YOLO v3:入門指南 ”中對YOLO v3的工作方式進行了非常詳細的說明。
Dive Really Deep into YOLO v3: A Beginner’s Guide”深入研究YOLO v3:入門指南 ” Dive Really Deep into YOLO v3: A Beginner’s Guide”深入研究YOLO v3:入門指南 ”Simply put, YOLO v3’s success comes from its more powerful backbone feature extractor and a RetinaNet-like detection head with an FPN neck. The new backbone network Darknet-53 leveraged ResNet’s skip connections to achieve an accuracy that’s on par with ResNet-50 but much faster. Also, YOLO v3 ditched v2’s pass through layers and fully embraced FPN’s multi-scale predictions design. Since then, YOLO v3 finally reversed people’s impression of its poor performance when dealing with small objects.
簡而言之,YOLO v3的成功來自更強大的主干特征提取器和帶有FPN頸部的類似RetinaNet的檢測頭。 新的骨干網Darknet-53利用ResNet的跳過連接來實現與ResNet-50相同的準確性,但速度要快得多。 此外,YOLO v3放棄了v2的通過層,并完全接受了FPN的多尺度預測設計。 從那時起,YOLO v3終于扭轉了人們對處理小物體時性能低下的印象。
Besides, there are a few fun facts about YOLO v3. It dissed the COCO mAP 0.5:0.95 metric, and also demonstrated the uselessness of Focal Loss when using a conditioned dense prediction. The author Joseph even decided to quit the whole computer vision research a year later, because of his concern of military usage.
此外,關于YOLO v3還有一些有趣的事實。 它放棄了COCO mAP 0.5:0.95指標,并且還證明了在使用條件密集預測時,焦點損失的無用性。 由于擔心軍事用途,作者約瑟夫甚至決定在一年后退出整個計算機視覺研究。
2019:以對象為點 (2019: Objects As Points)
Although the image classification area becomes less active recently, object detection research is still far from mature. In 2018, a paper called “CornerNet: Detecting Objects as Paired Keypoints” provided a new perspective for detector training. Since preparing anchor box targets is a quite cumbersome job, is it really necessary to use them as a prior? This new trend of ditching anchor boxes is called “anchor-free” object detection.
盡管近來圖像分類領域變得不太活躍,但是對象檢測研究還遠遠不成熟。 2018年,一篇名為“ CornerNet:將對象檢測為配對的關鍵點”的論文為檢測器培訓提供了新的視角。 由于準備錨定框目標是一項非常繁瑣的工作,因此真的有必要先使用它們嗎? 拋開錨框的這種新趨勢被稱為“無錨”物體檢測。
Stacked Hourglass Networks for Human Pose Estimation”用于人體姿勢估計的堆疊沙漏網絡 ”Inspired by the use of heat-map in the Hourglass network for human pose estimation, CornerNet uses a heat-map generated by box corners to supervise the bounding box regression. To learn more about how heat-map is used in Hourglass Network, you can read my previous article “Human Pose Estimation with Stacked Hourglass Network and TensorFlow”.
受Hourglass網絡中使用熱圖進行人體姿勢估計的啟發,CornerNet使用由框角生成的熱圖來監督邊界框回歸。 要了解有關在沙漏網絡中如何使用熱圖的更多信息,您可以閱讀我以前的文章“ 使用堆疊式沙漏網絡和TensorFlow進行人體姿勢估計 ”。
Objects as Points物體作為點””Objects As Points, aka CenterNet, took a step further. It uses heat-map peaks to represent object centers, and the network will regress the box width and height directly from these box centers. Essentially, CenterNet is using every pixel as grid cells. With a Gaussian distributed heat-map, the training is also easier to converge compared with previous attempts which tried to regress bounding box size directly.
對象即點(又名CenterNet)又走了一步。 它使用熱圖峰表示對象中心,網絡將直接從這些盒子中心回歸盒子的寬度和高度。 本質上,CenterNet會將每個像素用作網格單元。 與以前的嘗試直接使邊界框大小回歸的嘗試相比,使用高斯分布的熱圖,訓練也更易于收斂。
The elimination of anchor boxes also has another useful side effect. Previously, we rely on IOU ( such as > 0.7) between the anchor box and the ground truth box to assign training targets. By doing so, a few neighboring anchors may get all assigned a positive target for the same object. And the network will learn to predict multiple positive boxes for the same object too. The common way to fix this issue is to use a technique called Non-maximum Suppression (NMS). It’s a greedy algorithm to filter out boxes that are too close together. Now that anchors are gone and we only have one peak per object in the heat-map, there’s no need to use NMS any more. Since NMS is sometimes hard to implement and slow to run, getting rid of NMS is a big benefit for the applications that run in various environments with limited resources.
消除錨固盒也具有另一個有用的副作用。 以前,我們依靠錨框和地面真值框之間的IOU(例如> 0.7)來分配訓練目標。 這樣,幾個相鄰的錨可能會為所有相同對象分配正目標。 網絡還將學習預測同一對象的多個肯定框。 解決此問題的常用方法是使用一種稱為非最大抑制(NMS)的技術。 這是一種貪婪的算法,可過濾掉距離太近的盒子。 現在,錨點已經消失了,并且熱圖中每個對象只有一個峰,因此不再需要使用NMS。 由于NMS有時難以實施且運行緩慢,因此對于在資源有限的各種環境中運行的應用程序而言,擺脫NMS的好處是它的一大優勢。
2019年:EfficientDet (2019: EfficientDet)
EfficientDet: Scalable and Efficient Object Detection
EfficientDet:可擴展且高效的對象檢測
EfficientDet: Scalable and Efficient Object DetectionEfficientDet:可擴展且高效的對象檢測””In the recent CVPR’20, EfficientDet showed us some more exciting development in the object detection area. FPN structure has been proved to be a powerful technique to improve the detection network’s performance for objects at different scales. Famous detection networks such as RetinaNet and YOLO v3 all adopted an FPN neck before box regression and classification. Later, NAS-FPN and PANet (please refer to Read More section) both demonstrated that a plain multi-layer FPN structure may benefit from more design optimization. EfficientDet continued exploring in this direction, eventually created a new neck called BiFPN. Basically, BiFPN features additional cross-layer connections to encourage feature aggregation back and forth. To justify the efficiency part of the network, this BiFPN also removed some less useful connections from the original PANet design. Another innovative improvement over the FPN structure is the weight feature fusion. BiFPN added additional learnable weights to feature aggregation so that the network can learn the importance of different branches.
在最近的CVPR'20中,EfficientDet向我們展示了物體檢測領域中一些更令人興奮的發展。 事實證明,FPN結構是一種改進檢測網絡針對不同規模物體的性能的強大技術。 諸如RetinaNet和YOLO v3之類的著名檢測網絡在盒回歸和分類之前都采用了FPN頸。 后來,NAS-FPN和PANet(請參閱“”部分)都證明,普通的多層FPN結構可能會受益于更多的設計優化。 EfficientDet繼續朝這個方向進行探索,最終創建了一個稱為BiFPN的新脖子。 基本上,BiFPN具有附加的跨層連接,以鼓勵來回地進行特征聚合。 為了證明網絡的效率部分是合理的,此BiFPN還從原始PANet設計中刪除了一些不太有用的連接。 FPN結構的另一個創新改進是重量特征融合。 BiFPN為功能聚合增加了其他可學習的權重,以便網絡可以了解不同分支機構的重要性。
EfficientDet: Scalable and Efficient Object DetectionEfficientDet:可擴展且高效的對象檢測””Moreover, just like what we saw in the image classification network EfficientNet, EfficientDet also introduced a principled way to scale an object detection network. The φ parameter in the above formula controls both width (channels) and depth (layers) of both BiFPN neck and detection head.
此外,就像我們在圖像分類網絡EfficientNet中看到的一樣,EfficientDet也引入了一種原理化的方法來擴展對象檢測網絡。 上式中的φ參數控制BiFPN頸部和檢測頭的寬度(通道)和深度(層)。
EfficientDet: Scalable and Efficient Object DetectionEfficientDet:可擴展且高效的對象檢測””This new parameter results in 8 different variants of EfficientDet from D0 to D7. A light-weighed D0 variant can achieve similar accuracy with YOLO v3 while having much fewer FLOPs. A heavy-loaded D7 variant with monstrous 1536x1536 input can even reach 53.7 AP on COCO that dwarfed all other contenders.
此新參數導致從D0到D7的8種不同的EfficientDet變體。 重量輕的D0變體可以與YOLO v3達到類似的精度,而FLOP則少得多。 重負載的D7變種具有驚人的1536x1536輸入,在COCO上甚至可以達到53.7 AP,這使所有其他競爭者都相形見war。
(Read More)
From R-CNN, YOLO to recent CenterNet and EfficientDet, we have witnessed most major innovations in the object detection research in the deep learning era. Aside from the above papers, I’ve also provided a list of additional papers for you to keep reading to get a deeper understanding. They either provided a different perspective for object detection or extended this area with more powerful features.
從R-CNN,YOLO到最新的CenterNet和EfficientDet,我們見證了深度學習時代對象檢測研究中的大多數重大創新。 除了以上論文,我還提供了一些其他論文清單,供您繼續閱讀以加深了解。 他們為物體檢測提供了不同的視角,或者通過更強大的功能擴展了該領域。
2009年:DPM (2009: DPM)
Object Detection with Discriminatively Trained Part Based Models
具有區別訓練的基于零件的模型的目標檢測
By matching many HOG features for each deformable parts, DPM was one of the most efficient object detection models before the deep learning era. Take pedestrian detection as an example, it uses a star structure to recognize the general person pattern first, and then recognize parts with different sub-filters and calculate an overall score. Even today, the idea to recognize objects with deformable parts is still popular after we switch from HOG features to CNN features.
通過為每個可變形零件匹配許多HOG功能,DPM是深度學習時代之前最有效的對象檢測模型之一。 以行人檢測為例,它使用星形結構先識別一般人的模式,然后再識別具有不同子過濾器的部分并計算總得分。 即使在今天,當我們從HOG功能切換到CNN功能之后,識別具有可變形零件的對象的想法仍然很流行。
2012:選擇性搜尋 (2012: Selective Search)
Selective Search for Object Recognition
選擇性搜索對象識別
Like DPM, Selective Search is also not a product of the deep learning era. However, this method combined so many classical computer vision approaches together, and also used in the early R-CNN detector. The core idea of selective search is inspired by semantic segmentation where pixels are group by similarity. Selective Search uses different criteria of similarity such as color space and SIFT-based texture to iteratively merge similar areas together. And these merged area areas served as foreground predictions and followed by an SVM classifier for object recognition.
像DPM一樣,選擇性搜索也不是深度學習時代的產物。 但是,這種方法將許多經典的計算機視覺方法結合在一起,并且也用于早期的R-CNN檢測器。 選擇性搜索的核心思想是受語義分割啟發的,其中像素按相似性分組。 選擇性搜索使用不同的相似性標準(例如顏色空間和基于SIFT的紋理)將相似區域迭代合并在一起。 這些合并的區域用作前景預測,隨后是用于對象識別的SVM分類器。
2016年:R-FCN (2016: R-FCN)
R-FCN: Object Detection via Region-based Fully Convolutional Networks
R-FCN:通過基于區域的全卷積網絡進行對象檢測
Faster R-CNN finally combined RPN and ROI feature extraction and improved the speed a lot. However, for each region proposal, we still need fully connected layers to compute class and bounding box separately. If we have 300 ROIs, we need to repeat this by 300 hundred times, and this is also the origin of the major speed difference between one-stage and two-stage detector. R-FCN borrowed the idea from FCN for semantic segmentation, but instead of computing the class mask, R-FCN computes a positive sensitive score maps. This map will predict the probability of the appearance of the object at each location, and all locations will vote (average) to decide the final class and bounding box. Besides, R-FCN also used atrous convolution in its ResNet backbone, which is originally proposed in the DeepLab semantic segmentation network. To understand what is atrous convolution, please see my previous article “Witnessing the Progression in Semantic Segmentation: DeepLab Series from V1 to V3+”.
更快的R-CNN最終將RPN和ROI特征提取結合在一起,大大提高了速度。 但是,對于每個區域建議,我們仍然需要完全連接的圖層來分別計算類和邊界框。 如果我們有300個ROI,則需要重復進行300次,這也是一級和二級檢測器之間主要速度差異的根源。 R-FCN借用了FCN的思想進行語義分割,但是R-FCN代替了計算類別掩碼,而是計算了一個正敏感分數圖。 該圖將預測對象在每個位置出現的可能性,所有位置將投票(取平均值)以決定最終的類和邊界框。 此外,R-FCN在其ResNet主干網中也使用了粗糙卷積,該卷積網最初是在DeepLab語義分段網絡中提出的。 要了解什么是無意義的卷積,請參閱我以前的文章“ 見證語義分割的進展:從V1到V3 +的DeepLab系列 ”。
2017年:Soft-NMS (2017: Soft-NMS)
Improving Object Detection With One Line of Code
用一行代碼改善對象檢測
Non-maximum suppression (NMS) is widely used in anchor-based object detection networks to reduce duplicate positive proposals that are close-by. More specifically, NMS iteratively eliminates candidate boxes if they have a high IOU with a more confident candidate box. This could lead to some unexpected behavior when two objects with the same class are indeed very close to each other. Soft-NMS made a small change to only scaling down the confidence score of the overlapped candidate boxes with a parameter. This scaling parameter gives us more control when tuning the localization performance, and also leads to a better precision when a high recall is also needed.
非最大抑制(NMS)廣泛用于基于錨的對象檢測網絡中,以減少附近的重復肯定建議。 更具體地說,如果NMS具有較高的IOU和更自信的候選框,則NMS迭代地消除它們。 當具有相同類的兩個對象確實彼此非常接近時,這可能導致某些意外行為。 Soft-NMS進行了很小的更改,僅按比例縮小了重疊候選框的置信度。 當調整本地化性能時,此縮放參數可為我們提供更多控制權,并且在還需要較高召回率的情況下,還可以提高精度。
2017年:Cascade R-CNN (2017: Cascade R-CNN)
Cascade R-CNN: Delving into High Quality Object Detection
級聯R-CNN:致力于高質量目標檢測
While FPN exploring how to design a better R-CNN neck to use backbone features Cascade R-CNN investigated a redesign of R-CNN classification and regression head. The underlying assumption is simple yet insightful: the higher IOU criteria we use when preparing positive targets, the less false positive predictions the network will learn to make. However, we can’t simply increase such IOU threshold from commonly used 0.5 to more aggressive 0.7, because it could also lead to more overwhelming negative examples during training. Cascade R-CNN’s solution is to chain multiple detection head together, each will rely on the bounding box proposals from the previous detection head. Only the first detection head will use the original RPN proposals. This effectively simulated an increasing IOU threshold for latter heads.
在FPN探索如何設計更好的R-CNN頸部以使用骨干功能時,Cascade R-CNN研究了R-CNN分類和回歸頭的重新設計。 基本假設是簡單而有見地的:準備正目標時使用的IOU標準越高,網絡將學會做出的假正預測就越少。 但是,我們不能簡單地將此類IOU閾值從常用的0.5提高到更具侵略性的0.7,因為它還可能導致訓練過程中出現大量的負面案例。 Cascade R-CNN的解決方案是將多個檢測頭鏈接在一起,每個都將依賴于先前檢測頭的邊界框建議。 僅第一個檢測頭將使用原始RPN建議。 這有效地模擬了后頭的增加的IOU閾值。
2017: Mask R-CNN (2017: Mask R-CNN)
Mask R-CNN
Mask R-CNN
Mask R-CNN is not a typical object detection network. It was designed to solve a challenging instance segmentation task, i.e, creating a mask for each object in the scene. However, Mask R-CNN showed a great extension to the Faster R-CNN framework, and also in turn inspired object detection research. The main idea is to add a binary mask prediction branch after ROI pooling along with the existing bounding box and classification branches. Besides, to address the quantization error from the original ROI Pooling layer, Mask R-CNN also proposed a new ROI Align layer that uses bilinear image resampling under the hood. Unsurprisingly, both multi-task training (segmentation + detection) and the new ROI Align layer contribute to some improvement over the bounding box benchmark.
Mask R-CNN is not a typical object detection network. It was designed to solve a challenging instance segmentation task, ie, creating a mask for each object in the scene. However, Mask R-CNN showed a great extension to the Faster R-CNN framework, and also in turn inspired object detection research. The main idea is to add a binary mask prediction branch after ROI pooling along with the existing bounding box and classification branches. Besides, to address the quantization error from the original ROI Pooling layer, Mask R-CNN also proposed a new ROI Align layer that uses bilinear image resampling under the hood. Unsurprisingly, both multi-task training (segmentation + detection) and the new ROI Align layer contribute to some improvement over the bounding box benchmark.
2018: PANet (2018: PANet)
Path Aggregation Network for Instance Segmentation
Path Aggregation Network for Instance Segmentation
Instance segmentation has a close relationship with object detection, so often a new instance segmentation network could also benefit object detection research indirectly. PANet aims at boosting information flow in the FPN neck of Mask R-CNN by adding an additional bottom-up path after the original top-down path. To visualize this change, we have a ↑↓ structure in the original FPN neck, and PANet makes it more like a ↑↓↑ structure before pooling features from multiple layers. Also, instead of having separate pooling for each feature layer, PANet added an “adaptive feature pooling” layer after Mask R-CNN’s ROIAlign to merge (element-wise max of sum) multi-scale features.
Instance segmentation has a close relationship with object detection, so often a new instance segmentation network could also benefit object detection research indirectly. PANet aims at boosting information flow in the FPN neck of Mask R-CNN by adding an additional bottom-up path after the original top-down path. To visualize this change, we have a ↑↓ structure in the original FPN neck, and PANet makes it more like a ↑↓↑ structure before pooling features from multiple layers. Also, instead of having separate pooling for each feature layer, PANet added an “adaptive feature pooling” layer after Mask R-CNN's ROIAlign to merge (element-wise max of sum) multi-scale features.
2019: NAS-FPN (2019: NAS-FPN)
NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection
NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection
PANet’s success in adapting FPN structure drew attention from a group of NAS researchers. They used a similar reinforcement learning method from the image classification network NASNet and focused on searching the best combination of merging cells. Here, a merging cell is the basic build block of an FPN that merges any two input features layers into one output feature layer. The final results proved the idea that FPN could use further optimization, but the complex computer-searched structure made it too difficult for humans to understand.
PANet's success in adapting FPN structure drew attention from a group of NAS researchers. They used a similar reinforcement learning method from the image classification network NASNet and focused on searching the best combination of merging cells. Here, a merging cell is the basic build block of an FPN that merges any two input features layers into one output feature layer. The final results proved the idea that FPN could use further optimization, but the complex computer-searched structure made it too difficult for humans to understand.
結論 (Conclusion)
Object detection is still an active research area. Although the general landscape of this field is well shaped by a two-stage detector like R-CNN and one-stage detector such as YOLO, our best detector is still far from saturating the benchmark metrics, and also misses many targets in complicated background. At the same time, Anchor-free detector like CenterNet showed us a promising future where object detection networks can become as simple as image classification networks. Other directions of object detection, such as few-shot recognition and NAS, are still at an early age, and we will see how it goes in the next few years. Nevertheless, as object detection technology becomes more mature, we need to be very cautious about its adoption by the military and police. A dystopia where Terminators hunt and shoot humans with a YOLO detector is the last thing we want to see in our life.
Object detection is still an active research area. Although the general landscape of this field is well shaped by a two-stage detector like R-CNN and one-stage detector such as YOLO, our best detector is still far from saturating the benchmark metrics, and also misses many targets in complicated background. At the same time, Anchor-free detector like CenterNet showed us a promising future where object detection networks can become as simple as image classification networks. Other directions of object detection, such as few-shot recognition and NAS, are still at an early age, and we will see how it goes in the next few years. Nevertheless, as object detection technology becomes more mature, we need to be very cautious about its adoption by the military and police. A dystopia where Terminators hunt and shoot humans with a YOLO detector is the last thing we want to see in our life.
Originally published at http://yanjia.li on Aug 9, 2020
Originally published at http://yanjia.li on Aug 9, 2020
翻譯自: https://towardsdatascience.com/12-papers-you-should-read-to-understand-object-detection-in-the-deep-learning-era-3390d4a28891
深度學習之對象檢測
總結
以上是生活随笔為你收集整理的深度学习之对象检测_深度学习时代您应该阅读的12篇文章,以了解对象检测的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 鸿星尔克基金叫什么
- 下一篇: 社保中断后如何续交?