《每日论文》You Only Look Once: Unified, Real-Time Object Detection
Abstract
摘要
We present YOLO, a unified pipeline for object detection.
我們提出了一個統(tǒng)一的對象檢測管道YOLO。
Prior work on object detection repurposes classifiers to perform detection.
先前關(guān)于目標檢測的工作重新調(diào)整分類器來執(zhí)行檢測。
Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
取而代之,我們將對象檢測作為一個回歸問題來處理空間分離的邊界框和相關(guān)的類概率。
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.
在一次評估中,單個神經(jīng)網(wǎng)絡直接從完整圖像中預測邊界框和類概率。
Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
由于是單端檢測,所以可以直接優(yōu)化單端網(wǎng)絡檢測性能。
Our unified architecture is also extremely fast; YOLO processes images in real-time at 45 frames per second, hundreds to thousands of times faster than existing detection systems.
我們的統(tǒng)一體系結(jié)構(gòu)也非??焖?#xff1b;YOLO以每秒45幀的速度實時處理圖像,比現(xiàn)有的檢測系統(tǒng)快數(shù)百到數(shù)千倍。
Our system uses global image context to detect and localize objects, making it less prone to background errors than top detection systems like R-CNN.
我們的系統(tǒng)使用全局圖像上下文來檢測和定位對象,這使得它比R-CNN這樣的頂級檢測系統(tǒng)更不容易出現(xiàn)背景錯誤。
By itself, YOLO detects objects at unprecedented speeds with moderate accuracy.
就其本身而言,YOLO以前所未有的速度以中等精度檢測物體。
When combined with state-of-the-art detectors, YOLO boosts performance by 2-3% points mAP.
當與最先進的探測器相結(jié)合時,YOLO將性能提升2-3%。
1 Introduction
1簡介
Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact.
人類看了一眼圖像,就會立刻知道圖像中的對象是什么,它們在哪里,以及它們是如何相互作用的。
The human visual system is fast and accurate, allowing us to perform complex tasks like driving or grocery shopping with little conscious thought.
人類的視覺系統(tǒng)是快速和準確的,允許我們執(zhí)行復雜的任務,如駕駛或雜貨店購物很少有意識的想法。
Fast, accurate, algorithms for object detection would allow computers to drive cars in any weather without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.
快速、準確的目標檢測算法將允許計算機在任何天氣下駕駛汽車,而無需專門的傳感器,使輔助設備能夠向人類用戶傳送實時場景信息,并為通用、響應迅速的機器人系統(tǒng)釋放潛力。
Convolutional neural networks (CNNs) achieve impressive performance on classification tasks at real-time speeds [13].
卷積神經(jīng)網(wǎng)絡(CNN)在分類任務中以實時速度取得了令人印象深刻的性能[13]。
Yet top object detection systems like R-CNN take seconds to process individual images and hallucinate objects in background noise.
然而,像R-CNN這樣的頂級目標檢測系統(tǒng)需要幾秒鐘的時間來處理單個圖像并在背景噪聲中產(chǎn)生幻覺物體。
We believe these shortcomings result from how these systems approach object detection.
我們相信這些缺點是由于這些系統(tǒng)是如何接近目標檢測的。
Current detection systems repurpose classifiers to perform detection.
當前的檢測系統(tǒng)重新利用分類器來執(zhí)行檢測。
To detect an object these systems take a classifier for that object and evaluate it at various locations and scales in a test image.
為了檢測到一個物體,這些系統(tǒng)對該物體使用一個分類器,并在測試圖像中的不同位置和比例對其進行評估。
Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [7].
像可變形零件模型(DPM)這樣的系統(tǒng)使用滑動窗口方法,分類器在整個圖像上均勻分布的位置運行[7]。
More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes.
最近的一些方法(如R-CNN)使用區(qū)域建議方法首先在圖像中生成潛在的邊界框,然后在這些建議的框上運行分類器。
After classification, post-processing is used to refine the bounding box, eliminate duplicate detections, and rescore the box based on other objects in the scene [9].
分類后,使用后處理細化邊界框,消除重復檢測,并基于場景中的其他對象對框進行重新掃描[9]。
These region proposal techniques typically generate a few thousand potential boxes per image.
這些區(qū)域建議技術(shù)通常為每張圖像生成幾千個潛在框。
Selective Search, the most common region proposal method, takes 1-2 seconds per image to generate these boxes [26].
選擇性搜索是最常見的區(qū)域建議方法,每幅圖像需要1-2秒才能生成這些方框[26]。
The classifier then takes additional time to evaluate the proposals.
然后分類器需要額外的時間來評估建議。
The best performing systems require 2-40 seconds per image and even those optimized for speed do not achieve real-time performance.
性能最好的系統(tǒng)每幅圖像需要2-40秒,即使是針對速度進行優(yōu)化的系統(tǒng)也無法實現(xiàn)實時性能。
Additionally, even a highly accurate classifier will produce false positives when faced with so many proposals.
此外,即使是一個高精度的分類器在面對如此多的建議時也會產(chǎn)生誤報。
When viewed out of context, small sections of background can resemble actual objects, causing detection errors.
當脫離上下文查看時,背景的小部分可能與實際對象相似,從而導致檢測錯誤。
Finally, these detection pipelines rely on independent techniques at every stage that cannot be optimized jointly.
最后,這些檢測管道在每個階段都依賴于獨立的技術(shù),而這些技術(shù)無法聯(lián)合優(yōu)化。
A typical pipeline uses Selective Search for region proposals, a convolutional network for feature extraction, a collection of one-versus-all SVMs for classification, non-maximal suppression to reduce duplicates, and a linear model to adjust the final bounding box coordinates.
典型的流水線使用選擇性搜索區(qū)域建議,卷積網(wǎng)絡用于特征提取,一組SVM用于分類,非最大化抑制以減少重復,以及線性模型來調(diào)整最終邊界框坐標。
Selective Search tries to maximize recall while the SVMs optimize for single class accuracy and the linear model learns from localization error.
選擇搜索試圖最大化召回率,而支持向量機優(yōu)化單類準確度,線性模型從定位誤差中學習。
Our system is refreshingly simple, see Figure 1.
我們的系統(tǒng)非常簡單,見圖1。
Figure 1: The YOLO Detection System.
圖1:YOLO檢測系統(tǒng)。
Processing images with YOLO is simple and straightforward.
用YOLO處理圖像是簡單而直接的。
Our system (1) resizes the input image to 448 × 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.
我們的系統(tǒng)(1)將輸入圖像的大小調(diào)整為448×448,(2)在圖像上運行一個卷積網(wǎng)絡,(3)根據(jù)模型的置信度閾值檢測結(jié)果。
A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.
單個卷積網(wǎng)絡同時預測多個邊界盒和這些盒的類概率。
We train our network on full images and directly optimize detection performance.
我們在全圖像上訓練我們的網(wǎng)絡,并直接優(yōu)化檢測性能。
Context matters in object detection.
上下文在對象檢測中很重要。
Our network uses global image features to predict detections which drastically reduces its errors from background detections.
我們的網(wǎng)絡使用全局圖像特征來預測檢測,這大大減少了背景檢測的誤差。
At test time, a single network evaluation of the full image produces detections of multiple objects in multiple categories without any pre or post-processing.
在測試時,對完整圖像的單個網(wǎng)絡評估會產(chǎn)生多個類別的多個對象的檢測,而無需任何預處理或后處理。
Our training and testing code are open source and available online at [redacted]. A variety of pre-trained models are also available to download.
我們的訓練和測試代碼是開源的,可以在[修訂版]在線獲得。此外,還可下載各種經(jīng)過預訓練的型號。
2 Unified Detection
2統(tǒng)一檢測
We unify the separate components of object detection into a single neural network.
我們將目標檢測的各個部分統(tǒng)一到一個單一的神經(jīng)網(wǎng)絡中。
Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
使用我們的系統(tǒng),你只需看一次圖像(YOLO)就可以預測出哪些物體存在以及它們在哪里。
Our network uses features from the entire image to predict each bounding box.
我們的網(wǎng)絡使用來自整個圖像的特征來預測每個邊界框。
It also predicts all bounding boxes for an image simultaneously.
它還可以同時預測圖像的所有邊界框。
This means our network reasons globally about the full image and all the objects in the image.
這意味著我們的網(wǎng)絡對完整圖像和圖像中的所有對象進行了全球分析。
The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.
YOLO設計支持端到端的訓練和實時速度,同時保持較高的平均精度。
2.1 Design
2.1設計
Our system divides the input image into a 7 × 7 grid.
我們的系統(tǒng)將輸入圖像分成7×7的網(wǎng)格。
If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
如果一個物體的中心落在一個網(wǎng)格單元中,該網(wǎng)格單元負責檢測該對象。
Each grid cell predicts a bounding box and class probabilities associated with that bounding box, see Figure 2.
每個網(wǎng)格單元預測一個邊界框和與該邊界框相關(guān)的類概率,見圖2。
Figure 2: The Model.
圖2:模型。
Our system models detection as a regression problem to a 7 × 7 × 24 tensor.
我們的系統(tǒng)將檢測建模為一個7×7×24張量的回歸問題。
This tensor encodes bounding boxes and class probabilities for all objects in the image.
這個張量對圖像中所有對象的邊界框和類概率進行編碼。
We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset [6].
我們將此模型實現(xiàn)為卷積神經(jīng)網(wǎng)絡,并在PASCAL VOC檢測數(shù)據(jù)集上對其進行評估[6]。
The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.
網(wǎng)絡的初始卷積層從圖像中提取特征,而完全連接的層則預測輸出概率和坐標。
Our network architecture is inspired by the GoogLeNet model for image classification [25].
我們的網(wǎng)絡架構(gòu)是受GoogLeNet圖像分類模型的啟發(fā)[25]。
Our network has 24 convolutional layers followed by 2 fully connected layers.
我們的網(wǎng)絡有24個卷積層,緊接著是2個全連接層。
However, instead of the inception modules used by GoogLeNet we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al [17].
然而,與Lin等人[17]相似,我們只是使用1×1的還原層,然后是3×3的卷積層,而不是GoogLeNet使用的初始模塊。
We also replace maxpooling layers with strided convolutions.
我們還將maxpooling層替換為跨步卷積。
The full network is shown in Figure 3.
完整的網(wǎng)絡如圖3所示。
Figure 3: The Architecture.
圖3:架構(gòu)。
Our detection network has 24 convolutional layers followed by 2 fully connected layers.
我們的檢測網(wǎng)絡有24個卷積層和2個完全連接的層。
The network uses strided convolutional layers to downsample the feature space instead of maxpooling layers.
該網(wǎng)絡使用跨步卷積層來減少特征空間采樣,而不是最大池層。
Alternating 1 × 1 convolutional layers reduce the features space from preceding layers.
交替的1×1卷積層減少了前一層的特征空間。
We pretrain the convolutional layers on the ImageNet classification task at half the resolution (224×224 input image) and then double the resolution for detection.
我們在ImageNet分類任務中以一半的分辨率(224×224輸入圖像)對卷積層進行預處理,然后將分辨率提高一倍進行檢測。
The final output of our network is a 7 × 7 grid of predictions.
我們網(wǎng)絡的最終輸出是一個7×7的預測網(wǎng)格。
Each grid cell predicts 20 conditional class probabilities, and 4 bounding box coordinates.
每個網(wǎng)格單元預測20個條件類概率和4個邊界框坐標。
2.2 Training
2.2訓練
We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [22].
我們在ImageNet 1000類競爭數(shù)據(jù)集上預訓練卷積層[22]。
For pretraining we use the first 20 convolutional layers from Figure 3 followed by a maxpooling layer and two fully connected layers.
對于預訓練,我們使用圖3中的前20個卷積層,然后是一個maxpooling層和兩個完全連接的層。
We train this network for approximately a week and achieve a top-5 accuracy of 86% on the ImageNet 2012 validation set.
我們對該網(wǎng)絡進行了大約一周的培訓,在ImageNet 2012驗證集上達到了86%的 top-5 準確率。
We then adapt the model to perform detection.
然后我們調(diào)整模型來執(zhí)行檢測。
Ren et al. show that adding both convolutional and connected layers to pretrained networks can benefit performance [21].
Ren等人指出,在預訓練網(wǎng)絡中同時添加卷積層和連接層可以提高性能[21]。
Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights.
在他們的例子之后,我們添加了四個卷積層和兩個具有隨機初始化權(quán)重的全連接層。
Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.
檢測通常需要細粒度的視覺信息,因此我們將網(wǎng)絡的輸入分辨率從224×224提高到448×448。
Our final layer predicts both class probabilities and bounding box coordinates.
最后一層預測類概率和邊界框坐標。
We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1.
我們通過圖像寬度和高度規(guī)范化邊界框的寬度和高度,使它們介于0和1之間。
We parameterize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
我們將邊界框x和y坐標參數(shù)化為特定網(wǎng)格單元位置的偏移,因此它們也被限定在0和1之間。
We use a logistic activation function to reflect these constraints on the final layer.
我們使用邏輯激活函數(shù)來反映這些約束在最后一層上。
All other layers use the following leaky rectified linear activation:
所有其他層使用以下泄漏整流線性激活:
φ(x)={1.1x,if(x>0).1x,otherwiseφ(x)=\left\{\begin{aligned}1.1x, &&if (x > 0)\\.1x,&&otherwise\end{aligned}\right. φ(x)={1.1x,.1x,??if(x>0)otherwise?
We optimize for sum-squared error in the output of our model.
我們對模型輸出中的平方和誤差進行了優(yōu)化。
We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision.
我們使用平方和誤差,因為它很容易優(yōu)化,但它并不完全符合我們最大化平均精度的目標。
It weights localization error equally with classification error which may not be ideal.
它將定位誤差與分類誤差同等權(quán)重,而分類誤差可能并不理想。
To remedy this, we use a scaling factor λ to adjust the weight given to error from coordinate predictions versus error from class probabilities.
為了彌補這一點,我們使用比例因子λ來調(diào)整坐標預測誤差與類概率誤差的權(quán)重。
In our final model we use the scaling factor λ = 4.
在我們的最終模型中,我們使用比例因子λ=4。
Sum-squared error also equally weights errors in large boxes and small boxes.
平方和誤差在大盒子和小盒子中同樣加權(quán)誤差。
Our error metric should reflect that small deviations in large boxes matter less than in small boxes.
我們的誤差度量應該反映出大盒子里的小偏差比小盒子里的小偏差更重要。
To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.
為了部分解決這一問題,我們預測邊界框?qū)挾群透叨鹊钠椒礁?#xff0c;而不是直接預測寬度和高度。
If cell i predicts class probabilities pi^\hat{p_{i}}pi?^? (飛機), pi^\hat{p_{i}}pi?^? (自行車)… and the bounding box xi^,yi^,wi^,hi^,\hat{x_{i}},\hat{y_{i}},\hat{w_{i}},\hat{h_{i}},xi?^?,yi?^?,wi?^?,hi?^?, then our full loss function for an example is:
如果單元格i預測了類概率pi^\hat{p_{i}}pi?^? (飛機), pi^\hat{p_{i}}pi?^? (自行車)……邊界框xi^,yi^,wi^,hi^,\hat{x_{i}},\hat{y_{i}},\hat{w_{i}},\hat{h_{i}},xi?^?,yi?^?,wi?^?,hi?^?,,然后我們的全損失函數(shù)為:
∑i=048(λIiobj((xi?x^i)2+(yi?y^i)2+(wi?w^i)2+(hi?h^i)2)+∑c∈classes(pi(c)?pi(c)^)2)\sum^{48}_{i=0}\left(λ\Bbb I^{obj}_{i}\left((x_i-\hat x_i)^2+(y_i-\hat y_i)^2+(\sqrt w_i-\sqrt{\hat w_i})^2+(\sqrt h_i-\sqrt{\hat h_i})^2\right)+\sum_{c∈classes}(p_{i}(c)-\hat{p_{i}(c)})^2\right) i=0∑48?(λIiobj?((xi??x^i?)2+(yi??y^?i?)2+(w?i??w^i??)2+(h?i??h^i??)2)+c∈classes∑?(pi?(c)?pi?(c)^?)2)
Where Iiobj\Bbb I^{obj}_{i}Iiobj? encodes whether any object appears in cell i.
其中Iiobj\Bbb I^{obj}_{i}Iiobj?編碼是否有任何對象出現(xiàn)在單元格i中。
Note that if there is no object in a cell we do not consider any loss from the bounding box coordinates predicted by that cell.
請注意,如果單元格中沒有對象,則不會考慮該單元格預測的邊界框坐標的任何損失。
In this case, there is no ground truth bounding box so we only penalize the associated probabilities with that region.
在這種情況下,沒有基本真實邊界框,所以我們只懲罰與該區(qū)域相關(guān)的概率。
We train the network for about 120 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012 as well as the test set from 2007, a total of 21k images.
我們使用PASCAL VOC 2007和2012的訓練和驗證數(shù)據(jù)集以及2007年的測試集(共21k幅圖像)對網(wǎng)絡進行了大約120個時期的訓練。
Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.
在整個訓練過程中,我們使用的批量大小為64,動量為0.9,衰減為0.0005。
We use two learning rates during training: 10-2 and 10-3.
我們在訓練中使用兩種學習率:10-2和10-3。
Training diverges if we use the higher learning rate, 10-2 , from the start.
如果我們從一開始就使用更高的學習率(10-2),那么訓練就會出現(xiàn)分歧。
We use the lower rate, 10-3, for one epoch so that the randomly initialized weights in the final layers can settle to reasonable values.
我們對一個歷元使用較低的速率10-3,以便最終層中隨機初始化的權(quán)重可以設置為合理的值。
Then we train with the following learning rate schedule: 10-2 for 80 epochs, and 10-3 for 40 epochs.
然后,我們使用以下學習率計劃進行訓練:80個時期10-2,40個時期10-3。
To avoid overfitting we use dropout and extensive data augmentation.
為了避免過度擬合,我們使用了dropout和廣泛的擴充數(shù)據(jù)。
A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers [14].
在第一個連接層之后,速率為0.5的dropout層阻止了層之間的協(xié)同適應[14]。
For data augmentation we introduce random scaling and translations of up to 10% of the original image size.
對于數(shù)據(jù)增強,我們引入了原始圖像大小的10%的隨機縮放和平移。
We also randomly adjust the exposure and saturation of the image by up to a factor of 2.
我們還隨機調(diào)整曝光和飽和度的圖像多達2倍。
2.2.1 Parameterizing Class Probabilities
2.2.1類概率參數(shù)化
Each grid cell predicts class probabilities for that area of the image.
每個網(wǎng)格單元預測圖像中該區(qū)域的類概率。
There are 49 cells with a possible 20 classes each yielding 980 predicted probabilities per image.
共有49個單元,每個單元可能有20個類別,每幅圖像產(chǎn)生980個預測概率。
Most of these probabilities will be zero since only a few objects appear in any given image.
由于在任何給定的圖像中只有少數(shù)物體出現(xiàn),所以大多數(shù)概率都為零。
Left unchecked, this imbalance pushes all of the probabilities to zero, leading to divergence during training.
如果不加以控制,這種不平衡會將所有的概率推到零,導致訓練過程中出現(xiàn)分歧。
To overcome this, we add an extra variable to each grid location, the probability that any object exists in that location regardless of class.
為了克服這個問題,我們在每個網(wǎng)格位置添加一個額外的變量,即任何對象存在于該位置的概率,而不考慮類。
Thus instead of 20 class probabilities we have 1 “objectness” probability, Pr(Object), and 20 conditional probabilities: Pr(Airplane|Object), Pr(Bicycle|Object), etc.
因此,我們有1個“目標”概率、Pr(Object)和20個條件概率:Pr(Airplane|Object)、Pr(Bicycle|Object)等,而不是20個類概率。
To get the unconditional probability for an object class at a given location we simply multiply the “objectness” probability by the conditional class probability:
要獲得給定位置上對象類的無條件概率,我們只需將“objectness”概率乘以條件類概率:
Pr(Dog)=Pr(Object)?PR(Dog∣Object)Pr(Dog)=Pr(Object)*PR(Dog|Object) Pr(Dog)=Pr(Object)?PR(Dog∣Object)
We can optimize these probabilities independently or jointly using a novel “detection layer” in our convolutional network.
我們可以單獨或聯(lián)合使用卷積網(wǎng)絡中的新“檢測層”來優(yōu)化這些概率。
During the initial stages of training we optimize them independently to improve model stability.
在訓練的初始階段,我們獨立地對它們進行優(yōu)化,以提高模型的穩(wěn)定性。
We update the “objectness” probabilities at every location however we only update the conditional probabilities at locations that actually contain an object.
我們更新每個位置的“目標”概率,但是我們只更新實際包含對象的位置的條件概率。
This means there are far fewer probabilities getting pushed towards zero.
這意味著被推到零的概率要少得多。
During later stages of training we optimize the unconditioned probabilities by performing the required multiplications in the network and calculating error based on the result.
在訓練的后期階段,我們通過在網(wǎng)絡中執(zhí)行所需的乘法并根據(jù)結(jié)果計算誤差來優(yōu)化無條件概率。
2.2.2 Predicting IOU
2.2.2預測IOU
Like most detection systems, our network has trouble precisely localizing small objects.
像大多數(shù)探測系統(tǒng)一樣,我們的網(wǎng)絡很難精確定位小目標。
While it may correctly predict that an object is present in an area of the image, if it does not predict a precise enough bounding box the detection is counted as a false positive.
雖然它可以正確地預測對象存在于圖像的某個區(qū)域中,但是如果它沒有預測到足夠精確的邊界框,則檢測將被視為假陽性。
We want YOLO to have some notion of uncertainty in its probability predictions.
我們希望YOLO在它的概率預測中有一些不確定性的概念。
Instead of predicting 1-0 probabilities we can scale the target class probabilities by the IOU of the predicted bounding box with the ground truth box for a region.
與預測1-0概率不同,我們可以通過預測的邊界框與區(qū)域的底真值框的IOU來縮放目標類概率。
When YOLO predicts good bounding boxes it is also encouraged to predict high class probabilities.
當YOLO預測良好的邊界框時,也鼓勵預測高階概率。
For poor bounding box predictions it learns to predict lower confidence probabilities.
對于糟糕的邊界盒預測,它學會預測較低的置信概率。
We do not train to predict IOU from the beginning, only during the second stage of training.
我們不訓練從一開始就預測IOU,只在訓練的第二階段。
It is not necessary for good performance but it does boost our mean average precision by 3-4%.
它不需要好的性能,但它確實提高了我們的平均精度3-4%。
2.3 Inference
2.3推斷
Just like in training, predicting detections for a test image only requires one network evaluation.
就像在訓練中一樣,預測一個測試圖像的檢測只需要一次網(wǎng)絡評估。
The network predicts 49 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation unlike classifier-based methods.
網(wǎng)絡預測每個圖像有49個邊界框,每個框有類概率。與基于分類器的方法不同,YOLO在測試時非常快,因為它只需要單個網(wǎng)絡評估。
The grid design enforces spatial diversity in the bounding box predictions.
網(wǎng)格設計在邊界框預測中加強了空間多樣性。
Often it is clear which grid cell an object falls in to and the network only predicts one box for each object.
通常很清楚一個對象屬于哪個網(wǎng)格單元,而網(wǎng)絡只為每個對象預測一個框。
However, some large objects or objects near the border of multiple cells can be well localized by multiple cells.
但是,一些大對象或多個單元邊界附近的對象可以被多個單元很好地定位。
Nonmaximal suppression can be used to fix these multiple detections.
非最大抑制可用于修復這些多個檢測。
While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2-3% in mAP.
雖然對性能并不像R-CNN或DPM那樣重要,但非最大抑制在mAP中增加了2-3%。
2.4 Limitations of YOLO
2.4 YOLO的限制
YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts one box.
由于每個網(wǎng)格單元只預測一個框,所以YOLO對邊界框預測施加了強大的空間約束。
This spatial constraint limits the number of nearby objects that our model can predict.
這種空間約束限制了模型可以預測的附近對象的數(shù)量。
If two objects fall into the same cell our model can only predict one of them.
如果兩個物體落入同一個細胞,我們的模型只能預測其中一個。
Our model struggles with small objects that appear in groups, such as flocks of birds.
我們的模型與成群出現(xiàn)的小物體(如成群的鳥)作斗爭。
Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations.
由于我們的模型學會了從數(shù)據(jù)中預測邊界框,所以它很難將其推廣到新的或不尋常的縱橫比或配置中的對象。
Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.
我們的模型還使用相對粗糙的特征來預測邊界框,因為我們的架構(gòu)從輸入圖像有多個下采樣層。
Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes.
最后,當我們訓練一個近似檢測性能的損失函數(shù)時,我們的損失函數(shù)在小邊界框和大邊界框中處理錯誤是一樣的。
A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU.
大盒子里的小錯誤通常是良性的,但小盒子里的小錯誤對借據(jù)的影響要大得多。
Our main source of error is incorrect localizations.
我們錯誤的主要來源是錯誤的定位。
3 Comparison to Other Detection Systems
3與其他檢測系統(tǒng)的比較
Object detection is a core problem in computer vision.
目標檢測是計算機視覺的核心問題。
Detection pipelines generally start by extracting a set of robust features from input images (Haar [19], SIFT [18], HOG [2], convolutional features [3]).
檢測管道通常從從輸入圖像中提取一組魯棒特征開始(Haar[19],SIFT[18],HOG[2],卷積特征[3])。
Then, classifiers [27, 16, 9, 7] or localizers [1, 23] are used to identify objects in the feature space.
然后,使用分類器[27,16,9,7]或定位器[1,23]來識別特征空間中的對象。
These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image [26, 11, 28].
這些分類器或定位器要么在整個圖像上以滑動窗口方式運行,要么在圖像中的某些區(qū)域子集上運行[26,11,28]。
We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.
我們將YOLO檢測系統(tǒng)與幾種頂級檢測框架進行了比較,突出了關(guān)鍵的異同點。
Deformable parts models.
可變形零件模型。
Deformable parts models (DPM) use a sliding window approach to object detection [7].
可變形零件模型(DPM)使用滑動窗口方法進行目標檢測[7]。
DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc.
DPM使用不相交的管道來提取靜態(tài)特征、區(qū)域分類、預測高分區(qū)域的邊界框等。
Our system replaces all of these disparate parts with a single convolutional neural network.
我們的系統(tǒng)用一個卷積神經(jīng)網(wǎng)絡代替了所有這些不同的部分。
The network performs feature extraction, bounding box prediction, non-maximal suppression, and contextual reasoning all concurrently.
該網(wǎng)絡同時執(zhí)行特征提取、邊界框預測、非最大化抑制和上下文推理。
Instead of static features, the network trains the features in-line and optimizes them for the detection task.
與靜態(tài)特征不同,網(wǎng)絡在線訓練特征并對其進行優(yōu)化以執(zhí)行檢測任務。
Our unified architecture leads to a faster, more accurate model than DPM.
我們的統(tǒng)一體系結(jié)構(gòu)帶來了比DPM更快、更精確的模型。
Table 1: PASCAL VOC 2012 Leaderboard.
表1:PASCAL VOC 2012排行榜。
YOLO compared with the full comp4 (outside data allowed) public leaderboard as of June 6th, 2015.
YOLO與截至2015年6月6日的comp4(允許外部數(shù)據(jù))公開排行榜進行了比較。
Mean average precision and per-class average precision are shown for a variety of detection methods.
平均平均精度和每類平均精度顯示了各種檢測方法。
YOLO is the top detection method that is not based on the R-CNN detection framework.
YOLO是不基于R-CNN檢測框架的頂級檢測方法。
Fast R-CNN + YOLO is the second highest scoring method, with a 2% boost over Fast R-CNN.
Fast R-CNN+YOLO是第二高得分方法,比Fast R-CNN提高了2%。
R-CNN.
R-CNN and its variants use region proposals instead of sliding windows to find objects in images.
R-CNN及其變種使用區(qū)域建議而不是滑動窗口來查找圖像中的對象。
These systems use region proposal methods like Selective Search [26] to generate potential bounding boxes in an image.
這些系統(tǒng)使用區(qū)域建議方法,如選擇性搜索[26]在圖像中生成潛在的邊界框。
Instead of scanning through every region in the window, now the classifier only has to score a small subset of potential regions in an image.
現(xiàn)在分類器不再掃描窗口中的每個區(qū)域,而只需對圖像中的一小部分潛在區(qū)域進行評分。
Good region proposal methods maintain high recall despite greatly limiting the search space.
好的區(qū)域建議方法雖然極大地限制了搜索空間,但仍能保持較高的查全率。
This performance comes at a cost.
這種表演是有代價的。
Selective Search, even in “fast mode” takes about 2 seconds to propose regions for an image.
選擇性搜索,即使是在“快速模式”下,也需要2秒鐘來為圖像建議區(qū)域。
R-CNN shares many design aspects with DPM.
R-CNN與DPM共享許多設計方面。
After region proposal, R-CNN uses the same multistage pipeline of feature extraction (using CNNs instead of HOG), SVM scoring, non-maximal suppression, and bounding box prediction using a linear model [9].
區(qū)域建議后,R-CNN使用相同的多階段管道進行特征提取(使用CNNs代替HOG)、SVM評分、非最大抑制和使用線性模型進行邊界盒預測[9]。
YOLO shares some similarities with R-CNN.
YOLO和R-CNN有一些相似之處。
Each grid cell proposes a potential bounding box and then scores that bounding box using convolutional features.
每個網(wǎng)格單元提出一個潛在的邊界框,然后使用卷積特征對該邊界框進行評分。
However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object.
然而,我們的系統(tǒng)對網(wǎng)格單元方案施加了空間約束,這有助于減少對同一對象的多次檢測。
Our system also proposes far fewer bounding boxes, only 49 per image compared to about 2000 from Selective Search.
我們的系統(tǒng)也提出了少得多的邊界框,每幅圖像只有49個,相比之下,選擇性搜索大約有2000個。
Finally, our system combines these individual components into a single, jointly optimized model.
最后,我們的系統(tǒng)將這些單獨的組件組合成一個單獨的、聯(lián)合優(yōu)化的模型。
Deep MultiBox.
Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest [5] instead of using Selective Search.
與R-CNN不同,Szegedy等人訓練了一個卷積神經(jīng)網(wǎng)絡來預測感興趣的區(qū)域[5],而不是使用選擇性搜索。
MultiBox can also perform single object detection by replacing the confidence prediction with a single class prediction.
MultiBox還可以通過用單類預測代替置信預測來執(zhí)行單目標檢測。
However, MultiBox cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further image patch classification.
然而,MultiBox不能執(zhí)行一般的目標檢測,它仍然只是一個更大的檢測管道中的一部分,需要進一步的圖像補丁分類。
Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an image but YOLO is a complete detection pipeline.
YOLO和MultiBox都使用卷積網(wǎng)絡來預測圖像中的邊界框,但YOLO是一個完整的檢測管道。
OverFeat.
Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection [23].
Sermanet et al.訓練一個卷積神經(jīng)網(wǎng)絡來執(zhí)行定位,并調(diào)整該定位器來執(zhí)行檢測[23]。
OverFeat efficiently performs sliding window detection but it is still a disjoint system.
OverFeat有效地執(zhí)行滑動窗口檢測,但它仍然是一個不相交的系統(tǒng)。
OverFeat optimizes for localization, not detection performance.
OverFeat優(yōu)化定位,而不是檢測性能。
Like DPM, the localizer only sees local information when making a prediction.
與DPM一樣,定位器在進行預測時只看到本地信息。
OverFeat cannot reason about global context and thus requires significant post-processing to produce coherent detections.
OverFeat不能解釋全局上下文,因此需要大量的后處理來產(chǎn)生相干檢測。
MultiGrasp.
Our work is similar in design to work on grasp detection by Redmon et al [20].
我們的工作在設計上類似于Redmon等人[20]的抓取檢測工作。
Our grid approach to bounding box prediction is based on the MultiGrasp system for regression to grasps.
我們的包圍盒預測網(wǎng)格方法是基于MultiGrasp回歸系統(tǒng)。
4 Experiments
4實驗
We present detection results for the PASCAL VOC 2012 dataset and compare our mean average precision (mAP) and runtime to other top detection methods.
我們展示了PASCAL VOC 2012數(shù)據(jù)集的檢測結(jié)果,并將我們的平均精度(mAP)和運行時間與其他頂級檢測方法進行了比較。
We also perform error analysis on the VOC 2007 dataset.
我們還對voc2007數(shù)據(jù)集進行了錯誤分析。
We compare our results to Fast R-CNN, one of the highest performing versions of R-CNN [10].
我們將我們的結(jié)果與Fast R-CNN進行比較,后者是R-CNN的最高性能版本之一[10]。
We use publicly available runs of Fast R-CNN available on GitHub.
我們使用GitHub上的Fast R-CNN公開運行。
Finally we show that a combination of our method with Fast R-CNN gives a significant performance boost.
最后我們展示了我們的方法與Fast R-CNN相結(jié)合可以顯著提高性能。
4.1 VOC 2012 Results
4.1 VOC 2012的結(jié)果
On the VOC 2012 test set we achieve 54.5 mAP.
在VOC 2012測試集上,我們實現(xiàn)了54.5 mAP。
This is lower than the current state of the art, closer to R-CNN based methods that use AlexNet, see Table 1.
這低于目前的技術(shù)水平,更接近使用AlexNet的基于R-CNN的方法,見表1。
Our system struggles with small objects compared to its closest competitors.
與最接近的競爭對手相比,我們的系統(tǒng)在處理小對象時遇到了困難。
On categories like bottle, sheep, and tv/monitor YOLO scores 8-10 percentage points lower than R-CNN or Feature Edit.
在瓶子、綿羊和電視/監(jiān)視器等類別中,YOLO的得分比R-CNN或Feature Edit低8-10個百分點。
However, on other categories like cat and horse YOLO achieves significantly higher performance.
然而,在其他類別,如貓和馬,YOLO取得了顯著更高的表現(xiàn)。
We further investigate the source of these performance disparities in Section 4.3.
我們將在第4.3節(jié)中進一步調(diào)查這些性能差異的來源。
4.2 Speed
4.2速度
At test time YOLO processes images at 45 frames per second on an Nvidia Titan X GPU.
測試時,YOLO在Nvidia Titan X GPU上以每秒45幀的速度處理圖像。
It is considerably faster than classifier-based methods with similar mAP.
它比具有相似mAP的基于分類器的方法快得多。
Normal R-CNN using AlexNet or the small VGG network take 400-500x longer to process images.
正常的R-CNN使用AlexNet或小型VGG網(wǎng)絡需要400-500倍的時間來處理圖像。
The recently proposed Fast R-CNN shares convolutional features between the bounding boxes but still relies on selective search for bounding box proposals which accounts for the bulk of their processing time.
最近提出的Fast R-CNN共享邊界框之間的卷積特性,但仍然依賴于對邊界框方案的選擇性搜索,這占了它們處理時間的大部分。
YOLO is still around 100x faster than Fast R-CNN.
YOLO仍然比Fast R-CNN快100倍左右。
Table 2 shows a full comparison between multiple R-CNN and Fast R-CNN variants and YOLO.
表2顯示了多個R-CNN和Fast R-CNN變體與YOLO之間的完整比較。
Table 2: Prediction Timing.
表2:預測時間。
mAP and timing information for R-CNN, Fast R-CNN, and YOLO on the VOC 2007 test set.
voc2007測試集上R-CNN、Fast-R-CNN和YOLO的地圖和時間信息。
Timing information is given both as frames per second and the time each method takes to process the full 4952 image set.
定時信息以每秒幀數(shù)和每種方法處理完整的4952圖像集所用的時間來表示。
The final column shows the relative speed of YOLO compared to that method.
最后一列顯示了與該方法相比,YOLO的相對速度。
4.3 VOC 2007 Error Analysis
4.3 VOC 2007誤差分析
An object detector must have high recall for objects in the test set to obtain high performance.
對象檢測器必須對測試集中的對象具有較高的召回率才能獲得高性能。
Our model imposes spatial constraints on bounding box predictions which limits recall on small objects that are close together.
我們的模型對邊界框預測施加了空間約束,從而限制了對靠近的小對象的回憶。
We examine how detrimental this is in practice by calculating our highest possible recall assuming perfect coordinate prediction.
我們通過假設完美的坐標預測來計算我們的最高可能召回率來檢驗這在實踐中是多么的有害。
Under this assumption, our model can achieve a 93.1% recall for objects in the VOC 2007 test set.
在這個假設下,我們的模型可以在voc2007測試集中實現(xiàn)93.1%的召回率。
This is lower than Selective Search (98.0% [26]) but still relatively high.
這低于選擇性搜索(98.0%[26]),但仍然相對較高。
Using the methodology and tools of Hoiem et al. [15] we analyze our performance on the VOC 2007 test set.
使用Hoiem等人[15]的方法和工具,我們分析了我們在voc2007測試集上的性能。
We compare YOLO to Fast R-CNN using VGG-16, one of the highest performing object detectors.
我們使用VGG-16比較YOLO和Fast R-CNN,VGG-16是性能最好的目標探測器之一。
Figure 4 compares frequency of localization and background errors between Fast R-CNN and YOLO.
圖4比較了Fast R-CNN和YOLO之間定位和背景錯誤的頻率。
Figure 4: Error Analysis: Fast R-CNN vs. YOLO
圖4:錯誤分析:Fast R-CNN與YOLO
These charts show the percentage of localization and background errors in the top N detections for various categories (N = # objects in that category).
這些圖表顯示了不同類別(N=#該類別中的對象)前N個檢測中定位和背景錯誤的百分比。
A detection is considered a localization error if it overlaps a true positive in the image but by less than the required 50% IOU.
如果檢測與圖像中的真陽性重疊,但小于所需的50%IOU,則視為定位錯誤。
A detection is a background error if the box does not overlap any objects of any class in the image.
如果方框沒有覆蓋圖像中任何類的任何對象,則檢測就是背景錯誤。
Fast R-CNN makes around the same number of localization errors and background errors.
Fast R-CNN產(chǎn)生的定位錯誤和背景錯誤數(shù)量大致相同。
Over all 20 classes in the top N detections 13.6% are localization errors and 8.6% are background errors.
在所有20個類別中,前N個檢測中,13.6%是定位錯誤,8.6%是背景錯誤。
YOLO makes far more localization errors but relatively few background errors.
YOLO的定位誤差要大得多,但背景誤差相對較少。
Averaged across all classes, of its top N detections 24.7% are localization errors and a mere 4.3% are background errors.
平均所有類別,其前N個檢測中24.7%是定位錯誤,只有4.3%是背景錯誤。
This is about twice the number of localization errors but half the number of background detections.
這大約是定位錯誤數(shù)量的兩倍,但卻是背景檢測數(shù)量的一半。
YOLO uses global context to evaluate detections while R-CNN only sees small portions of the image.
YOLO使用全局上下文來評估檢測結(jié)果,而R-CNN只看到圖像的一小部分。
Many of the background detections made by R-CNN are obviously not objects when shown in context.
R-CNN的許多背景檢測在上下文中顯示時顯然不是對象。
YOLO and R-CNN are good at different parts of object detection.
YOLO和R-CNN擅長對象檢測的不同部分。
Since their main sources of error are orthogonal, combining them should produce a model that is better across the board.
由于它們的主要誤差來源是正交的,因此將它們結(jié)合起來應該會產(chǎn)生一個更全面的模型。
4.4 Combining Fast R-CNN and YOLO
4.4結(jié)合Fast R-CNN和YOLO
YOLO makes far fewer background mistakes than Fast R-CNN.
YOLO的背景錯誤比Fast R-CNN要少得多。
By using YOLO to eliminate background detections from Fast R-CNN we get a significant boost in performance.
通過使用YOLO來消除Fast R-CNN中的背景檢測,我們可以顯著提高性能。
For every bounding box that R-CNN predicts we check to see if YOLO predicts a similar box.
對于R-CNN預測的每個邊界框,我們都會檢查YOLO是否預測了類似的框。
If it does, we give that prediction a boost based on the probability predicted by YOLO and the overlap between the two boxes.
如果是這樣的話,我們會根據(jù)YOLO預測的概率和兩個盒子之間的重疊來提高預測。
This reorders the detections to favor those predicted by both systems.
這將重新排列探測順序,以利于兩個系統(tǒng)預測的探測。
Since we still use Fast R-CNN’s bounding box predictions we do not introduce any localization error.
由于我們?nèi)匀皇褂每焖賀-CNN的邊界框預測,我們沒有引入任何定位錯誤。
Thus we take advantage of the best aspects of both systems.
因此,我們利用了這兩個系統(tǒng)的最佳方面。
The best Fast R-CNN model achieves a mAP of 71.8% on the VOC 2007 test set.
最佳Fast R-CNN模型在voc2007測試集上達到了71.8%的mAP。
When combined with YOLO, its mAP increases by 2.9% to 74.7%.
與YOLO結(jié)合后,其mAP增加了2.9%至74.7%。
We also tried combining the top Fast R-CNN model with several other versions of Fast R-CNN.
我們還嘗試將頂級Fast R-CNN模型與其他幾個版本的Fast R-CNN相結(jié)合。
Those ensembles produced small increases in mAP between .3 and .6%, see Table 3 for details.
這些組合使mAP的增幅在0.3%到0.6%之間,詳見表3。
Table 3: Model combination expermients on VOC 2007.
表3:2007年VOC模型組合實驗。
We examine the effect of combining various models with the best version of Fast R-CNN.
我們檢驗了不同模型與最佳版本Fast R-CNN相結(jié)合的效果。
The model’s base mAP is listed as well as its mAP when combined with the top model on VOC 2007.
當與voc2007上的頂級模型相結(jié)合時,列出了該模型的基本圖及其mAP圖。
Other versions of Fast R-CNN provides only a small marginal benefit while combining with YOLO results in a significant performance boost.
其他版本的Fast R-CNN只提供了一個很小的邊際效益,而與YOLO相結(jié)合的結(jié)果是顯著的性能提升。
Thus, the benefit from combining Fast R-CNN with YOLO is unique, not a general property of combining models in this way.
因此,將Fast R-CNN與YOLO相結(jié)合的好處是獨特的,而不是以這種方式組合模型的一般特性。
Using this combination strategy we achieve a significant boost on the VOC 2012 and 2010 test sets as well, around 2%.
使用這種組合策略,我們在2012年和2010年的VOC測試集上也實現(xiàn)了顯著的提升,大約2%。
The combined Fast R-CNN + YOLO model is currently the second highest performing model on the VOC 2012 leaderboard.
Fast R-CNN+YOLO組合模型目前是VOC 2012排行榜上表現(xiàn)第二的模型。
5 Conclusion
5結(jié)論
We introduce YOLO, a unified pipeline for object detection.
我們介紹了YOLO,一個用于對象檢測的統(tǒng)一管道。
Our model is simple to construct and can be trained directly on full images.
我們的模型構(gòu)造簡單,可以直接在全圖像上訓練。
Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and every piece of the pipeline can be trained jointly.
與基于分類器的方法不同,YOLO被訓練在一個與檢測性能直接對應的損失函數(shù)上,并且每條管道都可以聯(lián)合訓練。
The network reasons about the entire image during inference which makes it less likely to predict background false positives than sliding window or proposal region detectors.
在推理過程中,由于整個圖像的網(wǎng)絡原因,與滑動窗口或建議區(qū)域檢測器相比,預測背景誤報的可能性更小。
Moreover, it predicts detections with only a single network evaluation, making it extremely fast.
此外,它只需對網(wǎng)絡進行一次評估就可以預測檢測結(jié)果,因此速度非常快。
YOLO achieves impressive performance on standard benchmarks considering it is 2-3 orders of magnitude faster than existing detection methods.
考慮到它比現(xiàn)有的檢測方法快2-3個數(shù)量級,YOLO在標準基準測試上取得了令人印象深刻的性能。
It can also be used to rescore the bounding boxes produced by state-of-the-art methods like R-CNN for a significant boost in performance.
它還可以用于重新掃描由最先進的方法(如R-CNN)生成的邊界框,以顯著提高性能。
總結(jié)
以上是生活随笔為你收集整理的《每日论文》You Only Look Once: Unified, Real-Time Object Detection的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: tqdm: ‘module‘ objec
- 下一篇: 《每日一题》49. Group Anag