當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

YOLO系列阅读（一） YOLOv1原文阅读：You Only Look Once: Unified, Real-Time Object Detection

發(fā)布時間：2025/4/5 编程问答 18 豆豆

生活随笔收集整理的這篇文章主要介紹了 YOLO系列阅读（一） YOLOv1原文阅读：You Only Look Once: Unified, Real-Time Object Detection 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

0.Abstract
- 0.1原文翻譯
- - - 第一段（說明本次研究和之前研究的區(qū)別）
    - 第二段（速度快、雖然錯誤率高一點，但是背景被錯誤標(biāo)記的概率更低）
- 0.2總結(jié)
1. Introduction
- 1.翻譯
- - - 第一段（研究具有意義：1.這個和人類似2.這個可以輔助解決很多實際問題）
    - 第二段（分成proposal和classification兩步進行的缺陷）
    - 第三段（YOLO的優(yōu)勢）
    - 第四段（YOLO可以直接得到結(jié)果）
    - 第五段（首先，YOLO的實時性，并且別別的實時系統(tǒng)強很多）
    - 第六段（第二，YOLO考慮全局的情況，因為考慮全局的信息，所以在背景上犯錯犯的更少）
    - 第七段（第三，泛化能力好）
    - 第八段（還有不足：總體精度低，對小物體識別不好）
    - 第九段
- 1.2總結(jié)
2. Unified Detection
- 2.0總體描述
- 2.0.1翻譯
- - - 第一段（proposal和classification一次解決、全局因素考慮、端到端）
    - 第二段（預(yù)測結(jié)果的說明）
    - 第三段（介紹模型的評價方式）
    - 第四段（預(yù)測結(jié)果的說明）
    - 第五段（詳細(xì)講預(yù)測結(jié)果當(dāng)中分類的事情）
    - 第六段（評價預(yù)測結(jié)果設(shè)計的邏輯合理性）
    - 第七段（輸出格式的說明）
    - 圖片內(nèi)容Fig.2
  - 2.0.2總結(jié)
- 2.1. Network Design
- - 2.1.1逐句翻譯
  - - 第一段（數(shù)據(jù)集和總體的模型提出）
    - 第二段（更加詳細(xì)的介紹網(wǎng)絡(luò)）
    - 第三段（簡單介紹Fast YOLO）
    - 第四段（介紹輸出）
  - 2.1.2總結(jié)
- 2.2. Training
- - 逐句翻譯
  - - - 第一段（大約就是講作者怎么實現(xiàn)的）
      - 第二段（大約就是在原來預(yù)訓(xùn)練網(wǎng)絡(luò)上增加層次和提升分辨率）
      - 第三段（將標(biāo)準(zhǔn)化）
      - 第四段（torch.nn.LeakyReLU()）
      - 第五段（同時完成坐標(biāo)誤差和classification誤差的evaluation）
      - 第六段（怎么實現(xiàn)這種不平均）
      - 第七段（均衡的大小格子的差異的問題）
      - 第八段（指定特定的預(yù)測器來進行預(yù)測）
      - 第九段（介紹損失函數(shù)）
      - 第十段（loss函數(shù)單獨penalize一個因素）
      - 第十一段（訓(xùn)練實際操作的問題）
      - 第十二段（學(xué)習(xí)率的選擇）
      - 第十三段（數(shù)據(jù)增強）
- 2.3. Inference
- - 2.3.1逐句翻譯
  - - - 第一段（test也和train一樣一次就出結(jié)果）
      - 第二段（介紹一個極大抑制的問題）
  - 2.3.2總結(jié)
- 2.4. Limitations of YOLO（大約可以理解為不足）
- - 2.4.1逐句翻譯
  - - - 第一段（相互臨近的物體模型很難處理）
      - 第二段（可能會受到輸入圖片的情況的影響、并且很難保證有效的）
      - 第三段（大小bounding box的問題）
  - 2.4.2總結(jié)—這段在說模型的不足
3.Comparison to Other Detection Systems
- 3.1逐句翻譯
- - - - 第一段（大約就是陳述了之前的都是分成兩個過程來走）
      - 第二段（）
4. Experiments
- 4.0 寫在前面
- - 4.0.1逐句翻譯
  - 4.0.2總結(jié)
- 4.1 Comparison to Other Real-Time Systems（和最近的研究相比）
- - - - 第一段（之前真正的實時性系統(tǒng)并不多，只有幾個，就算不是實時系統(tǒng)我們也做了對比實驗評估m(xù)AP降低換來的時間提升劃算嗎）
      - 第二段（從Fast YOLO引出YOLO真快）
      - 第三段（和VGG16的結(jié)合雖然準(zhǔn)但是慢，其實真正的YOLO比VGG淺）
      - 第四段（R-CNN minus R的失敗）
      - 第五段（Fast R-CNN雖然很快但是仍有延遲）
      - 第六段（ Faster R-CNN The Zeiler-Fergus Faster R-CNN 雖然可以更快，都沒有YOLO這么準(zhǔn)）
  - 4.2. VOC 2007 Error Analysis
  - - - 第一段
- 4.6總結(jié)
5.致歉

0.Abstract

0.1原文翻譯

第一段（說明本次研究和之前研究的區(qū)別）

We present YOLO, a new approach to object detection.
提出了一種新的目標(biāo)檢測方法YOLO。

Prior work on object detection repurposes classifiers to perform detection.
先前的對象檢測工作使分類器重新進行檢測。（也就是所謂的需要進行兩次檢驗）

Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
相反，我們將目標(biāo)檢測作為一個回歸問題，用來完成空間分離的邊界框和相關(guān)的類概率。（這里大約就是說，用一張圖片這個很多內(nèi)容，直接回歸出來一個目標(biāo)檢測的結(jié)果，這里的所謂目標(biāo)檢測結(jié)果：其實只是一個S×S×B的（x、y、w、h、c）和（分類的概率張量）這里前面S×S×B只是一個個數(shù)，后面的是真正的回歸結(jié)果。

A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.
一個單一的神經(jīng)網(wǎng)絡(luò)預(yù)測邊界盒和類概率直接從完整的圖像在一次評估。

Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
由于整個檢測管道是一個單一的網(wǎng)絡(luò)，它可以端到端直接對檢測性能進行優(yōu)化。
（我理解這里的pipline是原有的架構(gòu)確定的一種描述，以前的計算圖都是固定的）
（end-to-end的大約意思就是直接從一張圖就可以得到一張預(yù)測圖）

第二段（速度快、雖然錯誤率高一點，但是背景被錯誤標(biāo)記的概率更低）

Our unified architecture is extremely fast.
我們的架構(gòu)非常快
Our base YOLO model processes images in real-time at 45 frames per second.
我們的基本YOLO模型以每秒45幀的速度實時處理圖像。

A smaller version of the network, Fast YOLO,processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.
一個更小的網(wǎng)絡(luò)，Fast YOLO，處理速度驚人的155幀每秒，同時仍然實現(xiàn)了兩倍于其他實時探測器的mAP。

Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background.
與最先進的檢測系統(tǒng)相比，YOLO的定位誤差更大，但預(yù)測背景誤報的可能性更小。

Finally, YOLO learns very general representations of objects.
最后，YOLO學(xué)習(xí)非常一般的對象表示。

It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
它比其他檢測方法，包括DPM和R-CNN，從自然圖像泛化到其他領(lǐng)域，如藝術(shù)品。

0.2總結(jié)

1.只用看一次。
2.因為只用看一次所以處理速度快。
3.因為只看了一次所以總的錯誤率稍高一點。
4.雖然總的錯誤率高，但是在背景處理上犯錯少啊。

1. Introduction

1.翻譯

第一段（研究具有意義：1.這個和人類似2.這個可以輔助解決很多實際問題）

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact.
人們瞥一眼圖像，就能立即知道圖像中的物體是什么，它們在哪里，以及它們是如何互動的。（人看一眼就知道：東西在哪和運動趨勢就知道了）

The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought.
人類的視覺系統(tǒng)是快速和準(zhǔn)確的，允許我們執(zhí)行復(fù)雜的任務(wù)，如駕駛很少有意識的思考。

Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.
快速、準(zhǔn)確的目標(biāo)檢測算法將使計算機在不需要專門傳感器的情況下駕駛汽車，使輔助設(shè)備能夠向人類用戶傳遞實時場景信息，并為通用、靈敏的機器人系統(tǒng)釋放潛力。

第二段（分成proposal和classification兩步進行的缺陷）

Current detection systems repurpose classifiers to perform detection.
當(dāng)前的檢測系統(tǒng)重新利用分類器來執(zhí)行檢測。

To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image.
為了檢測一個對象，這些系統(tǒng)為該對象選取一個分類器，并在測試圖像的不同位置和尺度上對其進行評估。

Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].
像可變形部件模型(DPM)這樣的系統(tǒng)使用滑動窗口方法，其中分類器在整個圖像[10]上均勻間隔的位置上運行。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes.
最近的一些方法，如R-CNN，使用區(qū)域建議方法首先在圖像中生成潛在的邊界框，然后在這些被建議的框上運行分類器。（之前全部掃描一次，這里是部分掃描一次，選定一些可能的區(qū)域）

After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13].
分類完成后，使用后處理對邊界框進行細(xì)化，消除重復(fù)檢測，并基于場景中的其他物體對邊界框進行重新計算分?jǐn)?shù)。[13]

These complex pipelines are slow and hard to optimize because each individual component must be trained separately.
這些復(fù)雜的管道緩慢且難以優(yōu)化，因為每個單獨的組件都必須單獨訓(xùn)練。（這就像之前的PointNet一樣必須再中間的過程中優(yōu)化一個標(biāo)簽的生成）

第三段（YOLO的優(yōu)勢）

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities.
我們將目標(biāo)檢測作為一個單一的回歸問題，直接從圖像像素到包圍盒坐標(biāo)和類概率。

Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
使用我們的系統(tǒng)，你只需看一幅圖像(YOLO)，就能預(yù)測出存在哪些物體以及它們的位置。

第四段（YOLO可以直接得到結(jié)果）

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.
YOLO非常簡單:參見圖1。一個卷積網(wǎng)絡(luò)可以同時預(yù)測多個邊界框和這些邊界框的類概率。

YOLO trains on full images and directly optimizes detection performance.
YOLO對全圖像進行訓(xùn)練，直接優(yōu)化檢測性能。

This unified model has several benefits over traditional methods of object detection.
與傳統(tǒng)的目標(biāo)檢測方法相比，這種統(tǒng)一的模型有幾個優(yōu)點。

第五段（首先，YOLO的實時性，并且別別的實時系統(tǒng)強很多）

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline.
首先，YOLO非常快。因為我們把檢測作為一個回歸問題，我們不需要一個復(fù)雜的管道。

We simply run our neural network on a new image at test time to predict detections.
我們只是在測試時對新圖像運行我們的神經(jīng)網(wǎng)絡(luò)來預(yù)測檢測結(jié)果。

Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps.
我們的基本網(wǎng)絡(luò)運行速度是每秒45幀，在Titan X GPU上沒有批處理，而快速版本運行速度超過150幀。

This means we can process streaming video in real-time with less than 25 milliseconds of latency.
這意味著我們可以以小于25毫秒的延遲實時處理流媒體視頻。

Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems.
此外，YOLO的平均精度是其他實時系統(tǒng)的兩倍以上。

For a demo of our system running in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.
有關(guān)我們的系統(tǒng)在網(wǎng)絡(luò)攝像頭上實時運行的演示，請查看我們的項目
:http://pjreddie.com/yolo/

第六段（第二，YOLO考慮全局的情況，因為考慮全局的信息，所以在背景上犯錯犯的更少）

Second, YOLO reasons globally about the image when making predictions.
其次，在進行預(yù)測時，YOLO會考慮到全局的情況。

Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance.
與基于滑動窗口和區(qū)域建議的技術(shù)不同，YOLO在訓(xùn)練和測試期間查看整個圖像，因此它隱式地對類及其外觀的上下文信息進行編碼。

Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context.
快速R-CNN是一種頂級檢測方法[14]，它會將圖像中的背景補丁誤認(rèn)為目標(biāo)，因為它無法看到更大的背景

YOLO makes less than half the number of background errors compared to Fast R-CNN.
與Fast R-CNN相比，YOLO產(chǎn)生的背景錯誤不到前者的一半。

第七段（第三，泛化能力好）

Third, YOLO learns generalizable representations of objects.
第三，YOLO學(xué)習(xí)對象的概化表示。

When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin.
在自然圖像上進行訓(xùn)練，在藝術(shù)品上進行測試時，YOLO的性能遠(yuǎn)遠(yuǎn)優(yōu)于DPM和R-CNN等頂級檢測方法。

Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.
由于YOLO是高度一般化的，它在應(yīng)用于新域或意外輸入時不太可能出現(xiàn)故障。

第八段（還有不足：總體精度低，對小物體識別不好）

YOLO still lags behind state-of-the-art detection systems in accuracy.
YOLO在準(zhǔn)確性方面仍然落后于最先進的檢測系統(tǒng)。

While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones.
雖然它可以快速識別圖像中的物體，但很難精確定位一些物體，尤其是小物體。

We examine these tradeoffs further in our experiments.
我們在實驗中進一步研究了這些權(quán)衡。

第九段

All of our training and testing code is open source. A variety of pretrained models are also available to download.
我們所有的培訓(xùn)和測試代碼都是開源的。各種預(yù)訓(xùn)練模型也可以下載。

圖1:YOLO檢測系統(tǒng)。使用YOLO處理圖像是簡單而直接的。我們的系統(tǒng)(1)將輸入圖像的大小調(diào)整為448 × 448，(2)在圖像上運行單個卷積網(wǎng)絡(luò)，(3)根據(jù)模型的置信度對檢測結(jié)果進行閾值。
這里的confidence很好，因為這個東西無所謂概率的問題，這里直接提一個confidence。

1.2總結(jié)

1.實時性是個好東西，很有用，處理圖片數(shù)據(jù)更快很重要。
2.YOLO可以實現(xiàn)實時性。
3.YOLO因為考慮了全局的因素在背景識別上有很好的效果。
4.泛化能力好
5.但是在小物體的效果不好（作者也用比例優(yōu)化了這個問題）

2. Unified Detection

2.0總體描述

2.0.1翻譯

第一段（proposal和classification一次解決、全局因素考慮、端到端）

We unify the separate components of object detection into a single neural network.
我們將目標(biāo)檢測的各個部分統(tǒng)一成一個神經(jīng)網(wǎng)絡(luò)。（就是原來的proposal和classification直接合成一個了）

Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image.
我們的網(wǎng)絡(luò)使用整個圖像的特征來預(yù)測每個邊界框。它還可以同時預(yù)測圖像中所有類的所有邊界框。這意味著我們的網(wǎng)絡(luò)對完整圖像和圖像中的所有對象進行了全局分析。

The YOLO design enables end-to-end training and realtime speeds while maintaining high average precision.
YOLO設(shè)計使端到端訓(xùn)練和實時速度，同時保持高平均精度。

第二段（預(yù)測結(jié)果的說明）

Our system divides the input image into an S × S grid.If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
我們的系統(tǒng)將輸入圖像劃分為一個S × S網(wǎng)格。如果一個對象的中心落在一個網(wǎng)格單元中，該網(wǎng)格單元負(fù)責(zé)檢測該對象。（這個正是這個YOLO最核心的一件事情，就是將整個圖片轉(zhuǎn)化成一個個grid，每個grid負(fù)責(zé)B個bounding box。但是這個B個bounding boxs卻只有一組分類信息和她們對應(yīng)，這就導(dǎo)致了這B個bounding box必須是一個類別。這就導(dǎo)致了YOLO在預(yù)測相互接近的不同類別的物體過程中表現(xiàn)不好。）

第三段（介紹模型的評價方式）

Each grid cell predicts B bounding boxes and confidence scores for those boxes.
每個網(wǎng)格單元預(yù)測B個邊界框和這些框的置信值。（為什么預(yù)測這些邊框，因為這些邊框的中心落在了這個內(nèi)容的中心。）

These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts.
這些信心分?jǐn)?shù)反映了模型對盒子包含一個物體的信心程度，以及模型認(rèn)為盒子對其預(yù)測的準(zhǔn)確性。（這個盒子包含兩件事一個是有沒有東西一個是預(yù)測的對不對，其實，讀下面就可以知道這里同時反映兩件事是把這兩個東西乘起來的。）

Formally we define confidence as
在形式上，我們把信心定義為（這個東西我們可以看出來是兩個東西相乘）

Pr（Object）如果這里有對象就是1沒有就是0
IOU這個是個熟系的內(nèi)容了。

If no object exists in that cell, the confidence scores should be zero.
如果該單元格中不存在任何對象，則置信度分?jǐn)?shù)應(yīng)為零。（讓Pr變成0就是了）

Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.
否則，我們希望置信度得分等于預(yù)測框和地面真理之間的并集(IOU)的交集。

第四段（預(yù)測結(jié)果的說明）

Each bounding box consists of 5 predictions: x, y, w, h, and confidence.
每個邊界框由5個預(yù)測組成:x, y, w, h和置信度。

The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell.
(x, y)坐標(biāo)表示一個bounding box的中心，他的預(yù)測是和這個grid相互關(guān)聯(lián)的。

The width and height are predicted relative to the whole image.
寬度和高的預(yù)測是結(jié)合整個圖片進行預(yù)測的

Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.
這里的confidence如果確實有不就是個IOU嗎

第五段（詳細(xì)講預(yù)測結(jié)果當(dāng)中分類的事情）

Each grid cell also predicts C conditional class probabilities, Pr(Classi|Object).
每個網(wǎng)格單元格還預(yù)測C條件類概率，Pr(Classi|Object)。
（這里算的概率是這里已經(jīng)是一個object的概率）

These probabilities are conditioned on the grid cell containing an object.
這些概率取決于包含對象的網(wǎng)格單元格。

We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.
我們在每個網(wǎng)格單元中只預(yù)測一組類概率，而不考慮盒子B的數(shù)量。
（這里B之前說過，也就是一個prid cell 當(dāng)中包含的候選bounding box的數(shù)量）也就是說這B個bounding boxs必須是同一類這就導(dǎo)致了之后YOLOv1在預(yù)測相互靠近的不同類別的物體時很無力。

第六段（評價預(yù)測結(jié)果設(shè)計的邏輯合理性）

At test time we multiply the conditional class probabilities and the individual box confidence predictions,
在測試時，我們將條件類概率和單個框置信預(yù)測相乘，（也就是下面這個框，可以看到直接就得到了這個類的概率，作者在這大約也是想說明自己這個東西是符合數(shù)學(xué)規(guī)律的）

which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.
就給了我們每個箱子的特定類別的信心分?jǐn)?shù)。這些分?jǐn)?shù)既編碼了類出現(xiàn)在方框中的概率，也編碼了預(yù)測的方框與對象的匹配程度。

第七段（輸出格式的說明）

For evaluating YOLO on PASCAL VOC, we use S = 7, B = 2. PASCAL VOC has 20 labelled classes so C = 20.
Our final prediction is a 7 × 7 × 30 tensor.

圖片內(nèi)容Fig.2

2.0.2總結(jié)

1.大約說的就是使用一個網(wǎng)格（grid）來識別一部分detection，并且明確了各個的評價標(biāo)準(zhǔn)，這里比較特別的分類任務(wù)一般選擇使用softmax最后生成一個分類個數(shù)的概率。這里直接生成一個detection的置信程度（confidence）來判定這個框是不是畫對了。
2.所以這里的一切的展開都是圍繞著一個grid展開的，并且每個grid雖然最后會評價B個框的confidence，但是最后在生成分類數(shù)據(jù)的時候，卻只生成一組分類信息也就是如果有C個分類就生成C個維度，所以這也就導(dǎo)致了前面的B個框只能有一個是有效的。

2.1. Network Design

2.1.1逐句翻譯

第一段（數(shù)據(jù)集和總體的模型提出）

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset[9].
我們將該模型作為一個卷積神經(jīng)網(wǎng)絡(luò)實現(xiàn)，并在PASCAL VOC檢測數(shù)據(jù)集上對其進行評估。

The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.
網(wǎng)絡(luò)的初始卷積層從圖像中提取特征，而全連接層預(yù)測輸出概率和坐標(biāo)。（在FCN出來之前大家一般都是使用全連接做最后的轉(zhuǎn)化。說起來這個東西其實學(xué)習(xí)能力很好，但是也很容易過擬合。）

第二段（更加詳細(xì)的介紹網(wǎng)絡(luò)）

Our network architecture is inspired by the GoogLeNet model for image classification [34].
我們的網(wǎng)絡(luò)架構(gòu)是受GoogLeNet圖像分類模型的啟發(fā)。

Our network has 24 convolutional layers followed by 2 fully connected layers.
們的網(wǎng)絡(luò)有24個卷積層，然后是2個全連接層。

Instead of the inception modules used by GoogLeNet, we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.
與googleet使用的初始模塊不同，我們簡單地使用1 × 1的簡化層，然后是3 × 3的卷積層，類似于Lin等人[22]。完整的網(wǎng)絡(luò)如圖3所示。

（這個1×1的卷積我覺得其具有比較突出的特征提取能力。作者這里是使用1×1的卷積擴展通道和降低通道。說這么多大家可能還是不懂，看代碼其實你就可以發(fā)現(xiàn)作者是順序使用了1×1、3×3、1×1來進行信息提取。

第三段（簡單介紹Fast YOLO）

We also train a fast version of YOLO designed to push the boundaries of fast object detection.
我們還訓(xùn)練了一個YOLO的快速版本，旨在推動快速目標(biāo)檢測的邊界。
（這里的push the boundaries大約就是說在推動這個領(lǐng)域的發(fā)展吧）

Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers.
Fast YOLO使用的神經(jīng)網(wǎng)絡(luò)具有更少的卷積層(9個而不是24個)和更少的層過濾器。
（過濾器：指的就是一個卷積層當(dāng)中的不同卷積核，可以參考：卷積核個數(shù)和輸入通道和輸出通道個數(shù)的關(guān)系）

Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.
YOLO和Fast YOLO除了網(wǎng)絡(luò)大小不同外，所有訓(xùn)練和測試參數(shù)都是相同的。

第四段（介紹輸出）

The final output of our network is the 7 × 7 × 30 tensor of predictions.
我們的網(wǎng)絡(luò)的最終輸出是預(yù)測的7 × 7 × 30張量。

2.1.2總結(jié)

就是介紹網(wǎng)絡(luò)的輸入，并說明輸出的結(jié)構(gòu)。
大約就是：

1.網(wǎng)絡(luò)使用類似googleNet的結(jié)構(gòu)（時代的局限性），提取一個7×7的特征圖，每個像素就對應(yīng)一個grid。（這里注意特征圖當(dāng)中的每個像素實際上只有當(dāng)前這個grid當(dāng)中的信息）
2.將每個像素的不同channel的信息permute到一起，之后引入全連接層來講全局信息融入其中。（這里才能獲得全局信息）
3.再permute回去進行信息提取。

2.2. Training

逐句翻譯

第一段（大約就是講作者怎么實現(xiàn)的）

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30].
我們在ImageNet 1000分類的比賽數(shù)據(jù)集上預(yù)訓(xùn)練我們的卷積層。[30]
（底層的預(yù)訓(xùn)練有助于提升網(wǎng)絡(luò)整體的性能）

For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer.
為了進行預(yù)訓(xùn)練，我們使用圖3中的前20個卷積層，然后是一個平均池化層和一個完全連接層。（淺層網(wǎng)絡(luò)在之后的訓(xùn)練當(dāng)中很難得到充分優(yōu)化）

We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24].
我們對這個網(wǎng)絡(luò)進行了大約一周的訓(xùn)練，并在ImageNet 2012驗證集上實現(xiàn)了88%的單一模型前5名的精度，這與Caffe的Model Zoo中的GoogLeNet模型相當(dāng)。（大約就是預(yù)訓(xùn)練已經(jīng)達(dá)到了很好的精度了）

We use the Darknet framework for all training and inference [26].
我們使用暗網(wǎng)框架進行所有的訓(xùn)練和推理[26]。（就是這個作者之前提的一個架構(gòu)）

第二段（大約就是在原來預(yù)訓(xùn)練網(wǎng)絡(luò)上增加層次和提升分辨率）

We then convert the model to perform detection.
然后，我們將模型轉(zhuǎn)換為執(zhí)行檢測。

Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29].
Ren等人表明，在預(yù)先訓(xùn)練的網(wǎng)絡(luò)中同時添加卷積層和連接層可以提高性能。[29]

Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights.
根據(jù)他們的例子，我們添加了四個卷積層和兩個權(quán)值隨機初始化的完全連接層。
(這里的隨機權(quán)重主要是區(qū)別于之前的經(jīng)過預(yù)訓(xùn)練的權(quán)重)

Detection often requires fine grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.
檢測往往需要細(xì)粒度的視覺信息，因此我們將網(wǎng)絡(luò)的輸入分辨率從224 × 224提高到448 × 448。

第三段（將標(biāo)準(zhǔn)化）

Our final layer predicts both class probabilities and bounding box coordinates.
我們的最后一層預(yù)測類概率和邊界盒坐標(biāo)。

We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1.
我們通過圖像的寬度和高度規(guī)范化邊框的寬度和高度，使它們落在0和1之間。

We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
我們將邊界框的x和y坐標(biāo)參數(shù)化為特定網(wǎng)格單元位置的偏移量，因此它們的邊界也在0和1之間。

第四段（torch.nn.LeakyReLU()）

We use a linear activation function for the final layer and all other layers use the following leaky（漏的） rectified linear activation:
我們在最后一層使用線性激活函數(shù)，所有其他層使用以下泄漏修正線性激活:
（這個現(xiàn)在直接使用torch.nn.LeakyReLU()就可以創(chuàng)建了）

第五段（同時完成坐標(biāo)誤差和classification誤差的evaluation）

We optimize for sum-squared error in the output of our model.
我們優(yōu)化了模型輸出中的和平方誤差。
(sum-squared error整合localization error（bboxes的坐標(biāo)誤差）和classification error，因為是一次訓(xùn)練得到的坐標(biāo)和分類，所以得直接把兩者都evaluation了)

We use sum squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision.
我們使用求和平方誤差是因為它很容易優(yōu)化，但是它并不完全符合我們最大化平均精度的目標(biāo)。

It weights localization error equally with classification error which may not be ideal.
定位誤差與分類誤差的權(quán)重相等，這么評估誤差可能并不理想。
（分類很多，所以放在一鍋燉不合適）
Also, in every image many grid cells do not contain any object.
另外，在每個圖像中，許多網(wǎng)格單元不包含任何對象。

This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects.
這會將這些單元格的“confidence”分?jǐn)?shù)推向零，通常會壓倒包含對象的單元格的梯度。
（就是其中不包含元素的內(nèi)容太多了，如果全部一起評估，可能會出現(xiàn)問題就是不包含的object的grid得到了很好的優(yōu)化，但是真正預(yù)測object的grid優(yōu)化不充分）

This can lead to model instability, causing training to diverge early on.
這可能導(dǎo)致模型不穩(wěn)定，導(dǎo)致培訓(xùn)在早期就出現(xiàn)分歧。

第六段（怎么實現(xiàn)這種不平均）

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects.
為了彌補這一點，我們增加了包圍盒坐標(biāo)（這里指的是對具體坐標(biāo)的預(yù)測）預(yù)測的損失，并減少了不包含對象的盒的confidence預(yù)測的損失。

We use two parameters, λcoord and λnoobj to accomplish this. We set λcoord = 5 and λnoobj = .5.
我們使用λ坐標(biāo)和λnoobj兩個參數(shù)來實現(xiàn)這一目標(biāo)。我們設(shè)置λcoord = 5和λnoobj = .5。

第七段（均衡的大小格子的差異的問題）

Sum-squared error also equally weights errors in large boxes and small boxes.
平方和誤差在大盒子和小盒子中的權(quán)重相等。

Our error metric（度量標(biāo)準(zhǔn)） should reflect that small deviations in large boxes matter less than in small boxes.
我們的誤差度量應(yīng)該反映出大盒子里的小偏差比小盒子里的小偏差更重要。

To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.
為了部分解決這個問題，我們預(yù)測邊界框?qū)挾群透叨鹊钠椒礁?#xff0c;而不是直接預(yù)測寬度和高度。

（這里需要主要理解的是這樣的一個問題：這里之前我們不是已經(jīng)對weight和height作了標(biāo)準(zhǔn)化了嗎？理論上他們的大小應(yīng)該都在0到1之間，那么，為什么這里還要專門區(qū)分大小呢？
理解這個問題我們要理解幾個事情：

1.這里說的標(biāo)準(zhǔn)化對應(yīng)的是什么內(nèi)容的標(biāo)準(zhǔn)化？
其實整個項目當(dāng)中一共是使用了兩次標(biāo)準(zhǔn)化：
Firstly，在處理輸入數(shù)據(jù)集的時候，將輸入數(shù)據(jù)的框的weight和height的大小轉(zhuǎn)化到0到1之間，之后放入訓(xùn)練。
Second，對應(yīng)的應(yīng)當(dāng)是輸出結(jié)果，輸出結(jié)果是0到1，之后經(jīng)過反標(biāo)準(zhǔn)化，得到真正的結(jié)果。
2.這里計算損失的時候用的是什么內(nèi)容？
用的是反標(biāo)準(zhǔn)化出發(fā)來的正常數(shù)據(jù)，按照作者自己的描述是，浮點數(shù)直接計算損失太大，所以需要轉(zhuǎn)化為原來的數(shù)值計算損失。
3.所以這么轉(zhuǎn)化過去又轉(zhuǎn)化回來干啥呢？
轉(zhuǎn)化過去為什么叫norm，因為這個東西和normalization的的作用幾乎一樣，想要直接估計差異很大的數(shù)值的時候，很難得到很好的結(jié)果，還是轉(zhuǎn)換為0到1比較容易預(yù)測一點。
）

第八段（指定特定的預(yù)測器來進行預(yù)測）

YOLO predicts multiple bounding boxes per grid cell.
YOLO預(yù)測每個網(wǎng)格單元格有多個邊界框。

At training time we only want one bounding box predictor to be responsible for each object.
在訓(xùn)練時，我們只希望一個邊界盒預(yù)測器負(fù)責(zé)每個對象。

We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth.
我們基于哪個grid預(yù)測的結(jié)果和被預(yù)測物體真值有最高的交并比，指定一個特定的預(yù)測器來負(fù)責(zé)當(dāng)前這個對象的預(yù)測。
（這里的想法是很好的，但是有一個最大的問題就是，你在開始預(yù)測的時候怎么知道預(yù)測值，這就存在一個新的問題這樣我們訓(xùn)練集要怎么傳入的問題）

This leads to specialization between the bounding box predictors.
這導(dǎo)致了邊界框預(yù)測器之間的專門化。

Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.
每種預(yù)測器都能更好地預(yù)測物體的特定大小、長徑比或類別，從而提高整體recall（也就是發(fā)現(xiàn)目標(biāo)的能力）。

recall可以參考：什么是Precision和Recall？

第九段（介紹損失函數(shù)）

During training we optimize the following, multi-part loss function:
在training期間，我們優(yōu)化了以下多部分組成的損失函數(shù):

（這里之后細(xì)談）

第十段（loss函數(shù)單獨penalize一個因素）

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier).
請注意，loss函數(shù)只懲罰在該網(wǎng)格單元中存在對象的分類錯誤(因此前面討論的是條件類概率)。
這里的解釋一下說的什么，大約就是這里loss針對的只是分類的條件概率的錯誤，至于有沒有object的錯誤，不是在條件概率這里體現(xiàn)的。他們五個是分的比較清楚的。

It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).
它也只懲罰邊界盒坐標(biāo)誤差，如果預(yù)測器（grid）“負(fù)責(zé)”對應(yīng)的bounding box真值(即在網(wǎng)格單元中擁有最高的預(yù)測器IOU)。

也就是雖然這個事情是相關(guān)的事情，但是處理loss的時候，全是單獨預(yù)測的，不相互糾結(jié)。

第十一段（訓(xùn)練實際操作的問題）

We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012.
我們使用 PASCAL VOC 2007 and 2012的測試集合驗證集，訓(xùn)練我們的網(wǎng)絡(luò)135個epoch。

When testing on 2012 we also include the VOC 2007 test data for training.
在2012年的測試中，我們還包括了VOC 2007的測試數(shù)據(jù)。

Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.
在整個訓(xùn)練過程中，我們使用64個批次，動量為0.9，衰減為0.0005。（記一下）

第十二段（學(xué)習(xí)率的選擇）

一開始用比較小的學(xué)習(xí)率，之后再增大，再減小。因為一開始用的太大會受到一開始不穩(wěn)定的梯度的影響。

第十三段（數(shù)據(jù)增強）

To avoid overfitting we use dropout and extensive data augmentation.
為了避免過擬合，我們使用了dropout和extensive data augment。（數(shù)據(jù)增強）

A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers .
在第一連接層之后，速率為= 0.5的dropout層阻止層之間的共同適應(yīng)。

For data augmentation we introduce random scaling and translations of up to 20% of the original image size.
對于數(shù)據(jù)增強，我們引入了高達(dá)原始圖像大小20%的隨機縮放和平移。
translations 應(yīng)當(dāng)理解為平移
這里同時也是一個圖像識別當(dāng)中一個比較重要的思想，也就是平移不變性。

We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.
在HSV顏色空間中，我們還隨機調(diào)整了圖像的曝光和飽和度，最高可達(dá)到1.5倍。

2.3. Inference

2.3.1逐句翻譯

第一段（test也和train一樣一次就出結(jié)果）

Just like in training, predicting detections for a test image only requires one network evaluation.
就像在訓(xùn)練中，預(yù)測檢測測試圖像只需要一個網(wǎng)絡(luò)評估。

On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.
在PASCAL VOC上，該網(wǎng)絡(luò)預(yù)測每個圖像的98個邊界框和每個框的類概率。與基于分類器的方法不同，YOLO在測試時非常快，因為它只需要單個網(wǎng)絡(luò)評估。

（這里可以類比理解一下torch的train狀態(tài)和eval狀態(tài)做出理解）

第二段（介紹一個極大抑制的問題）

The grid design enforces spatial diversity in the bounding box predictions.
網(wǎng)格設(shè)計加強了邊界盒預(yù)測的空間多樣性。（就是你什么樣子的圖只要能劃分成grid不就完事了嗎）

Often it is clear which grid cell an object falls in to and the network only predicts one box for each object.
通常情況下，對象歸屬于哪個網(wǎng)格單元是很清楚的，并且網(wǎng)絡(luò)只對每個對象預(yù)測一個框。

However, some large objects or objects near the border of multiple cells can be well localized by multiple cells.
然而，一些較大的物體或靠近多個單元邊界的物體可以被多個單元很好地定位。

Non-maximal suppression can be used to fix these multiple detections.
非極大抑制可用于修復(fù)這些多重檢測。
（這個東西就是選取其中最大的那個有效，剩下的哪些讓他們抑制，這里理解一下，對于一個較大的物體，很有可能好幾個grid都覺得自己可以預(yù)測這個較大物體的位置，所以這個物體就會有好幾個，所以需要我們選定一個，這里就選定那個confidence最大的就好了）

While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2-3% in mAP.
雖然非最大抑制并不像R-CNN或DPM那樣對性能至關(guān)重要，但非最大抑制增加了2-3%的mAP（mean Average Precision）。
可以參考：

什么是mAP?

2.3.2總結(jié)

介紹：
1.訓(xùn)練和測試一樣都很簡單
2.非極大抑制，有的物體可能被很多框預(yù)測，所以不是confidence最大的框就會被抑制。

2.4. Limitations of YOLO（大約可以理解為不足）

2.4.1逐句翻譯

第一段（相互臨近的物體模型很難處理）

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class.
OLO對邊界框預(yù)測施加了很強的空間約束，因為每個網(wǎng)格單元格只能預(yù)測兩個框，并且只能有一個種類。
（這里理解一下，在代碼里每個grid雖然有B=2組的（x，y，w，h，c）理論上可以預(yù)測兩個不同的物體的兩個框，但是20個類別的概率輸出，只有一組，所以你預(yù)測的兩個物體必須是一個類別。
所以實際上，也就預(yù)測一個物體而已。
）

This spatial constraint limits the number of nearby objects that our model can predict.
這個空間限制限制了我們的模型可以預(yù)測的相互距離較近的物體的數(shù)量。
（就是里的近的幾個物體可能預(yù)測不到。）

Our model struggles with small objects that appear in groups, such as flocks of birds.
我們的模型和成群出現(xiàn)的物體是斗爭的，比如鳥群。
（我各人理解這里的struggle應(yīng)當(dāng)理解為費力的解決了某事，也就是YOLOv1對解決成群的物體應(yīng)該還行，例如鳥群。之所以認(rèn)為其應(yīng)該還行，因為這種成群是同種物體，所以YOLOv1是大約可以解決的，YOLOv1解決不了的是相互靠近的不同物體。）

第二段（可能會受到輸入圖片的情況的影響、并且很難保證有效的）

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations.
由于我們的設(shè)計是為了學(xué)習(xí)一個模型：這個從數(shù)據(jù)中預(yù)測邊界框，所以它很難推廣到新的或不尋常的高寬比或配置的對象。

Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.
我們的模型還使用相對粗糙的特征來預(yù)測邊界框，因為我們的架構(gòu)有多個從輸入圖像向下采樣的層。
（大約是下采樣的時候的分辨率下降的思考，應(yīng)該和deeplabv2是沒有關(guān)系的，因為deeplabv2是2017年出來的這個yolo是2016年出來的所以應(yīng)該沒關(guān)系）

第三段（大小bounding box的問題）

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes.
最后，當(dāng)我們訓(xùn)練一個近似檢測性能的損失函數(shù)時，我們的損失函數(shù)對小包圍盒和大包圍盒中的錯誤處理是一樣的。

A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU.
大框中的小錯誤通常是良性的，但小框中的小錯誤對借據(jù)的影響要大得多。

Our main source of error is incorrect localizations.
錯誤的主要來源是框的位置的錯誤。

2.4.2總結(jié)—這段在說模型的不足

不足如下：

1.相互靠近的物體預(yù)測能力不行，這是由于輸出的結(jié)構(gòu)導(dǎo)致的，輸出當(dāng)中每個grid可以有B個bounding box，但是這些bounding box必須是一個類別，所以受到很大的限制，實際作用效果一般。
2.因為是圈出來物體的大小，受到畫面拉伸的影響。
3.大小的bounding box對相同大小的偏差的敏感程度不同，作者這里實際上已經(jīng)在損失函數(shù)中補救了，文章中表述的不足應(yīng)當(dāng)指的是當(dāng)前的補足效果并不是最好，使得模型的主要錯誤來源還是框的位置定位錯誤。

3.Comparison to Other Detection Systems

3.1逐句翻譯

第一段（大約就是陳述了之前的都是分成兩個過程來走）

Object detection is a core problem in computer vision.（這里其實就是我們常說的CV）
目標(biāo)檢測是計算機視覺中的一個核心問題。

Detection pipelines generally start by extracting a set of robust features from input images (Haar [25], SIFT [23], HOG [4], convolutional features [6]).
檢測管道通常首先從輸入圖像中提取一組魯棒特征。（括號里的文章都是這么做的）

Then, classifiers [36, 21, 13, 10] or localizers [1, 32] are used to identify objects in the feature space.
然后，使用分類器[36,21,13,10]或定位器[1,32]來識別特征空間中的對象。

These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image [35, 15, 39].
這些分類器或定位器以滑動窗口的方式在整個圖像或圖像中的一些區(qū)域子集上運行[35,15,39]。

We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.
我們將YOLO檢測系統(tǒng)與幾個頂級檢測框架進行了比較，突出了關(guān)鍵的相似點和不同點。（就是把本文的YOLO和她們比較一下，看看有什么相似和不同。）

第二段（）

Deformable parts models (DPM) use a sliding window approach to object detection[10].
Deformable parts models使用滑動窗口方法來檢測目標(biāo)[10]

DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc.
Our system replaces all of these disparate parts with a single convolutional neural network.
The network performs feature extraction, bounding box prediction, nonmaximal suppression, and contextual reasoning all concurrently.
Instead of static features, the network trains the features in-line and optimizes them for the detection task.
Our unified architecture leads to a faster, more accurate model than DPM.
（這里都是古人的研究，之后再看，先看看yolo的研究內(nèi)容）

4. Experiments

4.0 寫在前面

4.0.1逐句翻譯

First we compare YOLO with other real-time detection systems on PASCAL VOC 2007.
首先，我們在PASCAL VOC 2007上對YOLO與其他實時檢測系統(tǒng)進行了比較。

To understand the differences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14].
為了理解YOLO和R-CNN變體之間的差異，我們探索了YOLO和Fast R-CNN (R-CNN表現(xiàn)最好的版本之一)對VOC 2007的錯誤。

Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the errors from background false positives, giving a significant performance boost.
基于不同的錯誤配置，我們表明YOLO可以用于對Fast R-CNN檢測進行重新評分，并減少來自背景誤報的錯誤，從而顯著提高性能。

We also present VOC 2012 results and compare mAP to current state-of-the-art methods.
我們還在VOC2012上做了試驗，并將結(jié)果的mAP與目前最先進的方法進行了比較。

Finally, we show that YOLO generalizes to new domains better than other detectors on two artwork datasets.
最后，我們證明了YOLO比其他檢測器在兩個圖像數(shù)據(jù)集上更好地泛化新領(lǐng)域。

4.0.2總結(jié)

1.試驗使用了VOC2007和VOC2012數(shù)據(jù)集進行測試
2.因為沒有state-of-art的效果好，作者用了很狡猾的描述：YOLO在背景上錯誤少，所以更加好用。

4.1 Comparison to Other Real-Time Systems（和最近的研究相比）

第一段（之前真正的實時性系統(tǒng)并不多，只有幾個，就算不是實時系統(tǒng)我們也做了對比實驗評估m(xù)AP降低換來的時間提升劃算嗎）

Many research efforts in object detection focus on making standard detection pipelines fast. [5] [38] [31] [14] [17] [28]
許多目標(biāo)檢測的研究工作都集中在使標(biāo)準(zhǔn)檢測管道快速。
（就是之前的研究都是在原來模型上進行加速，而不打破原來的架構(gòu)）

However, only Sadeghi et al. actually produce a deection system that runs in real-time (30 frames per second or better) [31].
只有Sadeghi等人真正產(chǎn)生了實時運行的檢測系統(tǒng)(每秒30幀或更好)。

We compare YOLO to their GPU implementation of DPM which runs either at 30Hz or 100Hz.
我們比較了YOLO與運行在30Hz或100Hz的DPM的GPU實現(xiàn)。
（就是和DPM運行在30幀或是100幀的硬件條件下看看YOLO大約怎么樣）

While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object detection systems.
雖然其他的工作沒有達(dá)到實時的里程碑，我們也比較了它們的相對mAP和速度，以檢查在目標(biāo)檢測系統(tǒng)中可用的準(zhǔn)確性和性能折衷。

第二段（從Fast YOLO引出YOLO真快）

Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector.
Fast YOLO是PASCAL上最快的目標(biāo)檢測方法;據(jù)我們所知，它是現(xiàn)存最快的物體探測器。

With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection.
它的mAP值為52.7%，但是卻可以比之前的實時檢測精度高出一倍多。

YOLO pushes mAP to 63.4% while still maintaining real-time performance.
YOLO將mAP提升到63.4%，同時仍然保持實時性能。

第三段（和VGG16的結(jié)合雖然準(zhǔn)但是慢，其實真正的YOLO比VGG淺）

We also train YOLO using VGG-16. This model is more accurate but also significantly slower than YOLO.
我們也用VGG-16訓(xùn)練YOLO。這個模型比YOLO更準(zhǔn)確，但也明顯慢。

It is useful for comparison to other detection systems that rely on VGG-16 but since it is slower than real-time the rest of the paper（這里沒有s，特指這個這篇論文） focuses on our faster models.
這對于與其他依賴VGG-16的檢測系統(tǒng)進行比較很有用，但由于它比實時更慢，本文的其余部分將重點放在我們更快的模型上。

（這里說一下，看一下yolov1的代碼就可以看到，他也是卷積、標(biāo)準(zhǔn)化、激活、池化，只是比ygg16的層數(shù)少）

第四段（R-CNN minus R的失敗）

R-CNN minus R replaces Selective Search with static bounding box proposals [20].
R- cnn 減 R用靜態(tài)邊界框proposal替換選擇性搜索。

While it is much faster than R-CNN, it still falls short of real-time and takes a significant accuracy hit from not having good proposals.
雖然它比R-CNN快得多，但它仍然缺乏實時性，而且由于沒有好的提案，它的準(zhǔn)確性受到了很大的打擊。

第五段（Fast R-CNN雖然很快但是仍有延遲）

Fast R-CNN speeds up the classification stage of R-CNN but it still relies on selective search which can take around 2 seconds per image to generate bounding box proposals.
Fast R-CNN加快了R-CNN的分類階段，但它仍然依賴于選擇性搜索，每幅圖像大約需要2秒來生成邊界框建議。

Thus it has high mAP but at 0.5 fps it is still far from realtime.
因此它有很高的mAP，但在0.5 fps的情況下，它離實時性還差得很遠(yuǎn)。

第六段（ Faster R-CNN The Zeiler-Fergus Faster R-CNN 雖然可以更快，都沒有YOLO這么準(zhǔn)）

The recent Faster R-CNN replaces selective search with a neural network to propose bounding boxes, similar to Szegedy et al. [8]
最近的Faster R-CNN用神經(jīng)網(wǎng)絡(luò)代替了選擇性搜索，提出了邊界框，類似于Szegedy等人。

In our tests, their most accurate model achieves 7 fps while a smaller, less accurate one runs at 18 fps.
在我們的測試中，他們最精確的模型達(dá)到7幀/秒，而更小、更不精確的模型運行在18幀/秒。

The VGG-16 version of Faster R-CNN is 10 mAP higher but is also 6 times slower than YOLO.
Faster-R-CNN的VGG-16版本比YOLO高10 mAP，但也慢6倍。

The Zeiler-Fergus Faster R-CNN is only 2.5 times slower than YOLO but is also less accurate.
Zeiller-Fergus的 Faster R-CNN只比YOLO慢2.5倍，但也不太準(zhǔn)確。

4.2. VOC 2007 Error Analysis

第一段

4.6總結(jié)

試驗大約分成幾個部分：
1.和最近的研究相比，YOLO雖然精度上可能有所欠缺，但是他快啊。

5.致歉

由于有其他的緊急項目，這里的學(xué)習(xí)暫時告一段里，所以沒有完成YOLOv1的閱讀

總結(jié)

以上是生活随笔為你收集整理的YOLO系列阅读（一） YOLOv1原文阅读：You Only Look Once: Unified, Real-Time Object Detection的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：什么是mAP?
下一篇：什么是pretext tasks？