當前位置：首頁 > 人工智能 > pytorch >内容正文

pytorch

深度学习图像分类_深度学习时代您应该阅读的10篇文章了解图像分类

發布時間：2023/12/15 pytorch 36 豆豆

生活随笔收集整理的這篇文章主要介紹了深度学习图像分类_深度学习时代您应该阅读的10篇文章了解图像分类小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

深度學習圖像分類

前言 (Foreword)

Computer vision is a subject to convert images and videos into machine-understandable signals. With these signals, programmers can further control the behavior of the machine based on this high-level understanding. Among many computer vision tasks, image classification is one of the most fundamental ones. It not only can be used in lots of real products like Google Photo’s tagging and AI content moderation but also opens a door for lots of more advanced vision tasks, such as object detection and video understanding. Due to the rapid changes in this field since the breakthrough of Deep Learning, beginners often find it too overwhelming to learn. Unlike typical software engineering subjects, there are not many great books about image classification using DCNN, and the best way to understand this field is though reading academic papers. But what papers to read? Where do I start? In this article, I’m going to introduce 10 best papers for beginners to read. With these papers, we can see how this field evolve, and how researchers brought up new ideas based on previous research outcome. Nevertheless, it is still helpful for you to sort out the big picture even if you have already worked in this area for a while. So, let’s get started.

計算機視覺是將圖像和視頻轉換成機器可理解的信號的主題。利用這些信號，程序員可以基于這種高級理解來進一步控制機器的行為。在許多計算機視覺任務中，圖像分類是最基本的任務之一。它不僅可以用于許多實際產品中，例如Google Photo的標記和AI內容審核，而且還為許多更高級的視覺任務(例如目標檢測和視頻理解)打開了一扇門。自從深度學習的突破以來，由于該領域的快速變化，初學者經常發現它太笨拙，無法學習。與典型的軟件工程學科不同，沒有很多關于使用DCNN進行圖像分類的書籍，而了解該領域的最佳方法是閱讀學術論文。但是要讀什么論文？我從哪說起呢？在本文中，我將為初學者介紹10篇最佳論文。通過這些論文，我們可以看到該領域是如何發展的，以及研究人員如何根據以前的研究成果提出新的想法。但是，即使您已經在此領域工作了一段時間，對您進行總體整理還是很有幫助的。因此，讓我們開始吧。

1998年：LeNet (1998: LeNet)

Gradient-based Learning Applied to Document Recognition

基于梯度的學習在文檔識別中的應用

Gradient-based Learning Applied to Document Recognition基于梯度的學習應用于文檔識別””

Introduced in 1998, LeNet sets a foundation for future image classification research using the Convolution Neural Network. Many classical CNN techniques, such as pooling layers, fully connected layers, padding, and activation layers are used to extract features and make a classification. With a Mean Square Error loss function and 20 epochs of training, this network can achieve 99.05% accuracy on the MNIST test set. Even after 20 years, many state-of-the-art classification networks still follows this pattern in general.

LeNet于1998年推出，為使用卷積神經網絡的未來圖像分類研究奠定了基礎。許多經典的CNN技術(例如池化層，完全連接的層，填充和激活層)用于提取特征并進行分類。借助均方誤差損失功能和20個訓練周期，該網絡在MNIST測試集上可以達到99.05％的精度。即使經過20年，仍然有許多最先進的分類網絡總體上遵循這種模式。

2012年：AlexNet (2012: AlexNet)

ImageNet Classification with Deep Convolutional Neural Networks

深度卷積神經網絡的ImageNet分類

ImageNet Classification with Deep Convolutional Neural Networks具有深度卷積神經網絡的ImageNet分類””

Although LeNet achieved a great result and showed the potential of CNN, the development in this area stagnated for a decade due to limited computing power and the amount of the data. It looked like CNN can only solve some easy tasks such as digit recognition, but for more complex features like faces and objects, a HarrCascade or SIFT feature extractor with an SVM classifier was a more preferred approach.

盡管LeNet取得了不錯的成績并展示了CNN的潛力，但是由于計算能力和數據量有限，該領域的發展停滯了十年。看起來CNN只能解決一些簡單的任務，例如數字識別，但是對于更復雜的特征(如人臉和物體)，帶有SVM分類器的HarrCascade或SIFT特征提取器是更可取的方法。

However, in 2012 ImageNet Large Scale Visual Recognition Challenge, Alex Krizhevsky proposed a CNN-based solution for this challenge and drastically increased ImageNet test set top-5 accuracy from 73.8% to 84.7%. Their approach inherits the multi-layer CNN idea from LeNet, but increased the size of CNN a lot. As you can see from the diagram above, the input is now 224x224 compared with LeNet’s 32x32, also many Convolution kernels have 192 channels compared with LeNet’s 6. Although the design isn’t changed much, with hundreds of more times of parameters, the network’s ability to capture and represent complex features improved hundreds of times too. To train such as a big model, Alex used two GTX 580 GPU with 3GB RAM for each, which pioneered a trend of GPU training. Also, the use of ReLU non-linearity also helped to reduce computation cost.

但是，在2012年ImageNet大規模視覺識別挑戰賽中，Alex Krizhevsky提出了基于CNN的解決方案來應對這一挑戰，并將ImageNet測試儀的top-5準確性從73.8％大大提高到84.7％。他們的方法繼承了LeNet的多層CNN想法，但大大增加了CNN的大小。從上圖可以看到，與LeNet的32x32相比，現在的輸入為224x224，與LeNet的6相比，許多卷積內核具有192個通道。盡管設計沒有太大變化，但參數卻有數百次，網絡捕獲和表示復雜特征的能力也提高了數百倍。為了進行大型模型訓練，Alex使用了兩個具有3GB RAM的GTX 580 GPU，這開創了GPU訓練的先河。同樣，使用ReLU非線性也有助于降低計算成本。

In addition to bringing many more parameters for the network, it also explored the overfitting issue brought by a larger network by using a Dropout layer. Its Local Response Normalization method didn’t get too much popularity afterward but inspired other important normalization techniques such as BatchNorm to combat with gradient saturation issue. To sum up, AlexNet defined the de facto classification network framework for the next 10 years: a combination of Convolution, ReLu non-linear activation, MaxPooling, and Dense layer.

除了為網絡帶來更多參數外，它還通過使用Dropout層探討了較大的網絡帶來的過擬合問題。其局部響應歸一化方法此后并沒有獲得太大的普及，但啟發了其他重要的歸一化技術(例如BatchNorm)來解決梯度飽和問題。綜上所述，AlexNet定義了未來十年的實際分類網絡框架：卷積，ReLu非線性激活，MaxPooling和Dense層的組合。

2014年：VGG (2014: VGG)

Very Deep Convolutional Networks for Large-Scale Image Recognition

用于大型圖像識別的超深度卷積網絡

https://www.quora.com/What-is-the-VGG-neural-network”https://www.quora.com/What-is-the-VGG-neural-network ”

With such a great success of using CNN for visual recognition, the entire research community blew up and all started to look into why this neural network works so well. For example, in “Visualizing and Understanding Convolutional Networks” from 2013, Matthew Zeiler discussed how CNN pick up features and visualized the intermediate representations. And suddenly everyone started to realize that CNN is the future of computer vision since 2014. Among all those immediate followers, the VGG network from Visual Geometry Group is the most eye-catching one. It got a remarkable result of 93.2% top-5 accuracy, and 76.3% top-1 accuracy on the ImageNet test set.

在使用CNN進行視覺識別方面取得了巨大成功，整個研究界都大吃一驚，所有人都開始研究為什么這種神經網絡能夠如此出色地工作。例如，在2013年出版的“可視化和理解卷積網絡”中，馬修·齊勒(Matthew Zeiler)討論了CNN如何獲取特征并可視化中間表示。突然之間，每個人都開始意識到CNN自2014年以來就是計算機視覺的未來。在所有直接關注者中，Visual Geometry Group的VGG網絡是最吸引眼球的網絡。在ImageNet測試儀上，top-5的準確度達到93.2％，top-1的準確度達到了76.3％。

Following AlexNet’s design, the VGG network has two major updates: 1) VGG not only used a wider network like AlexNet but also deeper. VGG-19 has 19 convolution layers, compared with 5 from AlexNet. 2) VGG also demonstrated that a few small 3x3 convolution filters can replace a single 7x7 or even 11x11 filters from AlexNet, achieve better performance while reducing the computation cost. Because of this elegant design, VGG also became the back-bone network of many pioneering networks in other computer vision tasks, such as FCN for semantic segmentation, and Faster R-CNN for object detection.

遵循AlexNet的設計，VGG網絡有兩個主要更新：1)VGG不僅使用了像AlexNet這樣的更廣泛的網絡，而且使用了更深的網絡。 VGG-19具有19個卷積層，而AlexNet中只有5個。 2)VGG還展示了一些小的3x3卷積濾波器可以代替AlexNet的單個7x7甚至11x11濾波器，在降低計算成本的同時實現更好的性能。由于這種優雅的設計，VGG也成為了其他計算機視覺任務中許多開拓性網絡的骨干網絡，例如用于語義分割的FCN和用于對象檢測的Faster R-CNN。

With a deeper network, gradient vanishing from multi-layers back-propagation becomes a bigger problem. To deal with it, VGG also discussed the importance of pre-training and weight initialization. This problem limits researchers to keep adding more layers, otherwise, the network will be really hard to converge. But we will see a better solution for this after two years.

隨著網絡的深入，從多層反向傳播中消失的梯度成為一個更大的問題。為了解決這個問題，VGG還討論了預訓練和體重初始化的重要性。這個問題限制了研究人員繼續添加更多的層，否則，網絡將很難融合。但是兩年后，我們將為此找到更好的解決方案。

2014年：GoogLeNet (2014: GoogLeNet)

Going Deeper with Convolutions

卷積更深

Going Deeper with Convolutions深入卷積””

VGG has a good looking and easy-to-understand structure, but its performance isn’t the best among all the finalists in ImageNet 2014 competitions. GoogLeNet, aka InceptionV1, won the final prize. Just like VGG, one of the main contributions of GoogLeNet is to push the limit of the network depth with a 22 layers structure. This demonstrated again that going deeper and wider is indeed the right direction to improve accuracy.

VGG具有漂亮的外觀和易于理解的結構，但在ImageNet 2014競賽的所有決賽入圍者中，其性能都不是最好的。 GoogLeNet(又名InceptionV1)獲得了最終獎。就像VGG一樣，GoogLeNet的主要貢獻之一就是采用22層結構來突破網絡深度的限制。這再次證明，進一步深入確實是提高準確性的正確方向。

Unlike VGG, GoogLeNet tried to address the computation and gradient diminishing issues head-on, instead of proposing a workaround with better pre-trained schema and weights initialization.

與VGG不同，GoogLeNet試圖直接解決計算和梯度遞減問題，而不是提出具有更好的預訓練模式和權重初始化的解決方法。

Going Deeper with Convolutions深入卷積””中的瓶頸概念模塊

First, it explored the idea of asymmetric network design by using a module called Inception (see diagram above). Ideally, they would like to pursuit sparse convolution or dense layers to improve feature efficiency, but modern hardware design wasn’t tailored to this case. So they believed that a sparsity at the network topology level could also help the fusion of features while leveraging existing hardware capabilities.

首先，它使用稱為Inception的模塊探索了非對稱網絡設計的思想(請參見上圖)。理想情況下，他們希望采用稀疏卷積或密集層來提高特征效率，但是現代硬件設計并非針對這種情況而定制。因此，他們認為網絡拓撲級別的稀疏性還可以在利用現有硬件功能的同時，幫助融合功能。

Second, it attacks the high computation cost problem by borrowing an idea from a paper called “Network in Network”. Basically, a 1x1 convolution filter is introduced to reduce dimensions of features before going through heavy computing operation like a 5x5 convolution kernel. This structure is called “Bottleneck” later and widely used in many following networks. Similar to “Network in Network”, it also used an average pooling layer to replace the final fully connected layer to further reduce cost.

其次，它通過借鑒論文“網絡中的網絡”來解決高計算成本的問題。基本上，引入1x1卷積濾波器以在進行繁重的計算操作(如5x5卷積內核)之前減小特征的尺寸。以后將該結構稱為“ Bottleneck”，并在許多后續網絡中廣泛使用。類似于“網絡中的網絡”，它還使用平均池層代替最終的完全連接層，以進一步降低成本。

Third, to help gradients to flow to deeper layers, GoogLeNet also used supervision on some intermediate layer outputs or auxiliary output. This design isn’t quite popular later in the image classification network because of the complexity, but getting more popular in other areas of computer vision such as Hourglass network in pose estimation.

第三，為了幫助梯度流向更深的層次，GoogLeNet還對某些中間層輸出或輔助輸出使用了監督。由于其復雜性，該設計后來在圖像分類網絡中并不十分流行，但是在計算機視覺的其他領域(例如，Hourglass網絡)的姿勢估計中越來越流行。

As a follow-up, this Google team wrote more papers for this Inception series. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” stands for InceptionV2. “Rethinking the Inception Architecture for Computer Vision” in 2015 stands for InceptionV3. And “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning” in 2015 stands for InceptionV4. Each paper added more improvement over the original Inception network and achieved a better result.

作為后續行動，該Google團隊為此Inception系列撰寫了更多論文。 “批處理規范化：通過減少內部協變量偏移來加速深度網絡訓練”代表InceptionV2。 2015年的“重新思考計算機視覺的Inception架構”代表InceptionV3。 2015年的“ Inception-v4，Inception-ResNet和殘余連接對學習的影響”代表InceptionV4。每篇論文都對原始的Inception網絡進行了更多改進，并取得了更好的效果。

2015年：批量標準化 (2015: Batch Normalization)

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

批量歸一化：通過減少內部協變量偏移來加速深度網絡訓練

The inception network helped researchers to reach a superhuman accuracy on the ImageNet dataset. However, as a statistic learning method, CNN is very much constrained to the statistic nature of a specific training dataset. Therefore, to achieve better accuracy, we usually need to pre-calculate the mean and standard deviation of the entire dataset and use them to normalize our input first to ensure most of the layer inputs in the network are close, which translates to better activation responsiveness. This approximate approach is very cumbersome, and sometimes doesn’t work at all for a new network structure or a new dataset, so the deep learning model is still viewed as difficult to train. To address this problem, Sergey Ioffe and Chritian Szegedy, the guy who created GoogLeNet, decided to invent something smarter called Batch Normalization.

初始網絡幫助研究人員在ImageNet數據集上達到了超人的準確性。但是，作為一種統計學習方法，CNN非常受特定訓練數據集的統計性質的限制。因此，為了獲得更好的精度，我們通常需要預先計算整個數據集的平均值和標準偏差，并使用它們首先對我們的輸入進行歸一化，以確保網絡中的大多數層輸入都緊密，從而轉化為更好的激活響應。這種近似方法非常麻煩，有時對于新的網絡結構或新的數據集根本不起作用，因此深度學習模型仍然被認為很難訓練。為了解決這個問題，創建GoogLeNet的人Sergey Ioffe和Chritian Szegedy決定發明一種更聰明的東西，稱為“批標準化”。

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift批量標準化：通過減少內部協變量偏移來加速深度網絡訓練””

The idea of batch normalization is not hard: We can use the statistics of a series of mini-batch to approximate the statistics of the whole dataset, as long as we train for long enough time. Also, instead of manually calculating the statistics, we can introduce two more learnable parameters “scale” and “shift” to have the network learn how to normalize each layer by itself.

批量規范化的想法并不難：只要訓練足夠長的時間，我們就可以使用一系列小批量的統計數據來近似整個數據集的統計數據。此外，代替手動計算統計信息，我們可以引入兩個更多可學習的參數“縮放”和“移位”，以使網絡學習如何單獨對每一層進行規范化。

The above diagram showed the process of calculating batch normalization values. As we can see, we take the mean of the whole mini-batch and calculate the variance as well. Next, we can normalize the input with this mini-batch mean and variance. Finally, with a scale and a shift parameter, the network will learn to adapt the batch normalized result to best fit the following layers, usually ReLU. One caveat is that we don’t have mini-batch information during inference, so a workaround is to calculate a moving average mean and variance during training, and then use these moving averages in the inference path. This little innovation is so impactful, and all later networks start to use it right away.

上圖顯示了計算批次歸一化值的過程。如我們所見，我們取整個小批量的平均值，并計算方差。接下來，我們可以使用此最小批量均值和方差對輸入進行歸一化。最后，通過比例尺和位移參數，網絡將學會調整批標準化結果以最適合以下層，通常是ReLU。一個警告是我們在推理期間沒有小批量信息，因此一種解決方法是在訓練期間計算移動平均值和方差，然后在推理路徑中使用這些移動平均值。這項小小的創新是如此具有影響力，所有后來的網絡都立即開始使用它。

2015年：ResNet (2015: ResNet)

Deep Residual Learning for Image Recognition

深度殘差學習用于圖像識別

2015 may be the best year for computer vision in a decade, we’ve seen so many great ideas popping out not only in image classification but all sorts of computer vision tasks such as object detection, semantic segmentation, etc. The biggest advancement of the year 2015 belongs to a new network called ResNet, or residual networks, proposed by a group of Chinese researchers from Microsoft Research Asia.

2015年可能是十年來計算機視覺最好的一年，我們已經看到很多偉大的想法不僅出現在圖像分類中，而且還出現了各種各樣的計算機視覺任務，例如對象檢測，語義分割等。 2015年屬于一個新的網絡，稱為ResNet，即殘差網絡，該網絡由來自Microsoft Research Asia的一組中國研究人員提出。

Deep Residual Learning for Image Recognition深度殘差圖像識別學習””

As we discussed earlier for VGG network, the biggest hurdle of getting even deeper is the gradient vanishing issue, i.e, derivatives become smaller and smaller when back-propagate through deeper layers, and eventually reaches a point that modern computer architecture can’t really represent meaningfully. GoogLeNet tried to attack this by using auxiliary supervision and asymmetric inception module, but it only alleviates the problem to a small extent. If we want to use 50 or even 100 layers, will there be a better way for the gradient to flow through the network? The answer from ResNet is to use a residual module.

正如我們之前在VGG網絡中討論的那樣，變得更深層的最大障礙是梯度消失問題，即，當向后傳播穿過更深層時，導數變得越來越小，最終達到現代計算機體系結構無法真正代表的地步。有意義地。 GoogLeNet嘗試通過使用輔助監管和非對稱啟動模塊來對此進行攻擊，但只能在一定程度上緩解該問題。如果我們要使用50甚至100層，是否會有更好的方法讓漸變流過網絡？ ResNet的答案是使用殘差模塊。

Deep Residual Learning for Image Recognition深殘余學習圖像識別””

ResNet added an identity shortcut to the output, so that each residual module can’t at least predict whatever the input is, without getting lost in the wild. Even more important thing is, instead of hoping each layer fit directly to the desired feature mapping, the residual module tries to learn the difference between output and input, which makes the task much easier because the information gain needed is less. Imagine that you are learning mathematics, for each new problem, you are given a solution of a similar problem, so all you need to do is to extend this solution and try to make it work. This is much easier than thinking of a brand new solution for every problem you run into. Or as Newton said, we can stand on the shoulders of giants, and the identity input is that giant for the residual module.

ResNet在輸出中添加了身份標識快捷方式，因此每個殘差模塊至少都不能預測輸入是什么，而不會迷路。甚至更重要的是，殘差模塊不是希望每個層都直接適合所需的特征映射，而是嘗試了解輸出和輸入之間的差異，這使任務變得更加容易，因為所需的信息增益較小。想象一下，您正在學習數學，對于每個新問題，都將得到一個類似問題的解決方案，因此您所需要做的就是擴展此解決方案并使其生效。這比為您遇到的每個問題想出一個全新的解決方案要容易得多。或者像牛頓所說，我們可以站在巨人的肩膀上，身份輸入就是剩余模塊的那個巨人。

In addition to identity mapping, ResNet also borrowed the bottleneck and Batch Normalization from Inception networks. Eventually, it managed to build a network with 152 convolution layers and achieved 80.72% top-1 accuracy on ImageNet. The residual approach also becomes a default option for many other networks later, such as Xception, Darknet, etc. Also, thanks to its simple and beautiful design, it’s still widely used in many production visual recognition systems nowadays.

除了身份映射之外，ResNet還從Inception網絡借用了瓶頸和批處理規范化。最終，它成功構建了具有152個卷積層的網絡，并在ImageNet上實現了80.72％的top-1準確性。剩余方法也成為以后許多其他網絡(例如Xception，Darknet等)的默認選項。而且，由于其簡單美觀的設計，它在當今的許多生產視覺識別系統中仍得到廣泛使用。

There’re many more invariants came out by following the hype of the residual network. In “Identity Mappings in Deep Residual Networks”, the original author of ResNet tried to put activation before the residual module and achieved a better result, and this design is called ResNetV2 afterward. Also, in a 2016 paper “Aggregated Residual Transformations for Deep Neural Networks”, researchers proposed ResNeXt which added parallel branches for residual modules to aggregate outputs of different transforms.

通過追蹤殘差網絡的炒作，還有更多不變式出現。在“深層殘差網絡中的身份映射”中，ResNet的原始作者試圖將激活放在殘差模塊之前，并獲得了更好的結果，此設計此后稱為ResNetV2。此外，在2016年的論文《深度神經網絡的聚合殘差變換》中，研究人員提出了ResNeXt，該模型為殘差模塊添加了并行分支，以匯總不同變換的輸出。

2016年：Xception (2016: Xception)

Xception: Deep Learning with Depthwise Separable Convolutions

Xception：具有深度可分卷積的深度學習

Xception: Deep Learning with Depthwise Separable ConvolutionsXception：深度學習與深度可分離卷積

With the release of ResNet, it looked like most of the low hanging fruits in the image classifier were grabbed already. Researchers started to think about what is the internal mechanism of the magic of CNN. Since cross-channel convolution usually introduces a ton of parameters, the Xception network chose to investigate this operation to understand a full picture of its effect.

隨著ResNet的發布，圖像分類器中的大多數低掛水果看起來已經被搶走了。研究人員開始考慮CNN魔術的內部機制是什么。由于跨通道卷積通常會引入大量參數，因此Xception網絡選擇調查此操作以了解其效果的全貌。

Like its name, Xception originates from the Inception network. In the Inception module, multiple branches of different transformations are aggregated together to achieve a topology sparsity. But why this sparsity worked? The author of Xception, also the author of the Keras framework, extended this idea to an extreme case where one 3x3 convolution file correspond to one output channel before a final concatenation. In this case, these parallel convolution kernels actually form a new operation called depth-wise convolution.

就像它的名字一樣，Xception源自Inception網絡。在Inception模塊中，將不同轉換的多個分支聚合在一起以實現拓撲稀疏性。但是為什么這種稀疏起作用了？ Xception的作者，也是Keras框架的作者，將這一思想擴展到了一種極端情況，在這種情況下，一個3x3卷積文件對應于最后一個串聯之前的一個輸出通道。在這種情況下，這些并行卷積內核實際上形成了一個稱為深度卷積的新操作。

Depth-wise Convolution and Depth-wise Separable Convolution”深度卷積和深度可分離卷積 ”

As shown in the diagram above, unlike traditional convolution where all channels are included for one computation, depth-wise convolution only computes convolution for each channel separately and then concatenate the output together. This cuts the feature exchange among channels, but also reduces a lot of connections, hence result in a layer with fewer parameters. However, this operation will output the same number of channels as input (Or a smaller number of channels if you group two or more channels together). Therefore, once the channel outputs are merged, we need another regular 1x1 filter, or point-wise convolution, to increase or reduce the number of channels, just like regular convolution does.

如上圖所示，與傳統卷積不同，傳統卷積包括所有通道以進行一次計算，深度卷積僅分別計算每個通道的卷積，然后將輸出串聯在一起。這減少了通道之間的特征交換，但也減少了很多連接，因此導致具有較少參數的層。但是，此操作將輸出與輸入相同數量的通道(如果將兩個或多個通道組合在一起，則輸出的通道數量將減少)。因此，一旦合并了通道輸出，就需要另一個常規的1x1濾波器或逐點卷積，以增加或減少通道數，就像常規卷積一樣。

This idea is not from Xception originally. It’s described in a paper called “Learning visual representations at scale” and also used occasionally in InceptionV2. Xception took a step further and replaced almost all convolutions with this new type. And the experiment result turned out to be pretty good. It surpasses ResNet and InceptionV3 and became a new SOTA method for image classification. This also proved that the mapping of cross-channel correlations and spatial correlations in CNN can be entirely decoupled. In addition, sharing the same virtue with ResNet, Xception has a simple and beautiful design as well, so its idea is used in lots of other following research such as MobileNet, DeepLabV3, etc.

這個想法最初不是來自Xception。在名為“大規模學習視覺表示”的論文中對此進行了描述，并且在InceptionV2中也偶爾使用。 Xception進一步走了一步，并用這種新型卷積代替了幾乎所有的卷積。實驗結果非常好。它超越了ResNet和InceptionV3，成為一種新的SOTA圖像分類方法。這也證明了CNN中跨通道相關性和空間相關性的映射可以完全解耦。此外，Xception與ResNet具有相同的優點，它還具有簡單美觀的設計，因此其思想還用于許多其他后續研究中，例如MobileNet，DeepLabV3等。

2017年：MobileNet (2017: MobileNet)

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Application

MobileNets：用于移動視覺應用的高效卷積神經網絡

Xception achieved 79% top-1 accuracy and 94.5% top-5 accuracy on ImageNet, but that’s only 0.8% and 0.4% improvement respectively compared with previous SOTA InceptionV3. The marginal gain of a new image classification network is becoming smaller, so researchers start to shift their focus into other areas. And MobileNet led a significant push of image classification in a resource constrained environment.

Xception在ImageNet上實現了79％的top-1準確性和94.5％的top-5準確性，但是與以前的SOTA InceptionV3相比分別僅提高了0.8％和0.4％。新圖像分類網絡的邊際收益越來越小，因此研究人員開始將注意力轉移到其他領域。在資源受限的環境中，MobileNet領導了圖像分類的重大推動。

MobileNet module from “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Application”“ MobileNets：針對移動視覺應用的高效卷積神經網絡”中的MobileNet模塊

Similar to Xception, MobileNet used a same depthwise separable convolution module as shown above and had an emphasis on the high efficiency and less parameters.

與Xception相似，MobileNet使用與上述相同的深度可分離卷積模塊，并著重于高效和較少參數。

Parameters ratio from “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Application”“ MobileNets：用于移動視覺應用的高效卷積神經網絡”中的參數比率

The numerator in the above formula is the total number of parameters required by a depthwise separable convolution. And the denominator is the total number of parameters of a similar regular convolution. Here D[K] is the size of convolution kernel, D[F] is the size of the feature map, M is the number of input channels, N is the number of output channels. Since we separated the calculation of channel and spatial feature, we can turn multiplication into an addition, which is a magnitude smaller. Even better, as we can see from this ratio, the larger the number of output channels is, the more calculation we saved from using this new convolution.

上式中的分子是深度可分離卷積所需的參數總數。分母是相似的規則卷積的參數總數。這里D [K]是卷積核的大小，D [F]是特征圖的大小，M是輸入通道數，N是輸出通道數。由于我們將通道和空間特征的計算分開了，因此我們可以將乘法轉化為相加，幅度要小一些。從這個比率可以看出，更好的是，輸出通道數越多，使用這種新卷積節省的計算量就越多。

Another contribution from MobileNet is the width and resolution multiplier. MobileNet team wanted to find a canonical way to shrink model size for mobile devices, and the most intuitive way is to reduce the number of input and output channels, as well as the input image resolution. To control this behavior, a ratio alpha is multiplied with channels, and a ratio rho is multiplied with input resolution (which also affects feature map size). So the total number of parameters can be represented in the following formula:

MobileNet的另一個貢獻是寬度和分辨率乘數。 MobileNet團隊希望找到一種縮小移動設備模型大小的規范方法，而最直觀的方法是減少輸入和輸出通道的數量以及輸入圖像的分辨率。為了控制此行為，比率alpha乘以通道，比率rho乘以輸入分辨率(這也會影響要素地圖的大小)。因此，參數總數可以用以下公式表示：

“MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Application”“ MobileNets：用于移動視覺應用的高效卷積神經網絡”

Although this change seems naive in terms of innovation, it has great engineering value because it’s the first time researchers conclude a canonical approach to adjust network for different resource constraints. Also, it kind of summarized the ultimate solution of improving neural network: wider and high-res input leads to better accuracy, thinner and low-res input leads to poorer accuracy.

盡管這種變化在創新方面看似天真，但它具有巨大的工程價值，因為這是研究人員首次得出結論，可以針對不同的資源約束調整網絡的規范方法。此外，它還總結了改進神經網絡的最終解決方案：更大和更高的分辨率輸入會導致更好的精度，更薄和更低的分辨率輸入會導致更差的精度。

Later in 2018 and 2019, the MobiletNet team also released “MobileNetV2: Inverted Residuals and Linear Bottlenecks” and “Searching for MobileNetV3”. In MobileNetV2, an inverted residual bottleneck structure is used. And in MobileNetV3, it started to search the optimal architecture combination using Neural Architecture Search technology, which we will cover next.

在2018年和2019年晚些時候，MobiletNet團隊還發布了“ MobileNetV2：反向殘差和線性瓶頸”和“正在搜索MobileNetV3”。在MobileNetV2中，使用了倒置的殘留瓶頸結構。在MobileNetV3中，它開始使用神經體系結構搜索技術來搜索最佳體系結構組合，我們將在后面介紹。

2017年：NASNet (2017: NASNet)

Learning Transferable Architectures for Scalable Image Recognition

學習可擴展的體系結構以實現可擴展的圖像識別

Just like image classification for a resource-constrained environment, neural architecture search is another field that emerged around 2017. With ResNet, Inception, and Xception, it seems like we reached an optimal network topology that humans can understand and design, but what if there’s a better and much more complex combination that far exceeds human imagination? A paper in 2016 called “Neural Architecture Search with Reinforcement Learning” proposed an idea to search the optimal combination within a pre-defined search space by using reinforcement learning. As we know, reinforcement learning is a method to find the best solution with a clear goal and reward for the search agent. However, limited by computing power, this paper only discussed the application in a small CIFAR dataset.

就像針對資源受限環境的圖像分類一樣，神經體系結構搜索是在2017年左右出現的另一個領域。借助ResNet，Inception和Xception，似乎我們已經達到了人類可以理解和設計的最佳網絡拓撲，但是如果有的話一個更好，更復雜的組合，遠遠超出了人類的想象力？ 2016年的一篇論文《帶有增強學習的神經體系結構搜索》提出了一種通過使用增強學習在預定搜索空間內搜索最佳組合的想法。眾所周知，強化學習是一種以目標明確，獎勵搜索代理商的最佳解決方案的方法。但是，受計算能力的限制，本文僅討論了在小型CIFAR數據集中的應用。

Learning Transferable Architectures for Scalable Image Recognition學習可擴展的體系結構以實現可擴展的圖像識別

With the goal of finding an optimal structure for a large dataset like ImageNet, NASNet made a search space that is tailored for ImageNet. It hopes to design a special search space so that the searched result on CIFAR can also work on ImageNet well. First, NASNet assumes that common hand-crafted module in good networks like ResNet and Xception are still useful when searching. So instead of searching for random connection and operations, NASNet searches the combination of these modules that have been proved to be useful already on ImageNet. Second, the actual searching is still performed on the CIFAR dataset with 32x32 resolution, so NASNet only searches for modules that are not affected by the input size. In order to make the second point work, NASNet predefined two types of module templates: Reduction and Normal. Reduction cell could have reduced feature map compared with the input, and for Normal cell it would be the same.

為了找到像ImageNet這樣的大型數據集的最佳結構，NASNet創建了一個針對ImageNet的搜索空間。它希望設計一個特殊的搜索空間，以便CIFAR上的搜索結果也可以在ImageNet上正常工作。首先，NASNet假設在良好的網絡(如ResNet和Xception)中常用的手工模塊在搜索時仍然有用。因此，NASNet無需搜索隨機連接和操作，而是搜索這些模塊的組合，這些模塊已被證明在ImageNet上已經有用。其次，實際搜索仍在32x32分辨率的CIFAR數據集上執行，因此NASNet僅搜索不受輸入大小影響的模塊。為了使第二點起作用，NASNet預定義了兩種類型的模塊模板：Reduction和Normal。與輸入相比，歸約單元可能具有縮小的特征圖，對于法線單元，它是相同的。

Learning Transferable Architectures for Scalable Image Recognition學習可擴展的體系結構以實現可擴展的圖像識別

Although NASNet has better metrics than manually design networks, it also suffers from a few drawbacks. The cost of searching for an optimal structure is very high, which is only affordable by big companies like Google and Facebook. Also, the final structure doesn’t make too much sense to humans, hence harder to maintain and improve in a production environment. Later in 2018, “MnasNet: Platform-Aware Neural Architecture Search for Mobile” further extends this NASNet idea by limit the search step with a pre-defined chained-blocks structure. Also, by defining a weight factor, mNASNet gave a more systematic way to search model given specific resource constraints, instead of just evaluating based on FLOPs.

盡管NASNet具有比手動設計網絡更好的度量標準，但是它也有一些缺點。尋找最佳結構的成本非常高，只有像Google和Facebook這樣的大公司才能負擔得起。而且，最終結構對人類而言并沒有太大意義，因此在生產環境中難以維護和改進。在2018年晚些時候，“ MnasNet：針對移動平臺的神經結構搜索”通過使用預定義的鏈塊結構限制搜索步驟，進一步擴展了NASNet的想法。另外，通過定義權重因子，mNASNet提供了一種更系統的方法來搜索給定特定資源限制的模型，而不僅僅是基于FLOP進行評估。

2019年：EfficientNet (2019: EfficientNet)

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

EfficientNet：卷積神經網絡的模型縮放比例的重新思考

In 2019, it looks like there are no exciting ideas for supervised image classification with CNN anymore. A drastic change in network structure usually only offers a little accuracy improvement. Even worse, when the same network applied to different datasets and tasks, previously claimed tricks don’t seem to work, which led to critiques that whether those improvements are just overfitting on the ImageNet dataset. On the other side, there’s one trick that never fails our expectation: using higher resolution input, adding more channels for convolution layers, and adding more layers. Although very brutal force, it seems like there’s a principled way to scale the network on demand. MobileNetV1 sort of suggested this in 2017, but the focus was shifted to a better network design later.

在2019年，對于CNN進行監督圖像分類似乎不再有令人興奮的想法。網絡結構的急劇變化通常只會帶來少許的精度提高。更糟的是，當同一個網絡應用于不同的數據集和任務時，以前聲稱的技巧似乎不起作用，這引發了人們的批評，即這些改進是否僅適合ImageNet數據集。另一方面，有一個技巧絕不會辜負我們的期望：使用更高分辨率的輸入，為卷積層添加更多通道以及添加更多層。盡管力量非常殘酷，但似乎存在一種按需擴展網絡的原則方法。 MobileNetV1在2017年提出了這種建議，但后來重點轉移到了更好的網絡設計上。

EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksEfficientNet：對卷積神經網絡的模型縮放的重新思考””

After NASNet and mNASNet, researchers realized that even with the help from a computer, a change in architecture doesn’t yield that much benefit. So they start to fall back to the scaling the network. EfficientNet is just built on top of this assumption. On one hand, it uses the optimal building block from mNASNet to make sure a good foundation to start with. On the other hand, it defined three parameters alpha, beta, and rho to control the depth, width, and resolution of the network correspondingly. By doing so, even without a large GPU pool to search for an optimal structure, engineers can still rely on these principled parameters to tune the network based on their different requirements. In the end, EfficientNet gave 8 different variants with different width, depth, and resolution ratios and got good performance for both small and big models. In other words, if you want high accuracy, go for a 600x600 and 66M parameters EfficientNet-B7. If you want low latency and smaller model, go for a 224x224 and 5.3M parameters EfficientNet-B0. Problem solved.

繼NASNet和mNASNet之后，研究人員意識到，即使有了計算機的幫助，架構的改變也不會帶來太大的好處。因此，他們開始回落到擴展網絡。 EfficientNet只是建立在此假設之上的。一方面，它使用mNASNet的最佳構建基塊來確保良好的基礎。另一方面，它定義了三個參數alpha，beta和rho來分別控制網絡的深度，寬度和分辨率。這樣，即使沒有大型GPU池來搜索最佳結構，工程師仍可以依靠這些原則性參數根據他們的不同要求來調整網絡。最后，EfficientNet提供了8種不同的變體，它們具有不同的寬度，深度和分辨率，并且無論大小模型都具有良好的性能。換句話說，如果要獲得高精度，請使用600x600和66M參數的EfficientNet-B7。如果您想要低延遲和更小的模型，請使用224x224和5.3M參數EfficientNet-B0。問題解決了。

(Read More)

If you finish reading above 10 papers, you should have a pretty good grasp of the history of image classification with CNN. If you like to keep learning this area, I’ve also listed some other interesting papers to read. Although not included in the top-10 list, these papers are all famous in their own area and inspired many other researchers in the world.

如果您完成了10篇以上的論文的閱讀，您應該對CNN的圖像分類歷史有了很好的了解。如果您想繼續學習這一領域，我還列出了一些其他有趣的論文供您閱讀。盡管未列入前十名，但這些論文在各自領域都很有名，并啟發了世界上許多其他研究人員。

2014年：SPPNet (2014: SPPNet)

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

深度卷積網絡中的空間金字塔池用于視覺識別

SPPNet borrows the idea of feature pyramid from traditional computer vision feature extraction. This pyramid forms a bag of words of features with different scales, so it is can adapt to different input sizes and get rid of the fixed-size fully connected layer. This idea also further inspired the ASPP module of DeepLab, and also FPN for object detection.

SPPNet從傳統的計算機視覺特征提取中借鑒了特征金字塔的思想。該金字塔形成了一個具有不同比例尺的特征單詞袋，因此它可以適應不同的輸入大小并擺脫固定大小的全連接層。這個想法還進一步啟發了DeepLab的ASPP模塊以及用于對象檢測的FPN。

2016年：DenseNet (2016: DenseNet)

Densely Connected Convolutional Networks

緊密連接的卷積網絡

DenseNet from Cornell further extends the idea from ResNet. It not only provides skip connection between layers but also has skip connections from all previous layers.

康奈爾大學的DenseNet進一步擴展了ResNet的想法。它不僅提供了各層之間的跳過連接，而且還具有來自所有先前各層的跳過連接。

2017年：SENet (2017: SENet)

Squeeze-and-Excitation Networks

擠壓和激勵網絡

The Xception network demonstrated that cross-channel correlation doesn’t have much to do with spatial correlation. However, as the champion of the last ImageNet competition, SENet devised a Squeeze-and-Excitation block and told a different story. The SE block first squeezes all channels into fewer channels using global pooling, applies a fully connected transform, and then “excite” them back to the original number of channels using another fully connected layer. So essentially, the FC layer helped the network to learn attentions on the input feature map.

Xception網絡證明，跨通道關聯與空間關聯關系不大。但是，作為上屆ImageNet競賽的冠軍，SENet設計了一個“擠壓和激發”區并講述了一個不同的故事。 SE塊首先使用全局池將所有通道壓縮為較少的通道，應用完全連接的轉換，然后使用另一個完全連接的層將其“激發”回原來的通道數量。因此，從本質上講，FC層幫助網絡了解了輸入要素圖上的注意力。

2017年：ShuffleNet (2017: ShuffleNet)

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

ShuffleNet：一種用于移動設備的極其高效的卷積神經網絡

Built on top of MobileNetV2’s inverted bottleneck module, ShuffleNet believes that point-wise convolution in depthwise separable convolution sacrifices accuracy in exchange for less computation. To compensate for this, ShuffleNet added an additional channel shuffling operation to make sure point-wise convolution will not always be applied to the same “point”. And in ShuffleNetV2, this channel shuffling mechanism is further extended to a ResNet identity mapping branch as well, so that part of the identity feature will also be used to shuffle.

ShuffleNet建立在MobileNetV2的倒置瓶頸模塊之上，他認為深度可分離卷積中的點式卷積會犧牲準確性，以換取較少的計算量。為了彌補這一點，ShuffleNet增加了一個額外的通道改組操作，以確保逐點卷積不會始終應用于同一“點”。在ShuffleNetV2中，此通道改組機制也進一步擴展到ResNet身份映射分支，因此，部分身份功能也將用于改組。

2018：技巧包 (2018: Bag of Tricks)

Bag of Tricks for Image Classification with Convolutional Neural Networks

使用卷積神經網絡進行圖像分類的技巧包

Bag of Tricks focuses on common tricks used in the image classification area. It serves as a good reference for engineers when they need to improve benchmark performance. Interestingly, these tricks such as mixup augmentation and cosine learning rate can sometimes achieve far better improvement than a new network architecture.

“技巧包”重點介紹在圖像分類區域中使用的常見技巧。當工程師需要提高基準性能時，它可以作為參考。有趣的是，諸如混合增強和余弦學習速率之類的這些技巧有時可以比新的網絡體系結構實現更好的改進。

結論 (Conclusion)

With the release of EfficientNet, it looks like the ImageNet classification benchmark comes to an end. With the existing deep learning approach, there will never be a day we can reach 99.999% accuracy on the ImageNet unless another paradigm shift happened. Hence, researchers are actively looking at some novel areas such as self-supervised or semi-supervised learning for large scale visual recognition. In the meantime, with existing methods, it became more a question for engineers and entrepreneurs to find the real-world application of this non-perfect technology. In the future, I will also write a survey to analyze those real-world computer vision applications powered by image classification, so please stay tuned! If you think there are also other important papers to read for image classification, please leave a comment below and let us know.

隨著EfficientNet的發布，ImageNet分類基準似乎即將結束。使用現有的深度學習方法，除非發生另一種模式轉變，否則我們永遠不會有一天可以在ImageNet上達到99.999％的準確性。因此，研究人員正在積極研究一些新穎的領域，例如用于大規模視覺識別的自我監督或半監督學習。同時，使用現有方法，對于工程師和企業家來說，找到這種不完美技術的實際應用已經成為一個問題。將來，我還將撰寫一份調查報告，以分析那些由圖像分類支持的現實世界中的計算機視覺應用程序，請繼續關注！如果您認為還需要閱讀其他重要文章以進行圖像分類，請在下面發表評論，并告知我們。

Originally published at http://yanjia.li on July 31, 2020

最初于 2020年7月31日發布在 http://yanjia.li

翻譯自: https://towardsdatascience.com/10-papers-you-should-read-to-understand-image-classification-in-the-deep-learning-era-4b9d792f45a7