深度学习目标检测详细解析以及Mask R-CNN示例
深度學習目標檢測詳細解析以及Mask R-CNN示例
本文詳細介紹了R-CNN走到端到端模型的Faster R-CNN的進化流程,以及典型的示例算法Mask R-CNN模型。算法如何變得更快,更強!
如何讓檢測更快?主要有兩種思路:
把好的方法改進的更快!
前面我們提到了從R-CNN到Faster R-CNN主要的技術思想就是避免特征計算浪費。所以,要把ConvNet特征計算前移,只做一次計算。而把區域操作后移。我們也提到Faster R-CNN在ROI之后,還有部分ConvNet的計算。有沒有可能把ROI之上的計算進一步前移?R-FCN(Region-Based Fully Convolutional Networks)基于這個思路,做到了,所以更快,某種意義上,是Faster R-CNN。
R-FCN
把快的方法,改進的更好!
前面談到overfeat的效果一般,對于重疊情況很多不能識別的情況。如何將基于回歸的思路,做到逼近區域推薦的小高?YOLO把分而治之和IOU的思想集成進來了。而SSD把多尺度Anchor Box的思想集成進來了。
除了快,還有什么?淡然是做優做強。
Faster R-CNN有三大主要郵件,RPN做區域推薦,ROI Pooling類似特征Pyramid,改善極大極小重疊,分類和Box回歸的Log加Smoothed L1損失,針對定位修正。如何做優做強?
能否比RPN做的更優
前面提到RPN能夠達到Selective Search的效果,那么假如還要更好,怎么能夠做到?
AttractionNet利用了NMS(non-maxima suppression)效果。AttentionNet利用率弱注意力集中的機制。
能否比ROI Pooling做的更優?
前面提到ROI Pooling能夠做到和HOG Pyramid和DPM空間限制類似的SPM的效果,那么,加入還要更好,怎么能夠做到?ION (Inside-Outside Net)
提出了四方向上下文的思想,FPN提出了特征Pyramid網絡。
能否比ROI Pooling做的更強?
前面提到ROI Pooling是建立在ROI基礎上的,對應的區域推薦,如何進一步對其到像素點?
Mask R-CNN提出了ROI Align的思想。在誤差計算中,除了分類,Box回歸基礎上再加入像素點Mask Branch距離的思想。
那么,什么是FCN (Fully Convolutional Networks),IOU,NMS,Weak
Attention Narrowing, ION,FPN,ROI Align和Mask Branch思想?理解了這些,你對下面這個圖,就不再陌生!
R-CNN,Over feat,DetectionNet,DeepMultibox,SPP-net,Fast R-CNN,MR-CNN,SSD,YOLO,YOLOv2,G-CNN,AttractionNet,Mask R-CNN,R-FCN,RPN,FPN,Faster R-CNN…
下面,開啟下半場的路程!
R-FCN
前面提到,Faster R-CNN打通前后端成為端到端的模型的同時,ConvNet模型也換成了VGG-16的模型。但是,在GoogleNet和ResNet網絡結構上,全連接FC層就只有一層了,最后一層,為Soft max分類服務了。
那么,如果要把GoogleLeNet和ResNet應用到Faster R-CNN中去,就面臨一個現象,去掉最后一層FC層,因為那是用來做分類的。需要切換新的尾部網絡,能夠兼容分類和Box回歸。
這樣,再看看ROI Pooling的使用,那么ROI Pooling后面的FC層也要換成卷積層。這樣,卷積層就被ROI Pooling層割斷了。而且這種隔斷使得ROI Pooling后面的ConvNet重復計算了。
一個問題,能不能直接把后面FCs編程ResNet之后的ConvNet直接丟棄?不行,這樣的效果會打折扣,為什么?在Fast R-CNN繼承SPPNet的SPM技術,演繹出ROI Pooling的時候講了,ROI PoolING只是相當于最細分的區域固定,那么粗粒度的部分,可以由后續的多層FCs來達到類似的效果。如果去掉,就少了金字塔結構了,或者少了深度了。
那么,如何把ROI后面的卷積計算也移植到前面去?就是R-FCN解決的問題!一方面要保留空間限制,另一方面要有一定的特征層次。E-FCN提出了Position-Sensitive ROI Pooling。
Position-Seneitive ROI Pooling的思想,正式將位置并行起來,是一種結合了空間信息的注意力機制。每個小的數據來源一個和特點位置綁定的ConvNet特征層。
一旦和位置綁定了,那么特征計算,就從以前的中心點,編程了一系列從上下左右的不同子框去看的特征圖。那么,再把這些組合起來。即暗含了不同的空間信息。也就是說,先上下左右看這個山峰,回頭看的憑借起來,判斷山峰有沒有認錯。旋轉好不同位置的特征,再整合起來,得到在不同位置點確認的特征,再做Pooling,通過Pooling進行投票。
Figure 3:Visualization of R-FCN (kk=33)for the
person category.
這樣的效果,就是,把特征計算放在前面,而把位置信息拼接投票放在最后處理。而不是先通過位置畫出特征,然后把帶位置的特征先融合,再做分類和回歸。這里直接進行位置投票。要注意,PS ROI Pooling和ROI Pooling并不是一個Pooling。
R-FCN優點:
1) 清楚的關聯了速度提升和ConvNet特征共享的關系。
2) 通過不同的位置為注意力的并行特征計算,再幾號的利用Pooling來投票,取代了ROI Pooling后續計算的計算要求。
3) 速度快,效果好的均衡下的推薦選擇。
R-FCN問題:
1) 依然無法實現視頻基本的實時(每24幀圖像)。
2) 功能上沒有涉及到像素級別的實例分割。
YOLO
其實,前面提到了Overfeat效果不好,一個很多原因就是Overfeat沒有專門為了提高召回率的區域推薦機制。
而有區域推薦RPN的Faster R-CNN慢的一個重要原因,就是RPN的計算量基本也夠計算Overfeat了,所以,它是兩個階段。
Overfeat開啟了一個階段端到端的神話,但是,效果卻不好。如果不使用區域推薦機制的情況下,僅僅依靠分類和回歸的話,如何進一步提升召回率呢?
如何改善滑動窗口呢?
1) 分而治之判斷類別
2) 分而治之預覽框
3) 合并類別不同框
這里有一個問題,就是如何選擇框,用到IOU(intersection over union)。有兩個步驟:
1) 先根據類別數預測不同的框,比如,3個物體(狗,自行車,汽車),那么就會對應到3個框。
2) 判斷物體應該對應哪個框呢?這個交集占丙級比會決定應該用哪個框。
發現對于VOC2007的數據分析有如下結論:
Correct:correct class and IOU> 0.5
Localization:correct class,0.1<IOU<0.5
Similar:class is similar,IOU>0.1
Other:class is wrong,IOU>0.1
Background:IOU<0.1 for any object
這樣YOLO的損失函數考慮了。
(1) 回歸
(2) 是否有物體
(3) 有哪個物體
(4) 區域最合適的物體
前面提到,Faster R-CNN已經很快了,但是做不到實時,因為視頻要求1秒24幀以上。既然YOLO很快,那么必然用到視頻中去了。如果再視頻,還可以進一步優化YOLO到Fast YOLO更快。更快,就是共享!共享了類別的概率圖Class Prob. Map。通過修正而不是重新學習。所以更快!
YOLO的優點:
1) 典型的回歸加分類模型和單一的CNN網絡
2) 分治思想很好
3) 實時性很好,基本上接近1秒24幀的標準。
4) 比Select
Search找的框少很多(區域推薦更看重召回率)
YOLO的問題:
1) 準確度不高,不如Faster
R-CNN和R-FCN
2) 小物體,不規則物體識別差
3) 定位精度不高
YOLO-v2
如何進一步提高YOLO的準確度呢?記得RPN里面利用了各種框的長寬比先驗知識么?Anchor Box。大概5種左右的框就占據了60%的情況。
這樣,把單純的框預測,編程帶先驗的框預測,就是長度和寬度擁有一定的先驗。
其它一系列改進技巧,使得YOLOv2要比YOLOv1好!提升最大的是dimension
priors。,尺度計算一個先驗分布的幫助很大!
然后,采用了DarkNet19 的網絡,速度變得更快。
YOLO9000,分層的物體標簽實現word
tree。
分層物體類標簽:word tree
YOLOv2的優點:
1) 引入BN(Batch
normalization)(%2 mAP改進)
2) 高分辨率圖片(448*448),改善小物體識別(%4mAP改進)
3) 更細化的分塊(13*13)(1%mAP改進)
4) 引入Anchor框(K-means)(81%召回到88%召回)
YOLO-9000的優點:
1) 分層的結果標簽COCO ImageNet
YOLOv2的問題:
2) 沒有實現實例分割。
SSD
和Anchor Box思想和Pyramid思想一致,引入多尺度和多默認比率。
多尺度CNN網絡采用類似GoogleLeNet的那種分層輸出模式。
所以,結合起來,就有SSD如下網絡:
從SSD的網絡可以看到,這個多尺度是并行實現的。
SSD優點:
1) 在YOLO基礎上引入多尺度特征映射,并且分成ConvNet并行實現
2) 引入Anchor Box機制
3) 和YOLO比,效果更好,速度更快
SSD問題:
1) 效果很難突破R-FCN和Faster R-CNN
AttentionNet
主力集中的思想比較簡單:
和區域推薦相比,有一定的優勢:
而這個注意力遷移的過程,可以解讀為左上點和右下點相互盡可能靠近的一個過程:
整個過程循環迭代,直到檢測的比較精準為止。
這種注意力移動的過程,也必須和具體目標對應起來,才能應用到多目標的情況下:
不同類別可以配置成并行的結構框架。
這樣的話,多個目標實例都要擁有一個這樣的注意力移動的過程。而多個實例,也可以并行實現。
這樣的話,采用兩階段過程,第一步先找到每個實例對應的一個大框,第二部,細化找到準確的框。
AttentionNet優點:
2) 全新的區域查找方式
3) 對比R-CNN,效果有提升
AttentionNet問題:
1) 多實例的方式較為復雜
2) 移動迭代
AttractionNet
(Act)ive Box Proposal Generation via (I)n-(O)ut Localization Network,如何優化框?
1) 更集中注意!
2) 更細化定位
如何細化定位?
通過對物體分布概率在橫軸和縱軸上裁剪的方法。
對應的網絡結構ARN(Attend & Refine),然后反復迭代,最后通過NMS矯正。這個過程是不是和RPN結構加ROI Pooling迭代過程有點類似。不一樣的地方,每個ARN的框推薦都會用上,使用NMS進行修正。
而ARN和之前RPN結構不太一樣,它的橫軸和縱軸是分別細化,然后,通過In-Out最大似然度來定義的,也就是前面的哪個細化的示意圖。
上面解釋了ARN,那么,NMS是什么呢?其實就是一個局部求最值的過程!
NMS修正的過程,效果能從多個框中找到一個最符合的框,有點類似投票。
AttractioNet優點:
1) 實現提出迭代優化區域思想
2) AttractionNet要比Selective Search效果更好
3) 基于CNN網絡上的區域推薦
AttractionNet問題:
1) 反復迭代會降低運行速度
2) 網絡結構復雜,不如RPN簡單
G-CNN
Grid-CNN吸引了YOLO分而治之的思想,進行區域合并。
不是簡單的合并,而是采用迭代優化的思路。
這個過程和NMS非常不一樣,通過反復的IOU計算,迭代優化。
為了避免特征的反復計算,它把特征計算作為全局步驟,而把回歸反復優化的部分稱為回歸部分。
可以看到回歸框的移動過程:
G-CNN優點:
1) 通過迭代優化,替換了類似NMS的簡單合并
2) 效果比Faster R-CNN要好點
3) 通過分而治之,速度要比Faster
R-CNN快點
G-CNN問題:
速度依然太慢,難以實用。
ION
Inside-Outside Net是提出基于RNN的上下文的目標檢測方法。對于圖像上下左右移動像素,用RNN來編碼,稱為這個方向上的上下文。
這樣,實現了4方向RNN上下文,用來提取上下文特征。
并且設置了RNN堆棧來強化不太粒度的上下文。
R-FCN里面對空間限制進行迭代編碼類似,不過,這次不是認為劃分框的位置,而是通過IRNN直接編碼。
對比添加上下文和沒有上下文的網絡設置區別。對比得到IRNN可以提高2個mAP的提升。
ION優點:
1)提出RNN上下文的概念
2)對小物體識別的效果提升
3)比R-CNN效果要佳
ION問題:
1)RNN計算量增加,速度要慢
FPN
如何將特征金字塔融合成神經網絡,為了避免重復計算。
提出FPN網絡,通過卷積和拼接得到特征金字塔網絡
有了金字塔,有什么好處呢?對于不同大小的物體,可以在不同縮放上進行分割。
這樣,在每個層次就可以利用類似的尺寸來發現目標物體。
做到各個尺度兼容:
FPN優點:
2)多尺度和小物體融合考慮
3)速度和準確率兼容
4)可以廣泛結合,提高不同模型的效果
FPN問題:
1)需要多層計算,增加計算量
Mask R-CNN
回顧一下,第一次提出ROI,在R-CNN里面
第一次提出ROI Poolin在Fast R-CNN里面
到了Mask R-CNN,做了什么改進呢?提出了ROI Align,方便后面增加的Mask Branch,對應到像素點。
什么是mask?
有了mask之后,就能實現實例分割了
那么,ROI Pooling和ROI Align的區別在哪里呢?如何能夠精確的反響找到像素點邊沿?這樣的話,就可以對Pooling的劃分不能按照Pooling邊沿,而是按照像素點縮放后的邊沿。
而用Pooling的話,就會有偏差,這種偏差對應到像素的Mask上就會找不準邊界,之前有人利用ROI Wrapping進行插值矯正。
對于Mask和分類,回歸學習,即可以基于FPN或者就是ROI Align的特征
Mask計算的先驅:
1)MNC(Multi-task
Network Cascade)的RoI Wrapping, 插值估算
2)FCIS
(Fully Convolutional Instance Segmentation)的positional
aware sliding masks
RoI Align要比Segment要好很多!
要在人頭姿勢的17個關鍵點
Mask R-CNN優點:
1)ROIPool到ROIAlign(借鑒了ROI Wraping)
2)Mask的預測(借鑒了MNC和FCIS)
3)State-of-Art的效果
4)輕微調整可以做人體姿態識別
Mask R-CNN問題:
1)速度不夠
2)像素預測需要大量訓練數據
Mask X R-CNN
帶Weight Transfer Learning的Mask R-CNN
效果提升:
小結:
給一個概要的takeaway:
1)速度優先:SSD算法
2)速度和效果均衡:R-FCN算法
3)效果優先:Faster R-CNN,Mask R-CNN
4)一網多用:Mask R-CNN
下面介紹一個深度學習目標檢測示例:
算法模型示例----Mask R-CNN算法模型原理與實現
Mask R-CNN for Object Detection and Segmentation
This is an implementation of Mask R-CNN on Python 3, Keras, and TensorFlow. The model generates bounding boxes and segmentation masks for each instance of an object in the image. It’s based on Feature Pyramid Network (FPN) and a ResNet101 backbone.
The repository includes:
Source code of Mask R-CNN built on FPN and ResNet101.
Training code for MS COCO
Pre-trained weights for MS COCO
Jupyter notebooks to visualize the detection pipeline at every step
ParallelModel class for multi-GPU training
Evaluation on MS COCO metrics (AP)
Example of training on your own dataset
The code is documented and designed to be easy to extend. If you use it in your research, please consider citing this repository (bibtex below). If you work on 3D vision, you might find our recently released Matterport3D dataset useful as well.
This dataset was created from 3D-reconstructed spaces captured by our customers who agreed to make them publicly available for academic use. You can see more examples here.
Getting Started
demo.ipynb Is the easiest way to start. It shows an example of using a model pre-trained on MS COCO to segment objects in your own images.
It includes code to run object detection and instance segmentation on arbitrary images.
train_shapes.ipynb shows how to train Mask R-CNN on your own dataset. This notebook introduces a toy dataset (Shapes) to demonstrate training on a new dataset.
(model.py, utils.py, config.py): These files contain the main Mask RCNN implementation.
inspect_data.ipynb. This notebook visualizes the different pre-processing steps
to prepare the training data.
inspect_model.ipynb This notebook goes in depth into the steps performed to detect and segment objects. It provides visualizations of every step of the pipeline.
inspect_weights.ipynb
This notebooks inspects the weights of a trained model and looks for anomalies and odd patterns.
Step by Step Detection
To help with debugging and understanding the model, there are 3 notebooks
(inspect_data.ipynb, inspect_model.ipynb,
inspect_weights.ipynb) that provide a lot of visualizations and allow running the model step by step to inspect the output at each point. Here are a few examples:
- Anchor sorting and filtering
Visualizes every step of the first stage Region Proposal Network and displays positive and negative anchors along with anchor box refinement.
- Bounding Box Refinement
This is an example of final detection boxes (dotted lines) and the refinement applied to them (solid lines) in the second stage.
- Mask Generation
Examples of generated masks. These then get scaled and placed on the image in the right location.
4.Layer activations
Often it’s useful to inspect the activations at different layers to look for signs of trouble (all zeros or random noise).
- Weight Histograms
Another useful debugging tool is to inspect the weight histograms. These are included in the inspect_weights.ipynb notebook.
6. Logging to TensorBoard
TensorBoard is another great debugging and visualization tool. The model is configured to log losses and save weights at the end of every epoch.
7. Composing the different pieces into a final result
Training on MS COCO
We’re providing pre-trained weights for MS COCO to make it easier to start. You can
use those weights as a starting point to train your own variation on the network.
Training and evaluation code is in samples/coco/coco.py. You can import this
module in Jupyter notebook (see the provided notebooks for examples) or you
can run it directly from the command line as such:
Train a new model starting from pre-trained COCO weights
python3 samples/coco/coco.py train --dataset=/path/to/coco/ --model=coco
Train a new model starting from ImageNet weights
python3 samples/coco/coco.py train --dataset=/path/to/coco/ --model=imagenet
Continue training a model that you had trained earlier
python3 samples/coco/coco.py train --dataset=/path/to/coco/ --model=/path/to/weights.h5
Continue training the last model you trained. This will find
the last trained weights in the model directory.
python3 samples/coco/coco.py train --dataset=/path/to/coco/ --model=last
You can also run the COCO evaluation code with:
Run COCO evaluation on the last trained model
python3 samples/coco/coco.py evaluate --dataset=/path/to/coco/ --model=last
The training schedule, learning rate, and other parameters should be set in samples/coco/coco.py.
Training on Your Own Dataset
Start by reading this blog post about the balloon color splash sample. It covers the process starting from annotating images to training to using the results in a sample application.
In summary, to train the model on your own dataset you’ll need to extend two classes:
Config
This class contains the default configuration. Subclass it and modify the attributes you need to change.
Dataset
This class provides a consistent way to work with any dataset.
It allows you to use new datasets for training without having to change
the code of the model. It also supports loading multiple datasets at the
same time, which is useful if the objects you want to detect are not
all available in one dataset.
See examples in samples/shapes/train_shapes.ipynb, samples/coco/coco.py, samples/balloon/balloon.py, and samples/nucleus/nucleus.py.
Differences from the Official Paper
This implementation follows the Mask RCNN paper for the most part, but there are a few cases where we deviated in favor of code simplicity and generalization. These are some of the differences we’re aware of. If you encounter other differences, please do let us know.
Image Resizing: To support training multiple images per batch we resize all images to the same size. For example, 1024x1024px on MS COCO. We preserve the aspect ratio, so if an image is not square we pad it with zeros. In the paper the resizing is done such that the smallest side is 800px and the largest is trimmed at 1000px.
Bounding Boxes: Some datasets provide bounding boxes and some provide masks only. To support training on multiple datasets we opted to ignore the bounding boxes that come with the dataset and generate them on the fly instead. We pick the smallest box that encapsulates all the pixels of the mask as the bounding box. This simplifies the implementation and also makes it easy to apply image augmentations that would otherwise be harder to apply to bounding boxes, such as image rotation.
To validate this approach, we compared our computed bounding boxes to those provided by the COCO dataset.
We found that ~2% of bounding boxes differed by 1px or more, ~0.05% differed by 5px or more,
and only 0.01% differed by 10px or more.
Learning Rate: The paper uses a learning rate of 0.02, but we found that to be
too high, and often causes the weights to explode, especially when using a small batch
size. It might be related to differences between how Caffe and TensorFlow compute
gradients (sum vs mean across batches and GPUs). Or, maybe the official model uses gradient
clipping to avoid this issue. We do use gradient clipping, but don’t set it too aggressively.
We found that smaller learning rates converge faster anyway so we go with that.
Citation
Use this bibtex to cite this repository:
@misc{matterport_maskrcnn_2017,
title={Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow},
author={Waleed Abdulla},
year={2017},
publisher={Github},
journal={GitHub repository},
howpublished={\url{https://github.com/matterport/Mask_RCNN}},
}
Contributing
Contributions to this repository are welcome. Examples of things you can contribute:
Speed Improvements. Like re-writing some Python code in TensorFlow or Cython.
Training on other datasets.
Accuracy Improvements.
Visualizations and examples.
You can also join our team and help us build even more projects like this one.
Requirements
Python 3.4, TensorFlow 1.3, Keras 2.0.8 and other common packages listed in requirements.txt.
MS COCO Requirements:
To train or test on MS COCO, you’ll also need:
pycocotools (installation instructions below)
MS COCO Dataset
Download the 5K minival
and the 35K validation-minus-minival
subsets. More details in the original Faster R-CNN implementation.
If you use Docker, the code has been verified to work on
this Docker container.
Installation
Clone this repository
Install dependencies
pip3 install -r requirements.txt
Run setup from the repository root directory
python3 setup.py install
Download pre-trained COCO weights (mask_rcnn_coco.h5) from the releases page.
(Optional) To train or test on MS COCO install pycocotools from one of these repos. They are forks of the original pycocotools with fixes for Python3 and Windows (the official repo doesn’t seem to be active anymore).
Linux: https://github.com/waleedka/coco
Windows: https://github.com/philferriere/cocoapi.
You must have the Visual C++ 2015 build tools on your path (see the repo for additional details)
Projects Using this Model
If you extend this model to other datasets or build projects that use it, we’d love to hear from you.
4K Video Demo by Karol Majek.
Images to OSM: Improve OpenStreetMap by adding baseball, soccer, tennis, football, and basketball fields.
Splash of Color. A blog post explaining how to train this model from scratch and use it to implement a color splash effect.
Segmenting Nuclei in Microscopy Images. Built for the 2018 Data Science Bowl
Code is in the samples/nucleus directory.
Detection and Segmentation for Surgery Robots by the NUS Control & Mechatronics Lab.
Reconstructing 3D buildings from aerial LiDAR
A proof of concept project by Esri, in collaboration with Nvidia and Miami-Dade County. Along with a great write up and code by Dmitry Kudinov, Daniel Hedges, and Omar Maher.
Usiigaci: Label-free Cell Tracking in Phase Contrast Microscopy
A project from Japan to automatically track cells in a microfluidics platform. Paper is pending, but the source code is released.
Characterization of Arctic Ice-Wedge Polygons in Very High Spatial Resolution Aerial Imagery
Research project to understand the complex processes between degradations in the Arctic and climate change. By Weixing Zhang, Chandi Witharana, Anna Liljedahl, and Mikhail Kanevskiy.
Mask-RCNN Shiny
A computer vision class project by HU Shiyu to apply the color pop effect on people with beautiful results.
Mapping Challenge: Convert satellite imagery to maps for use by humanitarian organisations.
GRASS GIS Addon to generate vector masks from geospatial imagery. Based on a Master’s thesis by Ond?ej Pe?ek.
總結
以上是生活随笔為你收集整理的深度学习目标检测详细解析以及Mask R-CNN示例的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 3D目标检测论文阅读多角度解析
- 下一篇: 深度学习在计算机视觉中的应用长篇综述