當前位置：首頁 > 人工智能 > 目标检测 >内容正文

目标检测

动手学CV-目标检测入门教程4：模型结构

發布時間：2024/7/23 目标检测 78 豆豆

生活随笔收集整理的這篇文章主要介紹了动手学CV-目标检测入门教程4：模型结构小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

3.4 模型結構

本文來自開源組織 DataWhale 🐳 CV小組創作的目標檢測入門教程。

對應開源項目《動手學CV-Pytorch》的第3章的內容，教程中涉及的代碼也可以在項目中找到，后續會持續更新更多的優質內容，歡迎??。

如果使用我們教程的內容或圖片，請在文章醒目位置注明我們的github主頁鏈接：https://github.com/datawhalechina/dive-into-cv-pytorch

本章教程所介紹的網絡，后面我們稱其為Tiny_Detector，是為了本教程特意設計的網絡，而并不是某個經典的目標檢測網絡。如果一定要溯源的話，由于代碼是由一個外國的開源SSD教程改編而來，因此很多細節上也更接近SSD網絡，可以認為是一個簡化后的版本，目的是幫助大家更好的入門。

那么下面，我們就開始介紹Tiny_Detector的模型結構

3.4.1 VGG16作為backbone

為了使結構簡單易懂，我們使用VGG16作為backbone，即完全采用vgg16的結構作為特征提取模塊，只是去掉fc6和fc7兩個全連接層。如圖3-17所示：

圖3-17 Tiny-Detector的backbone

對于網絡的輸入尺寸的確定，由于vgg16的ImageNet預訓練模型是使用224x224尺寸訓練的，因此我們的網絡輸入也固定為224x224，和預訓練模型尺度保持一致可以更好的發揮其作用。通常來說，這樣的網絡輸入大小，對于檢測網絡來說還是偏小，在完整的進行完本章的學習后，不妨嘗試下將輸入尺度擴大，看看會不會帶來更好的效果。

特征提取模塊對應代碼模塊在model.py中的VGGBase類進行了定義：

class VGGBase(nn.Module): """VGG base convolutions to produce feature maps.完全采用vgg16的結構作為特征提取模塊，丟掉fc6和fc7兩個全連接層。因為vgg16的ImageNet預訓練模型是使用224×224尺寸訓練的，因此我們的網絡輸入也固定為224×224"""def __init__(self):super(VGGBase, self).__init__()# Standard convolutional layers in VGG16self.conv1_1 = nn.Conv2d(3, 64, kernel_size=3, padding=1) # stride = 1, by defaultself.conv1_2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2) # 224->112self.conv2_1 = nn.Conv2d(64, 128, kernel_size=3, padding=1)self.conv2_2 = nn.Conv2d(128, 128, kernel_size=3, padding=1)self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2) # 112->56self.conv3_1 = nn.Conv2d(128, 256, kernel_size=3, padding=1)self.conv3_2 = nn.Conv2d(256, 256, kernel_size=3, padding=1)self.conv3_3 = nn.Conv2d(256, 256, kernel_size=3, padding=1)self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2) # 56->28self.conv4_1 = nn.Conv2d(256, 512, kernel_size=3, padding=1)self.conv4_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)self.conv4_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)self.pool4 = nn.MaxPool2d(kernel_size=2, stride=2) # 28->14self.conv5_1 = nn.Conv2d(512, 512, kernel_size=3, padding=1)self.conv5_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)self.conv5_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)self.pool5 = nn.MaxPool2d(kernel_size=2, stride=2) # 14->7# Load pretrained weights on ImageNetself.load_pretrained_layers()def forward(self, image):"""Forward propagation.:param image: images, a tensor of dimensions (N, 3, 224, 224):return: feature maps pool5"""out = F.relu(self.conv1_1(image)) # (N, 64, 224, 224)out = F.relu(self.conv1_2(out)) # (N, 64, 224, 224)out = self.pool1(out) # (N, 64, 112, 112)out = F.relu(self.conv2_1(out)) # (N, 128, 112, 112)out = F.relu(self.conv2_2(out)) # (N, 128, 112, 112)out = self.pool2(out) # (N, 128, 56, 56)out = F.relu(self.conv3_1(out)) # (N, 256, 56, 56)out = F.relu(self.conv3_2(out)) # (N, 256, 56, 56)out = F.relu(self.conv3_3(out)) # (N, 256, 56, 56)out = self.pool3(out) # (N, 256, 28, 28)out = F.relu(self.conv4_1(out)) # (N, 512, 28, 28)out = F.relu(self.conv4_2(out)) # (N, 512, 28, 28)out = F.relu(self.conv4_3(out)) # (N, 512, 28, 28)out = self.pool4(out) # (N, 512, 14, 14)out = F.relu(self.conv5_1(out)) # (N, 512, 14, 14)out = F.relu(self.conv5_2(out)) # (N, 512, 14, 14)out = F.relu(self.conv5_3(out)) # (N, 512, 14, 14)out = self.pool5(out) # (N, 512, 7, 7)# return 7*7 feature map return outdef load_pretrained_layers(self):"""we use a VGG-16 pretrained on the ImageNet task as the base network.There's one available in PyTorch, see https://pytorch.org/docs/stable/torchvision/models.html#torchvision.models.vgg16We copy these parameters into our network. It's straightforward for conv1 to conv5."""# Current state of basestate_dict = self.state_dict()param_names = list(state_dict.keys())# Pretrained VGG basepretrained_state_dict = torchvision.models.vgg16(pretrained=True).state_dict()pretrained_param_names = list(pretrained_state_dict.keys())# Transfer conv. parameters from pretrained model to current modelfor i, param in enumerate(param_names): state_dict[param] = pretrained_state_dict[pretrained_param_names[i]]self.load_state_dict(state_dict)print("\nLoaded base model.\n")

因此，我們的Tiny_Detector特征提取層輸出的是7x7的feature map，下面我們要在feature_map上設置對應的先驗框，或者說anchor。

關于先驗框的概念，上節已經做了介紹，在本實驗中，anchor的配置如下：

將原圖均勻分成7x7個cell
設置3種不同的尺度：0.2, 0.4, 0.6
設置3種不同的長寬比：1:1, 1:2, 2:1

因此，我們對這 7x7 的 feature map 設置了對應的7x7x9個anchor框，其中每一個cell有9個anchor框，如圖3-18所示：

圖3-18(a)原圖，(b)中的9種顏色框代表的是9個anchor框

對于每個anchor，我們需要預測兩類信息，一個是這個anchor的類別信息，一個是物體的邊界框信息。如圖3-19：

在我們的實驗中，類別信息由21類別的得分組成（VOC數據集的20個類別 + 一個背景類），模型最終會選擇預測得分最高的類作為邊界框對象的類別。

而邊界框信息是指，我們大致知道了當前anchor中包含一個物體的情況下，如何對anchor進行微調，使得最終能夠準確預測出物體的bbox。

圖3-19 Tiny-Detector的輸出示例

這兩種預測我們分別稱為分類頭和回歸頭，那么分類頭預測和回歸頭預測是怎么得到的？

其實我們只需在7x7的feature map后，接上兩個3x3的卷積層，即可分別完成分類和回歸的預測。

下面我們就對分類頭和回歸頭的更多細節進行介紹。

3.4.2 分類頭和回歸頭

3.4.2.1 邊界框的編解碼

Tiny_Detector并不是直接預測目標框，而是回歸對于anchor要進行多大程度的調整，才能更準確的預測出邊界框的位置。那么我們的目標就是需要找一種方法來量化計算這個偏差。

對于一只狗的目標邊界框和先驗框的示例如下圖3-21所示：

圖3-21 目標框和預測框示例

我們的模型要預測anchor與目標框的偏移，并且這個偏移會進行某種形式的歸一化，這個過程我們稱為邊界框的編碼。

這里我們使用的是與SSD完全一致的編碼方法，具體公示表達如下：

? $gcx=cx?c^xw^g_{cx}=\frac{c_x-\hat{c}_x}{\hat{w}}$

? $gcy=cy?c^yh^g_{cy}=\frac{c_y-\hat{c}_y}{\hat{h}}$

? $gw=log(ww^)g_w=log(\frac{w}{\hat{w}})$

? $gh=log(hh^)g_h=log(\frac{h}{\hat{h}})$

模型預測并輸出的是這個編碼后的偏移量( $g_{cx},g_{cy},g_w,g_h$ )，最終只要再依照公式反向進行解碼，就可以得到預測的目標框的信息。

目標框編碼與解碼的實現位于utils.py中，代碼如下：

def cxcy_to_gcxgcy(cxcy, priors_cxcy):""" Encode bounding boxes (that are in center-size form) w.r.t. the corresponding prior boxes (that are in center-size form).For the center coordinates, find the offset with respect to the prior box, and scale by the size of the prior box.For the size coordinates, scale by the size of the prior box, and convert to the log-space.In the model, we are predicting bounding box coordinates in this encoded form.:param cxcy: bounding boxes in center-size coordinates, a tensor of size (n_priors, 4):param priors_cxcy: prior boxes with respect to which the encoding must be performed, a tensor of size (n_priors, 4):return: encoded bounding boxes, a tensor of size (n_priors, 4)"""# The 10 and 5 below are referred to as 'variances' in the original SSD Caffe repo, completely empirical# They are for some sort of numerical conditioning, for 'scaling the localization gradient'# See https://github.com/weiliu89/caffe/issues/155return torch.cat([(cxcy[:, :2] - priors_cxcy[:, :2]) / (priors_cxcy[:, 2:] / 10), # g_c_x, g_c_ytorch.log(cxcy[:, 2:] / priors_cxcy[:, 2:]) * 5], 1) # g_w, g_hdef gcxgcy_to_cxcy(gcxgcy, priors_cxcy):""" Decode bounding box coordinates predicted by the model, since they are encoded in the form mentioned above.They are decoded into center-size coordinates.This is the inverse of the function above.:param gcxgcy: encoded bounding boxes, i.e. output of the model, a tensor of size (n_priors, 4):param priors_cxcy: prior boxes with respect to which the encoding is defined, a tensor of size (n_priors, 4):return: decoded bounding boxes in center-size form, a tensor of size (n_priors, 4)"""return torch.cat([gcxgcy[:, :2] * priors_cxcy[:, 2:] / 10 + priors_cxcy[:, :2], # c_x, c_ytorch.exp(gcxgcy[:, 2:] / 5) * priors_cxcy[:, 2:]], 1) # w, h

3.4.2.2 分類頭與回歸頭預測

按照前面的介紹，對于輸出7x7的feature map上的每個先驗框我們想預測：

1）邊界框的一組21類分數，其中包括VOC的20類和一個背景類。

2）邊界框編碼后的偏移量( $g_{cx},g_{cy},g_w,g_h$ )。

為了得到我們想預測的類別和偏移量，我們需要在feature map后分別接上兩個卷積層：

1）一個分類預測的卷積層采用3x3卷積核padding和stride都為1，每個anchor需要分配21個卷積核，每個位置有9個anchor，因此需要21x9個卷積核。

2）一個定位預測卷積層，每個位置使用3x3卷積核padding和stride都為1，每個anchor需要分配4個卷積核，因此需要4x9個卷積核。

我們直觀的看看這些卷積上的輸出，見下圖3-22：

圖 3-22 Tiny-Detector輸出示例

這個回歸頭和分類頭的輸出分別用藍色和黃色表示。其feature map的大小7x7保持不變。我們真正關心的是第三維度通道數，把其具體的展開可以看到如下圖3-23所示：

圖3-23 每個cell中9個anchor預測編碼偏移量

也就是說，最終回歸頭的輸出有36個通道，其中每4個值就對應了一個anchor的編碼后偏移量的預測，這樣的4個值的預測共有9組，因此通道數是36。

分類頭可以用同樣的方式理解，如下圖3-24所示:

圖3-24 每個cell中9個anchor預測分類得分

分類頭和回歸頭結構的定義，由 model.py 中的 PredictionConvolutions 類實現，代碼如下：

class PredictionConvolutions(nn.Module):""" Convolutions to predict class scores and bounding boxes using feature maps.The bounding boxes (locations) are predicted as encoded offsets w.r.t each of the 441 prior (default) boxes.See 'cxcy_to_gcxgcy' in utils.py for the encoding definition.這里預測坐標的編碼方式完全遵循的SSD的定義The class scores represent the scores of each object class in each of the 441 bounding boxes located.A high score for 'background' = no object."""def __init__(self, n_classes):""" :param n_classes: number of different types of objects"""super(PredictionConvolutions, self).__init__()self.n_classes = n_classes# Number of prior-boxes we are considering per position in the feature map# 9 prior-boxes implies we use 9 different aspect ratios, etc.n_boxes = 9 # Localization prediction convolutions (predict offsets w.r.t prior-boxes)self.loc_conv = nn.Conv2d(512, n_boxes * 4, kernel_size=3, padding=1)# Class prediction convolutions (predict classes in localization boxes)self.cl_conv = nn.Conv2d(512, n_boxes * n_classes, kernel_size=3, padding=1)# Initialize convolutions' parametersself.init_conv2d()def init_conv2d(self):"""Initialize convolution parameters."""for c in self.children():if isinstance(c, nn.Conv2d):nn.init.xavier_uniform_(c.weight)nn.init.constant_(c.bias, 0.)def forward(self, pool5_feats):"""Forward propagation.:param pool5_feats: conv4_3 feature map, a tensor of dimensions (N, 512, 7, 7):return: 441 locations and class scores (i.e. w.r.t each prior box) for each image"""batch_size = pool5_feats.size(0)# Predict localization boxes' bounds (as offsets w.r.t prior-boxes)l_conv = self.loc_conv(pool5_feats) # (N, n_boxes * 4, 7, 7)l_conv = l_conv.permute(0, 2, 3, 1).contiguous() # (N, 7, 7, n_boxes * 4), to match prior-box order (after .view())# (.contiguous() ensures it is stored in a contiguous chunk of memory, needed for .view() below)locs = l_conv.view(batch_size, -1, 4) # (N, 441, 4), there are a total 441 boxes on this feature map# Predict classes in localization boxesc_conv = self.cl_conv(pool5_feats) # (N, n_boxes * n_classes, 7, 7)c_conv = c_conv.permute(0, 2, 3, 1).contiguous() # (N, 7, 7, n_boxes * n_classes), to match prior-box order (after .view())classes_scores = c_conv.view(batch_size, -1, self.n_classes) # (N, 441, n_classes), there are a total 441 boxes on this feature mapreturn locs, classes_scores

按照上面的介紹，我們的模型輸出的shape應該為：

分類頭 batch_size x 7 x 7 x 189
回歸頭 batch_size x 7 x 7 x 36

但是為了方便后面的處理，我們肯定更希望每個anchor的預測獨自成一維，也就是：

分類頭 batch_size x 441 x 21
回歸頭 batch_size x 441 x 4

441是因為我們的模型定義了總共441=7x7x9個先驗框，這個轉換對應了這兩行代碼:

locs = l_conv.view(batch_size, -1, 4)

classes_scores = c_conv.view(batch_size, -1, self.n_classes)

這個過程的可視化如圖3-25所示。

圖3-25 pool5層輸出分類和回歸結果

3.4.3 小結

到此，模型前向推理相關的內容就已經都預測完畢了。我們已經了解了模型的結構，以及模型最終會輸出什么結果。

下一小節，我們將會學習和模型訓練相關的內容，看看如何通過定義損失函數和一些相關的訓練技巧，來讓模型向著正確的方向學習，從而預測出我們想要的結果。

總結

以上是生活随笔為你收集整理的动手学CV-目标检测入门教程4：模型结构的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：格子里输出 java_蓝桥杯-格子中输出
下一篇： php7安装xhprof,PHP 7.1