當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Faster-RCNN训练自己数据集遇到的问题集锦

發布時間：2024/9/21 编程问答 52 豆豆

生活随笔收集整理的這篇文章主要介紹了 Faster-RCNN训练自己数据集遇到的问题集锦小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

最近，用faster rcnn跑一些自己的數據，數據集為某遙感圖像數據集——RSOD，標注格式跟pascal_voc差不多，但由于是學生團隊標注，中間有一些標注錯誤，也為后面訓練埋了很多坑。下面是用自己的數據集跑時遇到的一些問題，一定一定要注意：在確定程序完全調通前，務必把迭代次數設一個較小的值（比如100），節省調試時間。

錯誤目錄：

1?./tools/train_faster_rcnn_alt_opt.py is not found

2?assert (boxes[:, 2] >= boxes[:, 0]).all()

3?'module' object has no attribute 'text_format'

4?Typeerror：Slice indices must be integers or None or have __index__ method

5?TypeError: 'numpy.float64' object cannot be interpreted as an index

6?error=cudaSuccess(2 vs. 0) out of memory？

7?loss_bbox = nan，result:?Mean AP＝0.000

8?AttributeError: 'NoneType' object has no attribute 'astype'

錯誤1: 執行sudo ./train_faster_rcnn_alt_opt.sh 0 ZF pascal_voc，報錯：./tools/train_faster_rcnn_alt_opt.py is not found

解決方法：執行sh文件位置錯誤，應退回到py-faster-rcnn目錄下，執行sudo ./experiments/scripts/train_faster_rcnn_alt_opt.sh 0 ZF pascal_voc

錯誤2：在調用append_flipped_images函數時出現：?assert (boxes[:, 2] >= boxes[:, 0]).all()

網上查資料說：出現這個問題主要是自己的數據集標注出錯。由于我們使用自己的數據集，可能出現x坐標為0的情況，而pascal_voc數據標注都是從1開始計數的，所以faster rcnn代碼里會轉化成0-based形式，對Xmin，Xmax，Ymin，Ymax進行-1操作，從而會出現溢出，如果x=0，減1后溢出為65535。更有甚者，標記坐標為負數或者超出圖像范圍。主要解決方法有：

（1）修改lib/datasets/imdb.py，在boxes[:, 2] = widths[i] - oldx1 - 1后插入：

[python] view plain copy

for?b?in?range(len(boxes)):??

????if?boxes[b][2]<?boxes[b][0]:??

????????boxes[b][0]?=?0??

這種方法其實頭痛醫頭，且認為溢出只有可能是?boxes[b][0] ，但后面事實告訴我，?boxes[b][2] 也有可能溢出。不推薦。

（2）修改lib/datasets/pascal_voc.py中_load_pascal_annotation函數，該函數是讀取pascal_voc格式標注文件的，下面幾句中的-1全部去掉（pascal_voc標注是1-based,所以需要-1轉化成0-based,如果我們的數據標注是0-based,再-1就可能溢出，所以要去掉）。如果只是0-based的問題（而沒有標注為負數或超出圖像邊界的坐標），這里就應該解決問題了。

[python] view plain copy

x1?=?float(bbox.find('xmin').text)#-1??

y1?=?float(bbox.find('ymin').text)#-1??

x2?=?float(bbox.find('xmax').text)#-1??

y2?=?float(bbox.find('ymax').text)#-1??

（3）標注文件矩形越界

我執行了上面兩步，運行stage 1 RPN, init from ImageNet Model時還是報錯。說明可能不僅僅是遇到x=0的情況了，有可能標注本身有錯誤，比如groundtruth的x1<0或x2>imageWidth。決定先看看到底是那張圖像的問題。在lib/datasets/imdb.py的

[python] view plain copy

assert?(boxes[:,?2]?>=?boxes[:,?0]).all()??

這句前面加上:

[python] view plain copy

print?self.image_index[i]??

打印當前處理的圖像名，運行之后報錯前最后一個打印的圖像名就是出問題的圖像啦，檢測Annotation中該圖像的標注是不是有矩形越界的情況。經查，還真有一個目標的x1被標注成了-2。

更正這個標注錯誤后，正當我覺得終于大功告成之時，依然報錯……咬著牙對自己說“我有耐心”。這次報錯出現在“Stage 1 Fast R-CNN using RPN proposals, init from ImageNet model”這個階段，也就是說此時調用append_flipped_images函數處理的是rpn產生的proposals而非標注文件中的groundtruth。不科學啊，groundtruth既然沒問題，proposals怎么會溢出呢？結論：沒刪緩存！把py-faster-rcnn/data/cache中的文件和 py-faster-rcnn/data/VOCdevkit2007/annotations_cache中的文件統統刪除。是這篇博客給我的啟發。在此之前，我花了些功夫執迷于找標注錯誤，如果只是想解決問題就沒有必要往下看了，但作為分析問題的思路，可以記錄一下：

首先我決定看看到底哪個proposal的問題。還是看看是哪張圖像的問題，在lib/datasets/imdb.py的

[python] view plain copy

assert?(boxes[:,?2]?>=?boxes[:,?0]).all()??

這句前面加上：

[python] view plain copy

print?("num_image:%d"%(i))??

然后運行，打印圖像在訓練集中的索引（這次不需要知道圖像名），找到告警前最后打印的那個索引，比如我找到的告警前索引為320，下一步就是看看這個圖片上所有的proposal是不是正常，同樣地，在告警語句前插入：

[python] view plain copy

if?i==320:??

????print?self.image_index[i]??

????for?z?in?xrange(len(boxes)):??

????????print?('x2:%d??x1:%d'%(boxes[z][2],boxes[z][0]))??

????????if?boxes[z][2]<boxes[z][0]:??

?????print"here?is?the?bad?point!!!"??

再次運行后看日志，發現here is the bad point!!!出現在一組“x2=-64491? x1=1011”后，因為我的圖像寬度是1044，而1044-65535=-64491，所以其實是x2越界了，因boxes[:, 2] = widths[i] - oldx1 - 1，其實也就是圖像反轉前對應的oldx1=65534溢出，為什么rpn產生的proposal也會溢出呢？正常情況下，rpn產生的proposal是絕不會超過圖像范圍的，除非——標準的groundtruth就超出了！而groundtruth如果有問題，stage 1?RPN, init from ImageNet Model這個階段就應該報錯了，所以是一定是緩存的問題。

錯誤3：pb2.text_format(...)這里報錯'module' object has no attribute 'text_format'。

解決方法：在./lib/fast_rcnn/train.py文件里import google.protobuf.text_format。網上有人說把protobuf版本回退到2.5.0，但這樣會是caffe編譯出現新問題——“cannot import name symbol database”，還需要去github上下對應的缺失文件，所以不建議。

錯誤4：執行到lib/proposal_target_layer.py時報錯Typeerror：Slice indices must be integers or None or have __index__ method

解決方法：這個錯誤的原因是，numpy1.12.0之后不在支持float型的index。網上很多人說numpy版本要降到1.11.0，但我這樣做了之后又有新的報錯：ImportError: numpy.core.multiarray failed to import。正確的解決辦法是：numpy不要降版本（如果已經降了版本，直接更新到最新版本就好），只用修改lib/proposal_target_layer.py兩處：

在126行后加上：

[python] view plain copy

start=int(start)??

end=int(end)??

在166行后加上：[python] view plain copy

fg_rois_per_this_image=int(fg_rois_per_this_image)??

錯誤5：py-faster-rcnn/tools/../lib/roi_data_layer/minibatch.py的_sample_rois函數中報錯TypeError: 'numpy.float64' object cannot be interpreted as an index

解決方法：這與錯誤（4）其實是一個問題，都是numpy版本導致的。一樣地，不支持網上很多答案說的降低版本的方法，更穩妥的辦法是修改工程代碼。這里給出的解決方案。修改minibatch.py文件：

第26行：

[python] view plain copy

fg_rois_per_image?=?np.round(cfg.TRAIN.FG_FRACTION?*?rois_per_image)??

改為：[python] view plain copy

fg_rois_per_image?=?np.round(cfg.TRAIN.FG_FRACTION?*?rois_per_image).astype(np.int)??

第173行：

[ruby] view plain copy

cls?=?clss[ind]??

改為：[python] view plain copy

cls?=?int(clss[ind])??

另外還有3處需要加上.astype(np.int),分別是：[python] view plain copy

#lib/datasets/ds_utils.py?line?12?:??

hashes?=?np.round(boxes?*?scale).dot(v)??

#lib/fast_rcnn/test.py?line?129：??

hashes?=?np.round(blobs['rois']?*?cfg.DEDUP_BOXES).dot(v)??

#lib/rpn/proposal_target_layer.py?line?60?:???

fg_rois_per_image?=?np.round(cfg.TRAIN.FG_FRACTION?*?rois_per_image)??

錯誤6：error=cudaSuccess(2 vs. 0) out of memory？

GPU內存不足，有兩種可能：（1）batchsize太大；（2）GPU被其他進程占用過多。

解決方法：先看GPU占用情況：watch -n 1 nvidia-smi，實時顯示GPU占用情況，運行訓練程序看占用變化。如果確定GPU被其他程序大量占用，可以關掉其他進程 kill -9 PID。如果是我們的訓練程序占用太多，則考慮將batchsize減少。

錯誤7：在lib/fast_rcnn/bbox_transform.py文件時RuntimeWarning: invalid value encountered in log targets_dw = np.log(gt_widths / ex_widths)，然后loss_bbox = nan，最終的Mean AP＝0.000

網上很多人說要降低學習率，其實這是指標不治本，不過是把報錯的時間推遲罷了，而且學習率過低，本身就有很大的風險陷入局部最優。

經過分析調試，發現這個問題還是自己的數據集標注越界的問題！！！越界有6種形式：x1<0;? x2>width;? x2<x1;? y1<0;? y2>height;? y2<y1。不巧的是，源代碼作者是針對pascal_voc數據寫的，壓根就沒有考慮標注出錯的可能性。發布的代碼中只在append_flipped_images函數里 assert (boxes[:, 2] >= boxes[:, 0]).all()，也就是只斷言了水平翻轉后的坐標x2>=x1，這個地方報錯可能是x的標注錯誤，參考前面的錯誤2。但是，對于y的標注錯誤，根本沒有檢查。

分析過程：先找的報warning的?lib/fast_rcnn/bbox_transform.py，函數bbox_transform，函數注釋參考這里。在

[python] view plain copy

targets_dw?=?np.log(gt_widths?/?ex_widths)??

前面加上：[python] view plain copy

print(gt_widths)??

print(ex_widths)??

print(gt_heights)??

print(ex_heights)??

assert(gt_widths>0).all()??

assert(gt_heights>0).all()??

assert(ex_widths>0).all()??

assert(ex_heights>0).all()??

然后運行，我發現AssertError出現在assert(ex_heights>0).all()，也就是說存在anchor高度為負數的，而height跟標注數據y方向對應，所以考慮是標注數據y的錯誤。類似于錯誤2，我回到lib/datasets/imdb.py，append_flipped_images函數中加入對y標注的檢查。直接粘貼代碼吧:

[python] view plain copy

#源代碼中沒有獲取圖像高度信息的函數，補充上??

def?_get_heights(self):??

??return?[PIL.Image.open(self.image_path_at(i)).size[1]??

??????????for?i?in?xrange(self.num_images)]??

def?append_flipped_images(self):??

????num_images?=?self.num_images??

????widths?=?self._get_widths()??

????heights?=?self._get_heights()#add?to?get?image?height??

????for?i?in?xrange(num_images):??

????????boxes?=?self.roidb[i]['boxes'].copy()??

????????oldx1?=?boxes[:,?0].copy()??

????????oldx2?=?boxes[:,?2].copy()??

????????print?self.image_index[i]#print?image?name??

????????assert?(boxes[:,1]<=boxes[:,3]).all()#assert?that?ymin<=ymax??

????????assert?(boxes[:,1]>=0).all()#assert?ymin>=0,for?0-based??

????????assert?(boxes[:,3]<heights[i]).all()#assert?ymax<height[i],for?0-based??

????????assert?(oldx2<widths[i]).all()#assert?xmax<withd[i],for?0-based??

????????assert?(oldx1>=0).all()#assert?xmin>=0,?for?0-based??

????????assert?(oldx2?>=?oldx1).all()#assert?xmax>=xmin,?for?0-based??

????????boxes[:,?0]?=?widths[i]?-?oldx2?-?1??

????????boxes[:,?2]?=?widths[i]?-?oldx1?-?1??

????????#print?("num_image:%d"%(i))??

????????assert?(boxes[:,?2]?>=?boxes[:,?0]).all()??

????????entry?=?{'boxes'?:?boxes,??

?????????????????'gt_overlaps'?:?self.roidb[i]['gt_overlaps'],??

?????????????????'gt_classes'?:?self.roidb[i]['gt_classes'],??

?????????????????'flipped'?:?True}??

????????self.roidb.append(entry)??

????self._image_index?=?self._image_index?*?2??

然后運行，遇到y有標注錯誤的地方就會報AssertError，然后看日志上最后一個打印的圖像名，到對應的Annotation上查看錯誤標記，改過來后不要忘記刪除py-faster-rcnn/data/cache緩存。然后再運行，遇到AssertError再改對應圖像的標準，再刪緩存……重復直到所有的標注錯誤都找出來。然后就大功告成了，MAP不再等于0.000了！

錯誤8：訓練大功告成，mAP=0.66，可以測試一下了。具體的這個博客寫的很清楚。在執行demo.py文件時報錯：im_orig = im.astype(np.float32, copy=True)，AttributeError: 'NoneType' object has no attribute 'astype'。

解決方法：仔細檢查路徑和文件名，查看demo.py里路徑相關的文件。

以上。

總結

以上是生活随笔為你收集整理的Faster-RCNN训练自己数据集遇到的问题集锦的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： R-FCN/Faster-rcnn使用s
下一篇：论文笔记 OHEM: Training