使用FgSegNet进行前景图像分割
介紹 (Introduction)
A tough challenge that remains to be solved robustly is foreground image segmentation (or background subtraction in a different perspective). The task may sound trivial: create a binary mask where only the pixels of a moving/important object are marked. However, this can become particularly difficult when real-world variabilities are introduced into the picture (no pun intended). For example, a truly robust image segmentation model must account for all the following:
前景圖像分割 (或不同角度的背景減法 )仍然是要有效解決的難題。 這項任務聽起來很微不足道: 創建一個二進制蒙版,其中僅標記了移動/重要對象的像素 。 但是,當將實際變化引入圖片時(這不是雙關語),這可能會變得特別困難。 例如, 一個真正健壯的圖像分割模型必須考慮以下所有問題 :
- Subtle changes in the (background) scenery (背景)風景的細微變化
- Ignoring moving trees, leaves, snow, rain, shadows, etc. 忽略移動的樹木,樹葉,雪,雨,陰影等。
- Handling scenarios with bad lighting 處理照明不良的場景
- Dealing with camera jitter and/or motion 處理相機抖動和/或運動
- Camouflaged objects in ambiguous areas within the camera’s field of view 相機視野內模糊區域中的偽裝物體
The list goes on…
清單繼續...
One of the first few approaches to this task which was reasonably robust (in its time) is statistical by nature. Specifically, it involved the use of multiple Gaussian models to map the distributions of each color’s values (i.e. RGB), for each pixel of the input. If a pixel’s color values did not match its Gaussian distributions in a particular frame, it could be determined that the pixel holds a foreground object. However, this method still was very susceptible to the above challenges, but was nonetheless a breakthrough for robust image segmentation in its time (1999) [2].
本質上是統計的,在此任務中相當健壯(在其時間范圍內)的前幾種方法之一。 具體來說,它涉及使用多個高斯模型為 輸入的每個像素 映射每種顏色值 (即RGB) 的分布 。 如果像素的顏色值與特定幀中的高斯分布不匹配,則可以確定該像素持有前景對象。 然而,該方法仍然很容易受到上述挑戰的影響,但是在其時代(1999)[2]仍然是魯棒圖像分割的突破。
Fast-forward to recent times, and now we have enough computing power and data for Convolutional Neural Networks (CNNs) and other complex models to operate rather accurately, let alone simple feedforward networks. Without surprise, there has been more deep learning-based methods of background subtraction right now than ever.
快速發展到最近, 現在我們已經為卷積神經網絡(CNN)和其他復雜模型提供了足夠的計算能力和數據 ,可以相當準確地運行,更不用說簡單的前饋網絡了。 毫不奇怪, 現在有比以往更多的基于深度學習的背景扣除方法 。
We’ll take a look at Foreground Segmentation Network, or FgSegNet, a recently proposed and top-performing neural network architecture which uses multiple CNNs and a Transposed CNN (TCNN) to achieve background subtraction. [1]
我們將看一下 Foreground Segmentation Network 或 FgSegNet ,這是最近提出的且性能最高的神經網絡體系結構,該體系結構使用多個CNN和轉置CNN(TCNN)來實現背景減法。 [1]
理論 (Theory)
CNN和TCNN (CNNs and TCNNs)
FgSegNet uses 3 Convolutional Neural Networks (CNNs) and a single Transposed CNN (TCNN) within its architecture. Specifically, the architecture uses a pre-trained VGG-16 model for each of its CNN.
FgSegNet在其體系結構中使用3個卷積神經網絡(CNN)和一個轉置的CNN(TCNN)。 具體而言,該體系結構對其每個CNN使用預訓練的VGG-16模型 。
As a recap, a CNN is widely used for image feature extraction and thus works well in image classification. The premise of how a convolution layer works within a CNN is via kernels (small 2D matrices with initialized, but changeable, values/weights) that trace through the image input and aggregates the products between values of the kernel and the corresponding input pixel. In other words, kernels convolve through the input as it operates, hence the C in CNNs. Convolutional layers are better understood when seen in action, so here’s a sweet GIF of the kernel operation.
作為回顧,CNN廣泛用于圖像特征提取,因此在圖像分類中效果很好。 卷積層如何在CNN中工作的前提是通過內核 (具有已初始化但可更改的值/權重的小2D矩陣) 跟蹤圖像輸入并聚集內核值和相應輸入像素之間的乘積 。 換句話說,內核在輸入操作時會通過輸入進行卷積 ,因此在CNN中使用C。 在實際操作中可以更好地理解卷積層,因此這是內核操作的一個不錯的GIF。
Example of a CNN kernel. The teal is the complete output, the blue is the image input, and the shadow/outlines is the kernel. The kernel multiplies each of its values relative to the input and adds it all to produce one pixel as an output. [3]CNN內核的示例。 藍綠色是完整的輸出,藍色是圖像輸入,陰影/輪廓是內核。 內核將其每個值相對于輸入相乘,然后將所有值相加以產生一個像素作為輸出。 [3]So what’s a Transposed CNN? Think of it as the exact opposite of what a CNN is supposed to do. A CNN naturally produces output that are generally smaller than its input; this is the reverse for TCNNs, since its kernels perform the reverse operation of that in CNNs.
那么什么是轉置的CNN? 認為它與CNN應該做的完全相反。 CNN自然產生的輸出通常小于其輸入。 對于TCNN,這是相反的, 因為它的內核執行CNN中的相反操作 。
From the architecture diagram above, FgSegNet feeds the input into three CNNs first, concatenates the three outputs, and feeds that as an input to the TCNN. The final output is a binary mask.
從上面的架構圖中, FgSegNet首先將輸入饋入三個CNN,將三個輸出串聯起來,并將其作為輸入饋入TCNN。 最終輸出是二進制掩碼。
為什么是“類似自動編碼器”? (Why “autoencoder-like”?)
The paper calls FgSegNet an “encoder-decoder type network model”, and it’s rightfully so. For starters, the architecture resembles that of an autoencoder: an input is first funneled/bottlenecked into a smaller feature map (encoding), then expanded back to its original shape by the end of the second half of the model (decoding). CNNs are encoders by nature too, since they extract features via their kernel operations. With that said, it would be also true to say that TCNNs (being the opposite of CNNs) would be decoders.
該論文稱FgSegNet為“ 編碼器-解碼器類型的網絡模型 ”,這是正確的。 對于初學者,該架構類似于自動編碼器的架構 :首先將輸入漏斗/瓶頸化為較小的特征圖(編碼),然后在模型的下半部分結束時將其擴展回其原始形狀(解碼)。 CNN本質上也是編碼器 ,因為它們通過其內核操作提取特征。 話雖如此,也可以說TCNN (與CNN相反) 將是解碼器 。
I presume the point of having encoders in a task such as foreground extraction is to pinpoint the “important” features of the image frame that are subject to change, and use that condensed information to output a mask with the TCNN, to put it in layman’s terms. Having 3 CNNs of different shapes working in parallel also supports this notion, and allows the model to be more versatile with different sizes of foreground objects.
我認為在諸如前景提取之類的任務中使用編碼器的目的是查明圖像幀的“重要”特征,這些特征可能會發生變化,并使用壓縮后的信息通過TCNN輸出遮罩 ,以將其放入外行的條款。 具有3個不同形狀的CNN并行工作也支持此概念,并允許模型在前景對象大小不同的情況下更加通用 。
性能 (Performance)
數據集 (The Dataset)
One prominent dataset for foreground extractor models is the CDnet2014 dataset. CDnet2014 includes 11 different challenging scenarios for the model (i.e. bad weather, camera jitter, night videos, etc.), each scenario containing 4 to 6 video sequences. The dataset includes ground-truth images/masks, marking all foreground objects and shadows for each frame. FgSegNet was tested on multiple image datasets, one of which is CDnet2014.
CDnet2014數據集是前景提取器模型的一個重要數據集 。 CDnet2014包含11種針對模型的不同挑戰性場景 (例如,惡劣天氣,相機抖動,夜間視頻等),每個場景均包含4至6個視頻序列。 數據集包括真實的圖像/蒙版,標記每幀的所有前景對象和陰影。 FgSegNet已在多個圖像數據集上進行了測試,其中之一是CDnet2014。
模型實施 (Model Implementation)
FgSegNet was built using the Keras and Tensorflow frameworks. All of its layers (except the last) uses the ReLU activation layer, and multiple pooling layers of the VGG-16 CNNs are replaced with dropout layers instead. In addition to dropout, the model utilizes L2 weight regularization. RMSProp was used as the optimizer with a learning rate reducer, which activates when validation loss stops improving for 6 epochs. Finally, the model allows the choice to train with 50 or 200 image frames.
FgSegNet是使用Keras和Tensorflow框架構建的。 它的所有層(最后一層除外)都使用ReLU激活層,并且VGG-16 CNN的多個池化層被替換為替代層。 除輟學外,該模型還利用L2權重正則化。 RMSProp用作帶有學習率降低器的優化器,當驗證損失停止改善6個時期時將激活。 最后,該模型允許選擇使用50或200個圖像幀進行訓練。
To learn more about the implementation of the model, please follow here to read its paper.
要了解有關模型實現的更多信息,請按照此處閱讀其論文。
評價 (Evaluation)
FgSegNet was among one of the best performing models during evaluation with the CDnet2014 dataset. When trained with 200 frames, the average F-score achieved was 0.9734 (score ranges from 0 [worse] to 1 [best]), and 0.9545 with 50 frames. Below is an example of the categories of scenarios, along with the ground-truths and the masks generated by the model.
在使用CDnet2014數據集進行評估期間, FgSegNet是性能最好的模型之一。 用200幀訓練時,平均F得分為0.9734(分數從0 [最差]到1 [最佳]),而50幀則為0.9545。 以下是場景類別的示例,以及模型生成的地面真相和蒙版。
(a) are the raw images. (b) are the ground truths. (c) is generated by FgsegNet. [1](a)是原始圖像。 (b)是基本事實。 (c)由FgsegNet生成。 [1] The rows of images show the raw images, the ground truths, and the model-generated masks, in order from top to bottom. [1]圖像行按從上到下的順序顯示原始圖像,基本情況和模型生成的蒙版。 [1]As you can see, the model-generated masks are quite impressive, especially for the three classes following baseline.
如您所見,模型生成的蒙版非常令人印象深刻 ,尤其是對于基線之后的三個類別。
On the left, you can find some instances where the model performed unexpectedly. However, these samples look quite troublesome even for a human to segment.
在左側,您可以找到一些模型執行異常的實例。 然而,這些樣本甚至對于人類來說看起來也很麻煩。
Overall, FgSegNet is very impressive and is a great take on the foreground extraction task.
總體而言,FgSegNet令人印象深刻,并且非常適合前景提取任務。
If you’d like to take a look, here’s the github repo to the FgSegNet source code:
如果您想看一看,這是FgSegNet源代碼的github 存儲庫:
翻譯自: https://towardsdatascience.com/foreground-image-segmentation-with-fgsegnet-9ecbe3d194ab
總結
以上是生活随笔為你收集整理的使用FgSegNet进行前景图像分割的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: mask rcnn实例分割_使用Mask
- 下一篇: 2022年中国折叠机出货量达330万台