當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

点云网络的论文理解（七）-Frustum PointNets for 3D Object Detection from RGB-D Data

發布時間：2025/4/5 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了点云网络的论文理解（七）-Frustum PointNets for 3D Object Detection from RGB-D Data 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

名詞解釋

RGB：就是彩色圖像。

RGB-D:就是彩色圖像外加一個深度，這個深度就是攝像頭到那個東西的距離。

單目RGB-D:就是一個攝像頭采集RGB-D數據

雙目RGB-D:就是兩個攝像頭一起采集RGB-D數據，這樣類似于兩個眼睛的效果，可以更加有效地推算出位置。

0.Abstract

0.1逐句翻譯

In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes.
在本工作中，我們研究了室內和室外場景下的RGB-D數據的三維目標檢測。（這是目標檢測任務）
While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans.
雖然以前的方法專注于圖像或3D體素，往往模糊了自然3D模式和3D數據的不變性，但我們通過跳出RGB-D掃描直接操作原始點云。（和PointNet類似都是直接作用在點云上）
However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes(region proposal).
然而，該方法的一個關鍵挑戰是如何有效地定位大規模場景點云中的目標(區域方案)。（就是在大規模的場景中定位出這個物體存在很大的問題）
Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization,achieving efficiency as well as high recall for even small objects.
我們的方法不是完全依賴于3D方案，而是利用成熟的2D對象檢測器和先進的3D深度學習來進行對象定位，實現了效率和高召回，即使是小的對象。
對recall不了解可以參考：Rrecision和Recall
Benefited from learning directly in raw point clouds,our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points.
受益于直接學習原始點云，即使在強遮擋或非常稀疏的點，我們的方法也能夠精確估計三維包圍盒。（大約就是抗干擾能力好）
（這里的三維包圍盒，其實就類似于我們在2d圖形識別當中，我們弄一個小的框來把那個東西框起來）
Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability.
在KITTI和SUN RGB-D 3D檢測基準上進行評估，我們的方法在具有實時能力的同時顯著優于現有技術水平。

0.2總結

這里的總結一下：

1.解決什么總體問題：解決三維空間中的目標識別，最終結果是用一個小盒子也就是bounding box把這個東西框起來。
2.解決什么痛點 ：1.原有的方案中忽略了三維最原始的特性。2.在大的空間當中，小的常常被忽略。3.遮擋的問題。 4.點稀疏的問題。
3.怎么解決：1.直接使用點云數據，減少手工處理的影響，2.不是嚴格的使用3D或是2D的方案，是在3D的框架下，結合2D的出色經驗。

1.Introduction

1.1逐句翻譯

第一段（三維數據處理的需求很大，但是做的不多，本文也只是作了其中的一部分）
Recently, great progress has been made on 2D image understanding tasks, such as object detection [10] and instance segmentation [11].
近年來，二維圖像理解任務如目標檢測[10]和實例分割[11]取得了很大的進展。

However, beyond getting 2D bounding boxes or pixel masks, 3D understanding is eagerly in demand in many applications such as autonomous driving and augmented reality (AR).
然而，除了獲得2D邊框或像素掩模外，理解3D技術在自動駕駛和增強現實(AR)等許多應用中都非常受歡迎。

With the popularity of 3D sensors deployed on mobile devices and autonomous vehicles, more and more 3D data is captured and processed.
隨著3D傳感器部署在移動設備和自動駕駛汽車上的普及，越來越多的3D數據被捕獲和處理。

In this work, we study one of the most important 3D perception tasks –3D object detection, which classifies the object category and estimates oriented 3D bounding boxes of physical objects from 3D sensor data.
在本研究中，我們研究了最重要的三維感知任務之一——三維物體檢測，它從三維傳感器數據中分類物體類別并估計物體的三維邊界框。

第二段（引出處理點的集合的問題，之后就引出來點云的事情）
While 3D sensor data is often in the form of point clouds,how to represent point cloud and what deep net architectures to use for 3D object detection remains an open problem.
雖然三維傳感器數據通常以點云的形式出現，但如何表示點云以及使用什么樣的深網架構進行三維目標檢測仍然是一個開放性的問題。

Most existing works convert 3D point clouds to images by projection [30, 21] or to volumetric grids by quantization [33, 18, 21] and then apply convolutional networks.
現有的研究大多是通過投影[30,21]或量化[33,18,21]將三維點云轉化為圖像，然后再應用卷積網絡。（這里不提PointNet，可能是想凸顯自己的創新點吧，但是后面也提了）

This data representation transformation, however, may obscure natural 3D patterns and invariances of the data.
然而，這種數據表示轉換可能會掩蓋自然的3D模式和數據的不變性。（老問題了）
Recently, a number of papers have proposed to process point clouds directly without converting them to other formats.
最近，一些論文提出直接處理點云，而不將它們轉換為其他格式。

For example, [20, 22] proposed new types of deep net architectures, called PointNets, which have shown superior performance and efficiency in several 3D understanding tasks such as object classification and semantic segmentation.
例如，[20,22]提出了一種名為PointNets的新型深度網絡架構，該架構在諸如對象分類和語義分割等幾個3D理解任務中表現出了卓越的性能和效率。

第三段（點云只做了分類和語義分割，剩下對象檢測還沒做，所以我們做一下）
While PointNets are capable of classifying a whole point cloud or predicting a semantic class for each point in a point cloud, it is unclear how this architecture can be used for instance-level 3D object detection.
雖然PointNets能夠對整個點云進行分類，或者預測點云中每個點的語義類，但目前還不清楚該架構如何用于實例級3D對象檢測。

Towards this goal, we have to address one key challenge: how to efficiently propose possible locations of 3D objects in a 3D space.
為了實現這個目標，我們必須解決一個關鍵的挑戰:如何有效地在3D空間中提出3D對象的可能位置。

Imitating the practice in image detection, it is traightforward to enumerate candidate 3D boxes by sliding windows [7] or by 3D region proposal networks such as [27].
模仿圖像檢測的實踐，通過滑動窗口[7]或3D區域建議網絡(如[27])枚舉候選3D盒子是很簡單的。
However, the computational complexity of 3D search typically grows cubically with respect to resolution and becomes too expensive for large scenes or real-time applications such as autonomous driving.
然而，3D搜索的計算復雜性在分辨率方面通常呈立方體增長，對于大型場景或實時應用(如自動駕駛)來說過于昂貴。

第四段（使用的方法的簡單介紹）
Instead, in this work, we reduce the search space following the dimension reduction principle: we take the advantage of mature 2D object detectors (Fig. 1).
相反，在這項工作中，我們遵循降維原則減少搜索空間:我們利用成熟的2D目標檢測器(圖1)。

First, we extract the 3D bounding frustum of an object by extruding 2D bounding boxes from image detectors.
首先，通過從圖像檢測器中擠出二維邊界框提取物體的三維邊界截錐。（我理解這個大約就是在一個二維切片上進行操作）

Then, within the 3D space trimmed by each of the 3D frustums, we consecutively perform 3D object instance segmentation and amodal3D bounding box regression using two variants of PointNet.
然后，在每個三維截錐裁剪的三維空間內，使用PointNet的兩個變量分別進行三維對象實例分割和amodal3D包圍盒回歸。（之后在這個小范圍內再進行就會節省一些計算量）

The segmentation network predicts the 3D mask of the object of interest (i.e. instance segmentation);
分割網絡預測感興趣對象的三維掩碼(即實例分割);

and the regression network estimates the amodal 3D bounding box (covering the entire object even if only part of it is visible).
回歸網絡估計模態3D包圍盒(覆蓋整個物體，即使只有一部分可見)。

第五段（三維數據要求我們有更優秀的搜索方法）
In contrast to previous work that treats RGB-D data as 2D maps for CNNs, our method is more 3D-centric as we lift depth maps to 3D point clouds and process them using 3D tools.
與之前將RGB-D數據作為cnn的2D地圖的工作相比，我們的方法更以3D為中心，因為我們將深度地圖提升為3D點云，并使用3D工具對它們進行處理。

This 3D-centric view enables new capabilities for exploring 3D data in a more effective manner.
這種以3D為中心的視圖支持以更有效的方式探索3D數據的新功能。

First, in our pipeline, a few transformations are applied successively on 3D coordinates, which align point clouds into a sequence of more constrained and canonical frames.
首先，在我們的管道中，在3D坐標上依次應用了一些轉換，將點云對齊成一系列更加約束和規范的框架。

These alignments factor out pose variations in data, and thus make 3D geometry pattern more evident, leading to an easier job of 3D learners.
這些對齊因素出姿態變化的數據，從而使三維幾何模式更明顯，導致3D學習者的工作更容易。

Second, learning in 3D space can better exploits the geometric and topological structure of 3D space.
其次，三維空間學習可以更好地利用三維空間的幾何和拓撲結構。

In principle, all objects live in 3D space; therefore, we believe that many geometric structures, such as repetition, planarity, and symmetry, are more naturally parameterized and captured by learners that directly operate in 3D space.
原則上，所有物體都生活在三維空間中;因此，我們認為許多幾何結構，如重復、平面性和對稱性，更自然地參數化和捕獲學習者直接操作在三維空間。（三維空間更加貼近我們的生活，自然而然也將有更多的應用）

The usefulness of this 3D-centric network design philosophy has been supported by much recent experimental evidence.
這種以3d為中心的網絡設計理念的有效性已經得到了許多最新實驗證據的支持。

第六段（介紹網絡的效果好）
Our method achieve leading positions on KITTI 3D object detection [1] and bird’s eye view detection [2] bench marks.
該方法在KITTI三維目標檢測[1]和鳥瞰圖檢測[2]標桿上取得領先地位。
Compared with the previous state of the art [5], our method is 8.04% better on 3D car AP with high efficiency (running at 5 fps).
與之前的技術[5]相比，我們的方法在3D汽車AP上效率提高了8.04%(以5幀每秒的速度運行)。
Our method also fits well to indoor RGB-D data where we have achieved 8.9% and 6.4% better 3D map than [13] and [24] on SUN-RGBD while running one to three orders of magnitude faster.
我們的方法也很適合室內RGB-D數據，我們在SUN-RGBD上獲得了比[13]和[24]更好的8.9%和6.4%的3D地圖，同時運行速度快1 - 3個數量級。

第七段-作者自己的總結
The key contributions of our work are as follows:
我們工作的主要貢獻如下:

? We propose a novel framework for RGB-D data based 3D object detection called Frustum PointNets.
我們提出了一個基于RGB-D數據的3D對象檢測的新框架，稱為Frustum PointNets。

? We show how we can train 3D object detectors under our framework and achieve state-of-the-art performance on standard 3D object detection benchmarks.
我們展示了如何在我們的框架下訓練3D對象檢測器，并在標準3D對象檢測基準上實現最先進的性能。

? We provide extensive quantitative evaluations to validate our design choices as well as rich qualitative results for understanding the strengths and limitations of our method.
我們提供了廣泛的定量評估，以驗證我們的設計選擇，以及豐富的定性結果，以理解我們的方法的優勢和局限性。

1.2總結

1.三維空間的應用十分重要。
2.三維空間中PointNet的效果很好，但是卻沒有解決三維空間目標分割的問題。
3.所以本文嘗試去做這個事情。
4.因為三維空間的計算量比較大，所以我們先借助二維空間縮小范圍形成 3D bounding frustum ，之后我們再在縮小范圍的空間進行處理。
5.取得了很好的效果

2.Related Work

2.1逐句翻譯

2.1.第一部分：3D目標檢測的相關工作

第一段（明確問題）

3D Object Detection from RGB-D Data：
Researchers have approached the 3D detection problem by taking various ways to represent RGB-D data.
基于RGB-D數據的三維物體檢測：
研究人員通過各種方法來表示RGB-D數據來解決三維物體檢測問題。

第二段（介紹之前的方法，我們的方法比之前的好）

Front view image based methods:
基于前視圖圖像的方法
[3, 19, 34] take monocular RGB images and shape priors or occlusion patterns to infer 3D bounding boxes.
[3, 19, 34]采用單眼RGB圖像和形狀，先驗或遮擋模式來推斷3D邊界框。
[15, 6] represent depth data as 2D maps and apply CNNs to localize objects in 2D image.
[15,6]將深度數據表示為2D地圖，利用cnn對2D圖像中的對象進行定位。
In comparison we represent depth as a point cloud and use advanced 3D deep networks (PointNets) that can exploit 3D geometry more effectively.
相比之下，我們將深度表示為點云，并使用先進的3D深度網絡(PointNets)，可以更有效地利用3D幾何。

第三段（鳥瞰圖投影的方法沒有辦法檢測小物體）

Bird’s eye view based methods: MV3D [5] projects LiDAR point cloud to bird’s eye view and trains a region proposal network (RPN [23]) for 3D bounding box proposal.
基于鳥瞰圖的方法:MV3D[5]將LiDAR點云投影到鳥瞰圖上，訓練區域提議網絡(RPN[23])進行3D包圍盒提議。（鳥瞰圖：雖然看起來立體但是他歸根結底還是一個二維的圖像，不太好滿足我們的要求）

However, the method lags behind in detecting small objects, such as pedestrians and cyclists and cannot easily adapt to scenes with multiple objects in vertical direction.
但是，該方法在檢測行人、自行車等小目標時存在一定的滯后，難以適應垂直方向有多個目標的場景。

第四段(介紹之前的方法，但是三維空間的計算開銷比較大)

3D based methods: [31, 28] train 3D object classifiers by SVMs on hand-designed geometry features extracted from point cloud and then localize objects using sliding-window search.
基于三維的方法:[31,28]根據從點云中提取的手工設計的幾何特征，利用支持向量機訓練三維對象分類器，然后使用滑動窗口搜索對對象進行定位。
[7] extends [31] by replacing SVM with 3D CNN on voxelized 3D grids. [24] designs new geometric features for 3D object detection in a point cloud.
[7]擴展了[31]，在體素化的3D網格上用3D CNN代替SVM。[24]為點云中的三維物體檢測設計了新的幾何特征。
[29, 14] convert a point cloud of the entire scene into a volumetric grid and use 3D volumetric CNN for object proposal and classification.
[29,14]將整個場景的點云轉化為體積網格，使用3D體積CNN進行對象提案和分類。
Computation cost for those method is usually quite high due to the expensive cost of 3D convolutions and large 3D search space.
由于三維卷積代價昂貴，三維搜索空間大，這些方法的計算成本通常相當高。

（所以，有人提出了轉化為二維的辦法，但是使用的是手工耿轉化這不好）

Recently, [13] proposes a 2D-driven 3D object detection method that is similar to ours in spirit.
最近，[13]提出了一種2d驅動的3D目標檢測方法，這種方法與我們的精神類似。
However, they use hand-crafted features (based on histogram of point coordinates) with simple fully connected networks to regress 3D box location and pose, which is sub-optimal in both speed and performance.
然而，他們使用手工制作的特征(基于點坐標直方圖)和簡單的全連接網絡回歸3D盒的位置和姿態，這在速度和性能上都不是最優的。
In contrast, we propose a more flexible and effective solution with deep 3D feature learning (PointNets).
相比之下，我們提出了一個更靈活和有效的解決方案，深度三維特征學習(PointNets)。

2.1.第二部分：PointNet的相關工作

第五段（原有網絡仍需要一定的轉換）

Deep Learning on Point Clouds ：
Most existing works convert point clouds to images or volumetric forms before feature learning.
大多數現有的作品在特征學習之前將點云轉換為圖像或體積形式。

[33, 18, 21] voxelize point clouds into volumetric grids and generalize image CNNs to 3D CNNs.
[33, 18, 21]將點云體素化為體積網格，將圖像cnn泛化為3D cnn。（也就是轉化為三維的圖像）

[16, 25, 32, 7] design more efficient 3D CNN or neural network architectures that exploit sparsity in point cloud.
[16, 25, 32,7]設計更高效的3D CNN或利用點云稀疏性的神經網絡架構。

However, these CNN based methods still require quantitization of point clouds with certain voxel resolution.
然而，這些基于CNN的方法仍然需要對具有一定體素分辨率的點云進行量化。

Recently, a few works [20, 22] propose a novel type of network architectures (PointNets) that directly consumes raw point clouds without converting them to other formats.
最近，一些工作[20,22]提出了一種新型的網絡架構(PointNets)，直接使用原始點云，而不將它們轉換為其他格式。

While PointNets have been applied to single object classification and semantic segmentation, our work explores how to extend the architecture for the purpose of 3D object detection.
雖然PointNets已經應用于單一對象分類和語義分割，我們的工作探索如何擴展架構的目的，3D對象檢測。

2.2總結

1.之前的研究者嘗試使用各種辦法將三維空間的內容轉化為二維空間，但是都會造成一定程度的信息丟失。所以本文提出了新的方法。
2.PointNet直接使用點云，避免了使用各種轉化導致信息丟失

3. Problem Definition

3.1逐句翻譯

Given RGB-D data as input, our goal is to classify and localize objects in 3D space.
以RGB-D數據作為輸入，我們的目標是對3D空間中的對象進行分類和定位。

The depth data, obtained from LiDAR or indoor depth sensors, is represented as a point cloud in RGB camera coordinates. The projection matrix is also known so that we can get a 3D frustum from a 2D image region.
深度數據，從激光雷達或室內深度傳感器，表示為點云在RGB相機坐標。投影矩陣也是已知的，因此我們可以從2D圖像區域得到3D截錐體。

Each object is represented by a class (one among k predefined classes) and an amodal 3D bounding box.
每個對象由一個類(k個預定義類中的一個)和一個模態3D邊界框表示。

The amodal box bounds the complete object even if part of the object is occluded or truncated.
模態框限定完整的對象，即使對象的一部分被遮擋或截斷。

The 3D box is parameterized by its size h, w, l, center cx, cy, cz, and orientation θ, φ, ψ relative to a predefined canonical pose for each category.
三維盒的參數化是通過其大小h, w, l，中心cx, cy, cz和方向θ， φ， ψ相對于每個類別的預定義標準姿態。

In our implementation, we only consider the heading angle θ around the up-axis for orientation
在我們的實現中，我們只考慮圍繞上軸的航向角θ

3.2總結

1.說明了什么是RGB-d
2.說明了什么是包圍盒.

4. 3D Detection with Frustum PointNets（三維檢測網絡Frustum PointNets）

4.0總體介紹

4.0.1逐句翻譯

As shown in Fig. 2, our system for 3D object detection consists of three modules: frustum proposal, 3D instance segmentation, and 3D amodal bounding box estimation.
如圖2所示，我們的三維目標檢測系統由三個模塊組成:截錐體提議、三維實例分割和三維模態邊界盒估計。

We will introduce each module in the following subsections.
我們將在下面的小節中介紹每個模塊。

We will focus on the pipeline and functionality of each module, and refer readers to supplementary for specific architectures of the deep networks involved.
我們將重點介紹每個模塊的管道和功能，并參考讀者對所涉及的深層網絡的具體架構的補充。

4.0.2總結

大約就說了一個事情這個網絡分成三個部分;截錐體提議、三維實例分割和三維模態邊界盒估計.

4.1Frustum Proposal

4.1.1逐句翻譯

第一段（三維的圖像分辨率不如二維圖片）

The resolution of data produced by most 3D sensors, especially real-time depth sensors, is still lower than RGB images from commodity cameras.
大多數3D傳感器，特別是實時深度傳感器產生的數據的分辨率仍然低于來自商品相機的RGB圖像。

Therefore, we leverage mature 2D object detector to propose 2D object regions in RGB images as well as to classify objects.
因此，我們利用成熟的2D目標檢測器提出RGB圖像中的2D目標區域，并對目標進行分類。

（這里大約是為了給后面使用）

第二段（旋轉椎體的過程）

With a known camera projection matrix, a 2D bounding box can be lifted to a frustum (with near and far planes specified by depth sensor range) that defines a 3D search space for the object.
使用已知的攝像機投影矩陣，可以將2D邊界框提升到定義對象3D搜索空間的截錐體(由深度傳感器范圍指定遠近平面)。

We then collect all points within the frustum to form a frustum point cloud.
然后我們收集截錐體內的所有點，形成一個截錐體點云。

As shown in Fig 4 (a), frustums may orient towards many different directions, which result in large variation in the placement of point clouds.
如圖4 (a)所示，截錐體可能朝向多個不同的方向，導致點云的位置變化很大。

We therefore normalize the frustums by rotating them toward a center view such that the center axis of the frustum is orthogonal to the image plane.
因此，我們通過將截錐體向中心視圖旋轉，使截錐體的中心軸與成像平面正交，從而使其標準化。

This normalization helps improve the rotation-invariance of the algorithm.
這種歸一化有助于提高算法的旋轉不變性。

We call this entire procedure for extracting frustum point clouds from RGB-D data frustum proposal generation.
我們將這整個過程稱為從RGB-D數據截錐體提案生成中提取截錐體點云的過程。

第三段（）

While our 3D detection framework is agnostic to the exact method for 2D region proposal, we adopt a FPN [17] based model.
雖然我們的三維檢測框架是不確定的準確方法的2D區域提議，我們采用FPN[17]提出的基礎模型。

We pre-train the model weights on ImageNet classification and COCO object detection datasets and further fine-tune it on a KITTI 2D object detection dataset to classify and predict amodal 2D boxes.
我們在ImageNet分類和COCO目標檢測數據集上預先訓練模型的權重，并在KITTI 2D目標檢測數據集上進一步微調，以分類和預測模態的2D盒子。（先用比較通用的數據集來進行預訓練，再用和領域比較相關的數據集進行訓練。）

More details of the 2D detector training are provided in the supplementary.
更多的2D探測器訓練細節在補充部分（supplementary.）提供。

4.1.2總結

1.想要進行目標檢測和識別，最關鍵的問題就是先縮小范圍，既然是一個縮小范圍的過程，其實沒必要做的特別準確，只要效率夠好就行，所以本文作者選擇在二維圖像當中先檢測到這個東西。
（這里的主要考量因素是：二維圖像的分辨率較好、二維圖像的分割算法較成熟）
2.我們在二維圖像當中取出來這個東西之后，最關鍵的是我們怎么在三維點云當中選擇出對應的那部分點，只需要依據視角，將選取出來的那個圖片部分進行擴大就行了，所以就產生了文章中反復提的 frustum point clouds。
3.這時候就產生了兩個新的問題：二維的目標檢測一定要準確（畢竟之后的推算的基于二維的目標檢測）、取出來的錐體怎么進行標準化。
3.1.為了提升二維空間的目標檢測的準確度作者首先使用通用的目標檢測數據集進行預訓練（ImageNet classification COCO object detection datasets），之后再用和領域結合的數據集（a KITTI 2D object detection dataset）進行微調。
3.2接下來就是標準化的問題。這個錐體的方向是五花八門的，所以需要對其進行歸一化。文章提出來既然已經成為了一個獨立的點的集合，那么我們對其進行適當地坐標系變換不就可以了嗎，所以就使用了坐標系變化，使得每個錐體都朝向一個相同的方向。（這里旋轉不需要使用網絡，直接變換成正的就可以了，因為這個東西有明確的方向性）

4.2. 3D Instance Segmentation

4.2.1逐句翻譯

第一段（大約就是說沒有深度因素的二維空間很難解決遮擋問題等，描述為什么不在二維空間直接解決這個問題）

Given a 2D image region (and its corresponding 3D frustum), several methods might be used to obtain 3D location of the object: One straightforward solution is to directly regress 3D object locations (e.g., by 3D bounding box) from a depth map using 2D CNNs.
給定一個2D圖像區域(及其相應的3D截錐)，可以使用幾種方法來獲取物體的3D位置:一個簡單的解決方案是使用2D cnn從深度地圖直接回歸3D物體位置(例如，通過3D邊界框)
However, this problem is not easy as occluding objects and background clutter is common in natural scenes (as in Fig. 3), which may severely distract the 3D localization task.
然而，這一問題并不容易解決，因為在自然場景中普遍存在遮擋目標和背景雜波(如圖3所示)，這可能會嚴重分散三維定位任務。
Because objects are naturally separated in physical space, segmentation in 3D point cloud is much more natural and easier than that in images where pixels from distant objects can be near-by to each other.
因為物體在物理空間上是自然分離的，所以在3D點云中分割要比在圖像中分割更自然、更容易，在圖像中，來自遙遠物體的像素可以彼此相鄰。

Having observed this fact, we propose to segment instances in 3D point cloud instead of in 2D image or depth map.
考慮到這一事實，我們建議在3D點云中分割實例，而不是在2D圖像或深度映射中。

Similar to Mask-RCNN [11], which achieves instance segmentation by binary classification of pixels in image regions, we realize 3D instance segmentation using a PointNet-based network on point clouds in frustums.
與Mask-RCNN[11]通過對圖像區域的像素進行二值分類來實現實例分割類似，我們利用基于點網的網絡對截錐體中的點云進行三維實例分割。

第二段

Based on 3D instance segmentation, we are able to achieve residual based 3D localization.
在三維實例分割的基礎上，實現了基于殘差的三維定位。
That is, rather than regressing the absolute 3D location of the object whose off-set from the sensor may vary in large ranges (e.g. from 5m to beyond 50m in KITTI data),
也就是說，不是回歸物體的絕對三維位置，其與傳感器的偏差可能在很大范圍內變化（例如，KITTI數據中從5米到超過50米）
we predict the 3D bounding box center in a local coordinate system – 3D mask coordinates as shown in Fig. 4 ?.
我們在局部坐標系-三維掩模坐標系中預測三維包圍盒中心，如圖4 ?所示。

第三段（）

3D Instance Segmentation PointNet.
The network takes a point cloud in frustum and predicts a probability score for each point that indicates how likely the point belongs to the object of interest.
該網絡在截錐體中取一個點云，并預測每個點的概率分數，該分數表明該點屬于感興趣的對象的可能性有多大。

Note that each frustum contains exactly one object of interest.
注意，每個截錐只包含一個感興趣的對象。

Here those “other” points could be points of non-relevant areas (such as ground, vegetation) or other instances that occlude or are behind the object of interest.
在這里，這些“其他”點可以是不相關區域的點(如地面、植被)或其他遮擋或隱藏在感興趣對象后面的實例。

Similar to the case in 2D instance segmentation, depending on the position of the frustum, object points in one frustum may become cluttered or occlude points in another.
與2D實例分割的情況類似，根據截錐體的位置，一個截錐體中的物體點可能變得雜亂或遮擋另一個截錐體中的點。

Therefore, our segmentation PointNet is learning the occlusion and clutter patterns as well as recognizing the geometry for the object of a certain category.
因此，我們的分割PointNet是在學習遮擋和雜波模式的同時，識別特定類別的對象的幾何形狀。

第四段（提出提前輸入物體種類的想法）

In a multi-class detection case, we also leverage the semantics from a 2D detector for better instance segmentation.
在一個多類檢測案例中，我們還利用來自2D檢測器的語義來更好地進行實例分割。

For example, if we know the object of interest is a pedestrian, then the segmentation network can use this prior to find geometries that look like a person.
例如，如果我們知道感興趣的對象是一個行人，那么分割網絡就可以利用這一點事先找到看起來像人的幾何圖形。

Specifi-cally, in our architecture we encode the semantic category as a one-hot class vector (k dimensional for the pre-defined k categories) and concatenate the one-hot vector to the intermediate point cloud features. More details of the specific architectures are described in the supplementary.
具體來說，在我們的體系結構中，我們將語義類別編碼為一個獨熱向量(預定義的k個類別的k維)，并將這個熱向量連接到中間點云特征。補充部分描述了具體架構的更多細節。

第五段（轉換成相同的方向可以增強平移不變性）

After 3D instance segmentation, points that are classified as the object of interest are extracted (“masking” in Fig. 2).
三維實例分割后，提取出分類為感興趣對象的點(圖2中的“masking”)。
Having obtained these segmented object points, we further normalize its coordinates to boost the translational invariance of the algorithm, following the same rationale as in the frustum proposal step.
在獲得這些分割的目標點之后，我們進一步規范化其坐標，以提高算法的平移不變性，遵循與截錐體建議步驟相同的原理。
In our implementation, we transform the point cloud into a local coordinate by subtracting XYZ values by its centroid.
在我們的實現中，我們通過用點云的質心減去XYZ值來將點云轉換為局部坐標。
This is illustrated in Fig. 4 ?. Note that we intentionally do not scale the point cloud, because the bounding sphere size of a partial point cloud can be greatly affected by viewpoints and the real size of the point cloud helps the box size estimation.
如圖4 ?所示。請注意，我們有意不縮放點云，因為部分點云的邊界球大小可以很大程度上受視點的影響，點云的實際大小有助于盒的大小估計。

第六段

In our experiments, we find that coordinate transformations such as the one above and the previous frustum rotation are critical for 3D detection result as shown in Tab. 8.
在我們的實驗中，我們發現如上圖所示的坐標變換和之前的截錐旋轉對三維檢測結果是至關重要的，如表8所示。

4.2.2總結

二維的可以檢測為什么不用二維？
因為二維對解決遮擋問題提存在缺陷，另外兩個本身遙遠的點在一個圖片當中可能是相鄰的。
回歸并不是針對一個真正的真值，有一定的偏差
每個錐體中只能有一個需要識別的對象
引入之前的分類結果：
我們在二維空間完成的是一個object detect，這個其實是包括兩個結果一個是分割的結果、一個是識別的結果。我們上面利用的信息只有一個也就是分割的結果，所以思考怎么利用分類結果接成為了一個需要考慮的因素。
所以本文作者就將分類信息做成一個one-hot-vector輸入到到縮小范圍的過程中，這樣可以更好地輔助識別。
這個其實很好理解，本身是一個mutil_class object detection,經過這么一處理，直接就變成了一個單獨類別的目標檢測了。找某種確定的東西顯然比找東西要有容易一些，顯然這樣也能讓網絡更加有針對性。

4.3. Amodal 3D Box Estimation

逐句翻譯

第一段（T-Net的中心點對其）

Given the segmented object points (in 3D mask coordinate), this module estimates the object’s amodal oriented 3D bounding box by using a box regression PointNet together with a preprocessing transformer network.
給定被分割的對象點(以3D掩模坐標表示)，該模塊使用盒回歸PointNet和預處理transformer網絡估計對象的模態導向3D包圍盒。
Learning-based 3D Alignment by T-Net ：基于T-Net學習的三維對齊
Even though we have aligned segmented object points according to their centroid position, we find that the origin of the mask coordinate frame (Fig. 4 ?) may still be quite far from the amodal box center.
即使我們將分割的物體點按質心位置對齊，我們發現掩模坐標系的原點(圖4 ?)仍然可能離模態盒中心很遠。
We therefore propose to use a light-weight regression PointNet (T-Net) to estimate the true center of the complete object and then transform the coordinate such thatthe predicted center becomes the origin (Fig. 4 (d)).
因此，我們建議使用一個輕量級回歸點網(T-Net)來估計完整對象的真實中心，然后轉換坐標，使預測中心成為原點(圖4 (d))。

（T-Net：和PointNet中的T-Net類似，只不過這里學習的不是旋轉矩陣，而是一個物體中心點的殘差，并且這里是有監督的。也就是說這里其實是想辦法修正整個錐的中心）

第二段

Amodal 3D Box Estimation PointNet 模態3D盒估計點網

The box estimation network predicts amodal bounding boxes (for entire object even if part of it is unseen) for objects given an object point cloud in 3D object coordinate (Fig. 4 (d)).
盒子估計網絡預測模態的邊界盒(對于整個物體，即使它的一部分是看不見的)在三維物體坐標中給定一個物體點云(圖4 (d))。

The network architecture is similar to that for object classification [20, 22], however the output is no longer object class scores but parameters for a 3D bounding box.
網絡結構與對象分類類似[20,22]，但輸出的不再是對象類分數，而是一個三維邊界框的參數。

As stated in Sec. 3, we parameterize a 3D bounding box by its center (cx, cy, cz), size (h, w, l) and heading angle θ (along up-axis).
如第3節所述，我們通過其中心(cx, cy, cz)，大小(h, w, l)和航向角度θ(沿上軸)參數化三維邊界框。

We take a “residual” approach for box center estimation. The center residual predicted by the box estimation network is combined with the previous center residual from the T-Net and the masked points’ centroid to recover an absolute center (Eq. 1).
我們采用“殘差”方法進行盒中心估計。箱形估計網絡預測的中心殘差與T-Net之前的中心殘差和掩蔽點的質心相結合，恢復絕對中心(Eq. 1)。

For box size and heading angle, we follow previous works [23, 19] and use a hybrid of classification and regression formulations.
對于盒子大小和朝向角度，我們遵循以前的工作[23,19]，并使用分類和回歸公式的混合。

Specifically we pre-define NS size templates and NH equally split angle bins.
具體來說，我們預先定義了nssize模板和NH等分角箱。

Our model will both classify size/heading (NS scores for size, NH scores for heading) to those pre-defined categories as well as predict residual numbers for each category (3×NS residual dimensions for height, width, length, NH residual angles for heading). In the end the net outputs 3 + 4 × NS + 2 × NH numbers in total.

4.4. Training with Multi-task Losses

We simultaneously optimize the three nets involved (3D instance segmentation PointNet, T-Net and amodal box estimation PointNet) with multi-task losses (as in Eq. 2).
我們同時優化了涉及到的三個網(三維實例分割PointNet, T-Net和模態盒估計PointNet)，并有多任務損失(如Eq. 2所示)。
（也就是說，我們這里雖然是一個流程走下來的，但是中間有很多步驟的結果是有標簽的，所以這個和我們傳統的監督學習不一樣了）

這個位置只能直接看原文了，大約就是作者自己定義了一個損失函數
Corner Loss for Joint Optimization of Box Parameters（就是這里的幾個東西要同時達到最佳，所以不能使用簡單的損失函數）
While our 3D bounding box parameterization is compact and complete, learning is not optimized for final 3D box accuracy – center, size and heading have separate loss terms.
雖然我們的三維包圍盒參數是緊湊和完整的，學習并沒有優化最終的三維盒精度-中心，大小和標題有單獨的損失條款。
Imagine cases where center and size are accurately predicted but heading angle is off – the 3D IoU with ground truth box will then be dominated by the angle error. Ideally all three terms (center,size,heading) should be jointly optimized for best 3D box estimation (under IoU metric).
想象一下，中心和大小是準確預測的情況下，但航向角度是偏離的- 3D IoU與地面真實盒，然后將被角度誤差主導。理想情況下，所有三個術語(中心，大小，標題)應該聯合優化最佳3D盒估計(根據IoU度量)。

To resolve this problem we propose a novel regularization loss, the corner loss:
為了解決這一問題，我們提出了一種新的正則化損失，即角點損失:

第三段（ corner loss很好用）
In essence, the corner loss is the sum of the distances between the eight corners of a predicted box and a ground truth box.
從本質上講，角損失是預測盒的八個角與地面真實盒之間的距離之和。
Since corner positions are jointly determined by center, size and heading, the corner loss is able to regularize the multi-task training for those parameters.
由于角點位置是由中心、大小和方向共同確定的，因此角點損失可以使多任務訓練規范化。

《新程序員》：云原生和全面數字化實踐50位技術專家共同創作，文字、視頻、音頻交互閱讀

總結

以上是生活随笔為你收集整理的点云网络的论文理解（七）-Frustum PointNets for 3D Object Detection from RGB-D Data的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：安卓惯性传感器（一）
下一篇： attention的query、key和