當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

论文阅读: （ICDAR2021 海康威视）LGPMA（表格识别算法）及官方源码对应解读

發布時間：2023/12/15 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了论文阅读: （ICDAR2021 海康威视）LGPMA（表格识别算法）及官方源码对应解读小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

- - - 引言
    - 2022-06-08 update
    - LGPMA整體結構
    - 訓練階段
    - - Aligned Boudning Box Detection(對齊的包圍框檢測)
      - Local Pyramid Mask Alignment (LPMA)
      - Global Pyramid Mask Alignment (GPMA)
    - 推理階段
    - - Inference: Aligned Bounding Box Refine (微調預測所得檢測框)
      - Inference: Table Structure Recovery (表格結構還原)
    - 總結

引言

LGPMA: Complicated Table Structure Recognition with Local and Global Pyramid Mask Alignment 是海康威視在ICDAR2021 Table Recognition賽道獲得冠軍的方案。該方案對應的源碼已經開源。這波操作算是很良心了。
PDF | Code
該論文采用方法可以分為深度學習下從目標檢測角度做的表格識別方法一欄。具體可以參見：OCR之表格結構識別綜述。
代碼是集成到海康的DAVAR OCR工具箱中，其中LGPMA中基礎backbone的實現基于MMDet和MMCV模塊，當然，采用的深度學習框架自然是PyTorch。
本篇文章將從論文與代碼一一對應解析的方式來撰寫，這樣便于找到論文重點地方以及用代碼如何實現的，更快地學到其中要點。
如果讀者可以閱讀英文的話，建議先去直接閱讀英文論文，會更直接看到整個面貌。

2022-06-08 update

整理開源基于LGPMA官方代碼得到的推理代碼倉庫LGPMA_Infer 。
同時在該倉庫下，嘗試了轉onnx模型，雖說成功轉換，但是由于轉換過程和推理過程中耗費內存太大（128G內存都沒夠），遂放棄。

LGPMA整體結構

由論文以及框圖，可以將整體結構分為5部分，訓練階段有3部分（Aligned Bounding Box Detectection、LPMA、GPMA），推理階段有2部分（Aligned Bounding Box Refine、Table Structure Recovery）。 下面我將一一作解讀。
整體結構對應的代碼位于demo/table_recognition/lgpma/configs/lgpma_base.py，可直接點擊查看。MMDet系列所有的配置文件均是通過py文件中字典方式指定。不過，個人認為沒有yaml格式文件更直觀一些。

訓練階段

Aligned Boudning Box Detection(對齊的包圍框檢測)

該分支是直接用來檢測包圍框對齊的非空cell的。舉個例子來說，下圖中從每一列來看，每個單元格的紅框長度和高度都基本一致且每個單元格中有值，這就是aligned bounding box for non-empty cells的意思。
為什么要用aligned bouding box作訓練呢？
答：假如我們可以獲得aligned cell，同時整個表格沒有空的單元格，這時，根據每個單元格在行和列方向上的坐標值，我們很容易就可以得到cell之間的關系，也就很容易還原整個表格。
aligned bounding box的標簽如何獲得呢？
答：根據已有表格中文本區域標注的數據，可以通過取每一行中高最大的文本框值作為該行cell的高，取每一列最寬的文本框值作為該列cell的寬。這樣就可以獲得aligned bouding box的標注數據。
因為在訓練時，cell框的標簽就是對齊的，所以最終推理所得同列的框都是基本一致的。
在訓練過中，由于某些表格數據中存在空的cell，這就使得該分支不容易很好地學到有效的信息，這也就引出了下面的兩個分支LPMA和GPMA。
該部分代碼主要是采用mmdetection的接口實現，這里給出配置文件源碼位置，具體每個接口的使用，可以參見mmdetection文檔。roi_head=dict(type='LGPMARoIHead',bbox_roi_extractor=dict(type='SingleRoIExtractor',roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),out_channels=256,featmap_strides=[4, 8, 16, 32]),bbox_head=dict(type='Shared2FCBBoxHead',in_channels=256,fc_out_channels=1024,roi_feat_size=7,num_classes=2, # 這里應該是是否有文本的二分類bbox_coder=dict(type='DeltaXYWHBBoxCoder',target_means=[0., 0., 0., 0.],target_stds=[0.1, 0.1, 0.2, 0.2]),reg_class_agnostic=False,loss_cls=dict(type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0)),mask_roi_extractor=dict(type='SingleRoIExtractor',roi_layer=dict(type='RoIAlign', output_size=14, sampling_ratio=0),out_channels=256,featmap_strides=[4, 8, 16, 32]),mask_head=dict(type='LPMAMaskHead', # 這里是LPMAHead配置文件num_convs=4,in_channels=256,conv_out_channels=256,num_classes=2,loss_mask=dict(type='CrossEntropyLoss', use_mask=True, loss_weight=1.0),loss_lpma=dict(type='L1Loss', loss_weight=1.0))),

Local Pyramid Mask Alignment (LPMA)

該分支針對的是每個cell中的text region (文本區域)。
分為兩部分：
- 第一部分是二值化分割任務，只用來判斷當前區域是否為text region；可以從loss看到有對應實現。
- 第二部分是pyramid mask regression，對預測得到的bounding boxes，分別在水平和垂直方向上賦于soft-label，該方法出自Pyramid mask text detector。
- 具體參見下圖，簡單來說，就是在下圖藍色框中，Text出現的紅色框周圍像素值賦予更高的權重。
之所以這樣做的，是因為使用soft-label segmentation也許可以打破proposed bounding box的限制，同時預測到更加精準的aligned bounding boxes。
代碼位于davarocr/davar_table/models/roi_heads/mask_heads/lpma_mask_head.py
其中獲得local pyramid mask的代碼位于davarocr/davar_table/core/mask/lp_mask_target.py，該部分代碼可直接對應論文中的公式(1)：
$th(ω,h)={w/x1w≤xmidW?wW?x2w>xmidtv(ω,h)={h/x1h≤ymidH?hH?y2h>hmidt_{h}^{(\omega, h)} = \left\{ \begin{aligned} & w/x_{1} & w\leq x_{mid} \\ &\frac{W - w}{W - x_{2}} & w > x_{mid} \end{aligned} \right. \quad t_{v}^{(\omega, h)} = \left\{ \begin{aligned} & h/x_{1} & h \leq y_{mid} \\ &\frac{H - h}{H - y_{2}} & h > h_{mid} \end{aligned} \right.$
摘抄部分代碼如下# davarocr/davar_table/core/mask/lp_mask_target.py#L57# Calculate the pyramid mask in horizontal direction col_np = np.arange(x_min, x_max + 1).reshape(1, -1) col_np_1 = (col_np[:, :middle_x - x_min] - left_col) / (middle_x - left_col) col_np_2 = (right_col - col_np[:, middle_x - x_min:]) / (right_col - middle_x) col_np = np.concatenate((col_np_1, col_np_2), axis=1) mask_s1[ind, y_min:y_max + 1, x_min:x_max + 1] = col_np# Calculate the pyramid mask in vertical direction row_np = np.arange(y_min, y_max + 1).reshape(-1, 1) row_np_1 = (row_np[:middle_y - y_min, :] - left_row) / (middle_y - left_row) row_np_2 = (right_row - row_np[middle_y - y_min:, :]) / (right_row - middle_y) row_np = np.concatenate((row_np_1, row_np_2), axis=0) mask_s2[ind, y_min:y_max + 1, x_min:x_max + 1] = row_np

Global Pyramid Mask Alignment (GPMA)

因為LPMA分支感受野十分有限，因此考慮加入全局特征。值得注意的是，只有該分支學習了非空的cell的信息，因為空的cell不存在text region，無法在LPMA中學習。
由于每個單元格中寬高比變化比較大，在回歸學習任務中，這往往會帶來很大的不平衡問題，所以采用和LPMA相同的策略：分為兩個同時進行的任務，全局分割任務和全局的pyramid mask regression任務。
- 全局分割任務（global segmentation task）直接對所有的aligned cells進行分割，這里包括空的cell和非空的cell。源碼
- 全局的pyramid mask regression任務部分和LPMA部分相同。源碼
該部分代碼位于davarocr/davar_table/models/seg_heads/gpma_mask_head.py。其中獲得pyramid masks部分，是先補齊空的單元格大小，同時對非空的單元格采用和LPMA同樣的做法。摘抄部分代碼如下:# davarocr/davar_table/models/seg_heads/gpma_mask_head.py#L228mask_pred_ = mask_pred[i, 0, :, :]mask_pred_resize = mmcv.imresize(mask_pred_, (w_pad, h_pad))mask_pred_resize = mmcv.imresize(mask_pred_resize[:h_img, :w_img], (w_ori, h_ori))mask_pred_resize = (mask_pred_resize > 0.5)cell_region_mask.append(mask_pred_resize)# 先補齊，再獲得水平和豎直的maskreg_pred1_ = reg_pred[i, 0, :, :]reg_pred2_ = reg_pred[i, 1, :, :]reg_pred1_resize = mmcv.imresize(reg_pred1_, (w_pad, h_pad))reg_pred1_resize = mmcv.imresize(reg_pred1_resize[:h_img, :w_img], (w_ori, h_ori))reg_pred2_resize = mmcv.imresize(reg_pred2_, (w_pad, h_pad))reg_pred2_resize = mmcv.imresize(reg_pred2_resize[:h_img, :w_img], (w_ori, h_ori))gp_mask_hor.append(reg_pred1_resize)gp_mask_ver.append(reg_pred2_resize)

推理階段

該階段主要分為兩個階段，首先獲得refined后的aligned bounding boxes，然后由structure recovery pipeline來還原表格。

Inference: Aligned Bounding Box Refine (微調預測所得檢測框)

該部分主要是由于訓練階段采用的pyramid label，整個部分可以參考論文：Pyramid mask text detector，在該論文中有詳細說明。
實話說，該部分暫時還沒完全看懂，感興趣的小伙伴可以直接去看論文的3.6小節。
嘗試按照論文中的公式(7)與代碼對應實現做了比對，發現代碼實現與公式(7)并不對應。
- 公式(7):
  $xrefine=?1y2?y1+1∑yi=y1y2byi+cax_{refine} = - \frac{1}{y_{2} - y_{1} +1}\sum_{y_{i} = y_{1}}^{y_{2}}\frac{by_{i} + c}{a}$
- 對應該塊實現的代碼：
def refine_x(xmin, xmax, ymin, ymax):"""Refining left boundary or right boundary.Args:xmin(int): left boundary of original aligned bboxes.xmax(int): right boundary of original aligned bboxes.ymin(int): top boundary of original aligned bboxes.ymax(int): lower boundary of original aligned bboxes.Returns:int: the refined boundary."""a_sum = get_matrix(xmin, xmax, ymin, ymax)z_sum = get_vector(xmin, xmax, ymin, ymax, soft_mask[0])try:(a, b, c) = np.dot(a_sum.I, z_sum)except:return -1y_mean = (ymax + ymin) / 2x_refine = int((-1 * c / a - y_mean * b / a) + 0.5)return x_refine
- 將最后一行代碼轉換為公式，即為:
  $xrefine=12?byˉ+ca(1)x_{refine} = \frac{1}{2} - \frac{b\bar{y} + c}{a} \quad (1)$
  其中， $yˉ=ymax+ymin2\bar{y} = \frac{y_{max} + y_{min}}{2}$
- 將論文中公式(7)展開得到如下：
  $xrefine=?1y2?y1+1b(y1+y2)+2ca=?1y2?y1+12(byˉ+c)a(2)x_{refine} = - \frac{1}{y_{2} - y_{1} + 1}\frac{b(y_1 + y_2) + 2c}{a} \\ = - \frac{1}{y_{2} - y_{1} + 1} \frac{2(b\bar{y} + c)}{a} \quad (2)$
- 文中公式(1)和公式(2)并不相等!!

Inference: Table Structure Recovery (表格結構還原)

該部分主要分為三步：單元格匹配、空單元格搜尋和空單元格融合。
單元格匹配。 思路很簡單，如果兩個對齊后的cell框，在x/y軸上有著很大的重疊部分，我們就有理由認為它們是在同一列/行上。對應代碼位于：davarocr/davar_table/core/post_processing/post_lgpma.py#L355-L364
# Calculating cell adjacency matrix according to bboxes of non-empty aligned cells bboxes_np = np.array(bboxes) adjr, adjc = bbox2adj(bboxes_np)# Predicting start and end row / column of each cell according to the cell adjacency matrix colspan = adj_to_cell(adjc, bboxes_np, 'col') rowspan = adj_to_cell(adjr, bboxes_np, 'row') cells = [[row.min(), col.min(), row.max(), col.max()]for col, row in zip(colspan, rowspan)] cells = [list(map(int, cell)) for cell in cells] cells_np = np.array(cells)
空單元格搜尋。 將aligned bounding boxes作為節點，它們之間的關系作為邊。所有在同一行/列的節點構成了一個子圖，采用Maximum Clique Search算法。我們要知道什么時候表格中會出現空的cell？答案是出現單元格合并情況時。
- 具體以行搜索過程為例講解原理。具體列和行子圖示意圖，下圖來自論文：Rethinking table recognition using graph neural networks，圖中中間部分即是行子圖，可以看到的是TriStar節點出現在了兩個子圖中。
- 當一個合并后的單元格跨多個行時，其相應的節點肯定會在多個子圖中出現，就像上圖中的TriStar單元格。
- 將所有的行子圖按照y軸排序，很容易定位到每個節點的行索引，而那些出現在多個子圖中的節點，也會被標記到多個行索引上。
- 由此，可以認定出現在多個行索引中的節點即是空的cell，或者說是空的cell的一部分（合并單元格前的獨立單元格）。
- 對應源碼位于：davarocr/davar_table/core/post_processing/post_lgpma.py#L25
from networkx import Graph, find_cliquesdef adj_to_cell(adj, bboxes, mod):"""Calculating start and end row / column of each cell according to row / column adjacent relationshipsArgs:adj(np.array): (n x n). row / column adjacent relationships of non-empty aligned cellsbboxes(np.array): (n x 4). bboxes of non-empty aligned cellsmod(str): 'row' or 'col'Returns:list(np.array): start and end row of each cell if mod is 'row' / start and end col of each cell if mod is 'col'"""assert mod in ('row', 'col')# generate graph of each non-empty aligned cellsnodenum = adj.shape[0]edge_temp = np.where(adj != 0)edge = list(zip(edge_temp[0], edge_temp[1]))# 采用的networkx中圖和find_cliques函數接口table_graph = Graph()table_graph.add_nodes_from(list(range(nodenum)))table_graph.add_edges_from(edge)# Find maximal clique in the graphclique_list = list(find_cliques(table_graph))# 省略部分代碼
空單元格合并。 經過上述幾個步驟，我們可以獲得空的單元格位置，注意這里獲得的空的單元格僅僅只是最小的單元格，并非是合并后的。
- 為了更加可靠的方式合并這些單元格，首先將這些空的單元格的大小設定為同行或同列單元格中最大的寬和高。
- 隨后，計算兩個相鄰的單元格內部之間，被預測為1的像素比例，如下圖中紅框所示。如果比例占用大于設定的閾值，則將這兩個cell合并為一個。
源碼：davarocr/davar_table/core/post_processing/post_lgpma.py#L366-L377
# Searching empty cells and recording them through arearec arearec = np.zeros([cells_np[:, 2].max() + 1,cells_np[:, 3].max() + 1]) for cellid, rec in enumerate(cells_np):srow, scol, erow, ecol = rec[0], rec[1], rec[2], rec[3]arearec[srow:erow + 1, scol:ecol + 1] = cellid + 1empty_index = -1 # deal with empty cell for row in range(arearec.shape[0]):for col in range(arearec.shape[1]):if arearec[row, col] == 0:cells.append([row, col, row, col])arearec[row, col] = empty_indexempty_index -= 1
推理最終的生成結果是一個字典形式，舉例如下：
{'html': '<html><body><table><thead><tr><td colspan="3"></td><td></td><td></td><td></td><td rowspan="2"></td></tr><tr><td></td><td></td><td></td><td></td><td></td><td></td></tr></thead><tbody><tr><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td></td><td></td></tbody></table></body></html>', 'content_ann': {'bboxes': [[10, 9, 216, 29], [], [], [], [642, 6, 817, 80], [8, 40, 120, 80], ], 'labels': [[0], [0], [1], [0], [1], [0]], 'texts': ['', '', '', '', '', '']} }

總結

LGPMA算法整體思路清晰，行文邏輯清晰，值得學習。
由于源碼是基于mmdetection修改而來的，所以整個復現環境有些繁瑣，不過該倉庫的作者維護還是十分及時的。
本篇文章涉及東西較多，難免掛一漏萬，如果哪里寫的不當，還請指出。

總結

以上是生活随笔為你收集整理的论文阅读: （ICDAR2021 海康威视）LGPMA（表格识别算法）及官方源码对应解读的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： hihocoder 1246 王胖浩与环
下一篇：盘点2022年nft艺术品交易平台排行榜