當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Tesseract 3 语言数据的训练方法【转】http://blog.csdn.net/dragoo1/article/details/8439373

發(fā)布時間：2023/12/19 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 Tesseract 3 语言数据的训练方法【转】http://blog.csdn.net/dragoo1/article/details/8439373 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Tesseract 3 語言數(shù)據(jù)的訓練方法(轉(zhuǎn))

分類：?開源2012-12-26 15:42?92人閱讀?評論(0)?收藏?舉報

說明：本人由于在google code下載了源碼，先生成LIB_Debug，再生成DLL_Debug，所以直接從E:\BuildFolder\tesseract-ocr\vs2008\LIB_Debug拷貝出

tesseract-dlld.exe，unicharset_extractord.exe，mftrainingd.exe，cntrainingd.exe，combine_tessdatad.exe到E:\BuildFolder\tesseract-ocr\testing下

步驟有：

1.1. Make Box Files

E:\BuildFolder\tesseract-ocr\testing>tesseract-dlld ABC.Roman.exp0.tif ABC.Roman.exp0 -l eng batch.nochop makebox
Tesseract Open Source OCR Engine v3.02 with Leptonica

1.2. Fix Box

使用CowBoxer編輯內(nèi)容，要看help

1.3. Run Tesseract for Training

E:\BuildFolder\tesseract-ocr\testing>tesseract-dlld ABC.Roman.exp0.tif ABC.Roman.exp0 nobatch box.train
Tesseract Open Source OCR Engine v3.02 with Leptonica
APPLY_BOXES:
?? Boxes read from boxfile:????? 14
?? Found 14 good blobs.
TRAINING ... Font name = Roman
Generated training data for 2 words

1.4. Compute the Character Set

E:\BuildFolder\tesseract-ocr\testing>unicharset_extractord ABC.Roman.exp0.box
Extracting unicharset from ABC.Roman.exp0.box
Wrote unicharset file ./unicharset.

1.5. Clustering

這一步要先建立一個font_properties.txt的文件，文件內(nèi)容格式如下：

[plain]?view plaincopy

<fontname>?<italic>?<bold>?<fixed>?<serif>?<fraktur>??

我的內(nèi)容是

[plain]?view plaincopy

Roman?0?0?0?0?0??

E:\BuildFolder\tesseract-ocr\testing>mftrainingd -F font_properties.txt -U unicharset ABC.Roman.exp0.tr
Warning: No shape table file present: shapetable
Reading ABC.Roman.exp0.tr ...
Flat shape table summary: Number of shapes = 12 max unichars = 1 number with multiple unichars = 0
Done!

E:\BuildFolder\tesseract-ocr\testing>cntrainingd ABC.Roman.exp0.tr
Reading ABC.Roman.exp0.tr ...
Clustering ...

Writing normproto ...

1.6. Combine

此時，在目錄下應該生成若干個文件了，把unicharset, inttemp, normproto, pffmtable這四個文件加上前綴“Roman.”。然后輸入命令：

E:\BuildFolder\tesseract-ocr\testing>combine_tessdatad Roman.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 140
Offset for type 2 is -1
Offset for type 3 is 939
Offset for type 4 is 140232
Offset for type 5 is 140335
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is 141961
Offset for type 14 is -1
Offset for type 15 is -1

1.7. Test

把生成的Roman.traineddata拷貝到E:\BuildFolder\tesseract-ocr\testing\tessdata

tesseract ABC.Roman.exp0.tif result? -l Roman -psm 7 nobatch

這樣就ok了。

參考：http://blog.wudilabs.org/entry/f25efc5f/?lang=zh-CN

http://www.lixin.me/blog/2012/05/26/29536

http://wenku.baidu.com/view/5eafc201e87101f69e3195f4.html

http://www.84kf.com/html/22453.html

http://blog.csdn.net/fengbingchun/article/details/7022421

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

以下轉(zhuǎn)自：http://blog.wudilabs.org/entry/f25efc5f/?lang=zh-CN

需要用到的程序

(1)?Tesseract 3.00
(2)?Tesseract 3.00 Bugfix
(3)?CowBoxer 1.01
(4)?Universal Extractor 1.61?(非必需)

使用 Universal Extractor 將 Tesseract 的安裝包解開，再用 Bugfix 里的 tesseract.exe 覆蓋原來的主程序，Tesseract 就可用了。CowBoxer 是用于修改 box 文件的程序。

生成第一個 box 文件

演示中將 Tesseract 解壓到了 E:\tesseract-ocr 目錄。然后在該目錄中建立了一個 build 目錄用于存放原始數(shù)據(jù)和訓練過程中生成的文件。原始圖片數(shù)據(jù)一個有 3 個 (test.001.tif - test.003.tif):

首先生成第一個圖片 test.001.tif 的 box 文件，這里使用官方的 eng 語言數(shù)據(jù)進行文字識別：

E:\tesseract-ocr\build>..\tesseract?test.001.tif?test.001?-l?eng?batch.nochop?makebox
Tesseract?Open?Source?OCR?Engine?with?Leptonica
Number?of?found?pages:?1.
執(zhí)行完這個命令之后，build 目錄下就生成了一個 test.001.box。使用 CowBoxer 打開這個 box 文件，CowBoxer 會自動找到同名的 tif 文件顯示出來。

CowBoxer 的使用方法可以看 Help -> About 中的說明。修改完成之后 File -> Save box file 保存文件。

生成初始的 traineddata

接下來使用這一個 box 文件先生成一個 traineddata，在接下來生成其他圖片的 box 文件時，使用這個 traineddata 有利于提高識別的正確率，減少修改次數(shù)。

..\tesseract?test.001.tif?test.001?nobatch?box.train
..\training\unicharset_extractor?test.001.box
..\training\mftraining?-U?unicharset?-O?test.unicharset?test.001.tr
..\training\cntraining?test.001.tr
rename?normproto?test.normproto
rename?Microfeat?test.Microfeat
rename?inttemp?test.inttemp
rename?pffmtable?test.pffmtable
..\training\combine_tessdata?test.
在 build 目錄下執(zhí)行完這一系列命令之后，就生成了可用的 test.traineddata。

生成其余 box 文件

將上一步生成的 test.traineddata 移動到 tesseract-ocr\tessdata 目錄中，接下來生成其他 box 文件時就可以通過 -l test 參數(shù)使用它了。

..\tesseract?test.002.tif?test.002?-l?test?batch.nochop?makebox
..\tesseract?test.003.tif?test.003?-l?test?batch.nochop?makebox
這里僅僅是使用 3 個原始文件作為例子。實際制作訓練文件時，什么時候生成一個 traineddata 根據(jù)情況而定。中途生成 traineddata 的目的只是為了提高文字識別的準確率，使后面生成的 box 文件能少做修改。

生成最終的 traineddata

在所有的 box 都制作完成后，就可以生成最終的 traineddata 了。

..\tesseract?test.001.tif?test.001?nobatch?box.train
..\tesseract?test.002.tif?test.002?nobatch?box.train
..\tesseract?test.003.tif?test.003?nobatch?box.train
..\training\unicharset_extractor?test.001.box?test.002.box?test.003.box
..\training\mftraining?-U?unicharset?-O?test.unicharset?test.001.tr?test.002.tr?test.003.tr
..\training\cntraining?test.001.tr?test.002.tr?test.003.tr
rename?normproto?test.normproto
rename?Microfeat?test.Microfeat
rename?inttemp?test.inttemp
rename?pffmtable?test.pffmtable
..\training\combine_tessdata?test.
在文件較多時可以用程序生成這種腳本執(zhí)行。

轉(zhuǎn)載于:https://www.cnblogs.com/songtzu/archive/2013/01/28/2880497.html

總結(jié)

以上是生活随笔為你收集整理的Tesseract 3 语言数据的训练方法【转】http://blog.csdn.net/dragoo1/article/details/8439373的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 520 再发车：芒果 TV 会员年卡 7
下一篇： X264多线程