在Ubuntu 18.04上安装和使用Tesseract 4
量子指南 (QUANTRIUM GUIDES)
Today, the extraction of information from scanned documents such as letters, write-ups, invoices, etc. has become an integral part of your business processes. To accomplish this task, you need to setup an OCR software to extract the information from these scanned documents or pdfs.
如今,從掃描的文檔中提取信息,例如信件,信件,發(fā)票等,已成為您業(yè)務(wù)流程中不可或缺的一部分。 要完成此任務(wù),您需要安裝OCR軟件以從這些掃描的文檔或pdf中提取信息。
Here we will take you through the process of building and installing Tesseract 4.x on your Ubuntu 18.04 machine. There are two ways to install Tesseract 4.x.:
在這里,我們將帶您完成在Ubuntu 18.04計算機(jī)上構(gòu)建和安裝Tesseract 4.x的過程。 有兩種安裝Tesseract 4.x的方法:
One is installing the Tesseract 4.0.0 beta version, it is easy to install and can be done using couple of commands.
一種是安裝Tesseract 4.0.0 beta版本,它易于安裝,可以使用幾個命令來完成。
Alternatively, you can install Tesseract 4.1.1 version, the latest stable release of Tesseract. In this post, we will guide you how to install each one of them on your Ubuntu 18.04 Machine.
或者,您可以安裝Tesseract 4.1.1版本( Tesseract的最新穩(wěn)定版本)。 在本文中,我們將指導(dǎo)您如何在Ubuntu 18.04機(jī)器上安裝它們中的每一個。
If you are not familiar with build tools and building from GitHub repositories, then installing Tesseract 4.0.0 beta is better way for you. However, if you are experienced in building and installing applications from GitHub repositories you can skip the next section and jump directly to section Installing Tesseract 4.1.1.
如果您不熟悉構(gòu)建工具以及如何從GitHub存儲庫構(gòu)建,那么安裝Tesseract 4.0.0 beta是您的更好方法。 但是,如果您有從GitHub存儲庫構(gòu)建和安裝應(yīng)用程序的經(jīng)驗,則可以跳過下一部分,直接跳至安裝Tesseract 4.1.1。
安裝Tesseract 4.0.0 Beta (Installing Tesseract 4.0.0 beta)
Installing Tesseract 4.0.0 beta version is quite simple to install and can be done using the following apt commands:
安裝Tesseract 4.0.0 beta版非常容易安裝,可以使用以下apt命令完成:
$ sudo apt install tesseract-ocr$ sudo apt install libtesseract-dev
Once you have run these two commands, check, if you have successfully installed tesseract by running the following command:
運(yùn)行這兩個命令后,通過運(yùn)行以下命令來檢查是否已成功安裝tesseract:
$ tesseract --versionAfter running this command, you should something like this:
運(yùn)行此命令后,應(yīng)執(zhí)行以下操作:
tesseract 4.0.0-beta.1leptonica-1.75.3
Or something along those lines if your installation was successful. If you it is not installed properly, you will get some errors. That means you have to check for your operating system and versions. These commands work only on Ubuntu 18.04 or higher.
如果安裝成功,則遵循這些原則。 如果未正確安裝,則會出現(xiàn)一些錯誤。 這意味著您必須檢查操作系統(tǒng)和版本。 這些命令僅適用于Ubuntu 18.04或更高版本。
Once your tesseract installation is successful, you can run the following command to check which languages are supported by your installed version of tesseract:
成功安裝tesseract之后,可以運(yùn)行以下命令來檢查已安裝的tesseract版本支持哪些語言:
$ tesseract --list-langsYou can expect the following output:
您可以期待以下輸出:
List of available languages (2):eng
osd
The eng means, it can detect English language and osd refers that it can detect orientation and script.
eng表示可以檢測英語,而osd則可以檢測方向和腳本。
Well Congratulations! You have successfully installed Tesseract 4.0.0 beta on your system and its ready to use it.
好恭喜! 您已經(jīng)在系統(tǒng)上成功安裝了Tesseract 4.0.0 beta,并且可以使用它了。
在Ubuntu 18.04上安裝tesseract 4.1.1: (Installing tesseract 4.1.1 on Ubuntu 18.04:)
In this section, we take you through the steps to build and install tesseract 4.1.1 from the following tesseract’s GitHub repository:
在本節(jié)中,我們將引導(dǎo)您從以下tesseract的GitHub存儲庫構(gòu)建和安裝tesseract 4.1.1的步驟:
Before you start building tesseract 4.1.1 from source, you need to install few dependencies. First, you have to install the leptonica library, its a pedagogically-oriented open source library containing software that is broadly useful for image processing and image analysis applications. To know more about leptonica, refer to Leptonica’s website:
從源代碼開始構(gòu)建tesseract 4.1.1之前,您需要安裝一些依賴項。 首先,您必須安裝leptonica庫,它是面向教學(xué)法的開源庫,其中包含對圖像處理和圖像分析應(yīng)用程序廣泛有用的軟件。 要了解更多關(guān)于leptonica ,請參閱Leptonica的網(wǎng)站:
http://www.leptonica.org/
http://www.leptonica.org/
To install leptonica, use the following command:
要安裝leptonica ,請使用以下命令:
$ sudo apt-get install -y libleptonica-devA further list of all the dependencies required by tesseract can be found here:
可在此處找到tesseract所需的所有依賴關(guān)系的進(jìn)一步列表:
From this list, most likely you will not have the following dependencies:
從此列表中,很可能您將沒有以下依賴項:
automakepkg-config
pango-devel
cairo-devel
icu-devel
Your Ubuntu system comes along with gcc which does offer C++11 support hence, its already there. You can use the following commands to install the above dependencies:
您的Ubuntu系統(tǒng)隨附了確實提供C ++ 11支持的gcc ,因此它已經(jīng)存在。 您可以使用以下命令來安裝以上依賴項:
$ sudo apt-get update -y$ sudo apt-get install automake
$ sudo apt-get install -y pkg-config
$ sudo apt-get install -y libsdl-pango-dev
$ sudo apt-get install -y libicu-dev
$ sudo apt-get install -y libcairo2-dev
$ sudo apt-get install bc
The last library bc is an extra dependency that is required to get tesseract 4 running on your machine.
最后一個庫bc是使tesseract 4在您的計算機(jī)上運(yùn)行所需的額外依賴項。
Now you have to clone the tesseract repository. Hey! but stop right there! First, go to the following repository:
現(xiàn)在,您必須克隆tesseract存儲庫。 嘿! 但是就停在那! 首先,轉(zhuǎn)到以下存儲庫:
And open the file named VERSION, you will see 5.0.0-alpha written, that means the tesseract version that will be installed by using the makefile in this repository will be 5.0.0-alpha. But this is not the stable release of tesseract, the stable release is 4.1.1 at the time of creation of this post.
并打開名為VERSION的文件,您將看到寫入5.0.0-alpha ,這意味著將使用此存儲庫中的makefile安裝的tesseract版本將為5.0.0-alpha 。 但這不是tesseract的穩(wěn)定版本,在創(chuàng)建此文章時,穩(wěn)定版本是4.1.1 。
Now to find the link to download latest stable release of tesseract, in the right side bar you will find a section titled “Releases” and within that you will see 4.1.1 Release.
現(xiàn)在,找到下載tesseract最新穩(wěn)定版本的鏈接,在右側(cè)欄中,您將找到標(biāo)題為“ Releases”的部分,在該部分中,您將看到4.1.1 Release 。
Tesseract GitHub RepositoryTesseract GitHub存儲庫Click on the link 4.1.1. Release and there you will find Assets section with Source code (zip) and Source code (tar.gz), copy the link and then download using the following command:
單擊鏈接4.1.1。 釋放,然后在其中找到帶有源代碼( zip )和源代碼( tar.gz )的Assets部分,復(fù)制鏈接,然后使用以下命令下載:
$ wget https://github.com/tesseract-ocr/tesseract/archive/4.1.1.zipYou can download either zip or tar.gz file. Here I have downloaded the zip file. You can unzip the file to your current directory using unzip command:
您可以下載zip或tar.gz文件。 在這里,我下載了zip文件。 您可以使用unzip命令將文件解壓縮到當(dāng)前目錄:
$ unzip 4.1.1.zipUpon the completion of unzip operation, a folder titled tesseract-4.1.1 has been created. Get into this directory using cd command.
解壓縮操作完成后,已創(chuàng)建一個名為tesseract-4.1.1的文件夾。 使用cd命令進(jìn)入該目錄。
$ cd tesseract-4.1.1In this folder if you list the files it should be something like this:
在此文件夾中,如果您列出文件,則應(yīng)如下所示:
abseil CONTRIBUTING.md java tessdataappveyor.yml cppan.yml LICENSE tesseract.pc.cmake
AUTHORS doc m4 tesseract.pc.in
autogen.sh docker-compose.yml Makefile.am test
ChangeLog Dockerfile README.md unittest
cmake googletest snap VERSION
CMakeLists.txt INSTALL src
configure.ac INSTALL.GIT.md sw.cpp
Now you are ready to install tesseract. The different ways and methods to do so for various operating systems are given here below in this link: https://github.com/tesseract-ocr/tesseract/blob/master/INSTALL.GIT.md We are going to use the autotools (LINUX/UNIX , msys…) to do so.
現(xiàn)在,您可以安裝tesseract 。 在下面的此鏈接中給出了針對各種操作系統(tǒng)執(zhí)行此操作的不同方法和方法: https : //github.com/tesseract-ocr/tesseract/blob/master/INSTALL.GIT.md我們將使用自動工具(LINUX / UNIX,msys等)來執(zhí)行此操作。
You need to run the following commands from the tesseract-4.1.1 directory to install the tesseract:
您需要從tesseract-4.1.1目錄運(yùn)行以下命令來安裝tesseract:
$ ./autogen.sh$ ./configure
$ make
$ sudo make install
$ sudo ldconfig
$ make training
$ sudo make training-install
To check that tesseract has been installed successfully, run the following command:
要檢查是否已成功安裝tesseract,請運(yùn)行以下命令:
$ tesseract --versionYou should see the output something like this:
您應(yīng)該看到如下輸出:
tesseract 4.1.1leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2
Found AVX
Found FMA
Found SSE
If the output is not same as the above or you get some error, then try to go back and check again to see where you went wrong or again follow the steps one by one.
如果輸出結(jié)果與上面的結(jié)果不同或出現(xiàn)錯誤,請嘗試返回并再次檢查錯誤的地方,或者再次按照步驟進(jìn)行操作。
文件夾tessdata (The Folder tessdata)
Now, the tessdata folder in the tesseract directory is where the tesseract checks for the language data that it needs to perform OCR on the input document.
現(xiàn)在,tesseract目錄中的tessdata文件夾是tesseract檢查在輸入文檔上執(zhí)行OCR所需的語言數(shù)據(jù)的位置。
For tesseract to work, you need at least one language, for English language you need a data file, titled 'eng.traineddata'. Also you will need another file titled 'osd.traineddata', it is used for orientation detection, and is also required in tessdata folder.
為了使tesseract正常工作,您至少需要一種語言,對于英語,則需要一個名為'eng.traineddata'的數(shù)據(jù)文件。 另外,您還需要另一個名為'osd.traineddata'文件,該文件用于方向檢測,在tessdata文件夾中也是必需的。
Unfortunately, these are not installed by default in this folder when we run make command. You need to download them separately into this folder. You can check the content of the tessdata folder by using ls command:
不幸的是,當(dāng)我們運(yùn)行make命令時,默認(rèn)情況下這些文件未安裝在此文件夾中。 您需要將它們分別下載到此文件夾中。 您可以使用ls命令檢查tessdata文件夾的內(nèi)容:
$ cd tessdata$ ls
You will see output somewhat similar to following:
您將看到類似于以下內(nèi)容的輸出:
configs eng.user-words Makefile.am pdf.ttfeng.user-patterns Makefile Makefile.in tessconfigs
As you can see, both the eng.traineddata, and the osd.traineddata are missing. Now download the eng.traineddata and osd.trainedddata from the following link:
如您所見, eng.traineddata和osd.traineddata都丟失了。 現(xiàn)在,從以下鏈接下載eng.traineddata和osd.trainedddata :
You can download them to your local system and then upload them to the tessdata folder or you could download them directly using the wget command:
您可以將它們下載到本地系統(tǒng),然后將它們上傳到tessdata文件夾,也可以使用wget命令直接下載它們:
$ wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata $ wget https://github.com/tesseract-ocr/tessdata/blob/master/osd.traineddataOnce you have successfully downloaded these files, you need to set your TESSDATA_PREFIX environment variable to the location of your tessdata directory. Use the export command to set the variable:
成功下載這些文件后,需要將TESSDATA_PREFIX環(huán)境變量設(shè)置為tessdata目錄的位置。 使用export命令設(shè)置變量:
$ export TESSDATA_PREFIX=/content/tesseract-4.1.1/tessdataNow you can list the languages in your tesseract using the following command:
現(xiàn)在,您可以使用以下命令列出tesseract中的語言:
$ tesseract --list-langsYou can see the output as following:
您可以看到以下輸出:
List of available languages (2):eng
osd
If you want to use other languages, you can download them to the tessdata folder and start using them.
如果要使用其他語言,可以將它們下載到tessdata文件夾中并開始使用它們。
從終端使用Tesseract (Using Tesseract from Terminal)
Tesseract has a various wrappers, for example, Python wrapper named pytesseract, these wrappers helps you to get access to tesseract using various programming languages. Here, we will be using tesseract through the command line.
Tesseract具有各種包裝器,例如,名為pytesseract Python包裝器,這些包裝器可幫助您使用各種編程語言來訪問tesseract 。 在這里,我們將通過命令行使用tesseract。
To perform OCR on an image you can run the following command on the terminal with the path of image file on which you want to perform OCR:
要在圖像上執(zhí)行OCR,可以在終端上運(yùn)行以下命令,并在其上執(zhí)行OCR的圖像文件的路徑為:
$ tesseract <path_of_image> stdoutIn the above command, the path_of_image is the location of the image that you want to test tesseract with. Once you do so, you should get an output right in the command line that looks something like this:
在上面的命令中, path_of_image是要用于測試tesseract的圖像的位置。 一旦這樣做,您應(yīng)該在命令行中獲得如下所示的輸出:
Here pardit was the text present in my image. So I was able to successfully use tesseract for extracting text out of my image file.
這里pardit存在于我的形象的文字。 因此,我能夠成功地使用tesseract從圖像文件中提取文本。
將Tesseract輸出保存到文件 (Saving Tesseract Output to a File)
If you want to save the output of tesseract to a text file, you can use the following command:
如果要將tesseract的輸出保存到文本文件,可以使用以下命令:
tesseract <path_of_image> output.txtHere, the output will be stored in output.txt file in your present working directory.
在這里,輸出將存儲在當(dāng)前工作目錄中的output.txt文件中。
在多個文件上運(yùn)行Tesseract (Running Tesseract on Multiple Files)
Sometimes we want to extract text out of multiple images or documents. To accomplish this, you can give text file as an input to the Tesseract which contains all the absolute path of the images that you want to perform OCR on, one file in each line.
有時我們想從多個圖像或文檔中提取文本。 為此,您可以將文本文件作為Tesseract的輸入,其中包含要對其執(zhí)行OCR的圖像的所有絕對路徑,每行一個文件。
For Example, let’s you have two photos called handwritten_photo_1.png and handwritten_photo_2.png, with some text in them, in /usr/share/ directory. Let’s create a file named input.txt with the following content:
例如,讓我們在/usr/share/目錄中有兩張名為handwritten_photo_1.png和handwritten_photo_2.png照片,其中包含一些文本。 讓我們創(chuàng)建一個名為input.txt的文件 具有以下內(nèi)容:
/usr/share/handwritten_photo_1.png/usr/share/handwritten_photo_2.png
And you want to store the contents of the these two handwritten photos in a text file, say output.txt. You have to run the following command:
您想將這兩張手寫照片的內(nèi)容存儲在一個文本文件中,例如output.txt 。 您必須運(yùn)行以下命令:
$ tesseract input.txt output.txtoutput.txt will have the OCR contents of both handwritten_photo_1.png and handwritten_photo_2.png, in that order. When you open and view the content of the output.txt, you will see that the extracted lines are preceded by some symbol like this:
output.txt的OCR內(nèi)容將output.txt順序同時為handwritten_photo_1.png和handwritten_photo_2.png 。 當(dāng)您打開并查看output.txt的內(nèi)容時,您將看到提取的行前面帶有一些符號,如下所示:
Tesseract output of an input text file with 5 lines of image locationsTesseract輸出具有5行圖像位置的輸入文本文件So in this case, Viral Calic is the prediction for the first image, CY am the king of the world the prediction for the second image, Com and Serr the prediction for the third image and so on.
因此,在這種情況下, Viral Calic是第一個圖像的預(yù)測, CY am the king of the world之CY am the king of the world第二個圖像的預(yù)測, Com and Serr是第三個圖像的預(yù)測,依此類推。
You can explore further on the usage of the tesseract on the following two links:
您可以在以下兩個鏈接上進(jìn)一步探索tesseract的用法:
I hope you were able to follow the guide and were able to install and use Tesseract on your Ubuntu 18.04 machine.
我希望您能夠按照指南進(jìn)行操作,并能夠在Ubuntu 18.04計算機(jī)上安裝和使用Tesseract。
翻譯自: https://medium.com/quantrium-tech/installing-tesseract-4-on-ubuntu-18-04-b6fcd0cbd78f
總結(jié)
以上是生活随笔為你收集整理的在Ubuntu 18.04上安装和使用Tesseract 4的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 国产新冠创新药先诺欣获批,价格将大幅低于
- 下一篇: 快手戳一下怎么关闭功能