當前位置：首頁 > 编程语言 > python >内容正文

python

基于python抓取图片或PDF文字（中文和英文）

發布時間：2023/12/31 python 19 豆豆

生活随笔收集整理的這篇文章主要介紹了基于python抓取图片或PDF文字（中文和英文）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

抓取文字

1. 軟件安裝
- 1.1 Tesseract安裝
- 1.2 python 模塊
2. 測試
- 2.1 英文測試圖片：
- 2.2 中文測試圖片
- 2.3 批量識別

1. 軟件安裝

文字識別是ORC的一部分內容，ORC的意思是光學字符識別，通俗講就是文字識別。Tesseract是一個用于文字識別的工具。基于python模塊可以完成這項復雜的任務。

1.1 Tesseract安裝

網址下載：https://digi.bib.uni-mannheim.de/tesseract/
注：w32表示32位系統；w64為64位系統
環境配置
一定要記住安裝位置，以便用于配置環境變量
配置環境變量（以win10為例）
方法：右擊我的電腦/此電腦->屬性->高級系統設置->環境變量->Path->編輯->新建

將安裝路徑復制黏貼 -> 依次確定

1.2 python 模塊

測試是在pyCharm中完成的。可以采用如下的安裝方式：

pip install pytesseract

注：需要將Tesseract安裝的路徑加入到 pytesseract.py 模塊中。
我這里使用的 Anaconda 數據庫，需要在D:\software\Anaconda\install\Lib\site-packages\pytesseract\pytesseract.py中修改

將：tesseract_cmd = 'tesseract' 改為：tesseract_cmd = r'$PATH\tesseract.exe' # 即安裝路徑

2. 測試

2.1 英文測試圖片：

import pytesseract from PIL import Imageim = Image.open("20200807105704.png")string = pytesseract.image_to_string(im)print(string) # 輸出結果如下： Do not go gentle into that good night! ?, ZackSock

2.2 中文測試圖片

import pytesseract from PIL import Imageim = Image.open("1596771968.jpg")string = pytesseract.image_to_string(im, lang='chi_sim')print(string) # 輸出結果 1. 軟件安裝文學 8 是 ORC 的一部分內容 ,ORC 的意悅是光學字等識別 , 通俗誠就星文字識別、fesseract 一個用于文字河的工具 , 基于 python 橫武可以完我頁復札的代務

效果不是太好，可能和文字圖片（分辨率）有關系

2.3 批量識別

pytesseract 還可以將圖片放入一個文檔中，批量識別圖片。

準備文檔，text.txt文檔內容如下：

sentence1.jpg sentence2.jpg

批量識別

import pytesseract # 識別文字 string = pytesseract.image_to_string('text.txt', lang='chi_sim') print(string)

查找圖片并寫入文檔，轉化文字

import os import pytesseract # 文字圖片的路徑 path = 'text_img/' # 獲取圖片路徑列表 imgs = [path + i for i in os.listdir(path)] # 打開文件 f = open('text.txt', 'w+', encoding='utf-8') # 將各個圖片的路徑寫入text.txt文件當中 for img in imgs:f.write(img + '\n') # 關閉文件 f.close() # 文字識別 string = pytesseract.image_to_string('text.txt', lang='chi_sim') print(string)

總結

以上是生活随笔為你收集整理的基于python抓取图片或PDF文字（中文和英文）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：北京大学软件工程研究所——简介
下一篇：微搭人员招聘管理系统官方模板解析（一）