當前位置：首頁 > 编程语言 > python >内容正文

python

Python使用pytesseract进行验证码图像识别

發布時間：2023/12/2 python 23 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python使用pytesseract进行验证码图像识别小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

圖像讀取主要用了兩個庫，不同的庫是不同的對象：

本次圖像識別測試需要以下兩個條件：

1. 驗證碼獲取

2. 登錄網站

3. 圖像處理

4. 驗證碼識別測試

測試說明

測試代碼

測試結果

5. 成功示例的再識別測試

測試說明

測試代碼

測試結果

測試注意事項

6. 集成融合投票模型，并使用多進程機制運行程序

測試說明

測試代碼

測試結果

單進程運行程序的結果

并行運行程序時的效果及結果?

7. 失敗示例的再識別

8. 其他

圖像讀取主要用了兩個庫，不同的庫是不同的對象：

# plt.imread和PIL.Image.open讀入的都是RGB順序 from PIL import Image img = Image.open('xxxx.png') # 讀取Image對象 img.save('xxx.png') ''' print(img.mode) # 有'1', 'L', 'P', 'RGB', 'RGBA'等 '1': 表示黑白模式照片 'L': 表示灰度模式照片 'RGB': 表示RGB通道模式的彩色照片 'RGBA': 表示RGB通道及Alpha通道的照片 ''' img.show() # 顯示圖片 img.convert('L') # 轉換為'L'模式 img.crop((20,30,300,200)) # 裁剪 # Image.eval(img, function) # 對每個像素/通道進行函數處理import cv2 # opencv中cv2.imread讀入的是BGR通道順序 # flags=0是灰度模式，flags=1是默認的彩色模式 # im = cv2.imread('xxxx.png', flags=0) # 讀取圖像array對象、 im = cv2.imread("imgCode_grey200.jpg", flags=cv2.IMREAD_GRAYSCALE) cv2.imwrite('imgCode_grey200.jpg', im) plt.imshow(im) # 顯示圖片 # plt.show() # plt.close() # cv2.imshow('im', im) # 顯示圖片## PIL.Image.open和cv2.imread的比較與相互轉換的方法 # 當圖片是png格式，讀取結果是一致的； # 當圖片是jpg格式時，讀取結果是不一致的。 # 這可能是因為Image.open 與 cv2.imread 在解碼jpg時運算有差異。 # 簡單轉換 # im = np.array(img, np.uint8) # copy=True # im = np.asarray(img, np.uint8) # copy=False# 不設置dtype為數值的話，得到的可能是布爾值的數組，比如二值化后的圖片 im = np.asarray(img) # img = Image.fromarray(np.uint8(im)) img = Image.fromarray(im)# 標準轉換 def PILImageToCV(imagePath):# PIL Image轉換成OpenCV格式img = Image.open(imagePath)plt.imshow(img)img = cv2.cvtColor(np.asarray(img), cv2.COLOR_RGB2BGR)plt.imshow(img)plt.show()def CVImageToPIL(imagePath):# OpenCV圖片轉換為PIL imageimg = cv2.imread(imagePath)plt.imshow(img)img2 = Image.fromarray(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))plt.imshow(img2)plt.show()

本次圖像識別測試需要以下兩個條件：

OCR軟件：tesseract.exe，通過命令行調用來識別字符。

OCR軟件的Python接口：pytesseract，內核是OCR軟件tesseract

OCR：Optical Character Recognition （光學字符識別）

備注：另外的一個接口PyOCR，內核可包括tesseract或其他，但也得提前安裝OCR軟件。

import pytesseractdef get_result_by_imgCode_recognition(img):# 進行驗證碼識別result = pytesseract.image_to_string(img) # 接口默認返回的是字符串# ''.join(result.split()) # 去掉全部空格和\n\t等result = ''.join(list(filter(str.isalnum, result))) # 只保留字母和數字return resultdef pass_counter(img, img_value):# 辨別是否識別正確rst = get_result_by_imgCode_recognition(img)if rst == img_value:return 1else:return 0def most_frequent(lst):# 獲取列表最頻繁的元素，可用于集成投票獲得識別結果# return max(lst, key=lst.count)return max(set(lst), key=lst.count)

1. 驗證碼獲取

通過瀏覽器的開發者工具，發現驗證碼圖片為base64編碼的文件，通過解碼后寫入文件。

def fetch_imgCode():# 獲取驗證碼url_imgCode = 'xxxx'html = requests.post(url_imgCode)'''print(f'imgCode rsp: {html.text}')imgCode rsp: {"data": {"image_buf_str": "/9j/4AAQ....KAP/9k=","image_code": "16501881494161"},"error_code": 0, "msg": {"en-us": "Success", "zh-CN": "\u6210\u529f"},"request": "POST /public/verificationCode/imgCode"}'''html = html.json()image_buf_str = html['data']['image_buf_str']image_code = html['data']['image_code']# 保存base64編碼的圖片為圖片文件with open(f'./imgCode_png_raw/imgCode_{image_code}.png', 'wb') as f:f.write(base64.b64decode(image_buf_str))return image_code

2. 登錄網站

通過向網站發起post請求，可登錄網站，一般情況下：

輸入image_code對應的正確的驗證碼的值image_value，即可登錄成功。

反過來，如果登錄成功，也意味著我們識別出來的驗證碼值image_value是正確。

HEADERS_PORTAL = {'User-Agent': 'xxxx',"Content-Type": "application/json", } def login(image_code, image_value):login_flag = Falseurl_login = 'xxxx'data_login = {"account": "DEMO_Tong","password": "9xdsaGcy","image_code": image_code,"captcha": image_value,"nickname": "DEMO_Tong", "client_type": 100}html = requests.post(url_login, data=json.dumps(data_login), headers=HEADERS_PORTAL)# print(f'login info: {html.text}')html = html.json()if html.get('data'):if html.get('data').get('token'):login_flag = Truereturn login_flag

3. 圖像處理

灰度處理、二值處理、去噪、膨脹及腐蝕、傾斜矯正、字符切割、歸一化等

# 灰度處理和二值處理 # lookup_table = [0 if i < 200 else 1 for i in range(256)] def gray_processing(img, threshold = 127):# 轉為灰度模式img = img.convert('L')# 轉為二值模式，閾值默認是 127，大于為白色，否則黑色。# 為什么127呢，256/2=128， 2^8=256, 一個字節byte是8個比特bit# image.convert('1') # 即 threshold = 127 # threshold = 125lookup_table = [0 if i < threshold else 1 for i in range(256)]img = img.point(lookup_table, '1')return img# 膨脹腐蝕法def erode_dilate(im, threshold=2):# im = cv2.imread('xxx.jpg', 0)# cv2.imshow('xxx.jpg', im)# (threshold, threshold) 腐蝕矩陣大小kernel = np.ones((threshold, threshold), np.uint8)# 膨脹erosion = cv2.erode(im, kernel, iterations=1)# cv2.imwrite('imgCode_erosion.jpg', erosion)# Image.open('imgCode_erosion.jpg').show()# # 腐蝕# eroded = cv2.dilate(erosion, kernel, iterations=1)# cv2.imwrite('imgCode_eroded.jpg', eroded)# Image.open('imgCode_eroded.jpg').show()return erosion

4.驗證碼識別測試

測試說明

根據不同的圖像處理方式，進行驗證碼識別測試，積累成功識別示例的同時，觀察不同處理方式的識別效果。測試中獲取的驗證碼具有隨機性，有容易識別的，也有不容易識別的，但客觀上屬于同一難度的驗證碼。

本次識別測試將分為3組，每次識別10000張，通過模擬登錄網站來驗證是否識別正確。

????????第一組直接識別原圖片文件，標簽為“raw”

????????第二組識別灰度處理和閾值為200的二值處理后的圖片對象，標簽為“gray”

????????第三組識別經灰度、二值和膨脹處理后的圖片對象，標簽為“erosion”

識別的結果根據圖像處理方式和識別正確與否，放在不同文件夾，識別結果也追加到文件名：

????????imgCode_png_raw：存放從網站保存下來的原始圖片

????????imgCode_png_raw_pass：存放raw測試識別正確的原始圖片

????????imgCode_png_raw_fail：存放raw測試識別失敗的原始圖片

????????imgCode_png_raw_gray_pass：存放gray測試識別正確的原始圖片

????????imgCode_png_raw_gray_fail：存放gray測試識別失敗的已處理后的圖片

????????imgCode_png_raw_gray_erosion_pass：存放erosion測試識別正確的原始圖片

????????imgCode_png_raw_gray_erosion_fail：存放erosion測試識別失敗的已處理后的圖片

?注意：通過瀏覽器的開發工具可以發現，驗證碼使用的字體應該是 element-icons.535877f5.woff

測試代碼

from tqdm import tqdm, trange from tqdm.contrib import tzip # tqdm是進度條模塊，為了便于觀察處理進度 TEST_TOTAL = 10000 # 測試數量1萬張def test_raw():print('raw: ')pass_count = 0# for _ in range(TEST_TOTAL):for _ in trange(TEST_TOTAL):try:image_code = fetch_imgCode()img = Image.open(f'./imgCode_png_raw/imgCode_{image_code}.png')result = get_result_by_imgCode_recognition(img)login_flag = login(image_code, result)if login_flag:img.save(f'./imgCode_png_raw_pass/imgCode_{image_code}_{result}.png')pass_count += 1else:img.save(f'./imgCode_png_raw_fail/imgCode_{image_code}_{result}.png')except:info = sys.exc_info()print(info)print(f'pass_rate: {pass_count/TEST_TOTAL*100}')def test_gray():print('gray: ')pass_count = 0for _ in trange(TEST_TOTAL):try:image_code = fetch_imgCode()img = Image.open(f'./imgCode_png_raw/imgCode_{image_code}.png')img_gray = gray_processing(img, threshold=200)result = get_result_by_imgCode_recognition(img_gray)login_flag = login(image_code, result)if login_flag:img.save(f'./imgCode_png_raw_gray_pass/imgCode_{image_code}_{result}.png')pass_count += 1else:img_gray.save(f'./imgCode_png_raw_gray_fail/imgCode_{image_code}_{result}.png')except:info = sys.exc_info()print(info)print(f'pass_rate: {pass_count/TEST_TOTAL*100}')def test_erosion():print('erosion: ')pass_count = 0for _ in trange(TEST_TOTAL):try:image_code = fetch_imgCode()img = Image.open(f'./imgCode_png_raw/imgCode_{image_code}.png')img_gray = gray_processing(img, threshold=200)im = np.asarray(img_gray, np.uint8) # gray之后變成array，值變為0和1，有效去噪點erosion = erode_dilate(im, threshold=2)img1 = Image.fromarray(erosion*255) # 值為0到1，整個圖片都是黑色的。result = get_result_by_imgCode_recognition(img1) # 這里用array也可以login_flag = login(image_code, result)if login_flag:img.save(f'./imgCode_png_raw_gray_erosion_pass/imgCode_{image_code}_{result}.png')pass_count += 1else:img1.save(f'./imgCode_png_raw_gray_erosion_fail/imgCode_{image_code}_{result}.png')except:info = sys.exc_info()print(info)print(f'pass_rate: {pass_count/TEST_TOTAL*100}')

測試結果

5. 成功示例的再識別測試

測試說明

將通過raw、gray、erosion識別測試正確的示例按照1:1:1的樣本比例拷貝到imgCode_pass文件夾，此時的驗證碼樣本都是有正確識別結果的，且數量一定和樣本比例均衡，可以用三種處理方式進行再識別，比較三種處理方式的識別效果。

此次再識別測試的樣本比例1:1:1，各8844張，共26532張。

測試代碼

def test_pass_raw():pass_list = os.listdir('./imgCode_pass')pass_value_list = [img_file[-8:-4] for img_file in pass_list]pass_cnt1 = 0pass_amt = len(pass_list)print(f'pass_amt: {pass_amt}')# for img_file, img_value in zip(pass_list, pass_value_list):for img_file, img_value in tzip(pass_list, pass_value_list):# rawimg = Image.open(f'./imgCode_pass/{img_file}')pass_cnt1 += pass_counter(img, img_value)print(f'raw: \npass_rate:{pass_cnt1 / pass_amt * 100}')def test_pass_gray():pass_list = os.listdir('./imgCode_pass')pass_value_list = [img_file[-8:-4] for img_file in pass_list]pass_cnt2 = 0pass_amt = len(pass_list)print(f'pass_amt: {pass_amt}')# for img_file, img_value in zip(pass_list, pass_value_list):for img_file, img_value in tzip(pass_list, pass_value_list):# rawimg = Image.open(f'./imgCode_pass/{img_file}')# raw + grey200img = gray_processing(img, threshold=200)pass_cnt2 += pass_counter(img, img_value)print(f'raw + grey200: \npass_rate:{pass_cnt2/pass_amt*100}')def test_pass_erosion():pass_list = os.listdir('./imgCode_pass')pass_value_list = [img_file[-8:-4] for img_file in pass_list]pass_cnt3 = 0pass_amt = len(pass_list)print(f'pass_amt: {pass_amt}')# for img_file, img_value in zip(pass_list, pass_value_list):for img_file, img_value in tzip(pass_list, pass_value_list):# rawimg = Image.open(f'./imgCode_pass/{img_file}')# raw + grey200img = gray_processing(img, threshold=200)# raw + grey200 + erosionim = np.asarray(img, np.uint8) # gray之后變成array，值變為0和1，有效去噪點erosion = erode_dilate(im, threshold=2)img1 = Image.fromarray(erosion*255) # 值為0到1，整個圖片都是黑色的。pass_cnt3 += pass_counter(img1, img_value)print(f'raw + grey200 + erosion(Image): \npass_rate:{pass_cnt3/pass_amt*100}')

測試結果

測試注意事項

此次測試特別需要注意樣本比例，如果樣本全為通過raw識別測試正確的來進行再識別，使用raw方式將為100%識別正確。下圖是使用大部分為raw識別成功的示例來進行再識別的結果，發現不同處理方式的識別模型的識別能力呈下降趨勢，越接近raw識別模型的模型精度越好，反正越差。

6. 集成融合投票模型，并使用多進程機制運行程序

測試說明

基于不同的模型的識別效果不一，考慮集成學習的模型融合，使用投票法，通過raw、gray、erosion三種模型進行識別預測投票，將票數多的識別結果作為集成融合投票模型的識別結果，來進行登錄驗證。

基于集成融合投票模型需要對同一張驗證碼示例進行3次識別，比較耗時，故使用多進程機制并行地運行程序，減少程序所消耗的時間。

測試代碼

def test_ensemble_vote(kwargs):result_list = []image_code = fetch_imgCode()img = Image.open(f'./imgCode_png_raw/imgCode_{image_code}.png')result_list.append(get_result_by_imgCode_recognition(img))img_gray = gray_processing(img, threshold=200)result_list.append(get_result_by_imgCode_recognition(img_gray))im = np.asarray(img_gray, np.uint8) # gray之后變成array，值變為0和1，有效去噪點erosion = erode_dilate(im, threshold=2)img1 = Image.fromarray(erosion*255) # 值為0到1，整個圖片都是黑色的。result_list.append(get_result_by_imgCode_recognition(img1))result = max(result_list, key=result_list.count)login_flag = login(image_code, result)return login_flagdef test_ensemble_vote_multi():print('test_ensemble_vote_multi: ')from multiprocessing import Poolpool = Pool()pool_result_list = pool.map(test_ensemble_vote, trange(TEST_TOTAL))pool.close()pool.terminate()pool.join()pass_count = pool_result_list.count(True)print(f'pass_rate: {pass_count/TEST_TOTAL*100}')

測試結果

單進程運行程序的結果

?并行運行程序時的效果及結果

7. 失敗示例的再識別

使用不同二值化閾值識別的融合投票模型對元模型（raw、gray或erosion）識別失敗的例子進行再識別。?

def test_fail():## 單獨一張圖片，不同的二值化閾值，最頻繁預測結果# img = Image.open(f'./imgCode_fail/imgCode_16501101286728_359.png')# img.show()# result_list = []# for i in trange(120,200,1):# img_gray = gray_processing(img, threshold=i)# img_gray.show()# result = get_result_by_imgCode_recognition(img_gray)# result_list.append(result)# print(f'most_frequent(lst): {most_frequent(result_list)}')## 多張圖片，不同灰度閾值，最頻繁預測結果，目的是尋找最佳閾值fail_list = os.listdir('./imgCode_fail')result_list_1 = []for img_file in fail_list:img = Image.open(f'./imgCode_fail/{img_file}')result_list_2 = []for i in trange(120,200,10):img_gray = gray_processing(img, threshold=i)result = get_result_by_imgCode_recognition(img_gray)result_list_2.append(result)result_list_1.append(result_list_2)for img_file, lst in zip(fail_list, result_list_1):print(f'{img_file}, most_frequent(lst): {most_frequent(lst)}')

8.其他

總結

以上是生活随笔為你收集整理的Python使用pytesseract进行验证码图像识别的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。