keras入门(三)搭建CNN模型破解网站验证码
項目介紹
??在文章CNN大戰驗證碼中,我們利用TensorFlow搭建了簡單的CNN模型來破解某個網站的驗證碼。驗證碼如下:
在本文中,我們將會用Keras來搭建一個稍微復雜的CNN模型來破解以上的驗證碼。
數據集
??對于驗證碼圖片的處理過程在本文中將不再具體敘述,有興趣的讀者可以參考文章CNN大戰驗證碼。
??在這個項目中,我們現在的樣本一共是1668個樣本,每個樣本都是一個字符圖片,字符圖片的大小為16*20。樣本的特征為字符圖片的像素,0代表白色,1代表黑色,每個樣本為320個特征,取值為0或1,特征變量名稱為v1到v320,樣本的類別標簽即為該字符。整個數據集的部分如下:
CNN模型
??利用Keras可以快速方便地搭建CNN模型,本文搭建的CNN模型如下:
將數據集分為訓練集和測試集,占比為8:2,該模型訓練的代碼如下:
# -*- coding: utf-8 -*- import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from matplotlib import pyplot as pltfrom keras.utils import np_utils, plot_model from keras.models import Sequential from keras.layers.core import Dense, Dropout, Activation, Flatten from keras.callbacks import EarlyStopping from keras.layers import Conv2D, MaxPooling2D# 讀取數據 df = pd.read_csv('F://verifycode_data/data.csv')# 標簽值 vals = range(31) keys = ['1','2','3','4','5','6','7','8','9','A','B','C','D','E','F','G','H','J','K','L','N','P','Q','R','S','T','U','V','X','Y','Z'] label_dict = dict(zip(keys, vals))x_data = df[['v'+str(i+1) for i in range(320)]] y_data = pd.DataFrame({'label':df['label']}) y_data['class'] = y_data['label'].apply(lambda x: label_dict[x])# 將數據分為訓練集和測試集 X_train, X_test, Y_train, Y_test = train_test_split(x_data, y_data['class'], test_size=0.3, random_state=42) x_train = np.array(X_train).reshape((1167, 20, 16, 1)) x_test = np.array(X_test).reshape((501, 20, 16, 1))# 對標簽值進行one-hot encoding n_classes = 31 y_train = np_utils.to_categorical(Y_train, n_classes) y_val = np_utils.to_categorical(Y_test, n_classes)input_shape = x_train[0].shape# CNN模型 model = Sequential()# 卷積層和池化層 model.add(Conv2D(32, kernel_size=(3, 3), input_shape=input_shape, padding='same')) model.add(Activation('relu')) model.add(Conv2D(32, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))# Dropout層 model.add(Dropout(0.25))model.add(Conv2D(64, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(Conv2D(64, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))model.add(Dropout(0.25))model.add(Conv2D(128, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(Conv2D(128, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))model.add(Dropout(0.25))model.add(Flatten())# 全連接層 model.add(Dense(256, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(128, activation='relu')) model.add(Dense(n_classes, activation='softmax'))model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])# plot model plot_model(model, to_file=r'./model.png', show_shapes=True)# 模型訓練 callbacks = [EarlyStopping(monitor='val_acc', patience=5, verbose=1)] batch_size = 64 n_epochs = 100 history = model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epochs, \verbose=1, validation_data=(x_test, y_val), callbacks=callbacks)mp = 'F://verifycode_data/verifycode_Keras.h5' model.save(mp)# 繪制驗證集上的準確率曲線 val_acc = history.history['val_acc'] plt.plot(range(len(val_acc)), val_acc, label='CNN model') plt.title('Validation accuracy on verifycode dataset') plt.xlabel('epochs') plt.ylabel('accuracy') plt.legend() plt.show()在上述代碼中,我們訓練模型的時候采用了early stopping技巧。early stopping是用于提前停止訓練的callbacks。具體地,可以達到當訓練集上的loss不在減小(即減小的程度小于某個閾值)的時候停止繼續訓練。
模型訓練
??運行上述模型訓練代碼,輸出的結果如下:
......(忽略之前的輸出) Epoch 22/10064/1167 [>.............................] - ETA: 3s - loss: 0.0399 - acc: 1.0000128/1167 [==>...........................] - ETA: 3s - loss: 0.1195 - acc: 0.9844192/1167 [===>..........................] - ETA: 2s - loss: 0.1085 - acc: 0.9792256/1167 [=====>........................] - ETA: 2s - loss: 0.1132 - acc: 0.9727320/1167 [=======>......................] - ETA: 2s - loss: 0.1045 - acc: 0.9750384/1167 [========>.....................] - ETA: 2s - loss: 0.1006 - acc: 0.9740448/1167 [==========>...................] - ETA: 2s - loss: 0.1522 - acc: 0.9643512/1167 [============>.................] - ETA: 1s - loss: 0.1450 - acc: 0.9648576/1167 [=============>................] - ETA: 1s - loss: 0.1368 - acc: 0.9653640/1167 [===============>..............] - ETA: 1s - loss: 0.1353 - acc: 0.9641704/1167 [=================>............] - ETA: 1s - loss: 0.1280 - acc: 0.9659768/1167 [==================>...........] - ETA: 1s - loss: 0.1243 - acc: 0.9674832/1167 [====================>.........] - ETA: 0s - loss: 0.1577 - acc: 0.9639896/1167 [======================>.......] - ETA: 0s - loss: 0.1488 - acc: 0.9665960/1167 [=======================>......] - ETA: 0s - loss: 0.1488 - acc: 0.9656 1024/1167 [=========================>....] - ETA: 0s - loss: 0.1427 - acc: 0.9668 1088/1167 [==========================>...] - ETA: 0s - loss: 0.1435 - acc: 0.9669 1152/1167 [============================>.] - ETA: 0s - loss: 0.1383 - acc: 0.9688 1167/1167 [==============================] - 4s 3ms/step - loss: 0.1380 - acc: 0.9683 - val_loss: 0.0835 - val_acc: 0.9760 Epoch 00022: early stopping可以看到,一共訓練了21次,最近一次的訓練后,在測試集上的準確率為96.83%。在測試集的準確率曲線如下圖:
模型預測
??模型訓練完后,我們對新的驗證碼進行預測。新的100張驗證碼如下圖:
??使用訓練好的CNN模型,對這些新的驗證碼進行預測,預測的Python代碼如下:
# -*- coding: utf-8 -*-import os import cv2 import numpy as npdef split_picture(imagepath):# 以灰度模式讀取圖片gray = cv2.imread(imagepath, 0)# 將圖片的邊緣變為白色height, width = gray.shapefor i in range(width):gray[0, i] = 255gray[height-1, i] = 255for j in range(height):gray[j, 0] = 255gray[j, width-1] = 255# 中值濾波blur = cv2.medianBlur(gray, 3) #模板大小3*3# 二值化ret,thresh1 = cv2.threshold(blur, 200, 255, cv2.THRESH_BINARY)# 提取單個字符chars_list = []image, contours, hierarchy = cv2.findContours(thresh1, 2, 2)for cnt in contours:# 最小的外接矩形x, y, w, h = cv2.boundingRect(cnt)if x != 0 and y != 0 and w*h >= 100:chars_list.append((x,y,w,h))sorted_chars_list = sorted(chars_list, key=lambda x:x[0])for i,item in enumerate(sorted_chars_list):x, y, w, h = itemcv2.imwrite('F://test_verifycode/chars/%d.jpg'%(i+1), thresh1[y:y+h, x:x+w])def remove_edge_picture(imagepath):image = cv2.imread(imagepath, 0)height, width = image.shapecorner_list = [image[0,0] < 127,image[height-1, 0] < 127,image[0, width-1]<127,image[ height-1, width-1] < 127]if sum(corner_list) >= 3:os.remove(imagepath)def resplit_with_parts(imagepath, parts):image = cv2.imread(imagepath, 0)os.remove(imagepath)height, width = image.shapefile_name = imagepath.split('/')[-1].split(r'.')[0]# 將圖片重新分裂成parts部分step = width//parts # 步長start = 0 # 起始位置for i in range(parts):cv2.imwrite('F://test_verifycode/chars/%s.jpg'%(file_name+'-'+str(i)), \image[:, start:start+step])start += stepdef resplit(imagepath):image = cv2.imread(imagepath, 0)height, width = image.shapeif width >= 64:resplit_with_parts(imagepath, 4)elif width >= 48:resplit_with_parts(imagepath, 3)elif width >= 26:resplit_with_parts(imagepath, 2)# rename and convert to 16*20 size def convert(dir, file):imagepath = dir+'/'+file# 讀取圖片image = cv2.imread(imagepath, 0)# 二值化ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)img = cv2.resize(thresh, (16, 20), interpolation=cv2.INTER_AREA)# 保存圖片cv2.imwrite('%s/%s' % (dir, file), img)# 讀取圖片的數據,并轉化為0-1值 def Read_Data(dir, file):imagepath = dir+'/'+file# 讀取圖片image = cv2.imread(imagepath, 0)# 二值化ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)# 顯示圖片bin_values = [1 if pixel==255 else 0 for pixel in thresh.ravel()]return bin_valuesdef predict(VerifyCodePath):dir = 'F://test_verifycode/chars'files = os.listdir(dir)# 清空原有的文件if files:for file in files:os.remove(dir + '/' + file)split_picture(VerifyCodePath)files = os.listdir(dir)if not files:print('查看的文件夾為空!')else:# 去除噪聲圖片for file in files:remove_edge_picture(dir + '/' + file)# 對黏連圖片進行重分割for file in os.listdir(dir):resplit(dir + '/' + file)# 將圖片統一調整至16*20大小for file in os.listdir(dir):convert(dir, file)# 圖片中的字符代表的向量files = sorted(os.listdir(dir), key=lambda x: x[0])table = np.array([Read_Data(dir, file) for file in files]).reshape(-1,20,16,1)# 模型保存地址mp = 'F://verifycode_data/verifycode_Keras.h5'# 載入模型from keras.models import load_modelcnn = load_model(mp)# 模型預測y_pred = cnn.predict(table)predictions = np.argmax(y_pred, axis=1)# 標簽字典keys = range(31)vals = ['1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'N','P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y', 'Z']label_dict = dict(zip(keys, vals))return ''.join([label_dict[pred] for pred in predictions])def main():dir = 'F://VerifyCode/'correct = 0for i, file in enumerate(os.listdir(dir)):true_label = file.split('.')[0]VerifyCodePath = dir+filepred = predict(VerifyCodePath)if true_label == pred:correct += 1print(i+1, (true_label, pred), true_label == pred, correct)total = len(os.listdir(dir))print('\n總共圖片:%d張\n識別正確:%d張\n識別準確率:%.2f%%.'\%(total, correct, correct*100/total))main()以下是該CNN模型的預測結果:
Using TensorFlow backend. 2018-10-25 15:13:50.390130: I C:\tf_jenkins\workspace\rel-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 1 ('ZK6N', 'ZK6N') True 1 2 ('4JPX', '4JPX') True 2 3 ('5GP5', '5GP5') True 3 4 ('5RQ8', '5RQ8') True 4 5 ('5TQP', '5TQP') True 5 6 ('7S62', '7S62') True 6 7 ('8R2Z', '8R2Z') True 7 8 ('8RFV', '8RFV') True 8 9 ('9BBT', '9BBT') True 9 10 ('9LNE', '9LNE') True 10 11 ('67UH', '67UH') True 11 12 ('74UK', '74UK') True 12 13 ('A5T2', 'A5T2') True 13 14 ('AHYV', 'AHYV') True 14 15 ('ASEY', 'ASEY') True 15 16 ('B371', 'B371') True 16 17 ('CCQL', 'CCQL') True 17 18 ('CFD5', 'GFD5') False 17 19 ('CJLJ', 'CJLJ') True 18 20 ('D4QV', 'D4QV') True 19 21 ('DFQ8', 'DFQ8') True 20 22 ('DP18', 'DP18') True 21 23 ('E3HC', 'E3HC') True 22 24 ('E8VB', 'E8VB') True 23 25 ('DE1U', 'DE1U') True 24 26 ('FK1R', 'FK1R') True 25 27 ('FK91', 'FK91') True 26 28 ('FSKP', 'FSKP') True 27 29 ('FVZP', 'FVZP') True 28 30 ('GC6H', 'GC6H') True 29 31 ('GH62', 'GH62') True 30 32 ('H9FQ', 'H9FQ') True 31 33 ('H67Q', 'H67Q') True 32 34 ('HEKC', 'HEKC') True 33 35 ('HV2B', 'HV2B') True 34 36 ('J65Z', 'J65Z') True 35 37 ('JZCX', 'JZCX') True 36 38 ('KH5D', 'KH5D') True 37 39 ('KXD2', 'KXD2') True 38 40 ('1GDH', '1GDH') True 39 41 ('LCL3', 'LCL3') True 40 42 ('LNZR', 'LNZR') True 41 43 ('LZU5', 'LZU5') True 42 44 ('N5AK', 'N5AK') True 43 45 ('N5Q3', 'N5Q3') True 44 46 ('N96Z', 'N96Z') True 45 47 ('NCDG', 'NCDG') True 46 48 ('NELS', 'NELS') True 47 49 ('P96U', 'P96U') True 48 50 ('PD42', 'PD42') True 49 51 ('PECG', 'PEQG') False 49 52 ('PPZF', 'PPZF') True 50 53 ('PUUL', 'PUUL') True 51 54 ('Q2DN', 'D2DN') False 51 55 ('QCQ9', 'QCQ9') True 52 56 ('QDB1', 'QDBJ') False 52 57 ('QZUD', 'QZUD') True 53 58 ('R3T5', 'R3T5') True 54 59 ('S1YT', 'S1YT') True 55 60 ('SP7L', 'SP7L') True 56 61 ('SR2K', 'SR2K') True 57 62 ('SUP5', 'SVP5') False 57 63 ('T2SP', 'T2SP') True 58 64 ('U6V9', 'U6V9') True 59 65 ('UC9P', 'UC9P') True 60 66 ('UFYD', 'UFYD') True 61 67 ('V9NJ', 'V9NH') False 61 68 ('V35X', 'V35X') True 62 69 ('V98F', 'V98F') True 63 70 ('VD28', 'VD28') True 64 71 ('YGHE', 'YGHE') True 65 72 ('YNKD', 'YNKD') True 66 73 ('YVXV', 'YVXV') True 67 74 ('ZFBS', 'ZFBS') True 68 75 ('ET6X', 'ET6X') True 69 76 ('TKVC', 'TKVC') True 70 77 ('2UCU', '2UCU') True 71 78 ('HNBK', 'HNBK') True 72 79 ('X8FD', 'X8FD') True 73 80 ('ZGNX', 'ZGNX') True 74 81 ('LQCU', 'LQCU') True 75 82 ('JNZY', 'JNZVY') False 75 83 ('RX34', 'RX34') True 76 84 ('811E', '811E') True 77 85 ('ETDX', 'ETDX') True 78 86 ('4CPR', '4CPR') True 79 87 ('FE91', 'FE91') True 80 88 ('B7XH', 'B7XH') True 81 89 ('1RUA', '1RUA') True 82 90 ('UBCX', 'UBCX') True 83 91 ('KVT5', 'KVT5') True 84 92 ('HZ3A', 'HZ3A') True 85 93 ('3XLR', '3XLR') True 86 94 ('VC7T', 'VC7T') True 87 95 ('7PG1', '7PQ1') False 87 96 ('4F21', '4F21') True 88 97 ('3HLJ', '3HLJ') True 89 98 ('1KT7', '1KT7') True 90 99 ('1RHE', '1RHE') True 91 100 ('1TTA', '1TTA') True 92總共圖片:100張 識別正確:92張 識別準確率:92.00%.可以看到,該訓練后的CNN模型,其預測新驗證的準確率在90%以上。
總結
??在文章CNN大戰驗證碼中,筆者使用TensorFlow搭建了CNN模型,代碼較長,訓練時間在兩個小時以上,而使用Keras搭建該模型,代碼簡潔,且使用early stopping技巧后能縮短訓練時間,同時保證模型的準確率,由此可見Keras的優勢所在。
??該項目已開源,Github地址為:https://github.com/percent4/C...。
注意:本人現已開通微信公眾號: Python爬蟲與算法(微信號為:easy_web_scrape), 歡迎大家關注哦~~
總結
以上是生活随笔為你收集整理的keras入门(三)搭建CNN模型破解网站验证码的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 在iview项目中添加echarts3
- 下一篇: 293/294 Flip Game I