Kaggle竞赛:Quora Insincere Questions Classification 总结与心得感想
這次Quora的文本分類題,4000支參賽隊(duì)伍中個(gè)人solo最終只在LB上達(dá)到了20%,一方面是因?yàn)榈谝淮螀⒓覰LP方面的比賽,完全是個(gè)小白,另一方面是自己在比賽途中也有不少懈怠,因此想做一些技術(shù)上以及客觀上的總結(jié)警醒自己。
比賽是通過文本訓(xùn)練集來預(yù)測Quora上的問題是真誠的還是不真誠的問題,比賽鏈接https://www.kaggle.com/c/quora-insincere-questions-classification
關(guān)于技術(shù)上的問題:
在比賽中通過學(xué)習(xí)各路大神的kernel,同時(shí)在很多問題一知半解的情況下,查閱各種文獻(xiàn)資料,我對NLP的文本分類有了一個(gè)大致的了解,分詞,語言模型,詞向量等有了初步認(rèn)識。語言模型n-gram,預(yù)訓(xùn)練詞向量word embedding,以及常用的lstm網(wǎng)絡(luò)和GRU網(wǎng)絡(luò)等等。文本預(yù)處理的各種方法,以及attention layer等模型方案。
客觀上存在的問題:
NLP的比賽和普通的數(shù)據(jù)挖掘比賽有很大的不同,普通的數(shù)據(jù)挖掘比賽最重要的需要挖掘到好的特征,其次是使用合適的模型;而NLP更注重模型本身,所以現(xiàn)有的模型中,深度學(xué)習(xí)的模型在NLP里得到廣泛應(yīng)用。我自己在這個(gè)比賽途中同時(shí)也在進(jìn)行另外一個(gè)數(shù)據(jù)挖掘的比賽,一心二用導(dǎo)致了自己不夠?qū)WⅰW龅揭欢ǔ潭鹊臅r(shí)候卡在一個(gè)地方,就發(fā)生了懈怠情緒,這也是需要自己改正的一個(gè)地方。
感悟與收獲:
最重要的收獲是感覺自己NLP終于入了門,同時(shí)了解了各種前沿的論文對于NLP建模的影響,需要的論文,因?yàn)榛旧虾玫膎lp模型都是從現(xiàn)有論文中衍生的(當(dāng)然很多大神是通過比賽驗(yàn)證自己的模型然后再發(fā)Paper),這和從數(shù)據(jù)中衍生的數(shù)據(jù)挖掘出的特征真的是有很大的區(qū)別。同時(shí)這次比賽借鑒了很多別人的方案,在以后更需要做的是站在巨人的肩膀上做出自己的一些想法。在這次比賽后,發(fā)現(xiàn)nlp真是一個(gè)巨大無比的坑,還有太多需要學(xué)習(xí)的地方,繼續(xù)加油,保持危機(jī)感。
以下附比賽代碼以及附注:
源碼鏈接:https://github.com/yyhhlancelot/Kaggle_Quora_Insincere_Question_Classification
首先載入需要用的包:
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import keras import os import os import time import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from tqdm import tqdm import math import gc from sklearn.model_selection import train_test_split from sklearn import metrics from sklearn.model_selection import GridSearchCV, StratifiedKFold from sklearn.metrics import f1_score, roc_auc_score import tensorflow as tf from sklearn.preprocessing import StandardScaler from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.layers import Dense, Input, CuDNNLSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D from keras.layers import Bidirectional, MaxPooling1D, GlobalMaxPool1D, GlobalMaxPooling1D, GlobalAveragePooling1D from keras.layers import Input, Embedding, Dense, Conv2D, MaxPool2D, concatenate from keras.layers import Reshape, Flatten, Concatenate, Dropout, SpatialDropout1D, BatchNormalization, PReLU from keras.optimizers import Adam from keras.models import Model from keras import backend as K from keras.engine.topology import Layer from keras import initializers, regularizers, constraints, optimizers, layers from keras.layers import concatenate, add from keras.callbacks import *預(yù)處理階段:
清理符號:
def clean_text(x):puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '?', '~', '@', '£', '·', '_', '{', '}', '?', '^', '?', '`', '<', '→', '°', '€', '?', '?', '?', '←', '×', '§', '″', '′', '?', '█', '?', 'à', '…', '“', '★', '”', '–', '●', 'a', '?', '?', '¢', '2', '?', '?', '?', '↑', '±', '?', '?', '═', '|', '║', '―', '¥', '▓', '—', '?', '─', '?', ':', '?', '⊕', '▼', '?', '?', '■', '’', '?', '¨', '▄', '?', '☆', 'é', 'ˉ', '?', '¤', '▲', 'è', '?', '?', '?', '?', '‘', '∞', '?', ')', '↓', '、', '│', '(', '?', ',', '?', '╩', '╚', '3', '?', '╦', '╣', '╔', '╗', '?', '?', '?', '?', '1', '≤', '?', '√', ]x = str(x)for punct in "/-'":x = x.replace(punct, ' ')for punct in '&':x = x.replace(punct, f' {punct} ')for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':x = x.replace(punct, '')for punct in puncts:x = x.replace(punct, f' {punct} ')return x使用正則表達(dá)式清理數(shù)字:
import re def clean_numbers(x):x = re.sub('[0-9]{5,}', '#####', x)x = re.sub('[0-9]{4}', '####', x)x = re.sub('[0-9]{3}', '###', x)x = re.sub('[0-9]{2}', '##', x)return x清理錯(cuò)誤拼寫:
def _get_mispell(mispell_dict):mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))return mispell_dict, mispell_remispell_dict = {'colour':'color','centre':'center','didnt':'did not','doesnt':'does not','isnt':'is not','shouldnt':'should not','favourite':'favorite','travelling':'traveling','counselling':'counseling','theatre':'theater','cancelled':'canceled','labour':'labor','organisation':'organization','wwii':'world war 2','citicise':'criticize','instagram': 'social medium','whatsapp': 'social medium','snapchat': 'social medium',"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is", "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will","i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is","let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have","needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not","oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not","shan't've": "shall not have","she'd": "she would", "she'd've": "she would have","she'll": "she will","she'll've": "she will have", "she's": "she is","should've": "should have","shouldn't": "should not","shouldn't've": "should not have", "so've": "so have","so's": "so as","this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have","they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are", "what's": "what is","what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have","you're": "you are", "you've": "you have", 'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do','doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization'}mispellings, mispellings_re = _get_mispell(mispell_dict)def replace_typical_misspell(text):def replace(match):return mispellings[match.group(0)]return mispellings_re.sub(replace, text)文本預(yù)處理:
train_df = pd.read_csv("../input/train.csv")test_df = pd.read_csv("../input/test.csv")print("Train shape : ",train_df.shape) print("Test shape : ",test_df.shape) embed_size = 300 #詞向量維度 max_features = 95000 #設(shè)置詞典大小 max_len = 70 #設(shè)置輸入的長度 # lower train_df['question_text'] = train_df['question_text'].apply(lambda x : x.lower()) test_df['question_text'] = test_df['question_text'].apply(lambda x : x.lower())# clean the text train_df["question_text"] = train_df["question_text"].apply(lambda x : clean_text(x)) test_df["question_text"] = test_df["question_text"].apply(lambda x : clean_text(x))# clean numbers train_df["question_text"] = train_df["question_text"].apply(lambda x: clean_numbers(x)) test_df["question_text"] = test_df["question_text"].apply(lambda x : clean_numbers(x))# clean spellings train_df['question_text'] = train_df['question_text'].apply(lambda x: replace_typical_misspell(x)) test_df['question_text'] = test_df['question_text'].apply(lambda x: replace_typical_misspell(x))# fill up the missing values train_X = train_df['question_text'].fillna("_##_").values test_X = test_df['question_text'].fillna("_##_").values# tokenize the sentences tokenizer = Tokenizer(num_words = max_features) tokenizer.fit_on_texts(list(train_X)) train_X = tokenizer.texts_to_sequences(train_X) test_X = tokenizer.texts_to_sequences(test_X)# pad the sentences train_X = pad_sequences(train_X, maxlen = max_len) test_X = pad_sequences(test_X, maxlen = max_len)# the target values train_y = train_df['target'].values np.random.seed(666) trn_idx = np.random.permutation(len(train_X))train_X = train_X[trn_idx] train_y = train_y[trn_idx]載入詞向量:
def load_glove(word_index): # EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'EMBEDDING_FILE = 'J:/Code/kaggle/Quora_Insincere_Question_Classfication/glove.840B.300d/glove.840B.300d.txt'def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, 'r', encoding = 'UTF-8'))all_embs = np.stack(embeddings_index.values())emb_mean,emb_std = all_embs.mean(), all_embs.std()embed_size = all_embs.shape[1]# word_index = tokenizer.word_indexnb_words = min(max_features, len(word_index))embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorreturn embedding_matrix def load_fasttext(word_index): # EMBEDDING_FILE = '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec'EMBEDDING_FILE = 'J:/Code/kaggle/Quora_Insincere_Question_Classfication/wiki-news-300d-1M/wiki-news-300d-1M.vec'def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, 'r', encoding = 'UTF-8') if len(o)>100)all_embs = np.stack(embeddings_index.values())emb_mean,emb_std = all_embs.mean(), all_embs.std()embed_size = all_embs.shape[1]# word_index = tokenizer.word_indexnb_words = min(max_features, len(word_index))embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorreturn embedding_matrixdef load_para(word_index): # EMBEDDING_FILE = '../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt'EMBEDDING_FILE = 'J:/Code/kaggle/Quora_Insincere_Question_Classfication/paragram_300_sl999/paragram_300_sl999.txt'def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8", errors='ignore') if len(o)>100)all_embs = np.stack(embeddings_index.values())emb_mean,emb_std = all_embs.mean(), all_embs.std()embed_size = all_embs.shape[1]# word_index = tokenizer.word_indexnb_words = min(max_features, len(word_index))embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorreturn embedding_matrix注意力機(jī)制(Attention Layer):
class Attention(Layer):def __init__(self, step_dim,W_regularizer=None, b_regularizer=None,W_constraint=None, b_constraint=None,bias=True, **kwargs):"""Keras Layer that implements an Attention mechanism for temporal data.Supports Masking.Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]# Input shape3D tensor with shape: `(samples, steps, features)`.# Output shape2D tensor with shape: `(samples, features)`.:param kwargs:Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.The dimensions are inferred based on the output shape of the RNN.Example:model.add(LSTM(64, return_sequences=True))model.add(Attention())"""self.supports_masking = True#self.init = initializations.get('glorot_uniform')self.init = initializers.get('glorot_uniform')self.W_regularizer = regularizers.get(W_regularizer)self.b_regularizer = regularizers.get(b_regularizer)self.W_constraint = constraints.get(W_constraint)self.b_constraint = constraints.get(b_constraint)self.bias = biasself.step_dim = step_dimself.features_dim = 0super(Attention, self).__init__(**kwargs)def build(self, input_shape):assert len(input_shape) == 3self.W = self.add_weight((input_shape[-1],),initializer=self.init,name='{}_W'.format(self.name),regularizer=self.W_regularizer,constraint=self.W_constraint)self.features_dim = input_shape[-1]if self.bias:self.b = self.add_weight((input_shape[1],),initializer='zero',name='{}_b'.format(self.name),regularizer=self.b_regularizer,constraint=self.b_constraint)else:self.b = Noneself.built = Truedef compute_mask(self, input, input_mask=None):# do not pass the mask to the next layersreturn Nonedef call(self, x, mask=None):# eij = K.dot(x, self.W) TF backend doesn't support it# features_dim = self.W.shape[0]# step_dim = x._keras_shape[1]features_dim = self.features_dimstep_dim = self.step_dimeij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))if self.bias:eij += self.beij = K.tanh(eij)a = K.exp(eij)# apply mask after the exp. will be re-normalized nextif mask is not None:# Cast the mask to floatX to avoid float64 upcasting in theanoa *= K.cast(mask, K.floatx())# in some cases especially in the early stages of training the sum may be almost zeroa /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())a = K.expand_dims(a)weighted_input = x * a#print weigthted_input.shapereturn K.sum(weighted_input, axis=1)def compute_output_shape(self, input_shape):#return input_shape[0], input_shape[-1]return input_shape[0], self.features_dim膠囊網(wǎng)絡(luò):
def squash(x, axis=-1):# s_squared_norm is really small# s_squared_norm = K.sum(K.square(x), axis, keepdims=True) + K.epsilon()# scale = K.sqrt(s_squared_norm)/ (0.5 + s_squared_norm)# return scale * xs_squared_norm = K.sum(K.square(x), axis, keepdims=True)scale = K.sqrt(s_squared_norm + K.epsilon())return x / scale# A Capsule Implement with Pure Keras class Capsule(Layer):def __init__(self, num_capsule, dim_capsule, routings=3, kernel_size=(9, 1), share_weights=True,activation='default', **kwargs):super(Capsule, self).__init__(**kwargs)self.num_capsule = num_capsuleself.dim_capsule = dim_capsuleself.routings = routingsself.kernel_size = kernel_sizeself.share_weights = share_weightsif activation == 'default':self.activation = squashelse:self.activation = Activation(activation)def build(self, input_shape):super(Capsule, self).build(input_shape)input_dim_capsule = input_shape[-1]if self.share_weights:self.W = self.add_weight(name='capsule_kernel',shape=(1, input_dim_capsule,self.num_capsule * self.dim_capsule),# shape=self.kernel_size,initializer='glorot_uniform',trainable=True)else:input_num_capsule = input_shape[-2]self.W = self.add_weight(name='capsule_kernel',shape=(input_num_capsule,input_dim_capsule,self.num_capsule * self.dim_capsule),initializer='glorot_uniform',trainable=True)def call(self, u_vecs):if self.share_weights:u_hat_vecs = K.conv1d(u_vecs, self.W)else:u_hat_vecs = K.local_conv1d(u_vecs, self.W, [1], [1])batch_size = K.shape(u_vecs)[0]input_num_capsule = K.shape(u_vecs)[1]u_hat_vecs = K.reshape(u_hat_vecs, (batch_size, input_num_capsule,self.num_capsule, self.dim_capsule))u_hat_vecs = K.permute_dimensions(u_hat_vecs, (0, 2, 1, 3))# final u_hat_vecs.shape = [None, num_capsule, input_num_capsule, dim_capsule]b = K.zeros_like(u_hat_vecs[:, :, :, 0]) # shape = [None, num_capsule, input_num_capsule]for i in range(self.routings):b = K.permute_dimensions(b, (0, 2, 1)) # shape = [None, input_num_capsule, num_capsule]c = K.softmax(b)c = K.permute_dimensions(c, (0, 2, 1))b = K.permute_dimensions(b, (0, 2, 1))outputs = self.activation(tf.keras.backend.batch_dot(c, u_hat_vecs, [2, 2]))if i < self.routings - 1:b = tf.keras.backend.batch_dot(outputs, u_hat_vecs, [2, 3])return outputsdef compute_output_shape(self, input_shape):return (None, self.num_capsule, self.dim_capsule)def capsule():K.clear_session() inp = Input(shape=(maxlen,))x = Embedding(max_features, embed_size, weights=[embedding_matrix], trainable=False)(inp)x = SpatialDropout1D(rate=0.2)(x)x = Bidirectional(CuDNNGRU(100, return_sequences=True, kernel_initializer=initializers.glorot_normal(seed=12300), recurrent_initializer=initializers.orthogonal(gain=1.0, seed=10000)))(x)x = Capsule(num_capsule=10, dim_capsule=10, routings=4, share_weights=True)(x)x = Flatten()(x)x = Dense(100, activation="relu", kernel_initializer=glorot_normal(seed=12300))(x)x = Dropout(0.12)(x)x = BatchNormalization()(x)x = Dense(1, activation="sigmoid")(x)model = Model(inputs=inp, outputs=x)model.compile(loss='binary_crossentropy', optimizer=Adam(),)return modeldef f1_smart(y_true, y_pred):args = np.argsort(y_pred)tp = y_true.sum()fs = (tp - np.cumsum(y_true[args[:-1]])) / np.arange(y_true.shape[0] + tp - 1, tp, -1)res_idx = np.argmax(fs)return 2 * fs[res_idx], (y_pred[args[res_idx]] + y_pred[args[res_idx + 1]]) / 2建模:
通過對各路大神的模型進(jìn)行比較,我找出了幾種比較高效的模型
首先是使用了注意力機(jī)制的雙向LSTM/GRU模型:
def model_lstm_atten(embedding_matrix):inp = Input(shape = (max_len,))x = Embedding(max_features, embed_size, weights = [embedding_matrix], trainable = False)(inp)x = SpatialDropout1D(0.1)(x)x = Bidirectional(CuDNNLSTM(40, return_sequences = True))(x)y = Bidirectional(CuDNNGRU(40, return_sequences = True))(x)atten_1 = Attention(max_len)(x)atten_2 = Attention(max_len)(y)avg_pool = GlobalAveragePooling1D()(y)max_pool = GlobalMaxPooling1D()(y)conc = concatenate([atten_1, atten_2, avg_pool, max_pool])conc = Dense(16, activation = "relu")(conc)conc = Dropout(0.1)(conc)outp = Dense(1, activation = "sigmoid")(conc)model = Model(inputs = inp, outputs = outp)model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = [f1])return model使用了注意力機(jī)制和膠囊網(wǎng)絡(luò)的雙向LSTM/GRU模型:
def model_atten_capsule(embedding_matrix): '''0.7'''inp_x = Input(shape = (max_len, ))inp_features = Input(shape = (6, ))x = Embedding(max_features,embed_size, weights = [embedding_matrix], trainable = False)(inp_x)x = SpatialDropout1D(0.1)(x)lstm = Bidirectional(CuDNNLSTM(60, return_sequences = True, kernel_initializer = initializers.glorot_normal(seed = 12300),recurrent_initializer=initializers.orthogonal(gain=1.0, seed=10000)))(x)gru = Bidirectional(CuDNNGRU(60, return_sequences = True, kernel_initializer = initializers.glorot_normal(seed = 12300),recurrent_initializer=initializers.orthogonal(gain=1.0, seed=10000)))(lstm) # x = Bidirectional(CuDNNLSTM(64, return_sequences = True))(x)content3 = Capsule(num_capsule = 10, dim_capsule = 10, routings = 4, share_weights = True)(gru)content3 = Dropout(0.1)(content3) # content3 = Reshape(-1, )(content3)content3 = Flatten()(content3)content3 = Dense(1, activation = "relu", kernel_initializer=initializers.glorot_normal(seed=12300))(content3)### 修改了content3atten_lstm = Attention(max_len)(lstm)atten_gru = Attention(max_len)(gru)avg_pool = GlobalAveragePooling1D()(gru)max_pool = GlobalMaxPooling1D()(gru)conc = concatenate([atten_lstm, atten_gru, content3, avg_pool, max_pool, inp_features]) #### 修改了denseconc = Dense(16, activation = "relu", kernel_initializer=initializers.glorot_normal(seed=12300))(conc)x = BatchNormalization()(conc)x = Dropout(0.1)(x)outp = Dense(1)(x)model = Model(inputs = [inp_x, inp_features], outputs = outp)model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = [f1])return modelCNN模型:
def model_cnn(embedding_matrix):filter_sizes = [1, 2, 3, 5]num_filters = 36inp = Input(shape = (max_len,))x = Embedding(max_features, embed_size, weights = [embedding_matrix])(inp)x = Reshape((max_len, embed_size, 1))(x)maxpool_pool = []for i in range(len(filter_sizes)):conv = Conv2D(num_filters, kernel_size = (filter_sizes[i], embed_size), kernel_initializer = 'he_normal', activation = 'elu')(x)maxpool_pool.append(MaxPool2D(pool_size = (max_len - filter_sizes[i] + 1, 1))(conv))z = Concatenate(axis = 1)(maxpool_pool)z = Flatten()(z)z = Dropout(0.1)(z)outp = Dense(1, activation = "sigmoid")(z)model = Model(inputs = inp, outputs = outp)model.summary()model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])return model以及我在網(wǎng)上找到了騰訊之前提出的dpCNN模型并進(jìn)行了一些針對性的修改:
def model_dpcnn(embedding_matrix):filter_nr = 64 filter_size = 3max_pool_size = 3max_pool_strides = 2dense_nr = 256spatial_dropout = 0.1dense_dropout = 0.2train_embed = Falseconv_kern_reg = regularizers.l2(0.00001)conv_bias_reg = regularizers.l2(0.00001)inp = Input(shape = (max_len, ))emb_comment = Embedding(max_features, embed_size, weights = [embedding_matrix], trainable = False)(inp) # emb_comment = SpatialDropout1D(0.1)(emb_comment)#block1block1 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(emb_comment)block1 = BatchNormalization()(block1)block1 = PReLU()(block1)block1 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block1)block1 = BatchNormalization()(block1)block1 = PReLU()(block1)#we pass embedded comment through conv1d with filter size 1 because it needs to have the same shape as block output#if you choose filter_nr = embed_size (300 in this case) you don't have to do this part and can add emb_comment directly to block1_outputresize_emb = Conv1D(filter_nr, kernel_size = 1, padding = 'same', activation = 'linear', kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(emb_comment)resize_emb = PReLU()(resize_emb)block1_output = add([block1, resize_emb])block1_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block1_output)#block2block2 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block1_output)block2 = BatchNormalization()(block2)block2 = PReLU()(block2)block2 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block2)block2 = BatchNormalization()(block2)block2 = PReLU()(block2)block2_output = add([block2, block1_output])block2_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block2_output)#block3block3 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block2_output)block3 = BatchNormalization()(block3)block3 = PReLU()(block3)block3 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block3)block3 = BatchNormalization()(block3)block3 = PReLU()(block3)block3_output = add([block3, block2_output])block3_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block3_output)#block4block4 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block3_output)block4 = BatchNormalization()(block4)block4 = PReLU()(block4)block4 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block4)block4 = BatchNormalization()(block4)block4 = PReLU()(block4)block4_output = add([block4, block3_output])block4_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block4_output)#block5block5 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block4_output)block5 = BatchNormalization()(block5)block5 = PReLU()(block5)block5 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block5)block5 = BatchNormalization()(block5)block5 = PReLU()(block5)block5_output = add([block5, block4_output])block5_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block5_output)# #block6 # block6 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear', # kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block5_output) # block6 = BatchNormalization()(block6) # block6 = PReLU()(block6) # block6 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear', # kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block6) # block6 = BatchNormalization()(block6) # block6 = PReLU()(block6)# block6_output = add([block6, block5_output]) # block6_output = MaxPooling1D(pool_size = max_pool_size, strides = max_pool_strides)(block6_output)#block7block7 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block5_output)block7 = BatchNormalization()(block7)block7 = PReLU()(block7)block7 = Conv1D(filter_nr, kernel_size = filter_size, padding = 'same', activation = 'linear',kernel_regularizer = conv_kern_reg, bias_regularizer = conv_bias_reg)(block7)block7 = BatchNormalization()(block7)block7 = PReLU()(block7)block7_output = add([block7, block5_output])outp = GlobalMaxPooling1D()(block7_output) # output = block7_outputoutp = Dense(dense_nr, activation = 'linear')(outp)outp = BatchNormalization()(outp)outp = Dropout(0.1)(outp)outp = Dense(1, activation = 'sigmoid')(outp)model = Model(inputs = inp, outputs = outp)model.summary()model.compile(loss = 'binary_crossentropy',optimizer = 'adam',metrics = ['accuracy'])return model同時(shí)還有一些模型,大致的思路也是用了注意力機(jī)制和膠囊網(wǎng)絡(luò)以及雙向LSTM或者GRU,只是模型的構(gòu)造不同,這里就沒有舉出。
詞向量處理與訓(xùn)練:
embedding_matrix_1 = load_glove(tokenizer.word_index) # embedding_matrix_2 = load_fasttext(tokenizer.word_index) embedding_matrix_3 = load_para(tokenizer.word_index)embedding_matrix = np.mean([embedding_matrix_1, embedding_matrix_3], axis = 0) # embedding_matrix = np.mean([embedding_matrix_1, embedding_matrix_2], axis = 0) # np.shape(embedding_matrix) del embedding_matrix_1, embedding_matrix_3 gc.collect()查找最佳閾值:
def threshold_search(y_true, y_prob):best_thresh = 0best_score = 0for thresh in np.arange(0.1, 0.701, 0.01):thresh = np.round(thresh, 2)score = metrics.f1_score(y_true, (y_prob >= thresh).astype(int))print("F1 score at threshold {} is {}".format(thresh, metrics.f1_score(y_true, (y_prob >= thresh).astype(int))))if score > best_score : best_score = scorebest_thresh = threshreturn best_thresh def train_pred(model, dev_X, dev_y, val_X, val_y, test_X, dev_features = None, val_features = None, epochs = None, callback = None):if dev_features is None:model.fit(dev_X, dev_y, batch_size = 512, epochs = epochs, validation_data = (val_X, val_y), callbacks = callback, verbose = 0)pred_test_y_temp = model.predict(test_X, batch_size = 1024) # pred_test_y_temp = model.predict(np.concatenate((test_X, test_features), axis = 1), batch_size = 1024)else:model.fit([dev_X, dev_features], dev_y, batch_size = 512, epochs = epochs, validation_data = ([val_X, val_features], val_y), callbacks = callback, verbose = 0)pred_test_y_temp = model.predict([test_X, test_features], batch_size = 1024)return pred_test_y_temp這里我用了四折交叉驗(yàn)證,并使用了其中一個(gè)模型來訓(xùn)練:
## ADDITION TRAIN lstm_atten num_splits = 4 skf = StratifiedKFold(n_splits = num_splits, shuffle = True, random_state = 2333) pred_test_y = 0 thresh_use = 0 val_score = 0 for dev_index, val_index in skf.split(train_X, train_y):dev_X, val_X = train_X[dev_index, :], train_X[val_index,:]dev_y, val_y = train_y[dev_index], train_y[val_index] # dev_features, val_features = train_features[dev_index, :], train_features[val_index, :]model = model_lstm_atten(embedding_matrix)pred_test_y_temp = train_pred(model, dev_X, dev_y, val_X, val_y, test_X, dev_features = None, val_features = None, epochs = 2, callback = [clr,])pred_val_y = model.predict(val_X, batch_size = 1024)best_thresh = threshold_search(val_y, pred_val_y)val_score_temp = metrics.f1_score(val_y, (pred_val_y > best_thresh).astype(int))print("val temp best f1 score is {0} and best thresh is {1}".format(val_score_temp, best_thresh))thresh_use += best_threshpred_test_y += pred_test_y_tempval_score += val_score_tempkeras.backend.clear_session() pred_test_y /= num_splits thresh_use /= num_splits val_score /= num_splits output.append([pred_test_y, thresh_use, val_score, 'lstm atten glove+para'])提交
sub = pd.read_csv('../input/sample_submission.csv') sub.prediction = (pred_test_y > thresh_use).astype(int) sub.to_csv("submission.csv", index=False)由于這個(gè)比賽的結(jié)果是通過Kernel進(jìn)行線上提交,也就是說運(yùn)行時(shí)間不能超過兩個(gè)小時(shí),所以要合理利用時(shí)間,在此基礎(chǔ)上可能需要對策略進(jìn)行修改,比如交叉驗(yàn)證的折數(shù),epoch的數(shù)量的等。以上基本上就是該次比賽我的全部流程。希望下次能比這次更好。
總結(jié)
以上是生活随笔為你收集整理的Kaggle竞赛:Quora Insincere Questions Classification 总结与心得感想的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 面试经验|华为二面分享 真难ε=(´ο`
- 下一篇: 支付宝免签 个人支付宝到银行卡