當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

kaggle:Quora Insincere Questions Classification

發(fā)布時(shí)間：2023/12/16 编程问答 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 kaggle:Quora Insincere Questions Classification 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

問題描述：

今天任何一個(gè)主要網(wǎng)站的存在問題是如何處理有毒（toxic）和分裂（divisive）的內(nèi)容。 Quora希望正面（head-on）解決（tackle）這個(gè)問題，讓他們的平臺(tái)成為用戶可以安全地與世界分享知識(shí)的地方。

Quora是一個(gè)讓人們相互學(xué)習(xí)的平臺(tái)。在Quora上，人們可以提出問題，并與提供獨(dú)特見解和質(zhì)量回答（unique insights and quality answers）的其他人聯(lián)系。一個(gè)關(guān)鍵的挑戰(zhàn)是淘汰（weed out）虛假的問題 - 那些建立在虛假前提（false premises）下的問題，或者打算發(fā)表聲明而不是尋求有用答案的問題。

在本次比賽中，Kagglers將開發(fā)識(shí)別和標(biāo)記虛假問題（flag insincere questions）的模型。到目前為止（To date），Quora已經(jīng)使用機(jī)器學(xué)習(xí)和人工審查（manual review）來(lái)解決這個(gè)問題（address this problem）。在您的幫助下，他們可以開發(fā)更具可擴(kuò)展性的方法（develop more scalable methods）來(lái)檢測(cè)有毒和誤導(dǎo)性內(nèi)容（detect toxic and misleading content）。

這是你大規(guī)模對(duì)抗在線巨魔（combat online trolls at scale）的機(jī)會(huì)。幫助Quora堅(jiān)持（uphold）“善良，尊重”（Be Nice, Be Respectful）的政策，繼續(xù)成為分享和發(fā)展世界知識(shí)的地方。

Important Note：（注意）
請(qǐng)注意，這是作為a Kernels Only Competition運(yùn)行，要求所有submissions都通過Kernel output進(jìn)行。請(qǐng)仔細(xì)閱讀內(nèi)核常見問題解答和數(shù)據(jù)頁(yè)面，以充分了解其設(shè)計(jì)方法。

Data Description（數(shù)據(jù)描述）

在本次比賽中，您將預(yù)測(cè)Quora上提出的問題是否真誠(chéng)（sincere）。

一個(gè)虛偽的（insincere）問題被定義為一個(gè)旨在發(fā)表聲明而不是尋找有用答案的問題。一些可以表明問題虛偽（insincere）的特征：

具有非中性語(yǔ)氣（Has a non-neutral tone）
- 夸張的語(yǔ)氣（exaggerated tone）強(qiáng)調(diào)了一群人的觀點(diǎn)
- 是修辭（rhetorical）的，意味著暗示（meant to imply）關(guān)于一群人的陳述
是貶低（disparaging）或煽動(dòng)性的（inflammatory）
- 建議針對(duì)受保護(hù)階層的人提出歧視性（discriminatory）觀點(diǎn)，或?qū)で蟠_認(rèn)陳規(guī)定型觀念（confirmation of a stereotype）
- 對(duì)特定的人或一群人進(jìn)行貶低（disparaging）的攻擊/侮辱（attacks/insults）
- 基于關(guān)于一群人的古怪前提（outlandish premise）
- 貶低（Disparages）不可修復(fù)（fixable）且無(wú)法衡量（measurable）的特征
不是基于現(xiàn)實(shí)（Isn’t grounded in reality）
- 基于虛假信息（false information），或包含荒謬的假設(shè)（absurd assumptions）
使用性內(nèi)容（亂倫incest，獸交bestiality，戀童癖pedophilia）來(lái)獲得震撼價(jià)值，而不是尋求真正的（genuine）答案

訓(xùn)練數(shù)據(jù)包括被詢問的問題（question that was asked），以及是否被識(shí)別為真誠(chéng)的（insincere）（target=1）。真實(shí) （ground-truth）標(biāo)簽包含一些噪音：它們不能保證是完美的。

請(qǐng)注意，數(shù)據(jù)集中問題的分布不應(yīng)被視為代表Quora上提出的問題的分布。部分原因是由于采樣程序和已應(yīng)用于最終數(shù)據(jù)集的消毒（sanitization）措施的組合。

Data fields（數(shù)據(jù)域描述）

qid - 唯一的問題標(biāo)識(shí)符
question_text - Quora問題文本
target - 標(biāo)記為“insincere”的問題的值為1，否則為0

這是僅限內(nèi)核的比賽（Kernels-only competition）。此數(shù)據(jù)部分中的文件可供下載，以便在階段1中參考。階段2文件僅在內(nèi)核中可用且無(wú)法下載。

比賽的第二階段會(huì)有什么？

在比賽的第二階段，我們將重新運(yùn)行您選擇的內(nèi)核。以下文件將與新數(shù)據(jù)交換：

test.csv - 這將與完整的公共和私有測(cè)試數(shù)據(jù)集交換。該文件在階段1中具有_{56k行，在階段2中具有}376k行。兩個(gè)版本的公共頁(yè)首數(shù)據(jù)保持相同。文件名將相同（均為test.csv）以確保您的代碼將運(yùn)行。
sample_submission.csv - 類似于test.csv，這將從第1階段的_{56k變?yōu)榈?階段的}376k行。文件名將保持不變。

Embeddings

本次比賽不允許使用外部數(shù)據(jù)源。但是，我們提供了許多字嵌入以及可以在模型中使用的數(shù)據(jù)集。這些如下：

GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/
glove.840B.300d - https://nlp.stanford.edu/projects/glove/
paragram_300_sl999 - https://cogcomp.org/page/resource_view/106
wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html

import os import time import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from tqdm import tqdm import math from sklearn.model_selection import train_test_split from sklearn import metricsfrom keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D from keras.layers import Bidirectional, GlobalMaxPool1D from keras.models import Model from keras import initializers, regularizers, constraints, optimizers, layers

train_df = pd.read_csv("../input/train.csv") test_df = pd.read_csv("../input/test.csv") print("Train shape : ",train_df.shape) print("Test shape : ",test_df.shape)

接下來(lái)的步驟如下：

將訓(xùn)練數(shù)據(jù)集拆分為train和val樣本。交叉驗(yàn)證是一個(gè)耗時(shí)的過程，因此讓我們進(jìn)行簡(jiǎn)單的train val split。
使用’na’填寫文本列中的缺失值
對(duì)文本列進(jìn)行標(biāo)記（Tokenize the text column）并將其轉(zhuǎn)換為矢量序列
根據(jù)需要填充序列 - 如果文本中的單詞數(shù)大于’max_len’，則將它們trunacate為’max_len’，或者如果文本中的單詞數(shù)小于’max_len’，則為剩余值添加零。

## split to train and val（劃分?jǐn)?shù)據(jù)集） train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=2018)## some config values（一些配置信息） embed_size = 300 # how big is each word vector（詞向量大小） max_features = 50000 # how many unique words to use (i.e num rows in embedding vector)（要使用多少個(gè)獨(dú)特的單詞） maxlen = 100 # max number of words in a question to use（要使用的問題中的最大單詞數(shù)）## fill up the missing values（填充缺失值） train_X = train_df["question_text"].fillna("_na_").values val_X = val_df["question_text"].fillna("_na_").values test_X = test_df["question_text"].fillna("_na_").values## Tokenize the sentences（對(duì)句子進(jìn)行標(biāo)記） tokenizer = Tokenizer(num_words=max_features) tokenizer.fit_on_texts(list(train_X)) train_X = tokenizer.texts_to_sequences(train_X) val_X = tokenizer.texts_to_sequences(val_X) test_X = tokenizer.texts_to_sequences(test_X)## Pad the sentences （填寫句子） train_X = pad_sequences(train_X, maxlen=maxlen) val_X = pad_sequences(val_X, maxlen=maxlen) test_X = pad_sequences(test_X, maxlen=maxlen)## Get the target values（獲取目標(biāo)值） train_y = train_df['target'].values val_y = val_df['target'].values

沒有預(yù)訓(xùn)練的Embeddings：（Without Pretrained Embeddings）**

現(xiàn)在我們完成了所有必要的預(yù)處理步驟，我們可以首先訓(xùn)練雙向GRU模型。我們不會(huì)對(duì)此模型使用任何預(yù)先訓(xùn)練過的字嵌入，Embeddings將從頭開始學(xué)習(xí)。請(qǐng)查看模型摘要，了解所用圖層的詳細(xì)信息。

inp = Input(shape=(maxlen,)) x = Embedding(max_features, embed_size)(inp) x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x) x = GlobalMaxPool1D()(x) x = Dense(16, activation="relu")(x) x = Dropout(0.1)(x) x = Dense(1, activation="sigmoid")(x) model = Model(inputs=inp, outputs=x) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])print(model.summary())

Train the model using train sample and monitor the metric on the valid sample。這只是一個(gè)運(yùn)行2個(gè)epochs的樣本模型。改變epochs，batch_size和模型參數(shù)可能會(huì)為我們提供更好的模型。

## Train the model （訓(xùn)練模型） model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

現(xiàn)在讓我們獲得驗(yàn)證樣本預(yù)測(cè)，并獲得F1分?jǐn)?shù)的最佳閾值。

pred_noemb_val_y = model.predict([val_X], batch_size=1024, verbose=1) for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_noemb_val_y>thresh).astype(int))))

現(xiàn)在讓我們獲取測(cè)試集預(yù)測(cè)并保存它們

pred_noemb_test_y = model.predict([test_X], batch_size=1024, verbose=1)

現(xiàn)在我們的模型構(gòu)建已經(jīng)完成，在我們進(jìn)入下一步之前清理一些內(nèi)存可能是個(gè)好主意。

del model, inp, x import gc; gc.collect() time.sleep(10)

因此我們得到了一些基線（baseline）GRU模型，沒有經(jīng)過預(yù)先訓(xùn)練的嵌入。現(xiàn)在讓我們使用提供的嵌入并再次重建模型以查看性能。

!ls ../input/embeddings/

我們有四種不同類型的嵌入（embeddings）。

GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/
glove.840B.300d - https://nlp.stanford.edu/projects/glove/
paragram_300_sl999 - https://cogcomp.org/page/resource_view/106
wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html

在這個(gè)內(nèi)核（in this kernel）中給出了對(duì)不同類型嵌入的非常好的解釋。有關(guān)詳細(xì)信息，請(qǐng)參閱相同內(nèi)容…

Glove Embeddings:
在本節(jié)中，讓我們使用Glove嵌入并重建GRU模型。

EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt' def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32') embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))all_embs = np.stack(embeddings_index.values()) emb_mean,emb_std = all_embs.mean(), all_embs.std() embed_size = all_embs.shape[1]word_index = tokenizer.word_index nb_words = min(max_features, len(word_index)) embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size)) for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorinp = Input(shape=(maxlen,)) x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp) x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x) x = GlobalMaxPool1D()(x) x = Dense(16, activation="relu")(x) x = Dropout(0.1)(x) x = Dense(1, activation="sigmoid")(x) model = Model(inputs=inp, outputs=x) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary())

model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

pred_glove_val_y = model.predict([val_X], batch_size=1024, verbose=1) for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_glove_val_y>thresh).astype(int))))

結(jié)果似乎比沒有預(yù)訓(xùn)練嵌入的模型更好。

pred_glove_test_y = model.predict([test_X], batch_size=1024, verbose=1)

del word_index, embeddings_index, all_embs, embedding_matrix, model, inp, x import gc; gc.collect() time.sleep(10)

Wiki News FastText Embeddings:

現(xiàn)在讓我們使用在Wiki News語(yǔ)料庫(kù)上訓(xùn)練的FastText Embeddings來(lái)代替Glove嵌入并重建模型。

EMBEDDING_FILE = '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec' def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32') embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE) if len(o)>100)all_embs = np.stack(embeddings_index.values()) emb_mean,emb_std = all_embs.mean(), all_embs.std() embed_size = all_embs.shape[1]word_index = tokenizer.word_index nb_words = min(max_features, len(word_index)) embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size)) for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorinp = Input(shape=(maxlen,)) x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp) x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x) x = GlobalMaxPool1D()(x) x = Dense(16, activation="relu")(x) x = Dropout(0.1)(x) x = Dense(1, activation="sigmoid")(x) model = Model(inputs=inp, outputs=x) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

pred_fasttext_val_y = model.predict([val_X], batch_size=1024, verbose=1) for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_fasttext_val_y>thresh).astype(int))))

pred_fasttext_test_y = model.predict([test_X], batch_size=1024, verbose=1)

del word_index, embeddings_index, all_embs, embedding_matrix, model, inp, x import gc; gc.collect() time.sleep(10)

Paragram Embeddings:

在本節(jié)中，我們可以使用段落嵌入并構(gòu)建模型并進(jìn)行預(yù)測(cè)。

EMBEDDING_FILE = '../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt' def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32') embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8", errors='ignore') if len(o)>100)all_embs = np.stack(embeddings_index.values()) emb_mean,emb_std = all_embs.mean(), all_embs.std() embed_size = all_embs.shape[1]word_index = tokenizer.word_index nb_words = min(max_features, len(word_index)) embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size)) for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorinp = Input(shape=(maxlen,)) x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp) x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x) x = GlobalMaxPool1D()(x) x = Dense(16, activation="relu")(x) x = Dropout(0.1)(x) x = Dense(1, activation="sigmoid")(x) model = Model(inputs=inp, outputs=x) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

pred_paragram_val_y = model.predict([val_X], batch_size=1024, verbose=1) for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_paragram_val_y>thresh).astype(int))))

pred_paragram_test_y = model.predict([test_X], batch_size=1024, verbose=1)

del word_index, embeddings_index, all_embs, embedding_matrix, model, inp, x import gc; gc.collect() time.sleep(10)

Observations:（觀察結(jié)論）

與非預(yù)訓(xùn)練模型相比，整體預(yù)訓(xùn)練嵌入似乎可以提供更好的結(jié)果。
不同預(yù)訓(xùn)練嵌入的性能幾乎相似。

Final Blend:（最后融合）

雖然具有不同預(yù)訓(xùn)練嵌入的模型（pre-trained embeddings）的結(jié)果是相似的，但是它們很可能從數(shù)據(jù)中捕獲不同類型的信息（capture different type of information）。因此，讓我們通過平均他們的預(yù)測(cè)來(lái)混合這三個(gè)模型。

pred_val_y = 0.33*pred_glove_val_y + 0.33*pred_fasttext_val_y + 0.34*pred_paragram_val_y for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_val_y>thresh).astype(int))))

結(jié)果似乎比單個(gè)預(yù)訓(xùn)練模型更好，因此我們讓我們使用此模型混合創(chuàng)建提交文件。

pred_test_y = 0.33*pred_glove_test_y + 0.33*pred_fasttext_test_y + 0.34*pred_paragram_test_y pred_test_y = (pred_test_y>0.35).astype(int) out_df = pd.DataFrame({"qid":test_df["qid"].values}) out_df['prediction'] = pred_test_y out_df.to_csv("submission.csv", index=False)

總結(jié)

以上是生活随笔為你收集整理的kaggle:Quora Insincere Questions Classification的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： PLM Agile BOM表结构笔记
下一篇：天天酷跑php源码_run 模仿“天天酷