kaggle 猫狗数据标签_动手变形金刚(Kaggle Google QUEST问题与解答标签)。
kaggle 貓狗數據標簽
This is a 3 part series where we will be going through Transformers, BERT, and a hands-on Kaggle challenge — Google QUEST Q&A Labeling to see Transformers in action (top 4.4% on the leaderboard).In this part (3/3) we will be looking at a hands-on project from Google on Kaggle.Since this is an NLP challenge, I’ve used transformers in this project. I have not covered transformers in much detail in this part but if you wish you could check out the part 1/3 of this series where I’ve discussed transformers in detail.
這是一個分為3部分的系列文章,我們將通過Transformers,BERT和動手Kaggle挑戰-Google QUEST問題與 解答 標簽 來查看Transformers的使用情況(在排行榜上排名前4.4%)。在這一部分(3/3)我們將在Kaggle上查看Google的動手項目。由于這是NLP的挑戰,因此我在該項目中使用了變壓器。 在本部分中,我沒有詳細介紹變壓器,但是如果您愿意,可以查看本系列的1/3部分,在該部分中我詳細討論了變壓器。
博客的鳥瞰圖: (Bird's eye view of the blog:)
To make the reading easy, I’ve divided the blog into different sub-topics-
為了便于閱讀,我將博客分為不同的子主題,
- Problem statement and evaluation metrics. 問題陳述和評估指標。
- About the data. 關于數據。
- Exploratory Data Analysis (EDA). 探索性數據分析(EDA)。
- Modeling (includes data preprocessing). 建模(包括數據預處理)。
- Post-modeling analysis. 建模后分析。
問題陳述和評估指標: (Problem statement and Evaluation metrics:)
Computers are really good at answering questions with single, verifiable answers. But, humans are often still better at answering questions about opinions, recommendations, or personal experiences.
計算機真的很擅長用單個可驗證的答案回答問題。 但是,人們通常仍會更好地回答有關觀點,建議或個人經歷的問題。
Humans are better at addressing subjective questions that require a deeper, multidimensional understanding of context. Questions can take many forms — some have multi-sentence elaborations, others may be simple curiosity or a fully developed problem. They can have multiple intents, or seek advice and opinions. Some may be helpful and others interesting. Some are simple right or wrong.
人類更擅長解決主觀問題,這些問題需要對上下文有更深層次的理解。 問題可以采取多種形式-有些具有多句處理的細節,有些則可能只是出于好奇或完全發展的問題。 他們可以有多種意圖,也可以征求建議和意見。 一些可能會有所幫助,而另一些可能會有趣。 有些簡單是對還是錯。
Unfortunately, it’s hard to build better subjective question-answering algorithms because of a lack of data and predictive models. That’s why the CrowdSource team at Google Research, a group dedicated to advancing NLP and other types of ML science via crowdsourcing, has collected data on a number of these quality scoring aspects.
不幸的是,由于缺乏數據和預測模型,很難構建更好的主觀問答算法。 這就是Google Research的CrowdSource團隊(致力于通過眾包推進NLP和其他類型的ML科學的團隊)收集了許多質量得分方面的數據的原因。
In this competition, we’re challenged to use this new dataset to build predictive algorithms for different subjective aspects of question-answering. The question-answer pairs were gathered from nearly 70 different websites, in a “common-sense” fashion. The raters received minimal guidance and training and relied largely on their subjective interpretation of the prompts. As such, each prompt was crafted in the most intuitive fashion so that raters could simply use their common-sense to complete the task.
在這場比賽中,我們面臨著挑戰,要使用這個新的數據集為問答的不同主觀方面構建預測算法。 問答對以“常識”方式從近70個不同的網站收集而來。 評估者僅獲得了最少的指導和培訓,并且很大程度上依賴于他們對提示的主觀解釋。 這樣,每個提示都以最直觀的方式制作,以便評估者可以簡單地使用常識來完成任務。
Demonstrating these subjective labels can be predicted reliably can shine a new light on this research area. Results from this competition will inform the way future intelligent Q&A systems will get built, hopefully contributing to them becoming more human-like.
證明這些主觀標簽可以被可靠地預測,可以在這個研究領域上嶄露頭角。 此次比賽的結果將為將來構建智能問答系統的方式提供信息,希望有助于使其變得更像人。
Evaluation metric: Submissions are evaluated on the mean column-wise Spearman’s correlation coefficient. The Spearman’s rank correlation is computed for each target column, and the mean of these values is calculated for the submission score.
評估指標:根據平均列式Spearman相關系數評估提交的內容 。 為每個目標列計算Spearman的排名相關性,并為提交分數計算這些值的平均值。
關于數據: (About the data:)
The data for this competition includes questions and answers from various StackExchange properties. Our task is to predict the target values of 30 labels for each question-answer pair.The list of 30 target labels is the same as the column names in the sample_submission.csv file. Target labels with the prefix question_ relate to the question_title and/or question_body features in the data. Target labels with the prefix answer_ relate to the answer feature.Each row contains a single question and a single answer to that question, along with additional features. The training data contains rows with some duplicated questions (but with different answers). The test data does not contain any duplicated questions.Target labels can have continuous values in the range [0,1]. Therefore, predictions must also be in that range.The files provided are:
該比賽的數據包括來自各種StackExchange屬性的問題和解答。 我們的任務是為每個問答對預測30個標簽的目標值.30個目標標簽的列表與sample_submission.csv文件中的列名相同。 帶有前綴question_目標標簽與數據中的question_title和/或question_body特征相關。 前綴為answer_目標標簽與answer功能相關。每行包含一個問題和對該問題的單個答案以及其他功能。 訓練數據包含一些重復的問題(但答案不同)的行。 測試數據不包含任何重復的問題。目標標簽可以具有[0,1]范圍內的連續值。 因此,預測也必須在該范圍內。提供的文件為:
- train.csv — the training data (target labels are the last 30 columns) train.csv —訓練數據(目標標簽是最后30列)
- test.csv — the test set (you must predict 30 labels for each test set row) test.csv —測試集(您必須為每個測試集行預測30個標簽)
- sample_submission.csv — a sample submission file in the correct format; column names are the 30 target labels sample_submission.csv —格式正確的示例提交文件; 列名稱是30個目標標簽
You can check out the dataset using this link.
您可以使用此鏈接簽出數據集。
探索性數據分析(EDA) (Exploratory Data Analysis (EDA))
Check-out the notebook with in-depth EDA + Data Scraping (Kaggle link).
通過深入的EDA +數據收集( Kaggle鏈接 ) 檢出筆記本 。
The training data contains 6079 listings and each listing has 41 columns. Out of these 41 columns, the first 11 columns/features have to be used as the input and the last 30 columns/features are the target predictions.Let’s take a look at the input and target labels:
培訓數據包含6079個列表,每個列表都有41列。 在這41列中,前11列/功能必須用作輸入,后30列/功能是目標預測。讓我們看一下輸入和目標標簽:
Image by Author圖片作者The output features are all of the float types between 0 and 1.
輸出功能為0到1之間的所有浮點類型。
Let's explore the input labels one by one.
讓我們一一探討輸入標簽。
qa_id (qa_id)
Question answer ID represents the id of a particular data point in the given dataset. Each data point has a unique qa_id. This feature is not to be used for training and will be used later while submitting the output to Kaggle.
問題答案ID代表給定數據集中特定數據點的ID。 每個數據點都有一個唯一的qa_id。 此功能不用于培訓,稍后將在將輸出提交給Kaggle時使用。
https://anime.stackexchange.com/questions/56789/if-naruto-loses-the-ability-he-used-on-kakashi-and-guy-after-kaguyas-seal-whathttps://anime.stackexchange.com/questions/56789/if-naruto-loses-the-ability-he-used-on-kakashi-and-guy-after-kaguyas-seal-what問題標題 (question_title)
This is a string data type feature that holds the title of the question asked.For the analysis of question_title, I’ll be plotting a histogram of the number of words in this feature.
這是一個字符串數據類型功能,用于保存所問問題的標題。為了分析question_title,我將在此功能中繪制字數的直方圖。
From the analysis, it is evident that:- Most of the question_title features have a word length of around 9.- The minimum question length is 2.- The maximum question length is 28.- 50% of question_title have lengths between 6 and 11.- 25% of question_title have lengths between 2 and 6.- 25% of question_title have lengths between 11 and 28.
從分析中可以明顯看出:-大多數Question_title功能的單詞長度約為9。-最小問題長度為2--最大問題長度為28.-50%的question_title的長度在6到11之間.-question_title的25%的長度在2到6之間。-question_title的25%的長度在11到28之間。
question_body (question_body)
This is again a string data type feature that holds the detailed text of the question asked.For the analysis of question_body, I’ll be plotting a histogram of the number of words in this feature.
這又是一個字符串數據類型功能,用于保存所問問題的詳細文本。為了分析question_body,我將在此功能中繪制單詞數的直方圖。
From the analysis, it is evident that:- Most of the question_body features have a word length of around 93.- The minimum question length is 1.- The maximum question length is 4666.- 50% of question_title have lengths between 55 and 165.- 25% of question_title have lengths between 1 and 55.- 25% of question_title have lengths between 165 and 4666.
從分析中可以明顯看出:-大多數question_body特征的單詞長度約為93.-最小問題長度為1.-最大問題長度為4666.-50%的question_title的長度在55至165之間.-question_title的25%的長度在1到55之間。-question_title的25%的長度在165和4666之間。
The distribution looks like a power-law distribution, it can be converted to a gaussian distribution using log and then used as an engineered feature.
該分布看起來像冪律分布,可以使用對數將其轉換為高斯分布,然后用作工程特征。
問題用戶名 (question_user_name)
This is a string data type feature that denotes the name of the user who asked the question.For the analysis of question_answer, I’ll be plotting a histogram of the number of words in this feature.
這是一個字符串數據類型功能,表示提出問題的用戶的名稱。為了分析question_answer,我將在此功能中繪制單詞數的直方圖。
I did not find this feature of much use therefore I won’t be using this for modeling.
我沒有發現此功能有很多用處,因此不會在建模中使用它。
question_user_page (question_user_page)
This is a string data type feature that holds the URL to the profile page of the user who asked the question.
這是一個字符串數據類型功能,用于保存提出問題的用戶的個人資料頁面的URL。
On the profile page, I noticed 4 useful features that could be used and should possibly contribute to good predictions. The features are:- Reputation: Denotes the reputation of the user.- gold_score: The number of gold medals awarded.- silver_score: The number of silver medals awarded.- bronze_score: The number of bronze medals awarded.
在個人資料頁面上,我注意到可以使用4個有用的功能,這些功能可能有助于做出良好的預測。 這些功能包括:-信譽:表示用戶的聲譽。-gold_score:授予的金牌數量。-silver_score:授予的銀牌數量。-copper_score:授予的銅牌數量。
回答 (answer)
This is again a string data type feature that holds the detailed text of the answer to the question.For the analysis of answer, I’ll be plotting a histogram of the number of words in this feature.
這又是一個字符串數據類型功能,用于保存問題答案的詳細文本。為分析答案 ,我將在此功能中繪制單詞數的直方圖。
From the analysis, it is evident that:- Most of the question_body features have a word length of around 143.- The minimum question length is 2.- The maximum question length is 8158.- 50% of question_title have lengths between 48 and 170.- 25% of question_title have lengths between 2 and 48.- 25% of question_title have lengths between 170 and 8158.
從分析中可以明顯看出:-大多數question_body特征的單詞長度約為143.-最小問題長度為2.-最大問題長度為8158.-50%的question_title的長度在48到170之間.-question_title的25%的長度在2到48之間。-question_title的25%的長度在170到8158之間。
This distribution also looks like a power-law distribution, it can also be converted to a gaussian distribution using log and then used as an engineered feature.
該分布看起來也像冪律分布,也可以使用對數將其轉換為高斯分布,然后用作工程特征。
answer_user_name (answer_user_name)
This is a string data type feature that denotes the name of the user who answered the question.
這是一個字符串數據類型功能,表示回答問題的用戶的名稱。
I did not find this feature of much use therefore I won’t be using this for modeling.
我沒有發現此功能有很多用處,因此不會在建模中使用它。
answer_user_page (answer_user_page)
This is a string data type feature similar to the feature “question_user_page” that holds the URL to the profile page of the user who asked the question.
這是一個字符串數據類型功能,類似于功能“ question_user_page”,其中包含指向提出問題的用戶的個人資料頁面的URL。
I also used the URL in this feature to scrape the external data from the user’s profile page, similar to what I did for the feature ‘question_user_page’.
我還使用此功能中的URL來從用戶的個人資料頁面抓取外部數據,類似于我對功能“ question_user_page”所做的操作。
網址 (url)
This feature holds the URL of the question and answers page on StackExchange or StackOverflow. Below I’ve printed the first 10 url data-points from train.csv
此功能保存StackExchange或StackOverflow上問答頁面的URL。 在下面,我打印了來自train.csv的前10個網址數據點
One thing to notice is that this feature lands us on the question-answer page, and that page may usually contain a lot more data like comments, upvotes, other answers, etc. which can be used for generating more features if the model does not perform well due to fewer data in train.csvLet’s see the data is present and what additional data can be scraped from the question-answer page.
需要注意的一件事是,此功能使我們進入了問答頁面,并且該頁面通常可能包含更多數據,例如注釋,更新,其他答案等,如果模型不提供這些數據,則可以用于生成更多功能由于train.csv中的數據較少,因此效果良好。讓我們看看數據是否存在以及可以從問題解答頁面中抓取哪些其他數據。
webpage source網頁來源In the snapshot attached above, Post 1 and Post 2 contain the answers, upvotes, and comments for the question asked in decreasing order of upvotes. The post with a green tick is the one containing the answer provided in the train.csv file.
在上面所附的快照中, 帖子1和帖子2包含按答題遞減順序回答的問題的答案,答題和注釋。 帶有綠色勾號的帖子是包含train.csv文件中提供的答案的帖子。
Each question may have more than one answer. We can scrape these answers and use them as additional data.
每個問題可能有多個答案。 我們可以抓取這些答案并將其用作其他數據。
webpage source網頁來源The snapshot above defines the anatomy of a post. We can scrape useful features like upvotes and comments and use them as additional data.
上面的快照定義了帖子的結構。 我們可以抓取有用的功能,例如投票和評論 ,并將其用作其他數據。
Below is the code for scraping the data from the URL page.
以下是用于從URL頁面抓取數據的代碼。
There are 8 new features that I’ve scraped-- upvotes: The number of upvotes on the provided answer.- comments_0: Comments to the provided answer.- answer_1: Most voted answer apart from the one provided.- comment_1: Top comment to answer_1.- answer_2: Second most voted answer.- comment_2: Top comment to answer_2.- answer_3: Third most voted answer.- comment_3: Top comment to answer_3.
我已抓取了8個新功能-投票:所提供答案的投票數。-評論_0:對所提供答案的評論。-answer_1:除所提供的答案外,投票最多的答案。-評論_1: answer_1.- answer_2:投票數第二高的答案。-comment_2:對answer_2最高的評論。-answer_3:投票數第三高的答案。-comment_3:對answer_3最高的評論。
類別 (category)
This is a categorical feature that tells the categories of question and answers pairs. Below I’ve printed the first 10 category data-points from train.csv
這是一個分類功能,可告知問題和答案對的類別。 下面,我打印了來自train.csv的前10個類別數據點
Below is the code for plotting a Pie chart of category.
以下是用于繪制類別餅圖的代碼。
The chart tells us that most of the points belong to the category TECHNOLOGY and least belong to LIFE_ARTS (709 out of 6079).
該圖表告訴我們,大多數點屬于技術類別,而最少屬于LIFE_ARTS (6079中的709)。
主辦 (host)
This feature holds the host or domain of the question and answers page on StackExchange or StackOverflow. Below I’ve printed the first 10 host data-points from train.csv
此功能保存StackExchange或StackOverflow上“問題與解答”頁面的主機或域。 下面,我打印了train.csv的前10個主機數據點
Below is the code for plotting a bar graph of unique hosts.
以下是用于繪制唯一主機條形圖的代碼。
It seems there are not many but just 63 different subdomains present in the training data. Most of the data points are from StackOverflow.com whereas least from meta.math.stackexchange.com
訓練數據中似乎沒有多少,但只有63個不同的子域。 大多數數據點來自StackOverflow.com,而最少的數據來自meta.math.stackexchange.com
目標值 (Target values)
Let’s analyze the target values that we need to predict. But first, for the sake of a better interpretation, please check out the full dataset on kaggle using this link.
讓我們分析我們需要預測的目標值。 但是首先,為了更好地解釋,請使用此鏈接在kaggle上查看完整的數據集。
Below is the code block displaying the statistical description of the target values. These are only the first 6 features out of all the 30 features.The values of all the features are of type float and are between 0 and 1.
下面的代碼塊顯示了目標值的統計描述。 這些只是30個功能中的前6個功能,所有功能的值均為float類型且介于0和1之間。
Notice the second code block which displays the unique values present in the dataset. There are just 25 unique values between 0 and 1. This could be useful later while fine-tuning the code.
注意第二個代碼塊,它顯示數據集中存在的唯一值。 在0到1之間只有25個唯一值。稍后在微調代碼時可能會很有用。
Finally, let’s check the distribution of the target features and their correlation.
最后,讓我們檢查目標特征的分布及其相關性。
Histograms of the target features.目標特征的直方圖。 Heatmap of correlation between target features.目標特征之間相關性的熱圖。造型 (Modeling)
Image by Author圖片作者Now that we know our data better through EDA, let’s begin with modeling. Below are the subtopics that we’ll go through in this section-
現在,我們已經通過EDA更好地了解了我們的數據,讓我們從建模開始。 以下是本節將要討論的子主題:
Overview of the architecture: Quick rundown of the ensemble architecture and it’s different components.
體系結構概述:集成體系結構及其不同組件的快速精簡。
Base learners: Overview of the base learners used in the ensemble.
基礎學習者:集成中使用的基礎學習者的概述。
Preparing the data: Data cleaning and preparation for modeling.
準備數據:數據清理和建模準備。
Ensembling: Creating models for training, and predicting. Pipelining the data preparation, model training, and model prediction steps.
整合:創建用于訓練和預測的模型。 管道化數據準備,模型訓練和模型預測步驟。
Getting the scores from Kaggle: Submitting the predicted target values for test data on Kaggle and generating a leaderboard score to see how well the ensemble did.
從Kaggle獲得分數:在Kaggle上提交測試數據的預測目標值,并生成一個排行榜分數,以查看整體表現如何。
I tried various deep neural network architectures with GRU, Conv1D, Dense layers, and with different features for the competition but, an ensemble of 8 transformers (as shown above) seems to work the best.In this part, we will be focusing on the final architecture of the ensemble used and for the other baseline models that I experimented with, you can check out my github repo.
我嘗試了具有GRU,Conv1D,Dense層和具有不同功能的各種深度神經網絡架構,但具有8個變壓器的集合(如上所示)似乎效果最好。在這一部分中,我們將重點關注所使用的集成體的最終體系結構以及我試驗過的其他基線模型,您可以查看我的github存儲庫。
Overview of the architecture:
體系結構概述:
Remember our task was for a given question_title, question_body, and answer, we had to predict 30 target labels. Now out of these 30 target labels, the first 21 are related to the question_title and question_body and have no connection to the answer whereas the last 9 target labels are related to the answer only but out of these 9, some of them also take question_title and question_body into the picture.Eg. features like answer_relevance and answer_satisfaction can only be rated by looking at both the question and answer.
請記住,我們的任務是針對給定的question_title,question_body和answer ,我們必須預測30個目標標簽。 現在,在這30個目標標簽中,前21個與question_title和question_body相關,并且與答案無關,而后9個目標標簽僅與答案相關,但在這9個目標標簽中,有一些還帶有question_title和question_body到picture.Eg。 只能通過同時查看問題和答案來對諸如answer_relevance和answer_satisfaction之類的功能進行評分。
With some experimentation, I found that the base-learner (BERT_base) performs exceptionally well in predicting the first 21 target features (related to questions only) but does not perform that well in predicting the last 9 target features. Taking note of this, I constructed 3 dedicated base-learners and 2 different datasets to train them.
通過一些實驗,我發現基本學習器(BERT_base)在預測前21個目標特征(僅與問題相關)方面表現出色,但在預測后9個目標特征方面表現不佳。 注意到這一點,我構建了3個專用的基礎學習器和2個不同的數據集來對其進行訓練。
The first base-learner was dedicated to predicting the question-related features (first 21) only. The dataset used for training this model consisted of features question_title and question_body only.
第一個基礎學習者專用于預測與問題相關的功能(前21個)。 用于訓練此模型的數據集僅包含問題question_title和question_body 。
The second base-learner was dedicated to predicting the answer-related features (last 9) only. The dataset used for training this model consisted of features question_title, question_body, and answer.
第二個基礎學習者專用于預測與答案相關的功能(僅前9個)。 用于訓練該模型的數據集包括問題question_title , question_body和answer 。
The third base-learner was dedicated to predicting all the 30 features. The dataset used for training this model again consisted of features question_title, question_body, and answer.
第三位基礎學習者致力于預測所有30個功能。 用于訓練該模型的數據集再次包含問題question_title , question_body和answer 。
To make the architecture even more robust, I used 3 different types of base learners — BERT, RoBERTa, and XLNet.We will be going through these different transformer models later in this blog.
為了使體系結構更加強大,我使用了3種不同類型的基礎學習器-BERT,RoBERTa和XLNet。 在本博客的后面,我們將介紹這些不同的變壓器模型。
In the ensemble diagram above, we can see —
在上方的整體圖中,我們可以看到-
The 2 datasets consisting of [question_title + question_body] and [question_title + question_body + answer] being used separately to train different base learners.
由[question_title + question_body]和[question_title + question_body + Answer]組成的2個數據集分別用于訓練不同的基礎學習者。
Then we can see the 3 different base learners (BERT, RoBERTa, and XLNet) dedicated to predicting the question-related features only (first 21) colored in blue, using the dataset [question_title + question_body]
然后,我們可以看到3個不同的基礎學習者(BERT,RoBERTa和XLNet)使用數據集[question_title + question_body]來僅預測與藍色相關的特征 (前21個)。
Next, we can see the 3 different base learners (BERT, RoBERTa, and XLNet) dedicated to predicting the answer-related features only (last 9) colored in green, using the dataset [question_title + question_body + answer].
接下來,我們可以看到3個不同的基礎學習者(BERT,RoBERTa和XLNet)專用于使用數據集[question_title + question_body + answer]來預測僅與綠色相關的功能 (最后9個) 。
Finally, we can see the 2 different base learners (BERT, and RoBERTa) dedicated to predicting all the 30 features colored in red, using the dataset [question_title + question_body + answer].
最后,我們可以看到2個不同的基礎學習者(BERT和RoBERTa)使用數據集[question_title + question_body + answer]來預測所有用紅色標記 的30個特征 。
In the next step, the predicted data from models dedicated to predicting the question-related features only (denoted as bert_pred_q, roberta_pred_q, xlnet_pred_q) and the predicted data from models dedicated to predicting the answer-related features only (denoted as bert_pred_a, roberta_pred_a, xlnet_pred_a) is collected and concatenated column-wise which leads to a predicted data with all the 30 features. These concatenated features are denoted as xlnet_concat, roberta_concat, and bert_concat.
下一步, 僅用于預測與問題相關的特征的模型中的預測數據(表示為bert_pred_q,roberta_pred_q,xlnet_pred_q ) 和來自的預測數據 僅收集專門用于預測與答案相關的特征的模型(表示為bert_pred_a,roberta_pred_a,xlnet_pred_a ) ,并按列進行級聯,從而得出具有全部30個特征的預測數據。 這些串聯的功能分別表示為xlnet_concat,roberta_concat和bert_concat。
Similarly, the predicted data from models dedicated to predicting all the 30 features (denoted as bert_qa, roberta_qa) is collected. Notice that I’ve not used the XLNet model here for predicting all the 30 features because the scores were not up to the mark.
同樣,來自 收集了專門用于預測所有30個特征的模型 (分別表示為bert_qa,roberta_qa )。 請注意,由于分數未達到標準,我這里沒有使用XLNet模型來預測所有30個功能。
Finally, after collecting all the different predicted data — [xlnet_concat, roberta_concat, bert_concat, bert_qa, and roberta_qa], the final value is calculated by taking the average of all the different predicted values.
最后,在收集了所有不同的預測數據( [xlnet_concat,roberta_concat,bert_concat,bert_qa和roberta_qa])之后,通過取所有不同預測值的平均值來計算最終值。
Base learners
基礎學習者
Now we will take a look at the 3 different transformer models that were used as base learners.
現在,我們將研究用作基礎學習者的3種不同的變壓器模型。
bert_base_uncased:
bert_base_uncased:
Bert was proposed by Google AI in late 2018 and since then it has become state-of-the-art for a wide spectrum of NLP tasks.It uses an architecture derived from transformers pre-trained over a lot of unlabeled text data to learn a language representation that can be used to fine-tune for specific machine learning tasks. BERT outperformed the NLP state-of-the-art on several challenging tasks. This performance of BERT can be ascribed to the transformer’s encoder architecture, unconventional training methodology like the Masked Language Model (MLM), and Next Sentence Prediction (NSP) and the humungous amount of text data (all of Wikipedia and book corpus) that it is trained on. BERT comes in different sizes but for this challenge, I’ve used bert_base_uncased.
Bert由Google AI于2018年底提出,自那以后它已成為處理大量NLP任務的最新技術,它使用的結構源自對變壓器進行了預訓練,這些結構經過大量未標記文本數據的預訓練,可以學習語言表示形式,可用于微調特定的機器學習任務。 BERT在一些艱巨的任務上勝過NLP的最新技術。 BERT的這種性能可以歸因于變壓器的編碼器體系結構,非常規的訓練方法(例如,蒙版語言模型(MLM)和下一句預測(NSP))以及大量的文本數據(所有維基百科和書籍語料庫)訓練有素。 BERT的大小不同,但是為了應對這一挑戰,我使用了bert_base_uncased。
Image by Author圖片作者The architecture of bert_base_uncased consists of 12 encoder cells with 8 attention heads in each encoder cell. It takes an input of size 512 and returns 2 values by default, the output corresponding to the first input token [CLS] which has a dimension of 786 and another output corresponding to all the 512 input tokens which have a dimension of (512, 768) aka pooled_output. But apart from these, we can also access the hidden states returned by each of the 12 encoder cells by passing output_hidden_states=True as one of the parameters.BERT accepts several sets of input, for this challenge, the input I’ll be using will be of 3 types:
bert_base_uncased的體系結構包含12個編碼器單元,每個編碼器單元中有8個關注頭。 它采用大小為512的輸入,并且默認情況下返回2個值,該輸出對應于尺寸為786的第一個輸入令牌[CLS],另一個輸出對應于尺寸為(512,768)的所有512個輸入令牌),也稱為pooled_output。 但是除了這些之外,我們還可以通過傳遞output_hidden_??states = True作為參數之一來訪問12個編碼器單元中的每個單元返回的隱藏狀態.BERT接受多組輸入,對此挑戰,我將使用的輸入將屬于3種類型:
input_ids: The token embeddings are numerical representations of words in the input sentence. There is also something called sub-word tokenization that BERT uses to first breakdown larger or complex words into simple words and then convert them into tokens. For example, in the above diagram look how the word ‘playing’ was broken into ‘play’ and ‘##ing’ before generating the token embeddings. This tweak in tokenization works wonders as it utilized the sub-word context of a complex word instead of just treating it like a new word.
input_ids :標記嵌入是輸入句子中單詞的數字表示。 BERT還使用一種稱為子詞標記的方法,首先將較大或復雜的詞分解為簡單的詞,然后將其轉換為標記。 例如,在上圖中,在生成令牌嵌入之前,如何將單詞“ playing”分為“ play”和“ ## ing”。 令牌化中的這種調整令人驚奇,因為它利用了復雜單詞的子單詞上下文,而不僅僅是將其視為新單詞。
attention_mask: The segment embeddings are used to help BERT distinguish between the different sentences in a single input. The elements of this embedding vector are all the same for the words from the same sentence and the value changes if the sentence is different.
tention_mask :段嵌入用于幫助BERT在單個輸入中區分不同的句子。 對于同一句子中的單詞,此嵌入向量的元素都相同,如果句子不同,則值會更改。
Let’s consider an example: Suppose we want to pass the two sentences
讓我們考慮一個例子:假設我們要傳遞兩個句子
“I have a pen” and “The pen is red” to BERT. The tokenizer will first tokenize these sentences as:
對于BERT, “我有一支筆”和“筆是紅色的” 。 分詞器首先將這些句子分詞為:
“I have a pen” and “The pen is red” to BERT. The tokenizer will first tokenize these sentences as: [‘[CLS]’, ‘I’, ‘have’, ‘a’, ‘pen’, ‘[SEP]’, ‘the’, ‘pen’, ‘is’, ‘red’, ‘[SEP]’]And the segment embeddings for these will look like:
對于BERT, “我有一支筆”和“筆是紅色的” 。 標記生成器首先將這些句子標記為: ['[CLS]','I','have','a','pen','[[SEP]','the','pen','is', 'red','[SEP]']并且這些的段嵌入看起來像:
“I have a pen” and “The pen is red” to BERT. The tokenizer will first tokenize these sentences as: [‘[CLS]’, ‘I’, ‘have’, ‘a’, ‘pen’, ‘[SEP]’, ‘the’, ‘pen’, ‘is’, ‘red’, ‘[SEP]’]And the segment embeddings for these will look like:[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]Notice how all the elements corresponding to the word in the first sentence have the same element 0 whereas all the elements corresponding to the word in the second sentence have the same element 1.
對于BERT, “我有一支筆”和“筆是紅色的” 。 標記生成器首先將這些句子標記為: ['[CLS]','I','have','a','pen','[[SEP]','the','pen','is', 'red','[[SEP]']]的分段嵌入如下所示: [0,0,0,0,0,0,1,1,1,1,1]請注意所有元素如何對應第一個句子中的單詞具有相同的元素0,而第二個句子中與該單詞對應的所有元素都具有相同的元素1 。
token_type_ids: The mask tokens that help BERT to understand what all input words are relevant and what all are just there for padding.
token_type_ids:掩碼令牌,可幫助BERT理解所有輸入詞都相關以及填充時所需要的所有內容。
Since BERT takes a 512-dimensional input, and suppose we have an input of 10 words only. To make the tokenized words compatible with the input size, we will add padding of size 512–10=502 at the end. Along with the padding, we will generate a mask token of size 512 in which the index corresponding to the relevant words will have
由于BERT采用512維輸入,并且假設我們只有10個單詞的輸入。 為了使標記化單詞與輸入大小兼容,我們將在末尾添加大小為512–10 = 502的填充。 與填充一起,我們將生成大小為512的掩碼令牌,其中與相關單詞相對應的索引將具有
1s and the index corresponding to padding will have 0s.
1 s,對應于填充的索引將為0 s。
2. XLNet_base_cased:
2. XLNet_base_cased:
XLNet was proposed by Google AI Brain team and researchers at CMU in mid-2019. Its architecture is larger than BERT and uses an improved methodology for training. It is trained on larger data and shows better performance than BERT in many language tasks. The conceptual difference between BERT and XLNet is that while training BERT, the words are predicted in an order such that the previous predicted word contributes to the prediction of the next word whereas, XLNet learns to predict the words in an arbitrary order but in an autoregressive manner (not necessarily left-to-right).
XLNet由Google AI Brain團隊和CMU的研究人員于2019年中提出。 它的體系結構比BERT大,并使用改進的方法進行訓練。 它在較大的數據上經過訓練,在許多語言任務中顯示出比BERT更好的性能。 BERT和XLNet之間的概念差異是,在訓練BERT時,單詞的預測順序要使上一個預測的單詞有助于下一個單詞的預測,而XLNet則學會以任意順序但以自回歸的方式預測單詞方式(不一定從左到右)。
The prediction scheme for a traditional language model. Shaded words are provided as input to the model while unshaded words are masked out. 傳統語言模型的預測方案。 提供陰影詞作為模型的輸入,同時屏蔽掉未陰影詞。 An example of how a permutation language model would predict tokens for a certain permutation. Shaded words are provided as input to the model while unshaded words are masked out. 排列語言模型如何預測特定排列的標記的示例。 提供陰影詞作為模型的輸入,同時屏蔽掉未陰影詞。This helps the model to learn bidirectional relationships and therefore better handles dependencies and relations between words.In addition to the training methodology, XLNet uses Transformer XL based architecture and 2 main key ideas: relative positional embeddings and the recurrence mechanism which showed good performance even in the absence of permutation-based training.XLNet was trained with over 130 GB of textual data and 512 TPU chips running for 2.5 days, both of which are much larger than BERT.
這有助于模型學習雙向關系,從而更好地處理單詞之間的依存關系和關系。除訓練方法外,XLNet還使用基于Transformer XL的體系結構和兩個主要關鍵思想: 相對位置嵌入和遞歸機制 ,即使在XLNet接受了130 GB以上的文本數據和512個TPU芯片運行2.5天的培訓,兩者均比BERT大得多。
For XLNet, I’ll be using only input_ids and attention_mask as input.
對于XLNet,我將只使用input_ids和attention_mask作為輸入。
3. RoBERTa_base:
3. RoBERTa_base:
RoBERTa was proposed by Facebook in mid-2019. It is a robustly optimized method for pretraining natural language processing (NLP) systems that improve on BERT’s self-supervised method. RoBERTa builds on BERT’s language masking strategy, wherein the system learns to predict intentionally hidden sections of text within otherwise unannotated language examples. RoBERTa modifies key hyperparameters in BERT, including removing BERT’s Next Sentence Prediction (NSP) objective, and training with much larger mini-batches and learning rates. This allows RoBERTa to improve on the masked language modeling objective compared with BERT and leads to better downstream task performance. RoBERTa was also trained on more data than BERT and for a longer amount of time. The dataset used was from existing unannotated NLP data sets as well as CC-News, a novel set drawn from public news articles.
RoBERTa由Facebook在2019年中期提出。 它是對自然語言處理(NLP)系統進行預訓練的強大優化方法,它在BERT的自我監督方法的基礎上有所改進。 RoBERTa建立在BERT的語言屏蔽策略的基礎上,其中系統學習預測在其他情況下未注釋的語言示例中故意隱藏的文本部分。 RoBERTa修改了BERT中的關鍵超參數,包括刪除BERT的下一句預測(NSP)目標,以及使用更大的迷你批次和學習率進行訓練。 與BERT相比,這使RoBERTa可以改進掩蔽語言建模目標,并可以實現更好的下游任務性能。 與BERT相比,RoBERTa還接受了更多數據的培訓,而且培訓時間更長。 使用的數據集來自現有的未注釋的NLP數據集以及CC-News,后者是從公共新聞文章中提取的新穎集。
For RoBERTa_base, I’ll be using only input_ids and attention_mask as input.
對于RoBERTa_base,我將只使用input_ids和attention_mask作為輸入。
Finally here is the comparison of BERT, XLNet, and RoBERTa:
最后是BERT,XLNet和RoBERTa的比較:
source link源鏈接Preparing the data
準備數據
Now that we have gained some idea about the architecture let’s see how to prepare the data for the base learners.
現在,我們已經對架構有了一些了解,讓我們看看如何為基礎學習者準備數據。
As a preprocessing step, I have just treated the HTML syntax present in the features. I used html.unescape() to extract the text from HTML DOM elements.
作為預處理步驟,我剛剛處理了功能中存在HTML語法。 我使用html.unescape()從HTML DOM元素提取文本。
In the code snippet below, the function
在下面的代碼段中,該函數
get_data() reads the train and test data and applies the preprocessing to the features question_title, question_body, and answer.
get_data()讀取訓練和測試數據,并將預處理應用于功能question_title,question_body和answer。
The next step was to create input_ids, attention_masks, and token_type_ids from the input sentence.
下一步是根據輸入語句創建input_id,attention_masks和token_type_id 。
In the code snippet below, the function
在下面的代碼段中,該函數
get_tokenizer() collects pre-trained tokenizer for the different base_learners.
get_tokenizer()為不同的base_learners收集經過預訓練的令牌生成器。
The second function
第二功能
fix_length() goes through the generated question tokens and answer tokens and makes sure that the maximum number of tokens is 512. The steps for fixing the number of tokens are as follows:
fix_length()遍歷生成的問題令牌和答案令牌,并確保令牌的最大數量為512。固定令牌數量的步驟如下:
- If the input sentence has the number of tokens > 512, the sentence is trimmed down to 512.
-如果輸入句子的令牌數> 512,則將該句子修剪為512。
- To trim the number of tokens, 256 tokens from the beginning and 256 tokens from the end are kept and the remaining tokens are dropped.
-為了減少令牌的數量,保留從開頭開始的256個令牌和從結尾開始的256個令牌,并丟棄其余的令牌。
- For example, suppose an answer has 700 tokens, to trim this down to 512, 256 tokens from the beginning are taken and 256 tokens from the end are taken and concatenated to make 512 tokens. The remaining [700-(256+256) = 288] tokens that are in the middle of the answer are dropped.
-例如,假設一個答案有700個令牌,將其減少到512個,則從頭開始抽取256個令牌,從末尾抽取256個令牌并將其連接起來以構成512個令牌。 答案中間的其余[700-(256 + 256)= 288]個標記被丟棄。
- The logic makes sense because in a large text, the beginning part usually describes what the text is all about and the end part describes the conclusion of the text.
-邏輯是有意義的,因為在大文本中,開始部分通常描述文本的全部內容,而結束部分描述文本的結論。
Next is the code block for generating the input_ids, attention_masks, and token_type_ids. I’ve used a condition that checks if the function needs to return the generated data for base learners relying on the dataset [question_title + question_body] or the dataset [question_title + question_body + answer].
接下來是用于生成input_id,attention_masks和token_type_ids的代碼塊。 我使用了一個條件,該條件檢查函數是否需要依賴于數據集[question_title + question_body]或數據集[question_title + Question_body + Answer]為基礎學習者返回生成的數據。
Finally, here is the function that makes use of the function initialized above and generates input_ids, attention_masks, and token_type_ids for each of the instances in the provided data.
最后,這里的函數利用上面初始化的函數,并為提供的數據中的每個實例生成input_id,attention_masks和token_type_id 。
To make the model training easy, I also created a class that generates train and cross-validation data based on the fold while using KFlod CV with the help of the functions specified above.
為了簡化模型訓練,我還創建了一個類,該類在使用KFlod CV并借助上面指定的功能的同時,基于折疊生成訓練和交叉驗證數據。
Ensembling
組裝
After data preprocessing, let's create the model architecture starting with base learners.
在數據預處理之后,讓我們從基礎學習者開始創建模型架構。
The code below takes the model name as input, collects the pre-trained model, and its configuration information according to the input name and creates the base learner model. Notice that output_hidden_states=True is passed after adding the config data.
下面的代碼以模型名稱為輸入,根據輸入的名稱收集經過預先訓練的模型及其配置信息,并創建基礎學習者模型。 請注意,在添加配置數據后,將傳遞output_hidden_??states = True 。
The next code block is to create the ensemble architecture. The function accepts 2 parameters name that expects the name of the model that we want to train and model_type that expects the type of model we want to train. The model type can be bert-base-uncased, roberta-base or xlnet-base-cased whereas the model type can be questions, answers, or question_answers.The function create_model() takes the model_name and model_type and generates a model that can be trained on the specified data accordingly.
下一個代碼塊是創建集成體系結構。 該函數接受兩個參數名稱,它們期望我們要訓練的模型的名稱,以及model_type期望我們要訓練的模型的類型。 模型類型可以是不帶bert的,基于roberta或xlnet的,而模型的類型可以是問題,答案或問題 答案 。 函數create_model ()接受model_name和model_type并生成可以相應地對指定數據進行訓練的模型。
Now let's create a function for calculating the evaluation metric Spearman’s correlation coefficient.
現在,讓我們創建一個函數來計算評估指標Spearman的相關系數 。
Now we need a function that can collect the base learner model, data according to the base learner model, and train the model.I’ve used K-Fold cross-validation with 5 folds for training.
現在我們需要一個可以收集基礎學習者模型,根據基礎學習者模型進行數據訓練的模型的功能。我使用了5折的K-fold交叉驗證進行訓練。
Now once we have trained the models and generated the predicted values, we need a function for calculating the weighted average. Here’s the code for that.*The weight’s in the weighted average are all 1s.
現在,一旦我們訓練了模型并生成了預測值,就需要一個函數來計算加權平均值。 這是代碼。*加權平均值中的權重均為1。
Before bringing everything together, there is one more function that I used for processing the final predicted values. Remember in the EDA section there was an analysis of the target values where we noticed that the target values were only 25 unique floats between 0 and 1. To make use of that information, I calculated 61 (a hyperparameter) uniformly distributed percentile values and mapped them to the 25 unique values. This created 61 bins uniformly spaced between the upper and lower range of the target values. Now to process the predicted data, I used those bins to collect the predicted values and put them in the right place/order. This trick helped in improving the score in the final submission to the leaderboard to some extent.
在將所有內容組合在一起之前,還有一個函數用于處理最終的預測值。 記得在EDA部分中對目標值進行了分析,我們注意到目標值只是0到1之間的25個唯一的浮點數。為了利用該信息,我計算了61個(超參數)均勻分布的百分位數并映射它們為25個唯一值。 這樣就在目標值的上限和下限范圍之間均勻間隔地創建了61個bin。 現在要處理預測數據,我使用這些容器收集了預測值并將它們放在正確的位置/順序中。 這個技巧在一定程度上有助于提高最終提交給排行榜的分數。
Finally, to bring the data-preprocessing, model training, and post-processing together, I created the get_predictions() function that-- Collects the data.- Creates the 8 base_learners.- Prepares the data for the base_learners.- Trains the base learners and collects the predicted values from them.- Calculates the weighted average of the predicted values.- Processes the weighted average prediction.- Converts the final predicted values into a dataframe format requested by Kaggle for submission and return it.
最后,為了將數據預處理,模型訓練和后處理結合在一起,我創建了get_predictions()函數-收集數據。-創建8個base_learners。-為base_learners準備數據。-訓練base學習者并從中收集預測值。-計算預測值的加權平均值。-處理加權平均預測。-將最終預測值轉換為Kaggle要求提交的數據幀格式并返回。
Getting the scores from Kaggle
從Kaggle獲得分數
Once the code compiles and runs successfully, it generates an output file that can be submitted to Kaggle for score calculation. The ranking of the code on the leaderboard is generated using the score.The ensemble model got a public score of 0.43658 which makes it in the top 4.4% on the leaderboard.
代碼編譯并成功運行后,它將生成一個輸出文件,可以將該文件提交給Kaggle進行分數計算。 排行榜上的代碼排名是使用分數生成的。 整體模型的公共得分為0.43658 ,在排行榜中排名前4.4%。
后期建模分析 (Post modeling Analysis)
Check-out the notebook with complete post-modeling analysis (Kaggle link).
使用完整的建模后分析來檢出筆記本( Kaggle鏈接 )。
Its time for some post-modeling analysis!
現在是時候進行一些后期建模分析了!
In this section, we will go through an analysis of train data to figure out what parts of the data is the model doing well on and what parts of the data it’s not.The main idea behind this step is to know the capability of the trained model and it works like a charm if applied properly for fine-tuning the model and data.But we won’t get into the fine-tuning part in this section, we will just be performing some basic EDA on the train data using the predicted target values for the train data.I’ll be covering the data feature by feature. Here are the top features we’ll be performing analysis on-
在本節中,我們將對訓練數據進行分析,以找出模型在哪些方面做得很好,而在哪些數據方面則做得不好。此步驟背后的主要思想是了解受訓者的能力模型,如果正確地應用它來進行模型和數據的微調,它就像一個魅力。但是,我們不會在本節中進行微調,我們只是使用預測的數據對火車數據執行一些基本的EDA。火車數據的目標值。我將逐個功能介紹數據。 以下是我們將要進行分析的主要功能-
- question_title, question_body, and answer. question_title,question_body和答案。
- Word lengths of question_title, question_body, and answer. Question_title,Question_body和Answer的字長。
- Host 主辦
- Category 類別
First, we will have to divide the data into a spectrum of good data and bad data. Good data will be the data points on which the model achieves a good score and bad data will be the data points on which the model achieves a bad score. Now for scoring, we will be comparing the actual target values of the train data with the model’s predicted target values on train data. I used mean squared error (MSE) as a metric for scoring since it focuses on how close the actual and target values are. Remember the more the MSE-score is, the bad the data point will be.Calculating the MSE-score is pretty simple. Here’s the code:
首先,我們必須將數據分為好數據和壞數據。 好的數據將是模型獲得良好分數的數據點,而壞的數據將是模型獲得不良分數的數據點。 現在進行評分,我們將火車數據的實際目標值與模型在火車數據上的預測目標值進行比較。 我將均方誤差(MSE)用作評分的指標,因為它著重于實際值和目標值的接近程度。 請記住,MSE得分越多,數據點就越糟糕。MSE得分的計算非常簡單。 這是代碼:
# Generating the MSE-score for each data point in train data.from sklearn.metrics import mean_squared_errortrain_score = [mean_squared_error(i,j) for i,j in zip(y_pred, y_true)]# sorting the losses from minimum to maximum index wise.
train_score_args = np.argsort(train_score)
question_title, question_body, and answer
問題標題,問題主體和答案
Starting with the first set of features, which are all text type features, I’ll be plotting word clouds using them. The plan is to segment out these features from 5 data-points that have the least scores and from another 5 data-points that have the most scores.
從第一組功能(都是文本類型的功能)開始,我將使用它們來繪制詞云。 該計劃是從得分最低的5個數據點和得分最高的另外5個數據點中分割出這些特征。
function for plotting the word-clouds繪制詞云的功能Let’s run the code and check what the results look like.
讓我們運行代碼并檢查結果是什么樣的。
Word lengths of question_title, question_body, and answer
Question_title,Question_body和Answer的字長
The next analysis is on the word lengths of question_title, question_body, and answer. For that, I’ll be picking 30 data-points that have the lowest MSE-scores and 30 data-points that have the highest MSE-scores for each of the 3 features question_title, question_body, and answer. Next, I’ll be calculating the word lengths of these 30 data-points for all the 3 features and plot them to see the trend.
接下來的分析是關于question_title,question_body和答案的詞長。 為此,我將為3個功能question_title,question_body和answer分別選擇30個具有最低MSE分數的數據點和30個具有最高MSE分數的數據點。 接下來,我將為所有3個功能計算這30個數據點的字長,并繪制它們以查看趨勢。
If we look at the number of words in question_title, question_body, and answer we can observe that the data that generates a high loss has a high number of words which means that the questions and answers are kind of thorough. So, the model does a good job when the questions and answers are concise.
如果我們查看question_title,question_body和answer中的單詞數,我們可以觀察到,產生高損失的數據具有大量單詞,這意味著問題和答案有點徹底。 因此,當問題和答案簡潔時,該模型就可以很好地完成工作。
host
主辦
The next analysis is on the feature host. For this feature, I’ll be picking 100 data-points that have the lowest MSE-scores and 100 data-points that have the highest MSE-scores and select the values in the feature host. Then I’ll be plotting a histogram of this categorical feature to see the distributions.
下一個分析在功能主機上。 對于此功能,我將選擇100個具有最低MSE分數的數據點和100個具有最高MSE分數的數據點,然后選擇功能主機中的值。 然后,我將繪制此分類特征的直方圖以查看分布。
We can see that there are a lot of data points from the domain English, biology, sci-fi, physics that contribute to a lesser loss value whereas there are a lot of data points from drupal, programmers, tex that contribute to a higher loss.
我們可以看到,來自英語,生物學,科幻,物理學等領域的大量數據點導致較小的損失值,而來自drupal,程序員和tex的大量數據點則導致較高的損失。
Let’s also take a look at word-clouds of the unique host values that contribute to a low score and a high score. This analysis is again done using the top and bottom 100 data-points.
我們還來看看獨特的主機值的詞云,它們會導致低分和高分。 再次使用頂部和底部100個數據點進行此分析。
Category
類別
The final analysis is on the feature category. For this feature, I’ll be picking 100 data-points that have the lowest MSE-scores and 100 data-points that have the highest MSE-scores and select the values in the feature category. Then I’ll be plotting a pie-chart of this categorical feature to see the proportions.
最終分析是在要素類別上。 對于此功能,我將選擇100個具有最低MSE分數的數據點和100個具有最高MSE分數的數據點,然后選擇要素類別中的值。 然后,我將繪制此分類特征的餅圖以查看比例。
We can notice that datapoints with category as technology make up 50% of the data that the model could not predict well whereas categories like LIFE_ARTS, SCIENCE, and CULTURE contribute much less to bad predictions.For the good predictions, all the 5 categories contribute almost the same since there is no major difference in the proportion, still, we could say that the data-points with StackOverflow as the category contribute the least.
我們可以注意到,類別為技術的數據點占模型無法很好預測的數據的50%,而LIFE_ARTS,SCIENCE和CULTURE等類別對不良預測的貢獻則要小得多。對于良好的預測,所有5個類別的貢獻幾乎都是相同,因為比例沒有重大差異,但是我們可以說以StackOverflow為類別的數據點貢獻最少。
With this, we have come to the end of this blog and the 3 part series. Hope the read was pleasant.You can check the complete notebook on Kaggle using this link and leave an upvote if found my work useful.I would like to thank all the creators for creating the awesome content I referred to for writing this blog.
至此,我們已經結束了本博客和3部分系列的結尾。 希望閱讀愉快。您可以使用 此鏈接 在Kaggle上查看完整的筆記本, 并在發現我的工作有用的情況下發表意見。我要感謝所有創建者創建了我寫這篇博客所提到的精彩內容。
Reference links:
參考鏈接:
Applied AI Course: https://www.appliedaicourse.com/
應用AI課程: https : //www.appliedaicourse.com/
https://www.kaggle.com/c/google-quest-challenge/notebooks
https://www.kaggle.com/c/google-quest-challenge/notebooks
http://jalammar.github.io/illustrated-transformer/
http://jalammar.github.io/illustrated-transformer/
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8
https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8
Final note
最后說明
Thank you for reading the blog. I hope it was useful for some of you aspiring to do projects or learn some new concepts in NLP.
感謝您閱讀博客。 我希望這對有志于在NLP中進行項目或學習一些新概念的人有用。
In part 1/3 we covered how Transformers became state-of-the-art in various modern natural language processing tasks and their working.
在第1/3部分中,我們介紹了《變形金剛》如何在各種現代自然語言處理任務及其工作中成為最先進的技術。
In part 2/3 we went through BERT (Bidirectional Encoder Representations from Transformers).
在第2/3部分中,我們介紹了BERT(來自變壓器的雙向編碼器表示)。
Kaggle in-depth EDA notebook link: https://www.kaggle.com/sarthakvajpayee/top-4-4-in-depth-eda-feature-scraping?scriptVersionId=40263047
Kaggle深入EDA筆記本鏈接: https ://www.kaggle.com/sarthakvajpayee/top-4-4-in-depth-eda-feature-scraping?scriptVersionId=40263047
Kaggle modeling notebook link: https://www.kaggle.com/sarthakvajpayee/top-4-4-bert-roberta-xlnet
Kaggle建模筆記本鏈接: https : //www.kaggle.com/sarthakvajpayee/top-4-4-bert-roberta-xlnet
Kaggle post-modeling notebook link: https://www.kaggle.com/sarthakvajpayee/top-4-4-post-modeling-analysis?scriptVersionId=40262842
Kaggle建模后筆記本鏈接: https ://www.kaggle.com/sarthakvajpayee/top-4-4-post-modeling-analysis?scriptVersionId=40262842
Find me on LinkedIn: www.linkedin.com/in/sarthak-vajpayee
在LinkedIn 上找到我: www.linkedin.com/in/sarthak-vajpayee
Find this project on Github: https://github.com/SarthakV7/Kaggle_google_quest_challenge
在Github上找到此項目: https : //github.com/SarthakV7/Kaggle_google_quest_challenge
Peace! ?
和平! ?
翻譯自: https://towardsdatascience.com/hands-on-transformers-kaggle-google-quest-q-a-labeling-affd3dad7bcb
kaggle 貓狗數據標簽
總結
以上是生活随笔為你收集整理的kaggle 猫狗数据标签_动手变形金刚(Kaggle Google QUEST问题与解答标签)。的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 台式计算机机箱的作用,电脑机箱的作用有哪
- 下一篇: Juniper认证