抑郁症损伤神经细胞吗_使用神经网络探索COVID-19与抑郁症之间的联系
抑郁癥損傷神經細胞嗎
The drastic changes in our lifestyles coupled with restrictions, quarantines, and social distancing measures introduced to combat the corona virus outbreak have lead to an alarming rise in mental health issues all over the world. Social media is a powerful indicator of the mental state of people at a given location and time. In order to study the link between the corona virus pandemic and the accelerating pace of depression and anxiety in the general population, I decided to explore tweets related to corona virus.
我們生活方式的急劇變化,加上為對抗日冕病毒爆發而采取的限制措施,隔離措施和社會疏遠措施,已導致全世界范圍內令人震驚的精神健康問題。 社交媒體是在給定位置和時間的人們心理狀態的有力指標。 為了研究普通人群中冠狀病毒大流行與抑郁和焦慮的加速步伐之間的聯系,我決定探索與冠狀病毒有關的推文。
這個博客是如何組織的? (How is this blog organized?)
In this blog post, I will first use keras to train a neural network to recognize depressive tweets. For this, I will use a data set of 10,314 tweets divided into depressive tweets (labelled 1) and non depressive tweets (labelled 0). This data set is made by Viridiana Romero Martinez. Here is the link to her github profile: https://github.com/viritaromero
在這篇博客文章中,我將首先使用keras訓練神經網絡來識別令人沮喪的推文。 為此,我將使用10,314條推文的數據集,分為壓抑推文(標記為1)和非壓抑推文(標記為0)。 該數據集由Viridiana Romero Martinez創建。 這是她的github個人資料的鏈接: https : //github.com/viritaromero
Once I have the network trained, I will use it for testing tweets scraped from twitter. To establish the link between COVID-19 and depression, I will obtain two different data sets. The first data set will be comprised of tweets with corona virus related keywords such as ‘COVID-19’, ‘quarantine’, ‘pandemic’, and ‘virus’. The second data set will be comprised of random tweets searched using neutral keywords such as ‘and’, ‘I’, ‘the’ etc. The second data set will serve as a control to check the percentage of depressive tweets in a random sample of tweets. This will allow us to measure the difference in percentage of depressive tweets in a random sample and a sample with COVID-19 specific tweets.
一旦對網絡進行了培訓,我將使用它來測試從Twitter抓取的推文。 為了建立COVID-19與抑郁癥之間的聯系,我將獲得兩個不同的數據集。 第一個數據集將包含帶有與日冕病毒相關的關鍵字的推文,例如“ COVID-19”,“隔離”,“大流行”和“病毒”。 第二個數據集將包含使用中性關鍵字(例如“ and”,“ I”,“ the”等)搜索的隨機推文。第二個數據集將用作檢查隨機抽樣的抑郁性推文百分比的控件。鳴叫。 這將使我們能夠測量隨機樣本和具有COVID-19特定推文的樣本中壓抑推文所占百分比的差異。
預處理數據 (Preprocessing the data)
Image source: https://xaltius.tech/why-is-data-cleaning-important/圖片來源: https : //xaltius.tech/why-is-data-cleaning-important/Before we can get started with training the neural networks, we need to collect and clean the data.
在開始訓練神經網絡之前,我們需要收集和清理數據。
導入庫 (Importing the libraries)
To get started with the project, we will first need to import all the necessary libraries and modules.
要開始該項目,我們首先需要導入所有必需的庫和模塊。
演示地址
Once we have all the libraries in place, we need to get the data and pre-process it. You can download the data set from this link: https://github.com/viritaromero/Detecting-Depression-in-Tweets/blob/master/sentiment_tweets3.csv
一旦所有庫都準備就緒,我們需要獲取數據并對其進行預處理。 您可以從以下鏈接下載數據集: https : //github.com/viritaromero/Detecting-Depression-in-Tweets/blob/master/sentiment_tweets3.csv
快速檢查數據 (Quick examination of the data)
We can quickly check the structure of the data set by reading it into a pandas data frame.
我們可以通過將數據讀取到熊貓數據框中來快速檢查數據集的結構。
演示地址
Now we will store the text of the tweets into an array called text. The corresponding labels of the tweets will be stored into a separate array called labels. The code is as follows:
現在,我們將推文的文本存儲到名為text的數組中。 推文的相應標簽將存儲到稱為標簽的單獨數組中。 代碼如下:
演示地址
Apologies for printing out a rather huge data set, but I just did it so that we can quickly examine the overall structure. The first thing I notice is that in the labels array, there are much more zeroes than ones. This means that we have roughly 3.5 times more non-depressive tweets than depressive tweets in the data set. In an ideal situation, I would like to train my neural network on a data set of equal number of depressive and non-depressive tweets. However, in order to obtain equal number of depressive and non-depressive tweets, I will have to substantially truncate my data. I think a larger and imbalanced data set is better than a very small and balanced data set, therefore, I am going to go ahead and use the data set in its original state.
很抱歉打印出相當大的數據集,但我只是這樣做了,以便我們可以快速檢查整體結構。 我注意到的第一件事是在labels數組中,零比1多得多。 這意味著我們在數據集中擁有的非抑郁性推文大約是抑郁性推文的3.5倍。 在理想情況下,我想在壓抑和非壓抑推文數量相等的數據集上訓練我的神經網絡。 但是,為了獲得相等數量的壓抑和非壓抑推文,我將不得不截斷我的數據。 我認為,較大的數據集和不平衡的數據集要比非常小的數據集和平衡的數據集更好,因此,我將繼續使用原始狀態的數據集。
清理數據 (Cleaning the data)
The second thing you’ll notice is that the tweets contain a lot of the so called ‘stopwords’ such as ‘a’, ‘the’, ‘and’ etc. These words are not important for classification of a tweet as depressive or non-depressive, hence we will remove these. We also need to remove the punctuation as it is again unnecessary and will only decrease the performance of our neural network.
您會注意到的第二件事是,這些推文包含很多所謂的“停用詞”,例如“ a”,“ the”,“ and”等。這些詞對于將推文歸類為沮喪或不重要并不重要。 -抑郁,因此我們將其刪除。 我們還需要刪除標點符號,因為它再次是不必要的,只會降低神經網絡的性能。
演示地址
I decided to do a quick visualization of the data after cleaning using the amazing wordCloud library and the result is down below. Quite unsurprisingly, the most common word in depressive tweets is depression.
我決定在清理后使用令人驚嘆的wordCloud庫對數據進行快速可視化,結果顯示如下。 毫不奇怪,抑郁推文中最常見的詞是抑郁。
Visualization of tweets using WordCloud使用WordCloud可視化推文數據令牌化 (Tokenization of the data)
What the on earth is tokenization?
到底什么是令牌化?
Basically, the neural networks do not understand raw text as we humans do. Therefore, in order to make the text more palatable to our neural network, we convert it into a series of ones and zeroes.
基本上,神經網絡不像人類那樣理解原始文本。 因此,為了使文本更適合我們的神經網絡,我們將其轉換為一系列的一和零。
Image Source: inboundhow.com圖片來源:inboundhow.comTo tokenize text in keras, we import the tokenizer class. This class basically makes a dictionary lookup for a set number of unique words in our overall text. Then using the dictionary lookup, keras allows us to create vectors replace the word with its index value in the dictionary lookup.
要對keras中的文本進行標記化,我們導入了tokenizer類。 此類基本上是對整個文本中一定數量的唯一單詞進行字典查找。 然后,使用字典查找,keras允許我們創建向量以在字典查找中將單詞替換為其索引值。
We also go ahead and pad the shorter tweets and truncate the larger ones to make the maximum length of each vector equal to 100.
我們還繼續填充較短的tweet,截斷較大的tweet,以使每個向量的最大長度等于100。
演示地址
演示地址
You might be wondering, ‘huh, we only converted words to numbers, not ones and zeroes!’ You are right. There are two ways we can take care of that: either we can covert the numbers into one-hot-encoded vectors or create an embeddings matrix. One-hot-encoding vectors are usually very high dimensional and sparse whereas matrices are lower dimensional and dense. If you are interested, you can read more about it in the ‘Deep Learning with Python’ book by Francois Chollet. In this blog, I will be using matrices, but before we initialize them, we will need to take care of a few other things first.
您可能想知道,“呵,我們只將單詞轉換為數字,而不是一和零!” 你是對的。 有兩種方法可以解決此問題:要么將數字隱蔽為一個熱編碼的矢量,要么創建一個嵌入矩陣。 一鍵編碼矢量通常具有很高的維數和稀疏度,而矩陣則具有較??低的維數和密集度。 如果您有興趣,可以在Francois Chollet撰寫的“ Python深度學習”一書中閱讀有關它的更多信息。 在此博客中,我將使用矩陣,但是在初始化矩陣之前,我們需要先處理一些其他事項。
整理數據 (Shuffling the data)
Sergi Viladesau on unsplahSergi Viladesau在unsplah上拍攝Another issue with the data that you might have identified earlier is that the text array contains all the non-depressive tweets first followed by the all depressive ones. We therefore need to shuffle the data to allow random samples of tweets to go into the training, validation, and test sets.
您之前可能已經確定的數據的另一個問題是,文本數組首先包含所有非壓抑推文,然后是所有壓抑推文。 因此,我們需要對數據進行混洗,以使隨機的推文樣本進入訓練,驗證和測試集。
演示地址
分割數據 (Splitting the data)
Now we need to split the data into the training, validation, and test sets.
現在,我們需要將數據分為訓練,驗證和測試集。
演示地址
Phew! Finally done with all the data munging!
! 最后完成所有數據處理!
制作神經網絡 (Making a neural network)
Image source: extremetech.com圖片來源:extremetech.comNow we can start making the model architecture.
現在我們可以開始制作模型架構了。
I will be trying two different models: one with a pre-trained word embeddings layer and one with a trainable word embeddings layer.
我將嘗試兩種不同的模型:一種具有預訓練的單詞嵌入層,另一種具有可訓練的單詞嵌入層。
In order to define the neural network architecture, you need to understand how word embeddings work. There is a wealth of information online about word embeddings. This blog post is one of my favorites:
為了定義神經網絡架構,您需要了解單詞嵌入的工作方式。 在線上有大量有關單詞嵌入的信息。 這篇博客文章是我的最愛之一:
https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
Now that you hopefully have an idea of the function of the embeddings layer, I will go ahead and create it in code.
現在,您希望對嵌入層的功能有所了解,我將繼續在代碼中創建它。
演示地址
第一個模型 (First model)
For the first model, the architecture consists of a pre-trained word embeddings layer followed by two dense layers. The code from training the model is as follows:
對于第一個模型,該體系結構包括一個預訓練的單詞嵌入層,然后是兩個密集層。 訓練模型的代碼如下:
演示地址
演示地址
Figure: Accuracy and loss for training and validation sets in model 1圖:模型1中的訓練和驗證集的準確性和損失Here we can see that the model performs very well with an accuracy of 98 % on the test set. Overfitting is likely to not be an issue because the validation accuracy and loss are almost the same as the training accuracy and loss.
在這里,我們可以看到該模型在測試集上的表現非常好,準確度達到98%。 過度擬合可能不會成為問題,因為驗證的準確性和損失與訓練的準確性和損失幾乎相同。
第二種模式 (The second model)
For the second model, I decided to exclude the pre-trained embeddings layer. The code is as follows.
對于第二個模型,我決定排除預訓練的嵌入層。 代碼如下。
演示地址
演示地址
Figure: Accuracy and loss for training and validation sets in model 2圖:模型2中的訓練和驗證集的準確性和損失The accuracy of both the models on the test set are equally good. However, since the second model is less complex, I will be using it for predicting whether a tweet is depressive or not.
測試集上的兩個模型的準確性都同樣好。 但是,由于第二個模型不那么復雜,因此我將使用它來預測一條推文是否令人沮喪。
從Twitter獲取COVID-19相關推文的數據 (Obtaining data from twitter for COVID-19 related tweets)
In order to obtain my data sets of tweets, I used twint which is an amazing webscraping tool for twitter. I prepared two different data sets of 1000 tweets each. The first one consisted of tweets containing corona related keywords such as ‘COVID-19’, ‘quarantine’, and ‘pandemic’.
為了獲取我的tweet數據集,我使用了twint,twitter是一個很棒的Twitter抓取工具。 我準備了兩個不同的數據集,每個數據集有1000條推文。 第一個由包含與電暈相關的關鍵字(例如“ COVID-19”,“隔離”和“大流行”)的推文組成。
Now in order to get a control sample to compare against, I searched for tweets containing neutral keywords such as ‘the’, ‘a’, ‘and’ etc. Using 1000 tweets from this sample, I made up the second control data set.
現在,為了比較一個對照樣本,我搜索了包含中性關鍵字(例如“ the”,“ a”,“ and”等)的推文。使用該樣本中的1000條推文,構成了第二個對照數據集。
WordCloud of COVID related tweetsCOVID相關推文的WordCloudI cleaned the data sets using a similar procedure to the one I used for cleaning the training set. After cleaning the data, I fed it to my neural network to predict the percentage of depressive tweets. The results, I obtained were surprising.
我使用與清理訓練集相似的過程清理了數據集。 清理數據后,我將其輸入到我的神經網絡以預測抑郁性推文的百分比。 我獲得的結果令人驚訝。
One run of the code is shown below, I repeated it with different batches of data obtained using the same procedure as described above and calculated the average results.
下面顯示了該代碼的一次運行,我使用與上述相同的程序對不同批次的數據重復了該代碼,并計算了平均結果。
演示地址
演示地址
On average, my model predicted, 35 % depressive tweets and 65 % non-depressive in a data set of tweets obtained using neutral keywords. 35% depressive tweets on a randomly obtained sample is an alarmingly high number. However, the number of depressive tweets with COVID-related keywords was even higher: 55 % depressive vs 45 % non-depressive. That is a 57 % increase in depressive tweets!
我的模型平均預測,在使用中性關鍵字獲得的推文數據集中,有35%的抑郁推文和65%的非抑郁推文。 隨機獲得的樣本上35%的壓抑推文數量驚人地高。 但是,帶有COVID相關關鍵字的壓抑推文的數量甚至更高:55%的壓抑和45%的非壓抑。 令人沮喪的推文增加了57%!
This leads to the conclusion that there is indeed a correlation between COVID-19 and depressive sentiments in tweets on Twitter.
由此得出結論,在推特上的推文中,COVID-19與抑郁情緒之間確實存在關聯。
結論 (Conclusion)
I hope this post helped you learn a bit more about sentiment analysis using machine learning and I hope you will try out a similar project yourself as well.
我希望這篇文章可以幫助您了解更多有關使用機器學習進行情感分析的知識,并且希望您自己也可以嘗試一個類似的項目。
Happy coding!
祝您編碼愉快!
演示地址
credits: Slater on giphy學分:斯吉特·吉菲翻譯自: https://towardsdatascience.com/exploring-the-link-between-covid-19-and-depression-using-neural-networks-469030112d3d
抑郁癥損傷神經細胞嗎
總結
以上是生活随笔為你收集整理的抑郁症损伤神经细胞吗_使用神经网络探索COVID-19与抑郁症之间的联系的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到卖鞋子是什么意思
- 下一篇: 梦到摘核桃是什么意思周公解梦