需求分析与建模最佳实践_社交媒体和主题建模:如何在实践中分析帖子
需求分析與建模最佳實踐
主題建模的實際使用 (Practical use of topic modeling)
There is a substantial amount of data generated on the internet every second — posts, comments, photos, and videos. These different data types mean that there is a lot of ground to cover, so let’s focus on one — text.
互聯網每秒產生大量數據-帖子,評論,照片和視頻。 這些不同的數據類型意味著有很多基礎,所以讓我們集中討論一個文本。
All social conversations are based on written words — tweets, Facebook posts, comments, online reviews, and so on. Being a social media marketer, a Facebook group/profile moderator, or trying to promote your business on social media requires you to know how your audience reacts to the content you are uploading. One way is to read it all, mark hateful comments, divide them into similar topic groups, calculate statistics and… lose a big chunk of your time just to see that there are thousands of new comments to add to your calculations. Fortunately, there is another solution to this problem — machine learning. From this text you will learn:
所有社交對話均基于書面文字-推文,Facebook帖子,評論,在線評論等。 作為社交媒體營銷商,Facebook組/個人資料主持人,或嘗試在社交媒體上推廣您的業務,您需要了解受眾對上傳內容的React。 一種方法是全部閱讀,標記可恨的評論,將它們分為相似的主題組,計算統計信息,以及……浪費大量時間,只是要看到有成千上萬的新評論要添加到計算中。 幸運的是,該問題還有另一種解決方案- 機器學習 。 通過本文,您將學到:
- Why do you need specialised tools for social media analyses? 為什么需要用于社交媒體分析的專用工具?
- What can you get from topic modeling and how it is done? 您可以從主題建模中獲得什么以及如何完成?
- How to automatically look for hate speech in comments? 如何自動在評論中尋找仇恨言論?
社交媒體文本為何獨特? (Why are social media texts unique?)
Before jumping to the analyses, it is really important to understand why social media texts are so unique:
在進行分析之前,了解社交媒體文本為何如此獨特非常重要:
- Posts and comments are short. They mostly contain one simple sentence or even single word or expression. This gives us a limited amount of information to obtain just from one post. 帖子和評論很簡短。 它們主要包含一個簡單的句子,甚至單個單詞或表達。 這使我們僅從一個帖子中獲得的信息量有限。
- Emojis and smiley faces — used almost exclusively on social media. They give additional details about the author’s emotions and context. 表情符號和笑臉-幾乎僅在社交媒體上使用。 他們提供了有關作者的情感和背景的其他詳細信息。
- Slang phrases which make posts resemble spoken language rather than written. It makes statements appear more casual. 使帖子成語的語類似于口語而不是書面語。 它使語句顯得更隨意。
These features make social media a whole different source of information and demand special attention while running an analysis using machine learning. In contrast, most open-source machine learning solutions are based on long, formal text, like Wikipedia articles and other website posts. As a result, these models perform badly on social media data, because they don’t understand additional forms of expression included. This problem is called domain shift and is a typical NLP problem. Different data also require customised data preparation methods called preprocessing. The step consists of cleaning text from invaluable tokens like URLs or mentions and conversion to machine readable format (more about how we do it in Sotrender). This is why it is crucial to use tools created especially for your data source to get the best results.
這些功能使社交媒體成為完全不同的信息來源,并且在使用機器學習進行分析時需要特別注意。 相反,大多數開源機器學習解決方案都是基于較長的正式文本,例如Wikipedia文章和其他網站帖子。 結果,這些模型在社交媒體數據上的表現不佳,因為它們不理解包含的其他表達形式。 此問題稱為域移位,是典型的NLP問題。 不同的數據還需要定制的數據準備方法,稱為預處理。 該步驟包括從寶貴的令牌(如URL或提及)中清除文本,然后轉換為機器可讀格式(更多有關我們在Sotrender中的操作方式 )。 這就是為什么使用專門為您的數據源創建的工具以獲得最佳結果至關重要 。
社交媒體的主題建模 (Topic Modeling for social media)
Machine learning for text analysis (Natural Language Processing) is a vast field with lots of different model types that can gain insight into your data. One of the areas that can answer the question “what are the topics of given pieces of texts?” is topic modeling. These models help with understanding what people are talking about in general. It does not require any specially prepared data set with predefined topics. It can find topics which are patterns hidden within the data on its own without supervision and help — which makes it an unsupervised machine learning method. This means that it is easy to build a model for each individual problem.
文本分析的機器學習( 自然語言處理 )是一個廣闊的領域,具有許多不同的模型類型,可以深入了解您的數據。 可以回答以下問題之一:“給定文本的主題是什么?” 是主題建模 。 這些模型有助于理解人們在談論什么。 它不需要任何帶有預定義主題的特殊準備的數據集。 它可以自行查找隱藏在數據中的模式主題,而無需監督和幫助-這使其成為無監督的機器學習方法。 這意味著很容易為每個單獨的問題建立模型 。
There are lots of different algorithms that can be used for this task, but the most common and widely used is LDA (Latent Dirichlet Allocation). It’s based on word frequencies and topics distribution in texts. To put it simply, this method counts words in a given data set and groups them based on their co-occurrence into topics. Then the percentage distribution of topics in each document is calculated. As a result this method assumes that each text is a mixture of topics which works great with long documents where every paragraph relates to a different matter.
有許多不同的算法可用于此任務,但是最常見且使用最廣泛的算法是LDA(潛在狄利克雷分配)。 它基于單詞頻率和主題在文本中的分布。 簡而言之,該方法對給定數據集中的單詞進行計數,并根據它們的同時出現將它們分組為主題。 然后計算每個文檔中主題的百分比分布。 結果,該方法假定每個文本都是主題的混合體,這對于較長的文檔(每個段落涉及一個不同的問題)非常有用。
Figure 1. LDA algorithm (Credit: 圖1. LDA算法(來源: Columbia University)哥倫比亞大學 )That’s why social media texts need a different procedure. One of the new algorithms is GSDMM (Gibbs sampling algorithm for a Dirichlet Mixture Model). What makes this one so different?:
這就是為什么社交媒體文本需要不同的程序的原因 。 新算法之一是GSDMM (狄利克雷混合模型的吉布斯采樣算法)。 是什么讓這個如此與眾不同?:
Students are told to write down some movie titles they liked within 2 minutes. Most students are able to list 3–5 movies with this time frame (it corresponds to a limited number of words for social media texts). Then they are randomly assigned to a group. The last step is for every student to pick a different table with two rules in mind:
要求學生在2分鐘內寫下他們喜歡的電影標題。 大多數學生可以在此時間范圍內列出3-5部電影(對應于社交媒體文字的單詞數量有限)。 然后將它們隨機分配到一個組。 最后一步是讓每個學生在選擇不同表時要牢記兩個規則:
- pick a group with more students — favours bigger groups 選擇一個有更多學生的團體-偏愛更大的團體
- or a group with the most similar movie titles — makes groups more cohesive. 或電影標題最相似的小組-使小組更具凝聚力。
This last step is repeated multiple times. First rule that favours bigger groups is crucial to ensure that groups are not excessively fragmented. Due to the limited number of movie titles (words) for each student (text), each group (topic) is bound to have members with different movies in their lists but from the same genre.
最后一步重復多次。 有利于更大群體的第一條規則對于確保群體不會過于分散至關重要。 由于每個學生(文本)的電影標題(單詞)數量有限,因此每個組(主題)都必須在列表中具有不同電影但來自同一流派的成員。
As A result of the GSDMM algorithm you obtain an assignment of each text to one topic, as well as a list of the most important words for every topic.
作為GSDMM算法的結果,您可以將每個文本分配給一個主題,并獲得每個主題最重要的單詞的列表。
Figure 3. Documents assignment to topics and getting topic word圖3.文檔分配給主題并獲取主題詞The tricky part is to decide upon number of topic (problem of every unsupervised method) but when you finally do this you can gain quite of a lot of insights from the data:
棘手的部分是確定主題的數量(每個無監督方法的問題),但是當您最終這樣做時,您可以從數據中獲得很多洞見:
- Distribution of topics in your data 數據中主題的分布
- Word Clouds — allows us to comprehend the topic and name it. It is a quick and easy solution that can replace reading the whole set of text and spare you hours of tedious work of dividing it into sets. 詞云-使我們能夠理解主題并為其命名。 這是一種快速簡便的解決方案,可以代替閱讀整個文本集,并且省去了將其分成幾組的繁瑣工作。
- Time series analysis of topics — As we can see in the plot below some topics can gain more attention like number 7 and some of them fade away like number 4. Trying to grasp the idea of what is popular or can be popular in the future is a good thing to look back and see how topics were changing in the past. 主題的時間序列分析-正如我們在下面的圖表中看到的那樣,某些主題可以像數字7一樣得到更多的關注,而某些主題像數字4一樣可以逐漸消失。嘗試掌握流行或將來流行的想法是回顧過去,看看主題是如何變化的,這是一件好事。
Use case
用例
In one of our recent projects for Collegium Civitas we analyzed 50 000 social media posts and comments and performed topic analysis on them. It allowed our client to answer questions like:
在我們針對Colvitaium Civitas的最新項目之一中,我們分析了5萬個社交媒體帖子和評論,并對它們進行了主題分析。 它使我們的客戶可以回答以下問題:
1) What was discussed in the time span of 2 months in social media?
1)在兩個月的社交媒體中討論了什么?
In the dataset we were able to distinguish 10 different topics, revolving around Covid-19. Discussions covered statistics and covid-19 etiology, everyday life, government response to pandemic, consequences of limitations in traveling, trade market and supplies, everyday life, health care during pandemic, church and politics, common knowledge and conspiracy theories of Covid-19, politics and economy, spam messages and ads.
在數據集中,我們能夠區分10個不同的主題,圍繞Covid-19。 討論內容涉及統計數據和covid-19的病因,日常生活,政府對大流行的React,旅行限制,貿易市場和物資供應的影響,日常生活,大流行期間的醫療保健,教堂和政治,常識和陰謀理論,政治和經濟,垃圾郵件和廣告。
2) How were the discussions influenced by the pandemic situation?
2)討論如何受到大流行情況的影響?
During the pandemic burst the biggest theme was the origin and statistics of Covid-19. People talked about how the situation is changing and exchanged information about ways of disease spreading . To read more visit Collegium Civitas’ site (Polish version only).
大流行爆發期間,最大的主題是Covid-19的起源和統計數據。 人們討論了情況的變化情況,并就疾病傳播方式交換了信息。 要了解更多信息,請訪問Collegium Civitas的網站 (僅波蘭版)。
仇恨語音識別 (Hate speech recognition)
Another question that can be answered with machine learning is “what kind of emotion do people express in their comments or posts?” or “is my content generating hateful comments?”. There are only a few solutions for these tasks in the Polish language. That is why we build models based on social media text for Sentiment and Hate Speech recognition at Sotrender. Our solutions were built in two steps.
機器學習可以回答的另一個問題是“ 人們在評論或帖子中表達什么樣的情感? ”或“ 我的內容是否引起仇恨評論? ”。 用波蘭語完成這些任務的解決方案很少。 這就是為什么我們基于社交媒體文本在Sotrender上建立用于情感和仇恨語音識別的模型的原因 。 我們的解決方案分兩步構建。
The first step is to convert text and emojis into numerical vector representation (embeddings) to be used later in neural networks. The main goal of this step is to achieve some kind of language model (LM) that has the knowledge of a human language so that vectors representing similar words are close to each other (for example: queen and king or paragraph and article) which implies that these words have similar meaning (semantic similarity). The property is shown on the graph below.
第一步是將文本和表情符號轉換為數字矢量表示形式(嵌入),以供以后在神經網絡中使用。 此步驟的主要目標是獲得某種具有人類語言知識的語言模型(LM),以便表示相似單詞的向量彼此接近(例如:皇后和國王或段落和文章),這意味著這些詞具有相似的含義(語義相似性)。 該屬性如下圖所示。
Figure 7. The intuition behind word similarity圖7.單詞相似性背后的直覺Training this model is similar to teaching a child how to speak by talking to them. Children by listening to their parents talk are able to grasp the meaning of words and the more they hear the more they understand.
訓練這種模式類似于教孩子如何通過與他們交談來說話。 孩子們通過聽父母說話可以理解單詞的含義,聽得越多,他們的理解就越多。
According to this analogy, we have to use a huge set of social media text to train our model to understand its language. That is why we used a set of 100 millions posts and comments to train our model so it can properly assign vectors to words as well as to emojis. Tokens vectorised with an embeddings model provide the input to the neural network.
根據這種類比,我們必須使用大量的社交媒體文本來訓練我們的模型以理解其語言。 這就是為什么我們使用了1億篇帖子和評論來訓練我們的模型,以便它可以將矢量正確分配給單詞以及表情符號的原因。 用嵌入模型矢量化的令牌為神經網絡提供輸入。
The second step is designing neural networks for a specific task — Hate speech recognition. The most important thing is the data set — the model needs examples of hate speech and non-hateful texts to learn how to tell them apart. In order to get best results you need to experiment with different architectures and model’s hyperparameters.
第二步是為特定任務(討厭的語音識別)設計神經網絡。 最重要的是數據集-該模型需要仇恨言論和非仇恨文本的示例,以學習如何區分它們。 為了獲得最佳結果,您需要嘗試使用不同的體系結構和模型的超參數。
As a result of the hate speech recognition model, we get another grouping of our data set. Now we can see how our audience reacts, how many hateful comments or posts it’s creating. What’s more, by combining it again with the time of publication of each comment, we can see if there was a specific time period when the most hateful comments were generated like shown in a histogram below.
仇恨語音識別模型的結果是,我們得到了另一組數據集。 現在我們可以看到觀眾的React,正在創建多少可惡的評論或帖子 。 此外,通過將其與每個評論的發布時間再次結合,我們可以查看是否在特定時間段內生成了最令人討厭的評論,如下面的直方圖所示。
Figure 圖 8. Hate Speech distribution over time8.討厭的語音隨時間分布Combining this distribution with recent posts or events can give you insight into the type of content that provokes people. Also changes of hate speech contribution in time can be related with changes in topic distribution. Combining all the information from analysis can provide an in-depth image of the dataset.
將此分布與最近的帖子或事件相結合,可以使您深入了解引起人們注意的內容類型 。 仇恨言論貢獻的及時變化也可能與主題分布的變化有關。 結合來自分析的所有信息可以提供數據集的深入圖像。
Figure 9. Weekly text count with hate speech圖9.帶有仇恨言論的每周文本計數As the histogram above shows most hate is connected to topic 3, 6 and 7. Knowing what makes people angry gives the opportunity to avoid sensitive topics in the future.
如上面的直方圖所示,大多數仇恨都與主題3、6和7有關。 知道什么使人生氣會使將來有機會避免敏感主題。
Same goes for sentiment analysis. We can produce similar visualizations for positive, negative or neutral comments and see their distribution in time or topics. If you would like to read thewhole report build based on our analysis of the 8 weeks of data you can find it here (only Polish version).
情緒分析也是如此。 我們可以為肯定,否定或中立的評論提供類似的可視化效果,并查看它們在時間或主題上的分布。 如果您想根據我們對8周數據的分析來閱讀整個報告,則可以在此處找到(僅限波蘭語版本)。
結論 (Conclusion)
In Sotrender we have models for hate speech and sentiment recognition that are constantly improved and updated for social media texts. What’s more we have experience in building topic modeling models for individual cases. As you can see there’s a lot of benefits coming from this type of analysis:
在Sotrender中,我們提供了針對仇恨言論和情感識別的模型,這些模型會針對社交媒體文本進行不斷改進和更新。 此外,我們在為個別案例構建主題建模模型方面經驗豐富。 如您所見,這種分析有很多好處:
- Getting to know your audience 了解你的聽眾
- Having in depth look into topics of comments 深入研究評論主題
- Discovering trending themes 發現熱門主題
- Finding source of hatred or negativity in our content 在我們的內容中找到仇恨或否定的根源
To name just a few!
僅舉幾例!
翻譯自: https://towardsdatascience.com/social-media-and-topic-modeling-how-to-analyze-posts-in-practice-d84fc0c613cb
需求分析與建模最佳實踐
總結
以上是生活随笔為你收集整理的需求分析与建模最佳实践_社交媒体和主题建模:如何在实践中分析帖子的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 模拟人生畅玩版怎么结婚(MuMu模拟器官
- 下一篇: 机器学习 数据模型_使用PyCaret将