throw 烦人_烦人的简单句子聚类
throw 煩人
Making the machine understand the modalities of any language is the core of almost all the NLP tasks being undertaken. Understanding patterns in the data and being able to segregate data points on the basis of similarities and differences can help in several ways. Some use cases can be :
使機器了解任何語言的形式是幾乎所有正在執行的NLP任務的核心。 理解數據中的模式并能夠基于相似點和不同點來分離數據點可以通過多種方式提供幫助。 一些用例可以是:
- One may scale down massive data which might have many unique data points, but are contextually similar in their meaning. For example in question-answer verification systems, one of the major problems is the repeating of answer statements and these data points are actually duplicate statements and should be removed based on a threshold of semantic similarity. 一個人可能會按比例縮小海量數據,這些數據可能有許多獨特的數據點,但上下文含義相似。 例如,在問題-答案驗證系統中,主要問題之一是答案語句的重復,并且這些數據點實際上是重復語句,應基于語義相似性閾值將其刪除。
- In classification tasks and especially while dealing with macaronic languages one can map a cluster of similar data points to a label, by relying on the similarity metric calculated. 在分類任務中,尤其是在處理Macaronic語言時,可以依靠所計算的相似性度量將一組相似的數據點映射到標簽。
- Many natural language applications,sentiment and non-sentiment, such as semantic search, summarization, question answering, document classification, and sentiment analysis, plagiarism depend on sentence similarity. natural竊的許多自然語言應用(情感和非情感)(例如語義搜索,摘要,問題解答,文檔分類和情感分析)都取決于句子的相似性。
Macaronic Languages and the peril involved:
Macaronic語言及其涉及的危險:
While working with macaronic languages, where there’s no grammar structure, no defined vocabulary and no basic rules to govern sentence formation, one can only rely on numbers to tell what data speaks. That is and will always be something AI/ML engineers have to keep in mind, feed the data in the language machine understands, because at the end of the day they crunch numbers to tell a story hidden in the data.
在使用Macaronic語言時,那里沒有語法結構,沒有定義的詞匯,也沒有基本的規則來控制句子的形成,但是人們只能依靠數字來說明數據在說什么。 那是并且永遠是AI / ML工程師必須記住的事情,以語言機器能夠理解的方式提供數據,因為最終他們會處理數字以講述隱藏在數據中的故事。
Let’s begin with an example here, sentence is in Hinglish(Hindi written using latin chars with usage of english words in between) to understand the difference between semantic and lexical similarity:
讓我們從這里的一個例子開始,用Hinglish(印度語使用拉丁字符寫成的印度語,中間使用英語單詞)來理解語義和詞匯相似性之間的區別:
Sentence 1 : “Mood kharab hai yaar aaj”
句子1:“心情kharab hai yaar aaj”
Sentence 2: “Mood kharab mat kar”
句子2:“心情kharab mat kar”
While sentence 1 has tones of sadness and disappointment, sentence 2 has more of anger connotations. These sentences are close in terms of lexical similarity but are placed at a distance in terms of semantic similarity.
句子1具有悲傷和失望的語氣,而句子2具有更多的憤怒涵義。 這些句子在詞匯相似性方面很接近,但在語義相似性上卻相距遙遠。
To make the machine understand the difference between the two and map them to the respective labels, a solution can be devised by using what Annoy (Approximate Nearest Neighbors Oh Yeah) has to offer.(https://github.com/spotify/annoy). It is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.
為了使機器了解兩者之間的區別并將它們映射到相應的標簽,可以使用Annoy( 近似最近的鄰居 Oh Yeah)提供的解決方案來設計解決方案。( https://github.com/spotify/annoy )。 這是一個具有Python綁定的C ++庫,用于搜索空間中與給定查詢點接近的點。 它還會創建大型的基于文件的只讀數據結構,這些數據結構被映射到內存中,以便許多進程可以共享相同的數據。
Let’s understand more with a task example:
讓我們通過任務示例了解更多信息:
Say we want to build a model that predicts emojis depending on the text input given by a user. The data has been prepared in the following format :
假設我們要建立一個根據用戶輸入的文字來預測表情符號的模型。 數據已按照以下格式準備:
- badhai ho 💐 Badhai ho💐
- many many returns of the day happy birthday 🎂 生日快樂很多回報returns
- be the way u r always strong personality 💪 成為您始終堅強的個性💪
One emoji per sentence, but the issue came when similar context sentences and in some cases same sentences with some minor changes in the words used were mapped to different emojis. Now this would confuse the model, so we tweaked the task from Multi Class Classification to Multi Label Classification. But even in that case, mapping similar context sentences to a unique emoji cluster was the ideal way to go forward.
每個句子一個表情符號,但是問題是當相似的上下文句子以及在某些情況下相同的句子(使用的單詞略有變化)映射到不同的表情符號時出現。 現在這會使模型感到困惑,因此我們將任務從“多類分類”調整為“多標簽分類”。 但是即使在那種情況下,將相似的上下文句子映射到唯一的表情符號群集也是前進的理想方法。
Strategy Adopted:
采用的策略:
First and foremost step to get the word vectors of the words in our corpus, remember it’s a macaronic language so no pre-trained model will work here.
要獲得我們語料庫中單詞的單詞向量的第一步也是最重要的一步,請記住,它是一種通配語言,因此此處沒有任何預訓練模型。
Word Vectors using Fasttext:
使用Fasttext的單詞向量:
A popular idea in modern machine learning is to represent words by vectors. These vectors capture hidden information about a language, like word analogies or semantic. It is also used to improve performance of text classifiers. Fasttext model was used to unsupervised train on the User Chat Data gathered with emojis and other special characters removed and only one sentence per line.
現代機器學習中一個流行的想法是用向量表示單詞。 這些向量捕獲有關語言的隱藏信息,例如單詞類比或語義。 它還用于提高文本分類器的性能。 快速文本模型用于對用戶聊天數據進行無監督訓練,其中收集了表情符號和其他特殊字符,每行僅一個句子。
Sentence Vectors :
句子向量:
The word vectors were used to form a sentence vector by taking an average of all the vectors fetched(one vector for each word) for a sentence and dividing it by the number of words present.
通過取一個句子的所有向量的平均值(每個單詞一個向量),然后將其除以出現的單詞數,即可使用單詞向量形成句子向量。
Clustering similar sentences using ANNOY:
使用ANNOY對相似的句子進行聚類:
Using ANNOY we formed a forest of trees that stored index vise the sentences found similar using the sentence vector given. For a query sentence given, it would give us a list of k similar sentences and their index position in the dataset. Depending on the extent of angular distance between the query sentence and the similar sentences, a threshold was decided to collect as many sentences that fulfilled the threshold criterion in our dataset. These clusters of sentences were then mapped to the most frequent 5 topmost emojis in that cluster. However, at times there were less than 5 emojis found, therefore we had a minimum of one emoji and maximum of 5 emoji being mapped to the similar cluster formed depending on the threshold.
使用ANNOY,我們形成了一個樹木樹林,其中存儲了索引虎鉗,這些虎鉗使用給定的句子向量找到的句子相似。 對于給定的查詢語句,它將為我們提供k個相似語句的列表及其在數據集中的索引位置。 根據查詢語句和相似語句之間的角度距離的大小,確定閾值以收集盡可能多的滿足我們數據集中閾值標準的語句。 然后將這些句子簇映射到該簇中最常見的5個最上面的表情符號。 但是,有時發現的表情符號少于5個,因此根據閾值,我們將最少1個表情符號和最多5個表情符號映射到形成的相似簇。
- The below function returns a set of indices of all the sentences that were found to be similar for an angular distance of ≤ threshold. input_df is the dataframe which has sentences. 下面的函數返回所有句子的一組索引,這些索引對于≤閾值的角距被發現是相似的。 input_df是具有句子的數據框。
def get_neighbours(index): k = 50 #number of neighbours being considered
def get_neighbours(index):k = 50#正在考慮的鄰居數
sentence_vector = tree.get_item_vector(index) ids,distance = t.get_nns_by_vector(sentence_vector,k,include_distances=True)
句子向量= tree.get_item_vector(索引)id,距離= t.get_nns_by_vector(句子向量,k,包含距離=真)
similarity = distance[-1] #gives the index and distance of last similar neighbour threshold = n # to be decided on respective task while (similarity < threshold): k= 2*k #seraches for more sentences that lie within the threshold criterion. ids,distance = t.get_nns_by_vector(sentence_vector,k,include_distances=True) similarity = distance[-1]
相似度=距離[-1]#給出最后一個相似鄰居閾值的索引和距離= n#將在相應任務上決定,而(相似度<閾值):k = 2 * k#對位于閾值標準內的更多句子進行排序。 ids,距離= t.get_nns_by_vector(sentence_vector,k,include_distances = True)相似度=距離[-1]
indices = extract_index(ids,distance,threshold)
索引= extract_index(ids,距離,閾值)
return indices
回報指數
Tree is built by calling the given python code:
通過調用給定的python代碼來構建樹:
from annoy import AnnoyIndeximport random
f = 40
t = AnnoyIndex(f, 'angular') # Length of item vector that will be indexed
for i in range(1000):
v = [random.gauss(0, 1) for z in range(f)]
t.add_item(i, v)
t.build(10) # 10 trees
t.save('test.ann')
# ...
u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors
- Eg: 例如:
Query Sentence : happy birthday, the closer the angular distance to 0, better is the similarity.
查詢語句:生日快樂,角度距離越接近0,相似度越好。
Similar sentences and respective Angular distance :
類似的句子和各自的Angular距離:
birthday happy 0.0
生日快樂0.0
happy birthday happy birthday 0.0
生日快樂生日快樂0.0
birthday happy birthday happy 0.0
生日快樂生日快樂0.0
happy birthday happy birthday yaar 0.14156264066696167
生日快樂生日快樂yaar 0.14156264066696167
happy birthday happy birthday hap 0.15268754959106445
生日快樂生日快樂0.15268754959106445
happy happy wala birthday birthday 0.16257968544960022
開心開心wala生日生日0.16257968544960022
happy birthday maa happy birthday 0.17669659852981567
生日快樂馬阿生日快樂0.17669659852981567
This entire cluster was mapped to this emoji mapping : [🎂 ,😘 ,😍, 🙏, 😂 ]
整個群集被映射到以下表情符號映射:[🎂,😘,😍,🙏,😂]
Similarly:
類似地:
i’m very very happy today [‘😊’, ‘😍’, ‘😁’, ‘😘’]
我今天非常非常快樂['😊','😍','😁','😘']
so i’m very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]
所以我很高興['😊','😍','😁','😘']
i’m so very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]
我非常高興['😊','😍','😁','😘']
yes i m very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]
是的,我非常高興['😊','😍','😁','😘']
yes am very happy today [‘😊’, ‘😍’, ‘😁’, ‘😘’]
是的,今天很開心['😊','😍','😁','😘']
i also very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]
我也很高興['😊','😍','😁','😘']
im so very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]
我非常高興['😊','😍','😁','😘']
i am very very hppy [‘😊’, ‘😍’, ‘😁’, ‘😘’]
我非常hppy ['😊','😍','😁','😘']
i am very happy friends [‘😊’, ‘😍’, ‘😁’, ‘😘’]
我是非??鞓返呐笥裑'😊','😍','😁','😘']
oh i’m very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]
哦,我很高興['😊','😍','😁','😘']
i am very very happ [‘😊’, ‘😍’, ‘😁’, ‘😘’]
我非常非常開心['😊','😍','😁','😘']
im very happy kal [‘😊’, ‘😍’, ‘😁’, ‘😘’]
我非常高興kal ['😊','😍','😁','😘']
i am very haappy [‘😊’, ‘😍’, ‘😁’, ‘😘’]
我很開心['😊','😍','😁','😘']
sister l am very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]
姐姐我很高興['😊','😍','😁','😘']
today i’m so very happpy [‘😊’, ‘😍’, ‘😁’, ‘😘’]
今天我非常開心['😊','😍','😁','😘']
This entire exercise helped us to map the similar context in sentences to the same emoji vector. It is important for the model to understand that such context can be mapped to these 5 emojis and not to each one of them as a different class, but as different labels. Thus, it becomes a problem of Multi-label classification. Multi-label classification originated from the investigation of text categorization problem, where each document may belong to several predefined topics simultaneously. In our case each text can belong to several or single emoji.
整個練習幫助我們將句子中的相似上下文映射到相同的表情符號矢量。 對于模型而言,重要的是要理解可以將此類上下文映射到這5個表情符號,而不是將每個上下文映射為不同的類,而是映射為不同的標簽。 因此,這成為多標簽分類的問題。 多標簽分類源自對文本分類問題的研究,其中每個文檔可能同時屬于幾個預定義的主題。 在我們的情況下,每個文本可以屬于多個或單個表情符號。
翻譯自: https://medium.com/bobble-engineering/annoyingly-simple-sentence-clustering-12de1316abf4
throw 煩人
總結
以上是生活随笔為你收集整理的throw 烦人_烦人的简单句子聚类的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: iPhone 15 Ultra渲染图出炉
- 下一篇: 中国石化投资设立智能机器人公司 注册资本