r语言模型评估:_情感分析评估:对自然语言处理的过去和未来的反思
r語言模型評估:
I recently received a new paper titled“Evaluation of Sentiment Analysis in Finance: From Lexicons to Transformers” published on July 16 2020 in IEEE. The authors, KostadinMishev, Ana Gjorgjevikj, Irena Vodenska, Lubomir T. Chitkushev, and DimitarTrajanov compared more than a hundred sentiment algorithms that were applied on two known financial sentiment datasets and evaluated their effectiveness. Although the purpose of the study was to test the effectiveness of different Natural Language Processing (NLP) models, the findings, in the paper, can tell us much more, about the progress of NLP over the duration of the last decade, especially, to better understand what elements contributed the most to the sentiment prediction task.
我最近收到了一份新論文,題為“金融中的情感分析評估:從Lexicons到變形金剛”,于2020年7月16日在IEEE發表。 作者KostadinMishev,Ana Gjorgjevikj,Irena Vodenska,Lubomir T.Chitkushev和DimitarTrajanov比較了應用于兩種已知金融情緒數據集的一百多種情緒算法,并評估了它們的有效性。 盡管研究的目的是測試不同自然語言處理(NLP)模型的有效性,但本文的發現可以告訴我們有關NLP在過去十年中的進展的更多信息,尤其是到更好地了解哪些元素對情緒預測任務的貢獻最大。
So let’s start with the definition of the sentiment prediction task. Given a collection of paragraphs, the model classifies each paragraph into one of three possible categories: positive sentiment, negative sentiment, or neutral. The model is then evaluated based on a confusion matrix (3X3) that is constructed from the counts of predicted sentiment versus the ground truths (the true labels of each paragraphs).
因此,讓我們從情感預測任務的定義開始。 給定一個段落集合,模型將每個段落分為三個可能的類別之一:積極情緒,消極情緒或中立。 然后,基于混淆矩陣(3X3)對模型進行評估,該混淆矩陣是根據預測的情緒與實際情況(每個段落的真實標簽)的計數構建的。
The evaluation metric implemented by the authors is called the Matthews correlation coefficient (MCC) and serves as a measure of the quality of binary (two-class) classifications (Matthews,1975). Although the MCC metric is only applicable for the binary case, the authors do not mention how they applied the MCC function in the multi-class case (3 sentiment classes). Did they use micro-averaging or did they apply the generalized equation for the multi-class case?
作者實施的評估指標稱為馬修斯相關系數(MCC),可作為衡量二元(兩類)分類質量的指標( 馬修斯 ,1975年)。 盡管MCC指標僅適用于二進制情況,但作者沒有提及他們在多類情況(3個情感類)中如何應用MCC功能。 他們是使用微平均還是針對多類情況應用了廣義方程?
The authors divided the NLP models into five broad categories based on their textual representation: (1) Lexicon-based knowledge, (2) statistical methods, (3) word encoder, (4) Sentence encoder, (5) transformer. Several different models were applied for each category and the performance is reported in a table here.
作者根據其文字表示將NLP模型分為五大類:(1)基于詞匯的知識,(2)統計方法,(3)單詞編碼器,(4)句子編碼器,(5)轉換器。 每種類別都應用了幾種不同的模型,其性能報告在此處的表格中。
The table above demonstrates the progress in sentiment analysis through the years driven by the text representation method. The authors confirm that transformers show superior performances compared to the other evaluated approaches and that the text representation plays the main role as it feeds the semantic meaning of the words and sentences into the models.
上表展示了文本表示法驅動的情感分析在過去幾年中的進展。 作者確認,與其他評估方法相比,變形器表現出更出色的性能,并且文本表示法將單詞和句子的語義含義饋入模型中,從而發揮了主要作用。
But wait! There are perhaps more conclusions that can be drawn from this experiment regarding the future of NLP. Can we uncover clues about the elements that are still missing to make NLP much more effective in more complex task? What might be the next big breakthrough in order to better represent human language by language models?
可是等等! 關于NLP的未來,可以從該實驗中得出更多結論。 我們是否可以找到有關使NLP在更復雜的任務中更加有效的要素的線索? 為了通過語言模型更好地表達人類語言,下一個重大突破是什么?
In an effort solve that exact question, I have started to dig further into the models’ outcomes and search for a connection between text representation, model size, and model performance in an attempt to extracting the contribution of model’s size and text representation on the final performance. Based on the authors’ analysis I created the figure below. The Figure below shows the MCC score of each model as a function of the model’s numeric parameters. The colors represent the model main category.
為了解決這個確切的問題,我開始進一步研究模型的結果,并在文本表示,模型大小和模型性能之間尋找聯系,以嘗試提取模型大小和文本表示對最終結果的貢獻。性能。 根據作者的分析,我創建了下圖。 下圖顯示了每個模型的MCC得分與模型的數字參數的關系。 顏色代表模型的主要類別。
Figure 1: The improvement in sentiment classification (MCC score) as a function of the number of parameters in the models (logarithmic scale)圖1:情感分類的改善(MCC得分)與模型中參數數量(對數刻度)的關系From my analysis, it can be seen that the progress of the sentiment prediction task consists of two phases. The first phase is mainly attributed to better text representation while the second phase is due to the introduction of the transformer that can handle huge corpora by increasing network size and administrating millions of parameters.
從我的分析可以看出,情緒預測任務的進度分為兩個階段。 第一階段主要歸因于更好的文本表示,而第二階段則歸因于轉換器的引入,該轉換器可以通過增加網絡規模和管理數百萬個參數來處理龐大的語料庫。
It is highlight from the above graph that text representation had three major revolutions starting from the early 80s. The first was from lexicon representation to embedding vector representation. The main advantage of embedding vectors is its unsupervised nature, as it does not require any tagging while still capturing meaningful semantic relations between words and benefitting from a model’s generalization capabilities. It’s important to remember that these embedding models, such as word2vec and GloVe, are context-independent. They assign the same pertained vector to the same word regardless of the context around the word. Thus, they cannot handle polysemy or complex semantics in natural languages.
從上圖可以看出,文本表示從80年代初開始經歷了三大革命。 首先是從詞典表示到嵌入矢量表示。 嵌入向量的主要優點是其無監督的性質,因為它不需要任何標記,同時仍可以捕獲單詞之間有意義的語義關系,并受益于模型的泛化能力。 重要的是要記住,這些嵌入模型(例如word2vec和GloVe)是與上下文無關的。 他們將相同的相關向量分配給同一單詞,而不管該單詞周圍的上下文如何。 因此,他們無法處理自然語言中的多義性或復雜語義。
Then, the context-sensitive word representations introduced towards 2016 with models like ELMo and GPT. These models have vector representations with words that depend on their contexts. ELMo encodes context bidirectionally, while GPT encodes context from left to right. The main contribution of these was their ability to handle polysemy and more complex semantics.
然后,在2016年推出了上下文相關字詞表示法,例如ELMo和GPT。 這些模型具有向量表示形式,其詞取決于其上下文。 ELMo雙向編碼上下文,而GPT從左到右編碼上下文。 這些的主要貢獻是它們處理多義和更復雜語義的能力。
The most recent revolution in NLP is BERT (Bidirectional Encoder Representations from Transformers), which combines bidirectional context encoding and requires minimal architecture changes for a wide range of natural language-processing tasks. The embeddings of the BERT input sequence is the sum of the token embeddings, segment embeddings, and positional embeddings. BERT and the following models are unique in that they can process a batch of sequences, from 1M parameters to the latest models that reached above 500M. From the graph it can be seen that the number of parameters in the model is the main reason for the continuous performance improvement during the last 4 years.
NLP的最新革命是BERT(來自變壓器的雙向編碼器表示),它結合了雙向上下文編碼,并且對于各種自然語言處理任務,只需進行最小的體系結構更改即可。 BERT輸入序列的嵌入是令牌嵌入,段嵌??入和位置嵌入的總和。 BERT和以下模型的獨特之處在于它們可以處理一系列序列,從1M參數到達到500M以上的最新模型。 從圖中可以看出,模型中的參數數量是過去4年中持續改進性能的主要原因。
Although NLP models have come a long way in the recent years and made substantial progress, there is still plenty of room for improvement. According to several studies,1,2 just increasing the network size is not enough, and even today the model is in a state of overparametrization. The next breakthrough will probably come from further progress in text representation, when NLP models will be better able to capture language compositionality (the ability to learn the meaning of a larger piece of text by composing the meaning of its constituents maintaining). A good place to start looking for some ideas about new text representations is in the domain of grammar inference. By learning controlled formal grammar, we can go deeper into our understanding about the elements that should handle compositionally (Solan et al, 2005) with respect to tests, like systematically, substitutivity, productivity, localism etc. (Hupkes et al., 2019; Onnis & Edelman, 2019).
盡管NLP模型在最近幾年取得了長足的進步,并取得了長足的進步,但仍有很大的改進空間。 根據數項研究1,2,僅增加網絡規模是不夠的,即使到今天,該模型仍處于過參數化狀態。 下一個突破可能來自文本表示形式的進一步發展,屆時NLP模型將能夠更好地捕獲語言組成性(通過組合其組成成分的含義來學習較大文本的含義的能力)。 在語法推斷領域中,是一個開始尋找有關新文本表示形式的想法的好地方。 通過學習受控的形式語法,我們可以對測試應該系統地處理組成部分 (Solan等,2005)的元素有更深入的了解,例如系統地,替代性,生產率,局部性等(Hupkes等,2019; Onnis&Edelman,2019)。
Biography
傳
(1) Hupkes, D., Dankers, V., Mul, M., & Bruni, E. (2019). The compositionality of neural networks: integrating symbolism and connectionism. arXiv preprint arXiv:1908.08351.?
(1)Hupkes,D.,Dankers,V.,Mul,M.,&Bruni,E.(2019年)。 神經網絡的組成:整合象征主義和連接主義。 arXiv預印本arXiv:1908.08351 。
(2) Kovaleva, O., Romanov, A., Rogers, A., &Rumshisky, A. (2019). Revealing the dark secrets of BERT. arXiv preprint arXiv:1908.08593.?
(2)Kovaleva,O.,Romanov,A.,Rogers,A.,&Rumshisky,A.(2019年)。 揭示BERT的黑暗秘密。 arXiv預印本arXiv:1908.08593 。
(3) Onnis, L., & Edelman, S. (2019). Local versus global statistical learning in language.?
(3)Onnis,L.和Edelman,S.(2019)。 語言的本地統計學習與全球統計學習。
(4) Solan, Z., Horn, D., Ruppin, E., & Edelman, S. (2005). Unsupervised learning of natural languages. Proceedings of the National Academy of Sciences, 102(33), 11629–11634.?
(4)Solan,Z.,Horn,D.,Ruppin,E.,&Edelman,S.(2005年)。 自然語言的無監督學習。 美國國家科學院院刊 , 102 (33),11629–11634。
(5) Mishev, K., Gjorgjevikj, A., Vodenska, I., Chitkushev, L. T., & Trajanov, D. (2020). Evaluation of Sentiment Analysis in Finance: From Lexicons to Transformers. IEEE Access, 8, 131662–131682.?
(5)Mishev,K.,Gjorgjevikj,A.,Vodenska,I.,Chitkushev,LT,&Trajanov,D.(2020年)。 金融中情感分析的評估:從詞匯到變形金剛。 IEEE訪問 ,8,131662-131682。
翻譯自: https://towardsdatascience.com/evaluation-of-sentiment-analysis-a-reflection-on-the-past-and-future-of-nlp-ccfd98ee2adc
r語言模型評估:
總結
以上是生活随笔為你收集整理的r语言模型评估:_情感分析评估:对自然语言处理的过去和未来的反思的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 韩国宣布出土20枚中国宋代钱币、一尊铜塔
- 下一篇: 开年大奖?赣锋锂业投资岚图汽车 后者增资