词嵌入应用_神经词嵌入的法律应用
詞嵌入應用
A fundamental issue with LegalTech is that words — the basic currency of all legal documentation — are a form of unstructured data that cannot be intuitively understood by machines. Therefore, in order to process textual documents, words have to be represented by vectors of real numbers.
LegalTech的一個基本問題是,單詞(所有法律文檔的基本貨幣)是一種非結構化數據的形式,機器無法直觀地理解。 因此,為了處理文本文檔,單詞必須用實數向量表示。
Traditionally, methods like bag-of-words (BoW) map word tokens/n-grams to term frequency vectors, which represent the number of times a word has appeared in the document. Using one-hot encoding, each word token/n-gram is represented by a vector element and marked 0, 1, 2 etc depending on whether or the number of times that a word is present in the document. This means that if a word is absent in the corpus vocabulary, the element will be marked 0, and if it is present once, the element will be marked 1 etc.
傳統上,諸如單詞袋(BoW)的方法將單詞標記/ n-gram映射到術語頻率向量,該頻率向量表示單詞在文檔中出現的次數。 使用一鍵編碼,每個單詞標記/ n-gram由一個矢量元素表示,并根據單詞在文檔中出現的次數或次數標記為0、1、2等。 這意味著,如果語料庫詞匯中不存在某個單詞,則該元素將被標記為0,如果該單詞出現一次,則該元素將被標記為1等。
AnalyticStepsAnalyticStepsThe problem was that this produced very sparse matrices (i.e. mostly comprising zeros) in extremely high dimensions. For instance, a corpus with 30,000 unique word tokens would require a matrix with 30,000 rows, which is extremely computationally taxing. Furthermore, this method fails to capture semantic meaning, context, and word relations, as it can only show how frequently a word exists in a document. This inability to represent semantic meaning persisted even as BoW was complemented by the TF-IDF measure, as while the latter was able to represent a measure of how important a word was to a corpus (i.e. an improvement from the plain BoW representations), it was still computed based on the frequency of a word token/n-gram appearing in the corpus.
問題在于,這會以非常高的尺寸生成非常稀疏的矩陣(即主要包含零)。 例如,具有30,000個唯一單詞標記的語料庫將需要具有30,000行的矩陣,這在計算上非常費力。 此外,該方法無法捕獲語義含義,上下文和單詞關系,因為它只能顯示單詞在文檔中的出現頻率。 即使在BoW補充了TF-IDF度量之后,這種無法表示語義的現象仍然存在,而TF-IDF度量可以用來度量單詞對語料庫的重要性(即對普通BoW表示的改進),仍然是根據語料中出現的單詞記號/ n-gram的頻率來計算的。
In contrast, modern word embeddings (word2vec, GloVE, fasttext etc) rely on neural networks to map the semantic properties of words into dense vector representations with significantly less dimensions.
相反,現代單詞嵌入(word2vec,GloVE,fasttext等)依賴于神經網絡將單詞的語義特性映射為尺寸明顯較小的密集矢量表示形式。
As a preliminary note, it should be said that word embeddings are premised off distributional semantics assumptions, i.e. words that are used and occur in the same contexts tend to have similar meanings. This means that the neural network learns the vector representations of the words through the contexts that the words are found in.
作為初步說明,應該說詞嵌入是基于分布語義假設的前提,即,在相同上下文中使用和出現的詞往往具有相似的含義。 這意味著神經網絡通過發現單詞的上下文來學習單詞的向量表示。
“Context” here is represented by a sliding context window of size n, where n words before the target word and n words after the target word will fall within the context window (e.g. n=2 in the example). The model will then train by using one-hot encoded context vectors to predict one-hot encoded target vectors as the context window moves down the sentence.
“上下文”在這里是由大小的滑動上下文窗口n,其中目標字之前n個字與目標字后n個字將落入上下文窗口內表示(例如n = 2的示例中)。 然后,當上下文窗口沿句子向下移動時,該模型將通過使用一熱編碼的上下文向量進行訓練,以預測一熱編碼的目標向量。
In doing so, word embeddings can capture semantic associations and linguistic contexts not captured by BoW. This article will explore the impact of neural word embeddings in legal AI technologies.
這樣,單詞嵌入可以捕獲BoW無法捕獲的語義關聯和語言環境。 本文將探討神經詞嵌入在法律AI技術中的影響。
捕獲法律條款之間的關系 (Capturing Relationships between Legal Terms)
An important implication of word embeddings is that it captures the semantic relationship between words.
詞嵌入的一個重要含義是它捕獲了詞之間的語義關系。
The ability to map out the relationship between legal terms and objects has exciting implications in its capacity to improve our understanding of legal reasoning. An interesting direction is the potential vectorisation of judicial opinions with doc2vec to identify/cluster judges with similar belief patterns (based on conservativeness of legal opinions, precedents-cited etc).
繪制法律術語與客體之間關系的能力對提高我們對法律推理的理解能力具有令人興奮的意義。 一個有趣的方向是使用doc2vec進行司法輿論的潛在矢量化,以識別/聚集具有相似信念模式的法官(基于法律意見的保守性,被引用的先例等)。
Another function is that word embeddings can capture implicit racial and gender biases in judicial opinions, as measured by the Word Embedding Association Test (WEAT). Word embeddings are powerful because they can represent societal biases in mathematical or diagrammatical form. For instance, Bolukbasi (2016) showed that word embeddings trained on Google News articles exhibited significant gender bias, which can be geometrically captured by a direction in the word embedding (i.e. gender-neutral words are linearly separable from gender-neutral words).
另一個功能是,單詞嵌入可以捕獲司法觀點中隱含的種族和性別偏見,這是通過單詞嵌入關聯測試(WEAT)測得的。 詞嵌入功能強大,因為它們可以以數學或圖表形式表示社會偏見。 例如,Bolukbasi(2016)表明,在Google新聞文章上訓練的詞嵌入表現出明顯的性別偏見,可以通過詞嵌入的方向以幾何方式捕獲(即,與性別無關的詞與與性別無關的詞可線性分離)。
Source: https://www.semanticscholar.org/paper/Man-is-to-Computer-Programmer-as-Woman-is-to-Word-Bolukbasi-Chang/ccf6a69a7f33bcf052aa7def176d3b9de495beb7)來源: https : //www.semanticscholar.org/paper/Man-is-to-Computer-Programmer-as-Woman-is-to-Word-Bolukbasi-Chang/ccf6a69a7f33bcf052aa7def176d3b9de495beb7)As such, word embeddings may reflect vector relationships like “man is to programmer as woman is to home-maker”, as the word “man” in the Google News corpora co-occurs more frequently alongside words like “programmer” or “engineer”, while the word “woman” will appear more frequently beside “homemaker” or “nurse”.
因此,詞嵌入可能反映矢量關系,例如“男人是程序員,女人是家庭主婦”,因為Google新聞語料庫中的“男人”一詞與“程序員”或“工程師”等詞同時出現,而“女人”一詞會在“家庭主婦”或“護士”旁邊更頻繁地出現。
Applied to the legal domain, we can tabulate WEAT scores across judicial opinions, and preliminary research in this field has shown interesting trends, such as (i) male judges showing higher gender bias (i.e. higher WEAT scores) than female judges and (ii) white judges showing lower race bias than black judges. More remains to be explored in this domain.
應用于法律領域,我們可以將WEAT分數匯總到各個司法意見中,并且該領域的初步研究顯示出有趣的趨勢,例如(i)男法官表現出比女性法官更高的性別偏見(即WEAT得分更高),以及(ii)白人法官的種族偏見低于黑人法官。 在這個領域還有更多的探索。
加強法律研究 (Improving Legal Research)
Source: http://mlwiki.org/index.php/Information_Retrieval資料來源: http : //mlwiki.org/index.php/Information_RetrievalWord embeddings also has fundamental implications for improving the technology behind legal research platforms (known in machine learning parlance as Legal Information Retrieval (LIR) systems).
單詞嵌入對于改善法律研究平臺(在機器學習中稱為法律信息檢索(LIR)系統)背后的技術也具有根本意義。
Currently, most LIR systems (e.g. Westlaw and LexisNexis) are still boolean-indexed systems primarily utilising keyword search functionality.
當前,大多數LIR系統(例如Westlaw和LexisNexis)仍然是主要使用關鍵字搜索功能的布爾索引系統。
This means that the system looks for literal matches or variants of the query keywords, usually by using string-based algorithms to measure the similarity between two text strings. However, these types of searches fail to understand the intent behind the solicitor’s query, meaning that search results are often under-inclusive (i.e. missing relevant information that does not contain the keyword, but perhaps variants of it) or over-inclusive (i.e. returning irrelevant information that comprises the keyword).
這意味著系統通常通過使用基于字符串的算法來測量兩個文本字符串之間的相似性,來查找查詢關鍵字的字面匹配或變體。 但是,這些類型的搜索無法理解律師查詢的意圖,這意味著搜索結果通常包含范圍不足(即缺少不包含關鍵字但可能包含其變體的相關信息)或包含范圍過大(即返回包含關鍵字的無關信息)。
Word embeddings, however, enhance the potential of commercially available semantic search LIR . As it allows practitioners to mathematically capture semantic similarity, word embeddings can help LIR systems find results that are not only exact matches of the query string, but also results that might be relevant or semantically close, but differ in certain words.
然而,詞嵌入增強了商業上可用的語義搜索LIR的潛力。 由于它使從業人員可以在數學上捕獲語義相似性,因此單詞嵌入可以幫助LIR系統查找結果,這些結果不僅是查詢字符串的精確匹配項,還可以是可能相關或在語義上接近但在某些單詞上有所不同的結果。
Source: https://www.ontotext.com/knowledgehub/fundamentals/what-is-semantic-search/資料來源: https : //www.ontotext.com/knowledgehub/fundamentals/what-is-semantic-search/For instance, Landthaler shows that effective results can be produced by first summing up the word vectors in each search phrase into a search phrase vector. The document is then sequentially parsed by a window of size n (where n = the number of words in the search phrase) and the cosine similarity of the search phrase vector and accumulated vectors is calculated. This will not only be able to return exact keyword-matching results, but also semantically-related search results, which provides more intuitive results.
例如,Landthaler表示,可以通過首先將每個搜索短語中的單詞向量求和為搜索短語向量來產生有效結果。 然后,通過大小為n (其中n =搜索短語中的單詞數)的窗口順序解析文檔,并計算搜索短語向量和累積向量的余弦相似度。 這樣不僅可以返回準確的關鍵字匹配結果,還可以返回語義相關的搜索結果,從而提供更直觀的結果。
Source: https://www.lawtechnologytoday.org/2017/11/legal-consumer-research/資料來源: https : //www.lawtechnologytoday.org/2017/11/legal-consumer-research/This is especially important since research shows that participants using boolean-indexed search LIR systems which search for the exact matches of query terms on full-text legal documents can have recall rates as low as 20% (i.e. only 20% of relevant documents are retrieved by the LIR). However, on average, these participants estimate their retrieval rates to be up to 75%, which is significantly higher. This means that solicitors can often overlook relevant precedents or case law that may bolster their case, just because the LIR system prioritises string similarity over semantic similarity. Word embeddings hence have the potential to significantly address this shortfall.
這一點尤其重要,因為研究表明,使用布爾索引搜索LIR系統的參與者在全文法律文件中搜索查詢詞的精確匹配項時,召回率可低至20%(即,僅檢索到20%的相關文件)由LIR)。 但是,平均而言,這些參與者估計他們的檢索率高達75%,這要高得多。 這意味著律師通常會忽略可能會增強其案例的相關先例或判例法,僅因為LIR系統將字符串相似度放在語義相似度之上。 因此,單詞嵌入有可能顯著解決這一不足。
結論和未來方向 (Conclusion and Future Directions)
Overall, the field of neural word embeddings is fascinating. Not only is the ability to mathematically capture semantic context and word relations academically intriguing, word embeddings have also been a hugely important driver behind many LegalTech products in the market.
總體而言,神經詞嵌入領域令人著迷。 數學上不僅能夠以數學方式捕獲語義上下文和單詞關系的能力,而且單詞嵌入也已成為市場上許多LegalTech產品背后的重要推動力。
However, word embeddings are not without limitations, and ML practitioners sometimes turn to newer pre-trained language modelling techniques (e.g. ULMFit, ELMo, OpenAI transformer, and BERT) to overcome some of the inherent problems with word embeddings (e.g. presuming monosemy). Nevertheless, word embeddings remain one of the most fascinating NLP topics today, and the move from sparse, frequency-based vector representations to denser semantically-representative vectors is a crucial step in advancing the NLP subdomain and the field of legal AI.
但是,單詞嵌入并非沒有局限性,并且ML從業人員有時會轉向較新的經過預先訓練的語言建模技術(例如ULMFit,ELMo,OpenAI轉換器和BERT)來克服單詞嵌入的一些固有問題(例如,假定為單義)。 盡管如此,詞嵌入仍然是當今最引人入勝的NLP主題之一,從稀疏的基于頻率的向量表示向密集的語義表示向量的轉變是推進NLP子域和合法AI領域的關鍵一步。
翻譯自: https://towardsdatascience.com/legal-applications-of-neural-word-embeddings-556b7515012f
詞嵌入應用
總結
以上是生活随笔為你收集整理的词嵌入应用_神经词嵌入的法律应用的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 2022 生成模型进展有多快,新论文盘点
- 下一篇: 退税操作流程2022,有以下八个步骤