嵌入式和非嵌入式_我如何向非技术同事解释词嵌入
嵌入式和非嵌入式
數據科學 (Data Science)
Word embeddings.
詞嵌入。
What are they? What do they look like? How are they made?…
這些是什么? 他們看起來怎么樣? 它們是怎么制成的?…
I’ll break it down without code or lingo.
我將在沒有代碼或行話的情況下將其分解。
This is based on my experience explaining machine learning terms to non-technical colleagues.
這是基于我向非技術同事解釋機器學習術語的經驗。
了解:機器學習模型僅了解數字 (Understand: Machine learning models only understand numbers)
You can’t feed text to an ML model.
您無法將文本輸入ML模型。
Because, in a nutshell, math doesn’t work on text.
簡而言之,因為數學不適用于文本。
Does not work不起作用You must convert words/sentences/etc… to a list of numbers first.
您必須先將單詞/句子/等轉換為數字列表。
But this works但這有效Word embeddings are one approach to converting text to a machine-readable format.
單詞嵌入是將文本轉換為機器可讀格式的一種方法。
There are several approaches, including bags of words, but we won’t get into those here.
有幾種方法,包括大量的單詞 ,但是我們在這里不做介紹。
單詞嵌入將單詞轉換為數字序列 (A word embedding converts words to sequences of numbers)
Below are a few words, “bright”, “shiny” and “dog”, translated into numbers with popular word-embedding library, Word2Vec.
以下是一些流行的詞嵌入庫Word2Vec將“亮”,“發光”和“狗”這幾個詞翻譯成數字。
Notice “bright” and “shiny” are more similar to each other than “dog”.請注意,“明亮”和“發光”比“狗”更相似。When using pre-trained embeddings like Word2Vec, the numeric representation of each word comes included.
當使用像Word2Vec這樣的經過預訓練的嵌入時,每個單詞的數字表示都包括在內。
You simply convert your words into each given sequence of numbers, then plug them into your model.
您只需將單詞轉換為每個給定的數字序列,然后將其插入模型即可。
詞嵌入中的值是什么意思? (What do the values in a word embedding mean?)
In isolation, nothing!
孤立無事!
But relative to each other, they contain a lot of information.
但是相對而言,它們包含很多信息。
In the previous Word2Vec example, we can see that at each index, “bright” and “shiny” are more similar to each other than to “dog”.
在前面的Word2Vec示例中,我們可以看到,在每個索引處,“明亮”和“發光”彼此之間的相似程度要比與“狗”相似。
We call each index a “dimension”我們稱每個索引為“維度”If we plotted D1 and D2 on a 2x2 grid, we would see “bright” and “shiny” very close to each other, and “dog” further away.
如果在2x2的網格上繪制D1和D2,我們將看到“明亮”和“發亮”彼此非常接近,而“狗”則更遠。
In the word-embedding approach, words with a similar meaning “should” have similar numeric values.
在單詞嵌入方法中,具有相似含義“應該”的單詞具有相似的數值。
為什么相似的單詞具有相似的數值? (Why do similar words have similar numeric values?)
Good question.
好問題。
Because a word IS its context.
因為一個詞就是它的上下文。
“context” = “the words used around a word” in the training text used to build the word-embedding.
在用于構建詞嵌入的訓練文本中,“ context” =“圍繞單詞使用的單詞”。
So when deriving numeric values for “beautiful” and “gorgeous” we would likely find similar words around them, and they would get similar numeric values.
因此,當導出“美麗”和“華麗”的數值時,我們可能會在它們周圍找到相似的詞,并且它們將獲得相似的數值。
A word’s numeric values are a function of the weights in a neural network upon correctly guessing a word, given its context (CBOW approach).
給定上下文(CBOW方法),在正確猜出單詞后,單詞的數值是神經網絡中權重的函數。
The neural network keeps adjusting its weights until it can take the context of a word, and guess the word itself.
神經網絡會不斷調整權重,直到可以獲取單詞的上下文并猜測單詞本身為止。
At that point, the current weights (think knobs and dials on a machine) become the numeric values.
此時,當前的砝碼(機器上的旋鈕和刻度盤)將變為數值。
Similar context is often found around words with similar meaning.
在具有相似含義的單詞周圍通常會發現相似的上下文。
So similar neural network weights will predict similar words.
因此,相似的神經網絡權重將預測相似的單詞。
結論 (Conclusion)
Disclaimer: I’ve made some generalizations and skipped some small steps to help make this more explainability
免責聲明:我做了一些概括,并跳過了一些小步驟來幫助提高解釋性
That’s it.
而已。
Word embeddings are an approach (one of several) for converting text into numbers so computers can process it.
單詞嵌入是一種將文本轉換為數字以便計算機可以處理的方法(幾種)。
Practical experience (successes and failures) is your best guide as to when you should actually use them, versus other approaches. While very popular, they often underperform more traditional approaches (in my anecdotal experience).
與何時使用其他方法相比,實踐經驗(成功和失敗)是您何時應該真正使用它們的最佳指南。 盡管很受歡迎,但它們通常不如傳統方法(根據我的軼事)。
That’s why I always recommend diving in and getting your hands dirty on some code.
這就是為什么我總是建議您深入了解一些代碼的原因 。
Remember that in machine learning, there is no free lunch. We really don’t know what works until we try it!
請記住,在機器學習中, 沒有免費的午餐 。 在嘗試之前,我們真的不知道什么有效!
翻譯自: https://towardsdatascience.com/how-i-explained-word-embeddings-to-my-non-technical-colleagues-52ced76cf3bb
嵌入式和非嵌入式
總結
以上是生活随笔為你收集整理的嵌入式和非嵌入式_我如何向非技术同事解释词嵌入的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 状告4位大V后 《满江红》片方称不再起诉
- 下一篇: ai与虚拟现实_将AI推向现实世界