深度学习模型建立过程_所有深度学习都是统计模型的建立
深度學習模型建立過程
Deep learning is often used to make predictions for data driven analysis. But what are the meanings of these predictions?
深度學習通常用于對數據驅動的分析進行預測。 但是這些預測的含義是什么?
This post explains how neural networks used in deep learning provide the parameters of a statistical model describing the probability of the occurrence of events.
這篇文章解釋了深度學習中使用的神經網絡如何提供描述事件發生概率的統計模型的參數。
事件的發生和不確定性 (The occurrence of events and aleatoric uncertainty)
Data, observables, events or any other way of describing the things we can see and/or collect is absolute: we roll two sixes on a pair of six-sided dice or we get some other combination of outcomes; we toss a coin 10 times and we get heads each time or we get some other mixture of heads and tails; our universe evolves some way and we observe it, or it doesn’t — and we don’t. We do not know, a priori, whether we will get two sixes with our dice roll or heads each time we toss a coin or what possible universes could exist for us to come into being and observe it. We describe the uncertainty due to this lack of knowledge as aleatoric. It is due to fundamental missing information about the generation of such data — we can never exactly know what outcome we will obtain. We can think of aleatoric uncertainty as not being able to know the random seed of some random number generating process.
數據,可觀察到的事物,事件或任何其他描述我們可以看到和/或收集的事物的方式都是絕對的:我們在一對六面骰子上擲兩個六點,或者得到其他結果組合。 我們拋硬幣十次,每次都正面,或者正面和反面都有其他混合; 我們的宇宙以某種方式進化,我們觀察它,或者觀察不到-我們也沒有。 我們不知道是先驗的,是否每次擲硬幣都將擲骰子或擲骰子獲得兩個六分,或者我們可能存在什么宇宙來觀察它。 由于缺乏知識,我們將不確定性描述為偶然的。 這是由于根本缺少有關此類數據生成的信息-我們永遠無法確切知道我們將獲得什么結果。 我們可以將預言性不確定性視為無法了解某個隨機數生成過程的隨機種子。
We describe the probability of the occurrence of events using a function, P??? ?:??? ?d??? ?∈??? ?E??? ????? ?P(d)??? ?∈??? ?[0 ,1], i.e. the probability distribution function, P, assigns a value between 0 and 1 to any event, d, in the space of all possible events, E. If an event is impossible then P(d)??? ?=??? ?0, whilst a certain outcome has a probability P(d)??? ?=??? ?1. This probability is additive such that the union of all possible events d??? ?∈??? ?E is certain, i.e. P(E)??? ?=??? ?1.
我們使用功能描述事件發生的概率,P:d∈??P(d) […,0] ,即概率分布函數P在所有可能的事件E的范圍內為任何事件d分配一個介于0和1之間的值。如果一個事件不可能,則P (d)= 0,而在一定的結果具有概率P(d)= 1。 此概率是添加劑,使得所有可能的事件的聯合e∈E是肯定的, 即,P(E)= 1。
Using a slight abuse of notation we can write d??? ?~??? ?P, which means that some event, d, is drawn from the space of all possible events, E, with a probability P(d). This means that there is a 100×P(d)% chance that event d is observed. d could be any observation, event or outcome of a process, for example, when rolling n = 2 six-sided dice and obtaining a six with both, d = (d1 = 0, d2 = 0, d3 = 0, d? = 0, d? = 0, d? = 2). We do not know, beforehand, exactly what result we will obtain by rolling these two dice, but we know there is a certain probability that any particular outcome will be obtained. Under many repetitions of the dice roll experiment (with perfectly balanced dice and identical conditions) we should see that the probability of d occurring is P(d)??? ?≈??? ?1/??. Even without performing many repetitions of the dice roll we could provide our believed estimate of the distribution of how likely we are to see particular outcomes.
使用的符號,我們可以寫d?P,這意味著一些事件,d,從所有可能的事件,E,以概率P(d)的空間引出一個輕微的濫用。 這意味著觀察到事件d的可能性為100× P(d) %。 d可以是過程的任何觀察,事件或結果,例如,當滾動n = 2個六邊形骰子并同時獲得六個骰子時, d =( d1 = 0, d2 = 0, d3 = 0, d = 0,d = 0,d = 2) 。 我們事先不知道確切地知道通過擲骰子這兩個骰子會得到什么結果,但是我們知道有可能獲得任何特定的結果。 在骰子擲骰實驗的許多次重復下(骰子完全平衡且條件相同),我們應該看到d發生的概率為P(d) 。 即使不進行骰子擲骰的許多次重復,我們也可以提供我們對看到特定結果的可能性分布的可靠估計。
統計模型 (Statistical models)
To make statistical predictions we model the distribution of data using parameterisable distributions, P?. We can think of a as defining a statistical model which contains a description of the distribution of data and any possible unobservable parameters, v ∈ E?, of the model. The distribution function then attributes values of probability to the occurrence of observable/unobservable events P? : (d, v) ∈ (E, E?) ? P?(d, v) ∈ [0, 1]. It is useful to note that we can write this joint probability distribution as a conditional statement, P? = L? · p? = ρ? · e?. These probability distribution functions are:
為了使統計預測模型 ,我們使用參數化的分布,P?數據的分布。 我們可以想到的是定義包含數據的分布和任何可能的不可觀測參數,V∈E?,該模型的描述的統計模型。 然后,分布函數將概率值歸因于可觀察/不可觀察事件P?的出現:(d,v) ∈ (E,E? ) ?P?(d,v) ∈[0,1]。 值得注意的是,我們可以將此聯合概率分布寫為條件語句P?=L? · p?=ρ? · e? 。 這些概率分布函數是:
The likelihood — L? : (d, v) ∈ (E, E?) ? L?(d|v) ∈ [0, 1]
似然—L?:(d,v) ∈ (E,E? ) ?L?(d | v) ∈[0,1]
The prior — p? : v ∈ E? ? p?(v) ∈ [0, 1]
現有- p?:v∈E??p?(V)∈[0,1]
The posterior — ρ? : (d, v) ∈ (E, E?) ? ρ?(v|d) ∈ [0, 1]
后驗—ρ?:(d,v) ∈ (E,E? ) ?ρ?(v | d) ∈[0,1]
The evidence — e? : d ∈ E ? e?(d) ∈ [0, 1]
證據- e?:d∈è?e?(d)∈[0,1]
The introduction of these functions allow us to interpret the probability of observing d and v as being equal to the probability of observing d given the value, v, of the model parameters multiplied by how likely these model parameter values are — likewise, it is equal to the probability of the value, v, of model parameters given that d is observed multiplied by how likely d is to be observed in the model.
這些函數的引入使我們可以將觀察d和v的概率解釋為等于給定模型參數的值v乘以這些模型參數值的可能性的觀察d的概率-同樣,它等于假設觀察到的d乘以在模型中觀察到d的可能性,則將模型參數的值v的概率乘以n。
For the dice roll experiment we could (and do) model the distribution of data using a multinomial distribution, P? = ∏? n!/d?! p??? where the fixed parameters of the multinomial model are v = {p?, p?, p?, p?, p?, p?, n} = {p?, n| i ∈ [1, 6]} with p? as the probabilities of obtaining value i ∈ [1, 6] from a die and n as the number of rolls. If we are considering completely unbiased dice then p? = p? = p? = p? = p? = p? = 1/?. The probability of observing two sixes, d = (d1 = 0, d2 = 0, d3 = 0, d? = 0, d? = 0, d? = 2), in our multinomial model with n = 2 dice rolls can therefore be estimated as P?(d) = ?1/??. Since the model parameters, v, are fixed this is equivalent to setting the prior to p? = δ(p? ? 1/?, n ? 2 ) for i ∈ [1, 6] such that L? = ∏? 2(1/?)??/d?! or 0.
對于擲骰實驗,我們可以(并且可以)使用多項式分布P model = ∏? n!/ d?!對數據分布建模 。 p???其中多項式模型的固定參數是V = {P 1,P 2,P 3,P 4,P 5,P 6,N} = {p?,N | 我 ∈[1,6]}的p?作為獲得值i的[1,6]從模頭和n為輥的數目的概率∈。 如果我們考慮完全無偏的骰子,則p 1 = p 2 = p 1 = p 1 = p 1 = p 1 = p 1 = 1 / 。 在我們的n = 2個骰子的多項式模型中,觀察到兩個六個概率d =( d1 = 0, d2 = 0, d3 = 0, d? = 0, d? = 0, d? = 2)因此,滾動可以估計為P as(d) = 1 //??。 由于模型參數,V,固定這相當于到p?=δ(p? - 1/?,N - 2)的現有設置對于i∈[1,6],使得L?=Π?2(1 /? )?? / d?! 或0。
Of course we could build a more complex model where the values of the p? depended on other factors, such as the number of surfaces which the dice could bounce off, or the strength which they were thrown or the speed of each molecule of air which hit the dice at the exact moment that they left our hands or an infinite number of different effects. In this case the distribution P? would assign probabilities to the occurrence of data, d ~ P, dependent on interactions between unobservable parameters, v, describing such physical effects, i.e. in the multinomial model, the values, v, of model parameters would change the value of the p? describing how likely d is. However, we might not know exactly what values these unobservable parameters have. Therefore P? describes not only an estimation of the true distribution of data, but also its dependence on the unobservable model parameters. We call the conditional distribution function describing the probability of observing data from unobservable parameters the likelihood, L?. Since the model a describes the entire statistical model, the prior probability distribution, p?, of model parameters, v, is an intrinsic property of the model.
當然,我們可以建立一個更復雜的模型,其中p?的值取決于其他因素,例如骰子可能彈起的表面數量,擲骰子的強度或撞擊的每個空氣分子的速度在他們離開我們的那一刻的骰子或無數種不同的效果。 在這種情況下,分配P?將會分配概率數據的發生,d?P,依賴于不可觀測參數,V之間的相互作用,描述這樣的物理效應, 即在多項式模型,值,V,模型參數將 改變描述d可能性的p?的值。 但是,我們可能不確切知道這些不可觀察參數的值。 因此, P?不僅描述了對數據真實分布的估計,而且還描??述了其對不可觀測模型參數的依賴性。 我們稱條件分布函數為描述從不可觀測參數觀測數據的概率,即似然 L? 。 由于模型a描述了整個統計模型,因此模型參數v的先驗概率分布p 1是模型的固有屬性。
The fact that we do not know the values, v, of parameters in a model a (and there is even a lack of knowledge about the choice of model itself) introduces a source of uncertainty which we call epistemic — the uncertainty due to things that we could learn about via the support from observing events. So whilst there is an aleatoric uncertainty due to the true random nature of the occurrence of events d from the distribution of data, P, there is also an epistemic uncertainty which comes from modelling this distribution with P?. The prior distribution, p?, should not be confused with the epistemic uncertainty, though, since the prior is a choice of a particular model a. An ill-informed choice of prior distribution (due to definition of the statistical model) could be used that does not allow the model to be supported by the data.
我們不知道模型a中參數的值v的事實(甚至缺乏關于模型本身選擇的知識)引入了不確定性的根源,我們稱其為認知的 ,即由于某些因素導致的不確定性。我們可以通過觀察事件的支持來了解。 因此,雖然由于事件分布d的真實隨機性而存在不確定的不確定性,但根據數據分布P ,存在認知不確定性,這是通過用P?對分布進行建模而得出的 。 但是 ,不應將先驗分布p?與認知不確定性相混淆,因為先驗是對特定模型a的選擇。 可能使用了錯誤的先驗分布選擇(由于統計模型的定義),這不允許數據支持模型。
For example, for the dice rolling problem, when we build a model, we could decide that our model was certain and that the prior distribution of the possible values of the model parameters is p? = δ(p? ? 1/?, n ? 2 ) for i ∈ [1, 6]. In this case the epistemic uncertainty would not be taken into account because there is assumed to be nothing that we can learn about. However, if the dice were weighted such that p? = 1 and p? = p? = p? = p? = p? = 0, then we would never get two sixes and there would be no support for our model from the data. Instead, if we choose a different model a′ which is a multinomial distribution but where the prior distribution on the possible values of the p? are such they can take any value from 0 to 1 under the condition that ∑ ? p? = 1 then there is no assumed knowledge (within this particular model). There is, therefore, a very large epistemic uncertainty due to our lack of knowledge, but this uncertainty can be reduced via inference of the possible parameter values when observing the available data.
例如,對于骰子滾動問題,在構建模型時,我們可以確定模型是確定的,并且模型參數的可能值的先驗分布為p is =δ(pδ ?1/ ?,n ? 2 )對于i∈[1,6]。 在這種情況下,將不會考慮認知不確定性,因為我們假設沒有什么可以了解。 但是,如果對骰子進行加權,使得p?= 1且p2 =p?=p?=p?=p?= 0 ,那么我們將永遠不會得到兩個6,并且數據也將不支持我們的模型。 相反,如果我們選擇不同的模型a ′,它是多項式分布,但是在p?的可能值上的先驗分布是這樣的,它們可以在∑?p can = 1的情況下取0到1的任何值,則存在沒有假定的知識(在該特定模型內)。 因此,由于缺乏知識,存在很大的認知不確定性,但是可以通過在觀察可用數據時推斷可能的參數值來減少不確定性。
主觀推論 (Subjective inference)
We can learn about what values of the model parameters are supported by the observed data using subjective inference (often called Bayesian inference) and hence reduce the epistemic uncertainty within our choice of model. Using the two equalities of the conditional expansion of the joint distribution, P?, we can calculate the posterior probability that, in a model a, the parameters have a value v when some d ~ P has been observed as
我們可以使用主觀推斷(通常稱為貝葉斯推斷)來了解觀測數據支持哪些模型參數值,從而在我們選擇的模型內減少認知不確定性。 使用的聯合分布的條件膨脹兩個等式,P?,我們可以計算出的后驗概率,在一個模型中的 ,當一些d?P已被觀察到的參數具有值v
This posterior distribution could now be used as the basis of a new model, a′, with joint probability distribution P?′, where p?′ = ρ?, i.e. P?′ = L? · p?′. Note that the form of the model hasn’t changed, just our certainty in the model parameters due to the support by data — we can use this new model to make more informed predictions about the distribution of data.
該后驗分布現在可以用作具有聯合概率分布P? '的新模型a'的基礎,其中p? '= ρ?,即P? '= L? · p? '。 請注意,由于數據的支持,模型的形式沒有改變,只是我們對模型參數的確定性—我們可以使用這個新模型對數據的分布做出更明智的預測。
We can use MCMC techniques, amongst a host of other methods, to characterise the posterior distribution, allowing us to reduce the epistemic uncertainty in this assumed model. However, rightly or wrongly, people are often interested in just the best fit distribution to the data, i.e. finding the set of v for which P? is most similar to P.
除其他方法外,我們還可以使用MCMC技術來表征后驗分布,從而使我們能夠減少此假設模型中的認知不確定性。 然而,無論對錯,人們往往熱衷于只是最適合的分布數據, 即找到集合V的針對P?是最相似的P.
最大似然和最大后驗估計 (Maximum likelihood and maximum a posteriori estimation)
To fit the model to the true distribution of data we need a measure of distance between the two distributions. The measure most commonly used is the relative entropy (also known as the Kullback-Leibler (KL) divergence)
為了使模型適合數據的真實分布,我們需要測量兩個分布之間的距離。 最常用的量度是相對熵(也稱為Kullback-Leibler(KL)散度)
The relative entropy describes the information lost due to approximating P(d) with P?(d, v). There are some interesting properties about the relative entropy which prevent it being ideal as a distance measure. For one it isn’t symmetric, D(P ∥ P?) ≠ D(P? ∥ P), and thus it cannot be used as a metric. We can take the symmetric combination of D(P? ∥ P) and D(P ∥ P?) but problems still remain, such as the fact that P and P? have to be defined over the same domain, E. Other measures, such as the earth mover distance, may have the edge here since it is symmetric and can be defined on different coordinate systems (and can now be well approximated using neural networks when used as arbitrary functions rather than being used for predicting model parameters). However, we still most often consider the relative entropy. Rewriting the relative entropy we see that we can express the measure of similarity as two terms
相對熵描述了由于用P( (d,v)近似P(d)而丟失的信息。 關于相對熵有一些有趣的性質,使它無法理想地用作距離度量。 因為它不是對稱的, D( PPP?) ≠ D(P?P ),因此它不能用作度量。 我們可以采用D(P?P )和D( P∥P?)的對稱組合,但是仍然存在問題,例如必須在同一個域E上定義P和P? 。 諸如推土機距離之類的其他度量在此處可能具有邊緣,因為它是對稱的并且可以在不同的坐標系上定義(并且當用作任意函數而不是用于預測模型參數時,現在可以使用神經網絡很好地近似它。 )。 但是,我們仍然最經常考慮相對熵。 重寫相對熵,我們看到我們可以將相似性的度量表示為兩個項
The first term is the negative entropy of the distribution of data, i.e. the expected amount of information which could be obtained by observing an outcome, directly analogous to entropy in statistical thermodynamics. The second term is the cross entropy, H(P, P?), which quantifies the amount of information needed to distinguish a distribution P from another distribution P?, i.e. how many draws of d ~ P would be needed to tell that d was drawn from P and not from P∈. Noticing that there is only one set of free parameters, v, in this form of the relative entropy then we can attempt to bring P? as close to P as possible by minimising the relative entropy with respect to these parameters
第一項是數據分布的負熵, 即可以通過觀察結果獲得的預期信息量,直接類似于統計熱力學中的熵。 第二項是所述交叉熵,H(P,P?),其量化從另一個分布P?,即多少d的彩票?將需要p告訴d繪制區分分布P所需要的信息量從P和不與P∈。 注意到以這種相對熵的形式只有一組自由參數v ,那么我們可以通過使相對于這些參數的相對熵最小來嘗試使P?盡可能接近P
How do we actually do this though? We might not have access to the entire distribution of possible data to do the integral. Instead, consider a sampling distribution, s : d ∈ S ? s(d) ∈ [0, 1], where s(d) is the normalised frequency of events from sampling space with N conditionally independent values of d, S ? E. In this case the integral becomes a sum and H(P, P?) ? ? ∑ s(d) log P?(d, v). Using the conditional relations we then write P? = L? · p? as before and as such the cross entropy is H(P, P?) ? ? ∑ s(d) log L?(d| v)?∑ s(d)log p?(v). Since the prior is independent of the data it just adds an additive constant to the cross entropy.
我們實際上如何做到這一點? 我們可能無法訪問可能數據的整個分布來進行積分。 相反,考慮抽樣分布,S:d∈ 小號 ?S(d)∈[0,1],其中s(d)是事件從具有N個采樣的空間d,S?è的條件獨立值的歸一化頻率。 在這種情況下,積分變為和, H(P,P?) ?-∑ s (d) logP?(d,v)。 然后使用條件關系, 如前所述寫出P?=L? · p and,因此交叉熵為H(P,P?) ?-∑ s (d) logL?(d | v) -∑ s (d) logp?( v)。 由于先驗與數據無關,因此它只是將加常數添加到交叉熵中。
On a different note, we can write the likelihood as the product of probabilities given the frequency of occurrences of data from the sampling distribution
換句話說,在給定采樣分布中數據出現頻率的情況下,我們可以將似然度寫為概率的乘積
So, besides the additive constant due to the prior, the cross entropy is directly proportional to the negative logarithm of the likelihood of the data in the model. This means that maximising the logarithm of the likelihood of the data with respect to the model parameters, assuming a uniform prior for all v (or ignoring the prior) is equivalent to minimising the cross entropy, which can be interpreted as the minimising the relative entropy, thus bringing P? as close as possible to P. Ignoring the second term in the cross entropy provides a non-subjective maximum likelihood estimate of the parameter values (non-subjective means that we neglect any prior knowledge of the parameter values). If we take the prior into account however, we recover the most simple form of subjective inference, maximum a posteriori (MAP) estimation
因此,除了先驗的加性常數外,交叉熵還與模型中數據似然性的負對數成正比。 這意味著,假設所有v均具有統一的先驗(或忽略先驗),則最大化數據相對于模型參數的對數,等同于最小化交叉熵,這可以解釋為最小化相對熵。 ,從而使P?盡可能接近P。 忽略交叉熵中的第二項,將提供參數值的非主觀最大似然估計(非主觀意味著我們忽略了參數值的任何先驗知識)。 但是,如果考慮到先驗,我們將恢復最簡單的主觀推斷形式,即最大后驗(MAP)估計
which describes the set of parameter values, v, which provide a P? as close as possible to P. A word of caution must be emphasised here — since we use a sampling distribution, s, maximising the likelihood (or posterior) actually provides us with the distribution P? which is closest to the sampling distribution, s. If s is not representative of P, then the model will not necessarily be a good fit of P. A second word of caution — although P? may be as close as possible to P (or actually s) with this set of v, the mode of the likelihood (or posterior) could actually be very far from the high density regions of the distribution and therefore not be representative at all of the more likely model parameter values. This is avoided when considering the entire posterior distribution using MCMC techniques or similar. Essentially, using maximum likelihood or maximum a posteriori estimation, the epistemic error will be massively underestimated without taking in to account the bulk of the prior (or posterior) probability density.
它描述了一組參數值v ,它們提供了一個盡可能接近P的P? 。 在這里必須要特別注意-因為我們使用采樣分布s,所以最大化可能性(或后驗)實際上為我們提供了最接近采樣分布s的分布P。 如果s不代表P,則該模型不一定是P的良好擬合。 第二個警告:盡管P?可能盡可能接近P (或實際上是s ) 在這組v的情況下,可能性(或后驗)的模式實際上可能與分布的高密度區域相距甚遠,因此無法在所有更可能的模型參數值上代表。 當考慮使用MCMC技術或類似方法進行整個后驗分布時,可以避免這種情況。 本質上,如果使用最大似然或最大后驗估計,則無需考慮先前(或后驗)概率密度的大部分,就會嚴重低估認知錯誤。
型號比較 (Model comparison)
Note that there is no statement so far saying whether a model a is actually any good. We can measure how good the fit of the model is to data d by calculating the evidence, which is equivalent to integrating over all possible values of the model parameters, i.e.
請注意,到目前為止,還沒有聲明說模型a實際上是否有任何好處。 我們可以通過計算證據來衡量模型對數據d的擬合程度,這相當于對模型參數的所有可能值進行積分, 即
By choosing a different model a′ with its own set of parameters u ∈ E? we could then come up with a criterion which describes whether model a or a′ better fits the data, d. Note that this criterion isn’t necessarily well defined. Do we prefer a model which fits the data exactly but has a semi-infinite number of parameters, or do we prefer an elegant model with few parameters but with a less good fit? Until neural networks came about we normally chose the model with the fewest parameters which fit the data well and generalised to make consistent predictions for future events — but this is still up for debate.
通過選擇不同的模型 “有它自己的一套參數U∈E?那么我們就可以拿出其描述是否模型 或準則”更好的擬合數據,d。 請注意,此標準不一定定義明確。 我們是否更喜歡一個完全適合數據但具有半無限個參數的模型,還是我們更喜歡一個很少參數但擬合性較差的優雅模型? 在神經網絡出現之前,我們通常選擇參數最少的模型,該模型最適合數據,并且可以對未來事件進行一致的預測,但這仍需爭論。
Everything described so far is none other than the scientific method. We observe some data and want to model how likely any future observations are. So we build a parameterised model describing the observation and how likely it is to occur, learn about the possible values of the parameters of that model and then improve the model based on some criterion like Occam’s razor, or whatever.
到目前為止,所描述的一切都不過是科學方法。 我們觀察到一些數據,并希望對將來的觀察結果進行建模。 因此,我們建立了一個參數化的模型來描述觀察和觀察的可能性,了解該模型的參數的可能值,然后根據某些標準(例如Occam的剃刀或其他)改進模型。
神經網絡作為統計模型 (Neural networks as statistical models)
深度學習是一種建立數據分布模型的方法 (Deep learning is a way to build models of the distribution of data)
No matter the objective — supervised learning, classification, regression, generation, etc — deep learning is just building models for the distribution of data. For supervised learning and other predictive methods we consider our data, d, as a pair of inputs and targets, d = (x, y). For example, our inputs could be pictures of cats and dogs, x, accompanied with labels, y. We might want to then make a prediction of label y′ for a previously unseen image x′ — this is equivalent to making a prediction of the pair of corresponding input and targets, d, given that part of d is known.
無論目標是有監督的學習,分類,回歸,生成等,深度學習都只是為數據分布建立模型。 對于監督學習和其他預測方法,我們將數據d視為一對輸入和目標d =(x,y) 。 例如,我們的輸入可以是貓和狗的圖片x ,并帶有標簽y 。 然后我們可能要對標簽y '進行預測 對于先前看不見的圖像x ',這等效于對已知的輸入和目標對d進行預測, 前提是已知d的一部分。
So, we want to model the distribution, P, using a neural network, f: (x, v) ∈ (E, E?) ? g= f(x, v) ∈ G, where f is a function parameterised by weights, v, that takes an input x and outputs some values g from a space of possible network outputs G. The form of the function, f, is described by the hyperparameterisation, a, and includes the architecture, the initialisation, the optimisation routine, and most importantly, the loss or cost function, Λ? : (d, v) ∈ (E, E?) ? Λ?(y|x, v) ∈ K · [0, 1]. The loss function describes an unnormalised measure for the probability of the occurrence of data, d = (x, y), with unobservable parameters, v. That is, using a neural network to make predictions for y when given x is equivalent to modelling the probability, P, of the occurrence of data, d, where the shape of the distribution is defined by the form and properties of the network a and the values of it parameters v. We often distinguish classical neural networks (which make predictions of targets) from neural density estimators (which estimate the probability of inputs, i.e. the space G = [0, 1]) — these are, however, performing the same job, but the distribution from the classical neural network can only be evaluated using the loss function (and is normally not normalised to integrate to 1 like a true probability). This illuminates the meaning of the outputs or predictions of a classical neural network — they are the values of the parameters controlling the shape of the probability distribution of data within our model (defined by the choice of hyperparameters).
所以,我們希望的分布,P型,使用神經網絡中,f:(X,V)∈(E,E?)? 克= F(X,V)∈G,其中f是由權重參數化的函數, v,它接受輸入x并從可能??的網絡輸出G的空間輸出一些值g 。 函數f的形式由超參數化a來描述,并且包括體系結構,初始化,優化例程,最重要的是,損失或成本函數Λ?:(d,v) ∈ (E,E? ) ?Λ?(y | x,v)∈K ·[0,1]。 損失函數描述了具有不可觀察參數v的數據出現概率d =(x,y)的非標準化度量。 即,使用神經網絡,使y的預測給定的x相當于建模的概率,P,數據,d,其中,所述分布的形狀是由形式和網絡的屬性和所定義的發生的時it參數v的值。 我們經常將經典的神經網絡(對目標進行預測)與神經密度估計器(對輸入的概率進行估計, 即空間G = [0,1])區分開來,但是這些神經網絡在執行相同的工作,但是分布只能使用損失函數來評估經典神經網絡中的估計值(通常不進行歸一化以像真實概率一樣積分為1)。 這闡明了經典神經網絡的輸出或預測的含義-它們是控制模型中數據概率分布形狀的參數值(通過選擇超參數來定義)。
As an example, when performing regression using the mean squared error as a loss function, the output of the neural network, g = f(x, v), is equivalent to the mean of a generalised normal distribution with unit variance. This means that, when fed with some input x, the network provides an estimate of the mean of the possible values of y via the values, v, of the parameters, where the possible values of y are drawn from a generalised normal with unit variance, i.e. y ~ N(g, I). Note that this model for the values of y may not be a good choice in any way. Another example is when performing classification using softmax output, we can interpret the outputs directly as the p? of a multinomial distribution, where the unobservable parameters of the model, v, affect the value of these outputs in a similar way to parameters in a physical model affect the probability of the occurrence of different data.
例如,當使用均方誤差作為損失函數執行回歸時,神經網絡的輸出g = f(x,v)等于具有單元方差的廣義正態分布的均值。 這意味著,當饋入某些輸入x時 ,網絡將通過參數的值v提供y的可能值的平均值的估計,其中y的可能值是從具有單位方差的廣義法線中得出的, 即y?N(g, I ) 。 請注意,對于y值的此模型無論如何都不是一個好的選擇。 另一個例子是,當使用softmax輸出執行分類時,我們可以將輸出直接解釋為多項式分布的p? ,其中模型的不可觀察參數v會以類似于物理模型中參數的方式影響這些輸出的值影響不同數據發生的可能性。
With this knowledge at hand, we can then understand the optimisation of network parameters (known as training) as modelling the distribution of data, P. Usually, when training a network classically, our choice of model allows any values of v ~ p? = Uniform[-∞, ∞] (although we tend to draw their initial values from some normal distribution). In essence, we ignore any prior information about the values of the weights because we do not have any prior knowledge. In this case, all the information about the distribution of data comes from the likelihood, L?. So, to train, we perform maximum likelihood estimation of the network parameters, which minimises the cross entropy between the distribution of data and the estimated distribution from the neural network, and hence minimises the relative entropy. To actually evaluate the logarithm of the likelihood for classical neural networks with parameter values v, we can expand the likelihood of some observed data, d, as L?(d| v) ? Λ?(y| x, v)s(x), where s(x) is the sampling distribution of x, equivalent to assigning normalised frequencies of the number of times x appears in S. Evaluating Λ?(y| x, v) at every y in the sampling distribution when given the corresponding x, taking the logarithm of this probability and summing the result therefore gives us the logarithm of the likelihood of the sampling distribution. Maximising this with respect to the network parameters, v, therefore gives us the distribution, P?,which is closest to s (which should hopefully be close to P).
掌握了這些知識之后,我們就可以通過對數據分布P建模來了解網絡參數的優化(稱為訓練)。 通常,當經典??地訓練網絡時,我們選擇的模型允許v?p?= Uniform [-∞,∞]的任何值(盡管我們傾向于從某些正態分布中得出其初始值)。 本質上,因為我們沒有任何先驗知識,所以我們忽略了有關權重值的任何先驗信息。 在這種情況下,有關數據分布的所有信息都來自似然度L? 。 因此,為了進行訓練,我們對網絡參數執行最大似然估計,這將最小化數據分布與來自神經網絡的估計分布之間的交叉熵,從而使相對熵最小化。 為了實際評估帶有參數值v的經典神經網絡的似然對數,我們可以擴展某些觀測數據d的似然,如L?(d | v) ?Λ?(y | x,v)s(x),其中,s(x)是x的采樣分布,相當于分配的S中倍X出現數的歸一化頻率。 給定相應的x時,在采樣分布的每個y處評估Λ?(y | x,v) ,取該概率的對數并將結果相加,得出抽樣分布似然的對數。 因此,相對于網絡參數v最大化此點,可以得出最接近s的分布P? (希望它應接近P )。
And that’s it. Once trained, a neural network provides the parameters of a statistical model which can be evaluated to find the most likely values of predictions. If the loss function gives an unnormalised likelihood methods like MCMC can be used to obtain samples which characterise the distribution of data.
就是這樣。 訓練后,神經網絡將提供統計模型的參數,可以對其進行評估以找到最可能的預測值。 如果損失函數給出了未歸一化的可能性,則可以使用像MCMC這樣的方法來獲取表征數據分布的樣本。
A few cautions must be considered. First, the choice of loss function defines the statistical model — if there is no way that the loss function describes the distribution of data, then the statistical model for the distribution of data will be wrong. One way of avoiding the assumption for the distribution is by considering loss functions beyond mean square error, categorical cross entropy or absolute error — one ideal choice would be the earth mover distance which can be well approximated by specific types of neural networks and provide an objective which simulates the optimal transport plan between the distribution of data to the statistical model, thus providing an unassumed form for P?. Another thing to note is that a statistical model using a neural network is overparameterised. A model of the evolution of the universe needs only 6 parameters (on a good day) — whilst a neural network would use millions of unidentifiable parameters for much simpler tasks. When doing model selection where model elegance is sought after, neural networks will almost always lose. Finally, networks are made to fit data, based on data — if there is any bias in the data, d ∈ S, i.e. s is not similar to P, that bias will be prominent. Physical models can avoid these biases by building in intuition. In fact, the same can be done with neural networks too, but at the expense of a lot more brain power and a lot more time spent writing code than picking something blindly.
必須考慮一些注意事項。 首先,損失函數的選擇定義了統計模型-如果損失函數無法描述數據的分布,那么用于數據分布的統計模型將是錯誤的。 避免假設分布的一種方法是通過考慮均方誤差,分類交叉熵或絕對誤差以外的損失函數-一種理想的選擇是推土機距離,該距離可以通過特定類型的神經網絡很好地近似并提供目標它模擬數據的分布之間的最佳運輸計劃于統計模型,從而提供了一個P?形式unassumed。 要注意的另一件事是,使用神經網絡的統計模型被過度參數化。 宇宙演化的模型只需要6個參數(在美好的一天),而神經網絡將使用數百萬個無法識別的參數來完成更簡單的任務。 在追求模型優雅的模型選擇中,神經網絡幾乎總是會丟失。 最后,網絡,以適應數據來進行,根據數據-如果數據中的任何偏差,d∈S,即s不是類似于P,這種傾向就會凸顯出來。 物理模型可以通過建立直覺來避免這些偏差。 實際上,神經網絡也可以做到這一點,但是要花很多的腦力和花更多的時間而不是盲目地選擇東西。
因此,使用神經網絡和損失函數進行的深度學習等效于構建描述數據分布的參數化統計模型。 (So deep learning, using a neural network and a loss function, is equivalent to building a parameterised statistical model describing the distribution of data.)
Tom Charnock is an expert in statistics and machine learning. He is currently based in Paris and working on solving outstanding issues in the statistical interpretability of models for machine learning and artificial intelligence. As an international freelance consultant, he provides practical solutions for problems related to complex data analysis, data modelling and next-generation methods for computer science. His roles include one-to-one support, global collaboration and outreach via lectures, seminars, tutorials and articles.
Tom Charnock是統計和機器學習方面的專家。 他目前居住在巴黎,致力于解決機器學習和人工智能模型的統計可解釋性方面的突出問題。 作為一名國際自由顧問,他為與復雜數據分析,數據建模和下一代計算機科學方法有關的問題提供了實用的解決方案。 他的角色包括一對一支持,全球合作以及通過講座,研討會,教程和文章進行的外展活動。
翻譯自: https://medium.com/@tom_14692/all-deep-learning-is-statistical-model-building-fc310328f07
深度學習模型建立過程
總結
以上是生活随笔為你收集整理的深度学习模型建立过程_所有深度学习都是统计模型的建立的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 核弹级产品来了!一加Ace2外观泄露 家
- 下一篇: 手机充电越快越好?外媒调查结果出炉 有线