Cost Function
首先本人一直有一個疑問纏繞了我很久,就是吳恩達老師所講的機器學習課程里邊的邏輯回歸這點,使用的是交叉熵損失函數,但是在進行求導推導時,google了很多的課件以及教程都是直接使用的,這個問題困擾了很久,最后了解了在國外的教程中都是默認log就是ln。所以在機器學習中見到log就在腦海中自動轉變一下思想他這里說的是ln就這樣去理解吧。
How to fit the parameters theta for logistic regression,In particular, I'd like to define the optimization objective or the cost function that we'll use to fit the parameters.
Here's to supervised learning problem of fitting a logistic regression model.
是n+1維的特征向量。是hypothesis,and the parameters of the hypothesis is this??over here.
Because this is a classification problem,our training set has the property that every label y is either 0 or 1.
?Back when we were developing the linear regression model,we use the followinbg cost function.
???Now,this cost function function worked fine for linear regression,but here we're interested in logistic regression.
?for?logistic regression,We're using a multipower hypothetical function,this would be a non-convex function of the parameters theta.Here is what I mean by non-convex.We have some cost function J of theta(J(θ)), and for logistic regression this function H here has a non linearity,right?因為這里是sigmoid函數,so it's a pretty complicated nonlinear function.And if you take the sigmoid function and plug it in here.J(θ)looks like:
J(θ) can look like a unction just like this, with many local optima and the formal term for this is that this a nion convex function. If you were to turn gradient descent on this sort of function,it is not guaranteed to converge to the global minimum.whereas in contrast,what we would like is to have a cost function J of theta that is convex, that is a single bow-shaped function that looks like this, so that if you run gradient descent ,we would be guaranteed the gradient descent would converge to the global minimun.And the problem of using the squared cost function is that because of this very non linear sigmoid function that appear in the middle here,J of theta ends up being a non convex function if you were to define it sa the squared cost function.So what wewould? like to do is to instead come up with a different cost function that is convex and so that we can apply a great algorithm like gradient descent and be guaranteed to find a global minimum.
------------------------------邏輯回歸------------------
Here is a cost function that we're going to use for logistic regression.?
?
We are going to the cost or the penalty that the algorithm pays.
When y=1,the curve looks like this:
?Now,this cost function has a few interesting and desirable properties,First you notice that:
預測對了cost就是0,沒有預測對cost就是1.
First you notice that if y is equal to 1,and h(x)=1,in other words,if the hypothesis exactly predicts h(x) equals 1, and y is exactly equal to what I predicted, Then the cost is equal 0,right?First notice that if h(x)=1,and if the hypothesis predicts that? y is equal?to 1,and if indeed y is equal to 1,then the cost is equal to 0(the case that y equals 1 here).
But if h(x) is equal to 1,the cost is down here is equal to 0,and that's what we like it to be.Because if we correctly predict,the output?y then the cost is 0。
But now notice also that as h(x) approaches ,the output of the hypoyhesis approaches 0,the cost blows up,and it goes to infinity,And what this does is it caputres the intuition.(如果假設函數輸出0,相當于說我們的假設函數Y=1的概率等于0,這類似于我們對病人說你有一個惡性腫瘤的概率y=1的概率是0,就是說你的惡性腫瘤完全不可能是惡性,但是如果結果這個病人的腫瘤的確是惡性的即y=1,雖然我們告訴他,它發生的概率是0,他完全不可能是惡性的,如果我們這樣確定無疑的告訴他我們的預測,結果卻發現我們是錯的,那么我們用非常非常大的代價值乘法這個學習算法,她是被這樣呈現的,這個代價值區域無窮。如果y=1,但是h(x)=0)
------------------上述是y=1的情況
下面是y=0時代價函數是什么樣的?
If y turns out to be equal to 0,But we predicted y is equal to 1 with almost certainty with probability 1.Then we end up paying a very large cost.
相反,如果h(x)=0 and y=0,then the hypothesis nailed it(那么假設函數預測對了).?The predicted y is equal to 0,and? it turns out y is equal to 0,so at this point the cost function is going to be 0(就是上圖中的原點).
-------------------------上述定義了單訓練樣本的代價函數,我們所選的代價函數會給我們一個凸優化問題,整體的代價函數j(θ)將會是一個convex and local optima free凸函數和局部最優。
the cost functionfor a single training example to develop further and define??the cost functionfor the entire training set。
Those are examples of more sophisticated optimization algorithms,that need a way to compute J of θ,and need a way to computer the derivatives,and can then use more sophisticated strategies than gradient descent to minimize the cost function.
The details of exactly what these three alhorithms do is well beyond the scope of this course.
--------how to get logistic regression to work for mutil-class classification problrms.---an algorithm called one-versus-all classfication.
What's a multi-class classfication problem.
??
邏輯回歸的損失函數推導以及損失函數的導數推導過程見:
記住幾個公式:
?
?
有了上述幾個公式以及性質的理解就可以求導了。?
開頭已經說過默認log是ln所以下邊推導就沒有疑點了。
總結
以上是生活随笔為你收集整理的Cost Function的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Don‘t entangle those
- 下一篇: 咕泡学院:(1)唐宇迪python课程作