梯度下降和随机梯度下降_梯度下降和链链接系统
梯度下降和隨機梯度下降
On Broken Incentives and Deadlock.
關于激勵機制和僵局。
By Shawn Jain and Blake Elias[Shawn Jain is an AI Resident at Microsoft Research.Blake Elias is a Researcher at the New England Complex Systems Institute.]
Shawn Jain和Blake Elias著[ Shawn Jain 是 微軟研究中心的 一個 AI駐地 。 布萊克·埃里亞斯 ( Blake Elias) 是 新英格蘭復雜系統研究所的研究員。 ]
起源 (Origins)
Several years ago, Blake read “Good Strategy / Bad Strategy: The Difference and Why It Matters”, a book on business and management strategy. Chapter 8, “Chain-Link Systems”, describes situations in which a system’s performance is limited by its weakest sub-unit, or “link”:
幾年前,布雷克(Blake)讀了一本關于業務和管理策略的書“好的策略/壞的策略:差異及其重要性” 。 第8章,“鏈式鏈接系統”,描述了系統性能受其最弱的子單元或“鏈接”限制的情況:
There are portions of organizations, and even of economies, that are chain-linked. When each link is managed somewhat separately, the system can get stuck in a low-effectiveness state. That is, if you are in charge of one link of the chain, there is no point in investing resources in making your link better if other link managers are not.
組織的各個部分,甚至是經濟體,都有部分是連鎖的。 當對每個鏈接進行一些單獨的管理時,系統可能會陷入低效狀態。 也就是說,如果您負責鏈中的一個鏈接,那么如果沒有其他鏈接管理器,則沒有必要投入資源來使您的鏈接更好。
To make matters even more difficult, striving for higher quality in just one of the linked units may make matters worse! Higher quality in a unit requires investments in better resources and more expensive inputs, including people. Since these efforts to improve just one linked unit will not improve the overall performance of the chain-linked system, the system’s overall profit actually declines. Thus, the incentive to improve each unit is dulled. [1]
為了使事情變得更加困難,僅在其中一個鏈接的單元中爭取更高的質量可能會使情況變得更糟! 一個單元的更高質量要求對包括人力在內的更好的資源和更昂貴的投入進行投資。 由于僅改善一個鏈接單元的這些努力不會改善鏈式鏈接系統的整體性能,因此該系統的整體利潤實際上下降了。 因此,改善每個單元的動力減弱了。 [1]
We wondered, how can this idea be made computational? If an organization were run by one or more AI agents, would it be capable of getting past these limitations?
我們想知道,如何將這個想法變成可計算的? 如果一個組織由一個或多個AI代理運行,它是否能夠克服這些限制?
Our intuition is that computation and AI should be able to solve problems where fragmented human organizations get stuck. On the other hand, today’s AI is quite brittle. One failure mode is the existence of local optima.
我們的直覺是,計算和AI應該能夠解決零散的人類組織陷入困境的問題。 另一方面,今天的AI相當脆弱。 一種故障模式是局部最優。
It fails to generalize outside its training distribution, and does not form logical representations.
它無法在訓練分布范圍之外進行概括 ,也沒有形成邏輯表示。
問題設定 (Problem Setup)
For a concrete example, consider an assembly line composed of two stations, A and B. Station A takes raw materials and produces a half-finished widget, at a cost of $𝛼 per widget. Station B takes half-finished widgets from Station A and turns them into finished widgets, at a cost of $β per widget. The factory earns $? per finished widget.
舉一個具體的例子,考慮由兩個工位A和B組成的裝配線。 站點A接收原材料并生產半成品小部件,每個小部件的成本為$ 𝛼 。 工作站B從工作站A取出半成品小部件,并將它們轉換為成品小部件,每個小部件的成本為$ β 。 每個完成的小部件,工廠賺取$ ? 。
The owner wants to earn as much profit as possible. To align incentives, he tries tying station managers’ bonus to the factory’s profits. The owner hopes to optimize the rate of production at each station to maximize his profit. (Note: assume the cost per widget at each station is fixed.)
所有者希望賺取盡可能多的利潤。 為了調整激勵措施,他嘗試將站長的獎金與工廠的利潤掛鉤。 所有者希望優化每個站點的生產率,以最大化其利潤。 (注意:假定每個工作站上每個小部件的成本是固定的。)
The managers can raise or lower only their station’s output, incurring a cost of $𝛼 or $β per widget. They do not coordinate.
管理人員只能提高或降低其工作站的輸出,每個小部件產生$ 𝛼或$ β的成本。 他們不協調。
But there’s one problem. If the manager of station A increases his station’s output, while station B’s output remains the same, the firm actually loses money. This is because station A spent more to increase its output, but the factory overall produced the same total number of widgets (because station B couldn’t keep up). This is what the author calls a “chain-linked system”, because the output is determined by the weakest link. If Station B is behind, there’s no point in Station A increasing its output — in fact, that would bring a net loss to the firm. Stations A and B have to improve jointly in order for profits to increase. But as a manager, if you only direct your single station, you’re not incentivized to improve at all (since you’re just spending the company’s money with no return) — and so deadlock occurs.
但是有一個問題。 如果工作站A的經理增加了工作站的產量,而工作站B的產量保持不變,則企業實際上會虧損。 這是因為工作站A花費更多的錢來增加其產量,但是工廠總體上生產了相同數量的小部件(因為工作站B跟不上)。 這就是作者所說的“鏈式系統”,因為輸出由最弱的鏈接決定。 如果B站落后,那么A站就沒有增加產量的事實-實際上,這將給公司帶來凈虧損 。 站A和B,以便共同提高利潤增加。 但是作為一名經理,如果您僅指揮單個站點,就根本沒有動力去改進(因為您只是花了公司的錢而沒有回報),因此就會出現僵局 。
Could the owner set production rates directly instead of relying on this bonus plan? Let’s see if optimization can solve this problem.
所有者可以直接設置生產率,而不是依靠此獎金計劃嗎? 讓我們看看優化是否可以解決此問題。
Define the following variables:
定義以下變量:
a: the output of station A
a :站A的輸出
b: the output of station B
b :站B的輸出
𝛼: cost per unit output of station A
𝛼 :A站的單位產出成本
β: cost per unit output of station B
β :B站的單位產出成本
?: sale price for a completed widget
? :完成的小部件的售價
We produce min(a, b) completed widgets (since every widget requires a unit of output from both stations A and B). Since we sell each widget for a price of ?, we will bring in revenue according to:
我們生成最少(a,b)個完成的窗口小部件(因為每個窗口小部件都需要來自站點A和B的單位輸出)。 由于我們以?的價格出售每個小部件,因此我們將根據以下條件帶來收入:
R(a, b) = min(a, b) * ?,
R(a,b)= 最小值 (a,b)*?,
Our cost will be:
我們的費用將是:
C(a, b) = a * 𝛼 + b * β
C(a,b)= a * 𝛼 + b *β
We are then trying to maximize profit:
然后,我們嘗試使利潤最大化:
P(a, b) = R(a, b) - C(a, b) = min(a, b) * ? - [a * 𝛼 + b * β]
P(a,b)= R(a,b)-C(a,b)= min (a,b)*?-[a * 𝛼 + b *β]
We then define a “loss” function,
然后我們定義一個“損失”函數,
L(a, b) = -P(a, b)
L(a,b)= -P(a,b)
to convert the problem into a minimization that can be solved with standard optimizers (e.g. gradient descent).
將問題轉換為可以使用標準優化程序(例如梯度下降)解決的最小化問題。
Note that increasing a alone, or b alone, decreases profit. But, if we increase a and b together, profit increases. Consider the gradient dL/da. This assumes all variables except a are held constant. The same holds for dL/db — we only get to vary b. However, we need to increase both a and b simultaneously to improve the loss function. Since SGD is based on gradient computations only, will it successfully optimize in this scenario? In other words, how does SGD deal with co-varying parameters?
請注意,單獨增加a或b會減少利潤。 但是,如果我們一起增加a和b ,利潤就會增加。 考慮梯度dL / da 。 假設除a外的所有變量均保持不變。 dL / db同樣適用-我們只能改變b 。 但是,我們需要同時增加a和b來改善損失函數。 由于SGD僅基于梯度計算,是否可以在這種情況下成功進行優化? 換句話說,SGD如何處理共變參數?
This is also not a traditional supervised learning problem; although we have two input variables, a and b, we do not have a target output variable. We simply seek to minimize the loss by setting a and b to the best values. Therefore, this is purely an optimization problem — more along the lines of an operations research (OR) problem than a machine learning (ML) problem.
這也不是傳統的監督學習問題。 盡管我們有兩個輸入變量a和b ,但我們沒有目標輸出變量。 我們只是試圖通過將a和b設置為最佳值來使損失最小化。 因此,這純粹是一個優化問題-更像是運籌學(OR)問題,而不是機器學習(ML)問題。
實驗裝置 (Experimental Setup)
We translate the above problem setup into PyTorch code as follows. We apply ReLU to a and b to enforce them to be non-negative. (Without this, the optimizer pushes a and b towards negative infinity, to incur negative cost, as a “loophole” allowing profit.) Here, we take 𝛼 = β = 1, and ? = 3.
我們將上述問題設置轉換為PyTorch代碼,如下所示。 我們將ReLU應用于a和b ,以將其強制為非負數。 (否則,優化器會將a和b推向負無窮大,以產生負成本,作為允許利潤的“漏洞”。)在這里,我們取𝛼 =β= 1, ? = 3。
Compute gradients using autograd & implement SGD from scratch:
使用autograd計算梯度并從頭開始實施SGD:
Our notebook is here.
我們的筆記本在這里 。
結果 (Results)
We show the loss landscape as a background colormap, and the trajectory the optimizer took (in red).
我們將損失情況顯示為背景色圖,并顯示優化程序采用的軌跡(紅色)。
Figure 1. Trajectory when Initialized to (a, b) = (1, 1)圖1.初始化為(a,b)=(1,1)時的軌跡 Figure 2. Trajectory when Initialized to (a, b) = (3, 5). [Note that the Adam optimizer would prevent the ‘waviness’ because momentum would ‘cancel out’ the lateral movement.]圖2.初始化為(a,b)=(3,5)時的軌跡。 (請注意,Adam優化器將防止“起伏”,因為動量會“抵消”橫向運動。) Figure 3. Trajectory when Initialized to (a, b) = (50, 2). Notice the non-intuitive behavior — SGD simultaneously decreases a while increasing b. Figures 1 and 2 match our intuitions more closely — simultaneously increasing a and b.圖3.初始化為(a,b)=(50,2)時的軌跡。 注意非直覺行為-SGD同時減少a而增加b。 圖1和圖2更符合我們的直覺,同時增加了a和b。討論區 (Discussion)
The optimizer worked, in that it discovered the correct pattern of increasing both a and b. It was not obvious that this would work — we had a concern that the optimizer might get stuck when a = b, resulting in deadlock. Let’s analyze the underlying math, and see why gradient descent is able to avoid this scenario.
優化器起作用,因為它發現了同時增大a和b的正確模式。 尚不清楚這是否行得通-我們擔心當a = b時 ,優化器可能會卡住,從而導致死鎖。 讓我們分析基礎數學,看看為什么梯度下降能夠避免這種情況。
What would gradient descent do?To remind ourselves, we are optimizing:
梯度下降會做什么? 為了提醒自己,我們正在優化:
where:
哪里:
The gradients are:
漸變為:
What about when a = b = 1: will there be deadlock? The equations above would give dL/da = 𝛼 = 1 and dL/db = β = 1. This would indicate that we should decrease both a and b by the same amount, since we step in the direction of the negative gradient. If we were to do this, we would see a and b continually shrink together, and eventually both reach 0.
當a = b = 1時會發生死鎖怎么辦? 上面的方程式將使dL / da = 𝛼 = 1且dL / db =β =1。這將表明我們應將a和b減小相同的量,因為我們朝負梯度方向移動。 如果要這樣做,我們將看到a和b持續縮小,最終都達到0。
What would PyTorch do?
PyTorch會做什么?
PyTorch’s evaluation on (a, b) = (1, 1) gives dL/da = 1 (i.e. dL/da = 𝛼, as expected). But, it gives dL/dB= -2 (i.e. β - ?, taken from the b < a case, when in fact b = a). Although torch.min(a, b) is not technically differentiable, PyTorch ‘breaks the tie’ — avoiding the deadlock scenario.
PyTorch對( a,b )=(1,1)的求值得出dL / da = 1(即dL / da = expected,如預期的那樣)。 但是,它給出dL / dB = -2(即β-?,取自b <一種情況,實際上b = a )。 盡管torch.min(a,b)在技??術上沒有區別,但PyTorch打破了束縛-避免了死鎖情況。
What would we do intuitively?
我們會憑直覺做什么?
How would we intuitively solve this problem? Suppose we start at v = <a, b> = <1, 1>. The best possible update would seem to be: v’ = v + u*<1, 1> (increase a and b equally). Let’s compute the directional derivative from this update:
我們將如何直觀地解決這個問題? 假設我們從v = < a,b > = <1,1>開始。 最好的更新似乎是: v' = v + u * <1,1>(相等地增加a和b )。 讓我們根據此更新計算方向導數:
Does this match gradient descent?
這符合梯度下降嗎?
Let’s compare our intuitive answer with our gradient descent calculation. Using an equation for the directional derivative, we would hope to have:
讓我們將我們的直觀答案與梯度下降計算進行比較。 使用方向導數方程 ,我們希望具有:
We only have this guarantee when L is differentiable at v = <a, b>. But when a = b, L is not technically differentiable. However, if we use the “tie-breaking” that PyTorch does, we get the following:
僅當L在v = <a,b>可微時,我們才有此保證。 但是當a = b時, L在技??術上是不可微分的。 但是,如果我們使用PyTorch的“打破常規”,則會得到以下結果:
which indeed holds.
的確成立。
While the components of ?L (i.e., dL/da and dL/db) both have a case-analysis embedded in their definition, we note that their sum is always 𝛼 + β - ?, as long as a ≠ b. And with PyTorch’s implementation for the gradient of the min function, we get dL/da + dL/db = 𝛼 + β - ? everywhere, even when a = b.
雖然?L的分量(即dL / da和dL / db )都在其定義中嵌入了案例分析,但我們注意到,只要a≠b,它們的總和始終為𝛼 +β- ? 。 借助PyTorch對min函數的梯度的實現,即使在a = b的情況下 ,我們仍然在各處獲得dL / da + dL / db = 𝛼 +β-? 。
We knew ahead of time, that jointly increasing a and b (i.e. v’ = v + u*<1, 1>) was the best direction to step in, given a = b. And the derivation above shows that taking separate gradients with respect to a and b, then linearly combining these gradients, achieves the same result. This is why, even though the optimizer operates in the (a, b) space, it can correctly optimize this problem.
我們提前知道,在給定a = b的情況下,共同增加a和b(即v' = v + u * <1,1> )是介入的最佳方向。 上面的推導表明,相對于a和b采取單獨的梯度,然后線性組合這些梯度,可以獲得相同的結果。 這就是為什么即使優化器在(a,b)空間中運行,它也可以正確地優化此問題的原因。
Why can’t the managers in our story solve this problem? Let’s say each manager computes his gradient; dL/da or dL/db. Can they simply each take a step according to their gradient? The managers get stuck in the “otherwise” case shown in the equations for dL/da and dL/db — i.e., where a = b. PyTorch has symmetry-breaking which solves this problem. But this does not map cleanly to the motivating problem.
為什么我們故事中的經理們不能解決這個問題? 假設每個經理都在計算自己的梯度; dL / da或dL / db 。 他們能否簡單地按照自己的梯度邁出一步? 管理人員陷入了dL / da和dL / db方程式中所示的“其他”情況,即a = b 。 PyTorch具有對稱破解功能,可以解決此問題。 但這并不能完全解決激勵問題。
Note: We have not considered alternate initializations (i.e. a ≠ b), nor how the direction of the derivative changes with different 𝛼 and β values (Hint: SGD will take a step in the <𝛼, β> direction iff a = b).
注意:我們沒有考慮過備用初始化(即a≠b ),也沒有考慮導數的方向如何隨不同的𝛼和β值而變化(提示:SGD將在< 𝛼,β >方向iff a = b時邁出一步) 。
An unanswered question: Over the course of a single trajectory (e.g. Figure 2), why does the period of the wave appear to grow larger?
一個懸而未決的問題:在單個軌跡的過程中(例如,圖2),為什么波浪的周期似乎越來越大?
未來工作與結論 (Future Work & Conclusion)
We found this investigation interesting, and became curious about several further directions:
我們發現這項調查很有趣,并對以下幾個方向感到好奇:
軌跡與RL的連接 (Trajectory & Connection to RL)
What if we are not just interested in the end result that the optimizer converges to? We might be interested in the path it took, because this corresponds to excess inventory and profit/loss over time. There is a connection to the RL problem setup. In the RL problem, we care about the trajectory to maximize overall reward; in vanilla optimization, we care only about the end result.
如果我們不僅對優化器收斂的最終結果感興趣,該怎么辦? 我們可能會對采用的路徑感興趣,因為這對應于過多的庫存和一段時間內的盈虧。 有一個到RL問題設置的連接。 在RL問題中,我們關心的是使整體收益最大化的軌跡。 在香草優化中,我們只關心最終結果。
In the current set-up, we are optimizing for profit/loss without consideration for inventory (time dimension). What about changing the trajectory to temporarily over-produce to work through accumulated inventory and maximize profit over time?
在當前設置中,我們正在針對損益進行優化,而不考慮庫存(時間維度)。 改變軌跡以暫時過度生產以通過累積庫存工作并隨著時間的推移最大化利潤又會如何呢?
增加生產的固定成本 (Fixed Cost of Increasing Production)
What if the cost function also includes fixed costs to scale up or scale down production at each station? How would that change the trajectory taken, and how would that impact overproduction to work through accumulated inventory?
如果成本函數還包括固定成本以擴大或縮小每個站點的產量怎么辦? 這將如何改變發展軌跡,以及如何影響生產過剩以通過累積庫存進行工作?
替代優化器 (Alternate Optimizers)
As noted in the Results section, using a momentum-based optimizer (e.g. Adam), rather than plain SGD, would reduce the “waviness” of the trajectory in Figure 2. This prompts the question as to whether the accumulated momentum in Adam is akin to the “stored inventory” from workers who produced excess widgets, as described above. In this case, the optimizer — and not only the loss function — would have an important role in determining the dynamics.
如“結果”部分所述,使用基于動量的優化器(例如Adam)而不是簡單的SGD,將減少圖2中軌跡的“波動性”。這引發了關于Adam中累積動量是否近似的問題。如上所述,向生產過多小部件的工人轉移到“存儲的庫存”。 在這種情況下,優化器(不僅是損失函數)將在確定動力學方面起重要作用。
開放式 (Open Ended)
How does the field of Operations Research treat these problems? Are there other loss functions which have non-differentiable regions, where a similar issue could come up? In these settings, could the optimizer still work?
運籌學領域如何處理這些問題? 是否還有其他具有不可微區域的損失函數,可能會出現類似的問題? 在這些設置下,優化器是否仍然可以工作?
結論 (Conclusion)
We understood how optimizers can successfully solve problems where the input variables have strong co-dependence while still operating in the input space.
我們了解了優化器如何成功解決輸入變量具有強相關性而仍在輸入空間中運行的問題。
腳注 (Footnotes)
[1]: The book further motivates this concept with a real world example:
[1]:這本書通過一個真實的例子進一步激發了這個概念:
For instance, the various problems at General Motors from 1980 to 2008 had strong chain-link features. Increasing the quality of an automobile transmission does little good if the knobs fall off the dashboard and door panels continue to rattle. Improving fit and finish, along with the drivetrain, may offer little overall improvement as long as the designers continue to produce pedestrian designs. Improving the look of the automobiles may only increase costs unless the complex technology of design for manufacturability is mastered. And so on.
例如,通用汽車從1980年到2008年的各種問題都具有很強的鏈接功能。 如果旋鈕從儀表板上脫落并且車門面板繼續發出咔嗒聲,那么提高汽車變速箱的質量效果不佳。 只要設計師繼續進行行人設計,改善裝配和完成以及動力傳動系統的總體改進就很少。 除非掌握了復雜的可制造性設計技術,否則改善汽車的外觀只會增加成本。 等等。
翻譯自: https://medium.com/@blakeelias/gradient-descent-and-chain-linked-systems-74ef5ec8444
梯度下降和隨機梯度下降
總結
以上是生活随笔為你收集整理的梯度下降和随机梯度下降_梯度下降和链链接系统的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: cpu几代怎样识别(CPUCPU报价及图
- 下一篇: 手游吃鸡灵敏度怎么设置(汉典手字的基本解