當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

PPO-强化学习算法

發(fā)布時(shí)間：2024/9/15 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 PPO-强化学习算法小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

Quick Facts
- Key Equations
- Exploration vs. Exploitation
- Pseudocode
Documentaton

PPO受到與TRPO相同的問題的激勵(lì)：我們?nèi)绾尾拍苁褂卯?dāng)前擁有的數(shù)據(jù)在策略上采取最大可能的改進(jìn)步驟，而又不會(huì)走得太遠(yuǎn)而導(dǎo)致意外導(dǎo)致性能下降？在TRPO試圖通過復(fù)雜的二階方法解決此問題的地方，PPO是一階方法的族，它使用其他一些技巧來使新策略接近于舊策略。 PPO方法實(shí)施起來非常簡單，并且從經(jīng)驗(yàn)上看，其性能至少與TRPO相同。

PPO有兩種主要變體：PPO-penalty和PPO-clip。

PPO-Penalty 近似解決了像TRPO這樣的受KL約束的更新，但是對目標(biāo)函數(shù)中的KL偏離進(jìn)行了懲罰，而不是使其成為硬約束，并且在訓(xùn)練過程中自動(dòng)調(diào)整了懲罰系數(shù)，以便對其進(jìn)行適當(dāng)縮放。
PPO-Clip 在目標(biāo)中沒有KL散度項(xiàng)，也沒有任何約束。取而代之的是依靠對目標(biāo)函數(shù)的專門裁剪來消除新政策消除舊政策的激勵(lì)。

這里我們將聚焦PPO-clip(OpenAI)

Quick Facts

PPO是一個(gè)on-policy算法
PPO能用于離散或者連續(xù)的動(dòng)作空間
Spinningup的PPO支持用MPI并行

Key Equations

PPO-clip更新策略通過： $θk+1=arg?max?θEs,a～πθk[L(s,a,θk,θ)],\theta_{k+1} = \arg \max_{\theta} \underset{s,a \sim \pi_{\theta_k}}{{\mathrm E}}\left[ L(s,a,\theta_k, \theta)\right],$ 一般用多步(通常minibatch)SGD去最大化目標(biāo)。這里的 $L$ 是 $L(s,a,θk,θ)=min?(πθ(a∣s)πθk(a∣s)Aπθk(s,a),clip(πθ(a∣s)πθk(a∣s),1??,1+?)Aπθk(s,a)),L(s,a,\theta_k,\theta) = \min\left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a), \;\; \text{clip}\left(\frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, 1 - \epsilon, 1+\epsilon \right) A^{\pi_{\theta_k}}(s,a) \right),$ 其中 $?\epsilon$ 是一個(gè)較小的超參數(shù)，大概地描述新策略與舊策略相距多遠(yuǎn)。

這是一個(gè)較復(fù)雜的表述，很難一眼看出它是怎么做的或是如何有助于保持新策略接近舊策略。事實(shí)證明，此目標(biāo)有一個(gè)相當(dāng)簡化的版本[1]，較容易解決（也是我們在代碼中實(shí)現(xiàn)的版本）： $L(s,a,θk,θ)=min?(πθ(a∣s)πθk(a∣s)Aπθk(s,a),g(?,Aπθk(s,a))),L(s,a,\theta_k,\theta) = \min\left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a), \;\; g(\epsilon, A^{\pi_{\theta_k}}(s,a)) \right),$ 其中 $g(?,A)={(1+?)AA≥0(1??)AA<0.g(\epsilon, A) = \left\{ \begin{array}{ll} (1 + \epsilon) A & A \geq 0 \\ (1 - \epsilon) A & A < 0. \end{array} \right.$ 為了弄清楚從中得到的直覺，讓我們看一下單個(gè)狀態(tài)-動(dòng)作對 $(s, a)$ ，并考慮多個(gè)案例。

Advantage is positive 假設(shè)該狀態(tài)-動(dòng)作對的優(yōu)勢為正，在這種情況下，其對目標(biāo)的貢獻(xiàn)減少為 $L(s,a,θk,θ)=min?(πθ(a∣s)πθk(a∣s),(1+?))Aπθk(s,a).L(s,a,\theta_k,\theta) = \min\left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, (1 + \epsilon) \right) A^{\pi_{\theta_k}}(s,a).$ 因?yàn)閮?yōu)勢是正的，所以如果采取行動(dòng)的可能性更大，即 $πθ(a∣s)\pi_\theta(a|s)$ 增加，則目標(biāo)也會(huì)增加。但是此術(shù)語中的最小值限制了目標(biāo)可以增加多少。當(dāng) $πθ(a∣s)>(1+?)πθk(a∣s)\pi_\theta(a|s)>(1+\epsilon)\pi_{\theta_k}(a|s)$ ,這個(gè)式子達(dá)到 $(1+?)Aπθk(s,a)(1+\epsilon)A^{\pi_{\theta_k}}(s,a)$ 的上限.因此新政策不會(huì)因遠(yuǎn)離舊政策而受益。
Advantage is negative: 假設(shè)該狀態(tài)對對的優(yōu)勢為負(fù)，在這種情況下，其對目標(biāo)的貢獻(xiàn)減少為 $L(s,a,θk,θ)=max?(πθ(a∣s)πθk(a∣s),(1??))Aπθk(s,a).L(s,a,\theta_k,\theta) = \max\left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, (1 - \epsilon) \right) A^{\pi_{\theta_k}}(s,a).$ 因?yàn)閮?yōu)勢是負(fù)面的，所以如果行動(dòng)的可能性降低，即 $πθ(a∣s)\pi_\theta(a|s)$ 減小，則目標(biāo)將增加。但是這個(gè)式子的最大值限制了目標(biāo)增加的多少。當(dāng) $πθ(a∣s)<(1??)πθk(a∣s)\pi_\theta(a|s)<(1-\epsilon)\pi_{\theta_k}(a|s)$ , 式子達(dá)到最大值 $(1??)Aπθk(s,a)(1-\epsilon)A^{\pi_{\theta_k}}(s,a)$ .因此，再次：新政策不會(huì)因遠(yuǎn)離舊政策而受益。

到目前為止，我們看到的是clipping作為一種調(diào)節(jié)器消除策略急劇變化的激勵(lì)，而超參數(shù)ε則對應(yīng)于新政策與舊政策的距離有多遠(yuǎn)，同時(shí)仍然有利于實(shí)現(xiàn)目標(biāo)。

[1]https://drive.google.com/file/d/1PDzn9RPvaXjJFZkGeapMHbHGiWWW20Ey/view?usp=sharing

盡管這種clipping對確保合理的策略更新大有幫助，但仍然有可能最終產(chǎn)生與舊策略相距太遠(yuǎn)的新策略，并且不同的PPO實(shí)現(xiàn)使用了很多技巧來避免這種情況關(guān)。在此處的實(shí)現(xiàn)中，我們使用一種特別簡單的方法：提前停止。如果新政策與舊政策的平均KL差距超出閾值，我們將停止采取梯度步驟。
如果您對基本的數(shù)學(xué)知識(shí)和實(shí)施細(xì)節(jié)感到滿意，則有必要查看其他實(shí)施以了解它們?nèi)绾翁幚泶藛栴}！

Exploration vs. Exploitation

PPO以一種基于策略的方式訓(xùn)練隨機(jī)策略。這意味著它將根據(jù)最新版本的隨機(jī)策略通過采樣操作來進(jìn)行探索。動(dòng)作選擇的隨機(jī)性取決于初始條件和訓(xùn)練程序。在培訓(xùn)過程中，由于更新規(guī)則鼓勵(lì)該策略利用已發(fā)現(xiàn)的獎(jiǎng)勵(lì)，因此該策略通常變得越來越少隨機(jī)性。這可能會(huì)導(dǎo)致策略陷入局部最優(yōu)狀態(tài)。

Pseudocode

Documentaton

spinup.ppo(env_fn, actor_critic=, ac_kwargs={}, seed=0, steps_per_epoch=4000, epochs=50, gamma=0.99, clip_ratio=0.2, pi_lr=0.0003, vf_lr=0.001, train_pi_iters=80, train_v_iters=80, lam=0.97, max_ep_len=1000, target_kl=0.01, logger_kwargs={}, save_freq=10)
Parameters:

env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
actor_critic – A function which takes in placeholder symbols for state, x_ph, and action, a_ph, and returns the main outputs from the agent’s Tensorflow computation graph:
ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to PPO.
seed (int) – Seed for random number generators.
steps_per_epoch (int) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch.
epochs (int) – Number of epochs of interaction (equivalent to number of policy updates) to perform.
gamma (float) – Discount factor. (Always between 0 and 1.)
clip_ratio (float) – Hyperparameter for clipping in the policy objective. Roughly: how far can the new policy go from the old policy while still profiting (improving the objective function)? The new policy can still go farther than the clip_ratio says, but it doesn’t help on the objective anymore. (Usually small, 0.1 to 0.3.)
pi_lr (float) – Learning rate for policy optimizer.
vf_lr (float) – Learning rate for value function optimizer.
train_pi_iters (int) – Maximum number of gradient descent steps to take on policy loss per epoch. (Early stopping may cause optimizer to take fewer than this.)
train_v_iters (int) – Number of gradient descent steps to take on value function per epoch.
lam (float) – Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)
max_ep_len (int) – Maximum length of trajectory / episode / rollout.
target_kl (float) – Roughly what KL divergence we think is appropriate between new and old policies after an update. This will get used for early stopping. (Usually small, 0.01 or 0.05.)
logger_kwargs (dict) – Keyword args for EpochLogger.
save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.

總結(jié)

以上是生活随笔為你收集整理的PPO-强化学习算法的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

算法
PPO

上一篇： DDPG-强化学习算法
下一篇： Twin Delayed DDPG(TD