重拾强化学习的核心概念_强化学习的核心概念
重拾強化學習的核心概念
By Hannah Peterson and George Williams (gwilliams@gsitechnology.com)
漢娜·彼得森 ( Hannah Peterson)和 喬治·威廉姆斯 ( George Williams) (gwilliams@gsitechnology.com)
You may recall from the previous post in this series on Reinforcement Learning (RL) that a Markov Decision Process (MDP) is a mathematical framework for modeling RL problems. To ground the mathematical theory of an MDP in reality, we identified each of its elements in the context of OpenAI Gym’s MountainCar game. Now that we understand them individually, let’s explore how they combine to form equations, continuing to refer to MountainCar when applicable. The field of RL consists of using different algorithms to approximate optimal solutions to the equations we derive here.
您可以從回憶起以前的帖子在這一系列的強化學習(RL),一個馬爾可夫決策過程(MDP)是用于模擬RL問題的數(shù)學框架。 為了在現(xiàn)實中扎根MDP的數(shù)學理論,我們在OpenAI Gym的MountainCar游戲的背景下確定了每個元素。 現(xiàn)在我們已經(jīng)分別理解了它們,讓我們探索它們如何結合形成方程式,并在適用時繼續(xù)引用MountainCar。 RL的領域包括使用不同的算法來逼近我們在此得出的方程的最優(yōu)解。
Disclaimer: All of the math involved in MDPs can easily get overwhelming, so I don’t go through step-by-step derivations for every equation. Additionally, there’s not one correct way to write these equations—they have multiple representations invoking a variety of notions, from which I picked those that I found to be the most intuitive. If you’re interested in lower-level explanation, I suggest taking a look at the resources I link to in the References section
免責聲明:MDP中涉及的所有數(shù)學運算很容易變得不知所措,因此,我不會對每個方程式進行逐步推導。 此外,沒有一種正確的方法來編寫這些方程式-它們具有引用多種概念的多種表示形式,從中我選擇了最直觀的概念。 如果您對下層的解釋感興趣,建議您參考“參考”部分中鏈接到的資源。
OpenAI體育館基礎知識 (OpenAI Gym Basics)
First, let’s explore some basic commands in Gym to help us see how an MDP is realized within the MountainCar game. These are the building blocks for the code we will be writing in the next post for getting the car agent to learn an intelligent decision-making process.
首先,讓我們探索Gym中的一些基本命令,以幫助我們了解如何在MountainCar游戲中實現(xiàn)MDP。 這些是我們將在下一篇文章中編寫的代碼的構建塊,以使汽車代理商學習智能的決策過程。
The command env = gym.make("MountainCar-v0") creates an instance of the MountainCar game environment. Recall that in this game, a state (i.e. a snapshot of the environment) is defined by the car’s horizontal position and velocity, of which there are theoretically infinite combinations. This is why when we access the environment’s observation_space (i.e. all its possible states) attribute, we find it is represented by a 2D box. In contrast, we see its action_space consists of 3 discrete values, corresponding to the 3 actions available to the car agent in a given state: accelerate to the left, don’t accelerate, or accelerate to the right, encoded as actions 0, 1 and 2, respectively. Actions are taken to transition between states, creating a series of states known in RL as an episode or trajectory.
命令env = gym.make("MountainCar-v0")創(chuàng)建MountainCar游戲環(huán)境的實例。 回想一下,在此游戲中,狀態(tài)(即環(huán)境的快照)由汽車的水平位置和速度定義,理論上存在無限的組合。 這就是為什么當我們訪問環(huán)境的observation_space (即所有可能的狀態(tài))屬性時,我們發(fā)現(xiàn)它由2D框表示。 相比之下,我們看到其action_space由3個離散值組成,對應于給定狀態(tài)下汽車代理可使用的3個動作:向左加速,不加速或向右加速,編碼為動作0、1和2。 采取行動在狀態(tài)之間進行轉換,從而創(chuàng)建一系列在RL中稱為情節(jié)或軌跡的狀態(tài) 。
A random starting state in the MountainCar game environment.MountainCar游戲環(huán)境中的隨機起始狀態(tài)。To begin an episode, we invoke the game's reset() function, which yields a vector with a randomly generated position and a velocity of 0, representing the episode’s starting state. Calling env.action_space.sample() generates a random action (0, 1, or 2) from the action space. We take an action by inputting it into the step() function, which returns a [state, reward, done, info] array. Here, state , reward , and done correspond to the new current state, the reward received, and a boolean of whether the agent has reached its goal as a result of taking the action (info is not relevant in this game—it’s just an empty dictionary). We can see that taking the random action leads to a new state with a slightly different position and non-zero velocity from the starting state.
要開始一個情節(jié),我們調用游戲的reset()函數(shù),該函數(shù)會產(chǎn)生一個向量,該向量具有隨機生成的位置且速度為0,代表該情節(jié)的開始狀態(tài)。 調用env.action_space.sample()從操作空間生成隨機操作(0、1或2)。 我們通過將其輸入到step()函數(shù)中來執(zhí)行操作,該函數(shù)返回[state, reward, done, info]數(shù)組。 state , reward和done對應于新的當前狀態(tài),所獲得的獎勵,以及布爾值是否表示代理由于采取行動而達到目標的布爾值(此游戲中的info無關緊要,只是一個空值)字典)。 我們可以看到,采取隨機動作會導致一個新狀態(tài),該狀態(tài)的位置與開始狀態(tài)的速度略有不同,并且速度不為零。
Having familiarized ourselves with the basics of OpenAI Gym, we can now ask the all-important question of How does our car determine which actions to take? Above, we used the random env.action_space.sample() function, which samples from a uniform distribution across all actions—that is, each of our 3 actions has a 0.33 probability of being selected. With this naive decision-making approach, we can imagine the extremely high number of trials it would take for the car to reach the flag at the top of the hill. We’ll now go over the math involved in teaching our agent a more intelligent way of choosing between actions.
了解了OpenAI Gym的基礎知識之后,我們現(xiàn)在可以問一個最重要的問題:我們的汽車如何確定要采取的行動? 上面,我們使用了隨機的env.action_space.sample()函數(shù),該函數(shù)從所有動作的均勻分布中進行采樣-也就是說,我們3個動作中的每一個都有0.33的被選擇概率。 通過這種幼稚的決策方法,我們可以想象,汽車要到達山頂?shù)钠鞄眯枰M行大量的試驗。 現(xiàn)在,我們將討論有關教導代理人以更明智的方式在動作之間進行選擇的數(shù)學方法。
決定采取哪些行動 (Deciding Which Actions to Take)
政策職能 (The Policy Function)
A natural approach to this problem is to think, well, the flag is to the right of the car, so the car should just accelerate to the right to reach it; however, the game is structured such that the car is unable to overcome the force of gravity by only accelerating to the right—rather, it has to build up its momentum by moving left and right between the two slopes, alternating its acceleration. To determine when to switch between these actions, the cart must take into account its current position and velocity, i.e. its state. In RL, the function that decides which action a should be taken in a particular state s is called a policy, π:
解決此問題的自然方法是認為,旗幟在汽車的右側,因此汽車應向右加速才可以到達。 但是,游戲的結構使得汽車無法僅通過向右加速來克服重力-相反,它必須通過在兩個斜坡之間左右移動,交替加速來建立動量。 為了確定何時在這些動作之間切換,手推車必須考慮其當前位置和速度,即其狀態(tài) 。 在RL,決定采用何種行動應該在一個特定的狀態(tài)s采取被稱為政策功能,π:
This function represents a distribution, taking a state as input and outputting a probability for each possible action, which is sampled from during gameplay to determine which action to take. As we discussed above, the policy for env.action_space.sample() is a uniform distribution across all 3 actions, one that does not take into account an input state s. Let’s take a look at what an optimal policy for MountainCar that does take state into account might look like, bearing in mind that this is a vastly simplified example:
該函數(shù)表示一種分布,以狀態(tài)為輸入并輸出每個可能動作的概率,該概率是在游戲過程中從中采樣以確定采取哪個動作的。 如上文所述, env.action_space.sample()的策略是所有3個動作的均勻分布,其中一個不考慮輸入狀態(tài)s 。 讓我們看一下考慮了狀態(tài)的MountainCar最佳策略,它看起來像是一個大大簡化的示例:
The agent navigates between states to reach the flag by taking actions outputted by its policy. The flag is at position x = 0.5, and states have a reward of -1 if the agent is at x < 0.5 and a reward of 0 otherwise. Note, this idealized example does not take into account stochasticity of the environment (which we’ll account for later).代理通過采取策略輸出的操作,在狀態(tài)之間導航以到達標志。 該標志位于x = 0.5的位置,如果代理在x <0.5上,則狀態(tài)為-1,否則為0。 請注意,這個理想化的示例并未考慮環(huán)境的隨機性(稍后我們將予以說明)。Ideally, such a policy would know with 100% certainty the best action to take in any state, that is, it would output a probability of 1 for this action and 0 for the other two. Recall that the cart’s starting position is randomized for each episode, so there’s no one optimal route for the policy to learn. Additionally, we see that the agent refers to the policy at each state to determine which action to take; that is, it’s not as if the policy just takes in the episode’s starting state and outputs the optimal series of actions for the agent to follow through to the end—rather, it outputs one action per state. This is because, while a pre-planned approach could potentially work for a simple game like MountainCar, MDPs are also used to model dynamic environments, in which it’s imperative for the agent to make decisions based on real-time information about its state.
理想情況下,這樣的策略將以100%的確定性知道在任何狀態(tài)下采取的最佳操作,也就是說,它將為該操作輸出1的概率,為其他兩個輸出0的概率。 回想一下,購物車的開始位置在每個情節(jié)中都是隨機的,因此沒有一種最佳的策略可供學習。 此外,我們看到代理會參考每個州的策略來確定要采取的操作; 也就是說,這似乎不是策略僅采用情節(jié)的開始狀態(tài)并輸出最佳操作序列以使代理繼續(xù)執(zhí)行到結束,而是每個狀態(tài)輸出一個操作。 這是因為,盡管預先計劃的方法可能適用于MountainCar之類的簡單游戲,但MDP也用于對動態(tài)環(huán)境進行建模,在這種環(huán)境中,代理必須根據(jù)有關其狀態(tài)的實時信息來做出決策。
In reality, it’s not feasible to learn policies as decisive as the one above, rather, we use RL algorithms to arrive at decent approximations. In sum, the fundamental goal of RL is to find a policy that maximizes the reward the agent can expect to receive over the course of an episode by acting on it. There are two important things to note here:
實際上,要學習像上述策略那樣具有決定性的策略是不可行的,相反,我們使用RL算法來獲得合理的近似值。 總而言之, RL的基本目標是找到一種策略,通過對代理采取行動,可以使代理在情節(jié)中可以期望獲得的獎勵最大化。 這里有兩件重要的事情要注意:
To address the first, you may notice that the policy equation we defined above doesn’t include a variable for reward. Indeed, the policy still needs some means of assigning rewards to actions so it can output high probabilities for “rewarding” actions and vice versa, which requires us to first find a way of mapping rewards to states, since actions are just transitions between states. Let’s discuss how to go about assigning rewards to states, for which we must take into account the second point—the notion of expectation.
為了解決第一個問題,您可能會注意到我們上面定義的策略方程式不包含獎勵變量。 的確,該政策仍然需要一些為行動分配獎勵的方法,以便可以為“獎勵”行動輸出高概率,反之亦然,這需要我們首先找到一種將獎勵映射到各州的方法 ,因為行動只是各州之間的過渡。 讓我們討論如何為國家分配獎勵,為此我們必須考慮第二點-期望的概念。
確定預期獎勵 (Determining Expected Reward)
The inclusion of the word expect in the fundamental goal of RL implies that:
在RL的基本目標中加入“ 期望 ”一詞意味著:
Regarding the first point, our agent needs to map rewards to states in a way that isn’t based solely on how rewarding a given state is immediately upon visiting it but also on its potential for being a part of an overall rewarding trajectory of many states; that is, the agent’s calculation of reward needs to be structured such that it is still able to recognize states that may be penalizing in the short term but beneficial in the long term as rewarding. With this in mind, it makes sense to define the “rewardingness” of a particular state using a sum of rewards of future states.
關于第一點,我們的代理商需要將獎勵映射到州,而不僅要基于訪問給定州后如何立即獲得獎勵,還取決于其成為許多州總體獎勵軌跡的一部分的潛力; 也就是說,代理商的獎勵計算必須經(jīng)過結構設計,以便它仍然能夠識別可能在短期內受到懲罰但在長期內有益的國家。 考慮到這一點,使用未來狀態(tài)的獎勵總和來定義特定狀態(tài)的“獎勵”是有意義的。
But, as the second point addresses, the future is uncertain. Recall that MDPs are used to model environments that are stochastic in nature, meaning that instructing an agent to take a particular action doesn’t lead to it actually be carried out 100% of the time. Further, environments are often dynamic in ways that are extremely difficult to model directly, such as games played against a human opponent. Given these factors, it makes sense to weigh expected rewards less heavily the further in the future they occur because we are less certain we will actually receive them.
但是,正如第二點所述,未來是不確定的。 回想一下,MDP用于建模本質上是隨機的環(huán)境,這意味著指示代理采取特定行動并不會導致它實際上在100%的時間執(zhí)行。 此外,環(huán)境通常是動態(tài)變化的 ,很難直接建模,例如與人類對手玩游戲。 考慮到這些因素,有意義的是,在預期獎勵的未來發(fā)生地越遠,就應對其進行較小的權衡,因為我們不太確定自己會實際收到這些獎勵。
狀態(tài)值函數(shù) (The State-Value Function)
With these two points in mind, we can introduce the state-value function, which maps values to states in MDPs. The value of a state s is defined as the total reward an agent can expect to receive by starting in s and following a trajectory of successive states from s:
考慮到這兩點,我們可以引入state-value函數(shù) ,該函數(shù)將值映射到MDP中的狀態(tài)。 狀態(tài)s的值定義為代理商從s開始可以期望獲得的總獎勵 并遵循s的連續(xù)狀態(tài)軌跡:
Here, the first term in the expectation operator is the immediate reward the agent expects to receive by being in s and the remaining terms are the rewards for successive states, which we see are weighted by γ. γ is a constant between 0 and 1 called the discount factor because it gets exponentiated to weigh future expected rewards less heavily in the sum of total expected rewards attributed to s, accounting for uncertainty of the future in this way. The closer γ is to 1, the more future reward is considered when attributing value to the present state and vice versa. We can group the discounted terms together, arriving at the following representation of the state-value function:
在這里,期望運算符中的第一項是代理期望通過s獲得的立即獎勵 ,其余項是對連續(xù)狀態(tài)的獎勵,我們看到它們由γ加權。 γ是介于0和1之間的常數(shù)(稱為折現(xiàn)因子),因為它可以對s歸因于s的總預期收益之和進行權重較小的加權,從而權衡了未來的不確定性。 γ越接近1,則將價值歸因于當前狀態(tài)時考慮的未來回報就越多,反之亦然。 我們可以將折價條款組合在一起,得出狀態(tài)值函數(shù)的以下表示形式:
Additionally, it is the policy function π that determines the trajectory of states the agent takes, so π can be viewed as another parameter of the function:
此外,策略函數(shù)π決定了代理采取的狀態(tài)軌跡,因此π可以看作函數(shù)的另一個參數(shù):
This is what is known as the Bellman equation for the state-value function. A Bellman equation is a type of optimization equation that specifies that the “value” of a decision problem at a given point in time depends on the value generated from an initial choice and the value of resulting choices, and, hence, is recursive in nature.
這就是所謂的狀態(tài)值函數(shù)的Bellman方程 。 貝爾曼方程式是一種優(yōu)化方程式,它指定給定時間點決策問題的“值”取決于從初始選擇產(chǎn)生的值和結果選擇的值,因此本質上是遞歸的。
動作值功能 (The Action-Value Function)
Given we now have a means of mapping a value to a given state, we can use the fact that an action is a transition between states to map values to state-action pairs as well. Using such a mapping, the agent in any state will have a way to estimate the value of taking each action available to it so it can choose the one with the highest chance of contributing to a maximum overall reward. This is achieved by the action-value function, which takes in a state s and action a and outputs the expected reward (a scalar value) for taking a from s and following policy π to take actions thereafter:
鑒于我們現(xiàn)在有了將值映射到給定狀態(tài)的方法 ,我們可以利用以下事實:動作是狀態(tài)之間的過渡,也可以將值映射到狀態(tài)-動作對 。 使用這種映射,處于任何狀態(tài)的代理都將有一種方法來估計采取其可用的每個操作的價值,以便它可以選擇機會最大的人來獲得最大的總體回報。 這是由動作值函數(shù) ,該函數(shù)在一個狀態(tài)s和動作的并輸出用于從s 拍攝和以下策略π之后采取行動的預期回報(標量值)來實現(xiàn)的:
You may notice it is also a Bellman equation, and that it looks very similar to the state-value function. How are these two equations related? The value of a state s can be determined by taking the sum of the action-values for all actions, where each action-value term is weighted by the probability of taking the respective action a from s, which, as we know, is determined by the policy. Thus, the state-value function can be re-written as:
您可能會注意到它也是一個Bellman方程,并且看起來與狀態(tài)值函數(shù)非常相似。 這兩個方程如何關聯(lián)? 可以通過取所有動作的動作值之和來確定狀態(tài)s的值,其中每個動作值項都由從s采取相應動作a的概率加權(據(jù)我們所知)通過政策。 因此,狀態(tài)值函數(shù)可以重寫為:
In MountainCar, for example, say the car is following some policy π that outputs probabilities of 0.3, 0.1 and 0.6 and its action-value function outputs values of 1.2, 0.4 and 2.8 for the three actions given its current state s. Then the value of s is 0.3(1.2) + 0.1(0.4) + 0.6(2.8) = 2.08.
例如,在MountainCar中,假設汽車遵循某些策略π,該策略輸出概率0.3、0.1和0.6,并且給定當前狀態(tài)s ,其動作值函數(shù)針對這三個動作輸出值1.2、0.4和2.8 。 則s的值為0.3(1.2)+ 0.1(0.4)+ 0.6(2.8)= 2.08。
We can also define the action-value function in terms of the state-value function. Calculating the action-value of a state s and some action a must take into account the stochastic nature of the environment, that is, that even if the agent chooses to take a, random interaction with the environment may cause it to take a different action. You may recall from the first blog that this randomness is encoded in an MDP as a matrix P where an entry P_ss’ is the probability of ending up in state s’ by choosing to take a from s. We can visualize this using an example in MountainCar. In this case, let’s assume a policy that takes in state s and outputs probabilities of 0.8, 0.06 and 0.14 for actions 0, 1 and 2, respectively. The agent draws from this distribution and chooses action 0 (accelerate to the left), as expected given it has the largest probability. Despite this choice, the probability P_ss’ that the agent actually takes this action is less than certain, equaling 0.9 for example, with the remaining probability being distributed among the other two actions:
我們還可以根據(jù)狀態(tài)值函數(shù)來定義動作值函數(shù)。 計算的狀態(tài)s的動作值和某些行動必須考慮到環(huán)境,這是隨機性質,即使代理人選擇采取與環(huán)境的隨機相互作用,可能導致其采取不同的行動。 您可能從第一個博客中回憶起,該隨機性在MDP中編碼為矩陣P ,其中條目P_ss'是通過選擇a從s終止于狀態(tài)s'的概率。 我們可以使用MountainCar中的示例對此進行可視化。 在這種情況下,讓我們假設一個策略接受狀態(tài)s并分別輸出動作0、1和2的概率為0.8、0.06和0.14。 代理從此分布中提取并選擇動作0(向左加速),因為預期的概率最大。 盡管有此選擇,代理實際執(zhí)行此操作的概率P_ss'仍小于一定值,例如等于0.9,剩余概率分布在其他兩個操作之間:
This uncertainty in taking actions means that to calculate the expected reward (i.e. the action-value) for taking any action a in state s, the immediate expected reward for a and s must be added to the discounted sum of expected rewards (i.e. state-values) of all possible successor states, weighted by their probabilities of occurring:
這種采取行動的不確定性意味著要計算在狀態(tài)s中采取任何行動a的預期獎勵(即行動價值),必須將a和s的直接預期獎勵加到預期獎勵的折現(xiàn)和中(即狀態(tài)-值)的所有可能的后繼狀態(tài),并按其發(fā)生的概率加權:
We can substitute the new form of the state-value function we derived above into this formula to arrive at the following definition of the action-value function, which we see is the same as the initial equation with the expectation operator applied:
我們可以將上面導出的狀態(tài)值函數(shù)的新形式替換為該公式,以得出以下動作值函數(shù)的定義,該定義與應用了期望運算符的初始方程式相同:
優(yōu)化問題 (The Optimization Problem)
The action-value function gives us a way of determining the value of any action a from any state s in a trajectory determined by policy π—the higher the output of the action-value function, the better that action is for accumulating a high reward over the course of the trajectory. Thus, the optimal policy will know which actions result in the highest action-values with 100% certainty and, as we discussed earlier with the first MountainCar example, will output a probability of 1 for those actions and 0 for the others:
動作值函數(shù)為我們提供了一種從策略π確定的軌跡中的任何狀態(tài)s確定任何動作a的值的方法-動作值函數(shù)的輸出越高,動作用于積累高回報的效果就越好在軌跡的過程中。 因此, 最佳策略將知道哪些動作會以100%的確定性產(chǎn)生最高的動作值,并且正如我們之前在第一個MountainCar示例中所討論的那樣,對于這些動作,其輸出概率為1,對于其他動作,其輸出概率為0:
With this, we arrive at the Bellman optimality equation for the action-value function:
這樣,我們得出了作用值函數(shù)的Bellman最優(yōu)方程 :
This is what Reinforcement Learning aims to solve through a wide variety of approaches. In the next blog of this series, we’ll walk through the theory behind one popular approach called “Q-Learning” to arrive at an estimate of this optimal action-value function using MountainCar.
這是強化學習旨在通過多種方法解決的問題。 在本系列的下一個博客中,我們將介紹一種稱為“ Q-Learning”的流行方法背后的理論,以使用MountainCar對該最佳作用值函數(shù)進行估算。
翻譯自: https://medium.com/gsi-technology/core-concepts-in-reinforcement-learning-by-example-dc8e839f6a2c
重拾強化學習的核心概念
總結
以上是生活随笔為你收集整理的重拾强化学习的核心概念_强化学习的核心概念的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: win10无法正常连接到internet
- 下一篇: gpt 语言模型_您可以使用语言模型构建