當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

构建强化学习_如何构建强化学习项目（第1部分）

發(fā)布時間：2023/12/15 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了构建强化学习_如何构建强化学习项目（第1部分）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

構建強化學習

Ten months ago, I started my work as an undergraduate researcher. What I can clearly say is that it is true that working on a research project is hard, but working on an Reinforcement Learning (RL) research project is even harder!

牛逼恩個月前，我開始了我的工作，作為一個大學生研究員。我可以明確地說的是， 從事研究項目確實很辛苦，但是從事強化學習(RL)研究項目的確更難！

What made it challenging to work on such a project was the lack of proper online resources for structuring such type of projects;

從事這樣一個項目的挑戰(zhàn)是缺乏適當的在線資源來構造這種類型的項目 ;

Structuring a Web Development project? Check!
構建Web開發(fā)項目？ 檢查！
Structuring a Mobile Development project? Check!
構建移動開發(fā)項目？ 檢查！
Structuring a Machine Learning project? Check!
構建機器學習項目？ 檢查！
Structuring a Reinforcement Learning project? Not really!
構建強化學習項目？ 并不是的！

To better guide future novice researchers, beginner machine learning engineers, and amateur software developers to start their RL projects, I pulled up this non-comprehensive step-by-step guide for structuring an RL project which will be divided as follows:

為了更好地指導未來的新手研究人員，初學者機器學習工程師和業(yè)余軟件開發(fā)人員啟動RL項目，我整理了這份非全面的分步指南，以構建RL項目 ，該指南分為以下幾部分：

Start the Journey: Frame your Problem as an RL Problem

開始旅程：將您的問題定為RL問題

Choose your Weapons: All the Tools You Need to Build a Working RL Environment

選擇武器：建立有效的RL環(huán)境所需的所有工具

Face the Beast: Pick your RL (or Deep RL) Algorithm

面對野獸：選擇您的RL(或深度RL)算法

Tame the Beast: Test the Performance of the Algorithm

馴服野獸：測試算法的性能

Set it Free: Prepare your Project for Deployment/Publishing

免費設置：為部署/發(fā)布準備項目

In this post, we will discuss the first part of this series:

在本文中，我們將討論本系列的第一部分：

開始旅程：將您的問題定為RL問題 (Start the Journey: Frame your Problem as an RL Problem)

giphygiphy

This step is the most crucial in the whole project. First, we need to make sure whether Reinforcement Learning can be actually used to solve your problem or not.

這是整個項目中最關鍵的一步。首先，我們需要確定強化學習是否可以真正用于解決您的問題 。

1.將問題視為馬爾可夫決策過程(MDP) (1. Framing the Problem as a Markov Decision Process (MDP))

For a problem to be framed as an RL problem, it must be first modeled as a Markov Decision Process (MDP).

對于要被構造為RL問題的問題，必須首先將其建模為馬爾可夫決策過程(MDP)。

A Markov Decision Process (MDP) is a representation of the sequence of actions of an agent in an environment and their consequences on not only the immediate rewards but also future states and rewards.

馬爾可夫決策過程(MDP)表示環(huán)境中代理行為的順序，以及它們不僅對即時收益而且對未來狀態(tài)和收益的影響 。

An example of a MDP is the following, where S0, S1, S2 are the states, a0 and a1 are the actions, and the orange arrows are the rewards.

以下是一個MDP的示例，其中S0，S1，S2是狀態(tài)，a0和a1是動作，橙色箭頭是獎勵。

Figure 2: example of an MDP (source: 圖2：例的MDP的(來源： Wikipedia維基百科))

An MDP must also satisfy the Markov Property:

MDP還必須滿足Markov屬性 ：

The new state depends only on the preceding state and action, and is independent of all previous states and actions.

新狀態(tài)僅取決于先前的狀態(tài)和動作，并且獨立于所有先前的狀態(tài)和動作 。

2.確定目標 (2. Identifying your Goal)

Figure 3 : Photo by 圖3：照片由 Paul Alnet保羅Alnet on Unsplash上Unsplash

What distinguishes Reinforcement Learning from other types of Learning such as Supervised Learning is the presence of exploration and exploitation and the trade-off between them.

強化學習與其他類型的學習(如監(jiān)督學習) 的區(qū)別在于探索和開發(fā)的存在以及它們之間的權衡 。

While Supervised Learning agents learn by comparing their predictions with existing labels and updating their strategies afterward, RL agents learn by interacting with an environment , trying different actions, and receiving different reward values while aiming to maximize the cumulative expected reward at the end.

監(jiān)督學習的代理商通過將其預測與現有標簽進行比較并隨后更新策略來進行學習，而RL代理商通過與環(huán)境進行交互， 嘗試不同的操作并獲得不同的獎勵值來學習，同時力求最大程度地提高最終的預期獎勵 。

Therefore, it becomes crucial to identify the reason that pushed us to use RL:

因此，確定促使我們使用RL的原因至關重要：

Is the task an optimization problem?
任務是優(yōu)化問題嗎？
Is there any metric that we want the RL agent to learn to maximize (or minimize)?
是否有任何我們希望RL代理學習最大化(或最小化)的指標？

如果您的回答是肯定的，那么RL可能很適合這個問題！ (If your answer is yes, then RL might be a good fit for the problem!)

3.構筑環(huán)境 (3. Framing the Environment)

Now that we are convinced that the RL is a good fit for our problem. It is important to define the main components of the RL environment: the states, the observation space, the action space, the reward signal, and the terminal state.

現在，我們確信RL非常適合我們的問題。定義RL環(huán)境的主要組成部分很重要： 狀態(tài)，觀察空間，動作空間，獎勵信號和終端狀態(tài) 。

Formally speaking, an agent lies in a specific state s1 at a specific time. For the agent to move to another state s2, it must perform a specific action a0 for example. We can confidently say that the state s1 encapsulates all the current conditions of the environment at that time.

從形式上來說，代理在特定時間處于特定狀態(tài)s1。為了使代理移動到另一個狀態(tài)s2，它必須執(zhí)行例如特定的操作a0。我們可以自信地說狀態(tài)s1封裝了當時環(huán)境的所有當前條件。

The observation space: In practice, the state and observation are used interchangeably. However, we must be careful because there is a discernable difference between them. The observation represents all the information that the agent can capture from the environment at a specific state.
觀察空間 ：實際上，狀態(tài)和觀察可以互換使用。但是，我們必須小心，因為它們之間存在明顯的差異。觀察結果表示代理可以在特定狀態(tài)下從環(huán)境捕獲的所有信息。

Let us take the very famous RL example of the CartPole environment, where the agent has to learn to balance a pole on a cart:

讓我們以CartPole環(huán)境的非常著名的RL示例為例，代理商必須學習平衡推車上的桿：

Figure 4: CartPole trained agent in action (圖4：在CartPole動作訓練劑( Source源))

The observations that are recorded at each step are the following:

每個步驟記錄的觀察結果如下：

Table 1: Observation space for CartPole-v0 (source: 表1：CartPole-v0的觀察空間(來源： OpenAI Gym WikiOpenAI Gym Wiki))

Another good example might be the case of an agent trying to discover its way through a maze, where at each step, the agent might receive for example an observation of the maze architecture and its current position.

另一個很好的例子可能是代理試圖通過迷宮發(fā)現自己的方式的情況，在該步驟中，代理可能會收到例如對迷宮架構及其當前位置的觀察。

Figure 5: Another similar environment is Packman (source: 圖5：另一個類似的環(huán)境是帕克曼(來源： Wikipedia維基百科))

The action space: The action space defines the possible actions the agent can choose to take at a specific state. It is by optimizing its choice of actions, that the agent can optimize its behavior.
操作空間 ：操作空間定義了代理可以選擇在特定狀態(tài)下采取的可能操作。代理可以通過優(yōu)化其操作選擇來優(yōu)化其行為。

Table 2: Action space for CartPole-v0 (source: 表2：CartPole-v0的操作空間(來源： OpenAI Gym WikiOpenAI Gym Wiki))

In the Maze example, an agent roaming the environment would have the choice of moving up, down, left, or right to move to another state.

在迷宮示例中，漫游環(huán)境的代理可以選擇上，下，左或右移動到另一個狀態(tài)。

The reward Signal: Generally speaking, the RL agent tries to maximize the cumulative reward over time. With that in mind, we can design the reward function in the best way to be able to maximize or minimize specific metrics that we choose.
獎勵信號 ：一般而言， RL代理會嘗試隨著時間的推移最大化累積獎勵 。考慮到這一點，我們可以以最佳方式設計獎勵功能，以便能夠最大化或最小化我們選擇的特定指標。

For the CartPole environment example, the reward function was designed as follows:

對于 CartPole 環(huán)境例子 ，獎勵功能設計如下：

“Reward is 1 for every step taken, including the termination step. The threshold is 475.”

“所采取的每個步驟(包括終止步驟)的獎勵均為1。閾值為475。”

Knowing that when the pole slips off the cart the simulation is ended, the agent has to learn eventually to balance the pole on the cart as much as possible, by maximizing the sum of individual rewards it gets at each step.

知道當桿子滑出購物車時，模擬結束了，代理商必須學習如何通過最大化每個步驟獲得的單個獎勵??的總和來盡可能地平衡車子上的桿子。

In the case of the Maze environment, our goal might be to let the agent find its way from the source to destination in the least number of steps possible. To do so, we can design the reward function to give the RL agent a negative reward at each step to teach it eventually to take the least number of steps while approaching the destination.

在迷宮環(huán)境中 ，我們的目標可能是讓代理以盡可能少的步驟找到從源到目的地的方式。為此，我們可以設計獎勵功能，為RL代理在每個步驟中提供負獎勵，以指導其在到達目的地時最終采取最少的步驟。

The terminal state: Another crucial component is the terminal state. Although it might not seem like a major issue, failing to set the flag that signals the end of a simulation correctly can badly affect the performance of an RL agent.
終端狀態(tài) ：另一個至關重要的組成部分是終端狀態(tài)。盡管這看起來似乎不是主要問題，但未能正確設置信號來指示模擬結束的標記可能會嚴重影響RL代理的性能。

The done flag, as referred to in many RL environment implementations, can be set whenever the simulation reaches its end or when a maximum number of steps is reached.Setting a maximum number of steps will prevent the agent from taking an infinite number of steps to maximize its rewards as much as possible.

在許多RL環(huán)境實現中都提到的完成標志 ，可以在模擬結束時或達到最大步數時進行設置。設置最大步數將阻止代理采取無限步數進行操作。盡可能最大化其獎勵。

It is important to note that designing the environment is one of the hardest parts of RL. It requires lots of tough design decisions and many remakes. Unless the problem is that straightforward, you will most likely have to experiment with many environment definitions until you land on the definition that yields the best results.

重要的是要注意 ，設計環(huán)境是RL最難的部分之一 。它需要許多艱難的設計決策和許多重制。除非問題很簡單，否則您很可能必須嘗試許多環(huán)境定義，直到您獲得可產生最佳結果的定義。

我的建議：嘗試查找類似環(huán)境的先前實現，以獲取構建您的靈感。 (My advice: Try to look up previous implementations of similar environments to get some inspiration for building yours.)

4.獎勵延遲嗎？ (4. Are Rewards Delayed?)

Another important consideration is to check whether our goal is to maximize the immediate reward or the cumulative reward. It is crucial to have this distinction set clearly before starting the implementation since RL algorithms optimize the cumulative reward at the end of the simulation and not the immediate reward.

另一個重要的考慮因素是檢查我們的目標是最大化即時獎勵還是累積獎勵 。在開始實施之前，必須明確設置此區(qū)別，這是至關重要的， 因為RL算法會在模擬結束時優(yōu)化累積獎勵，而不是立即獎勵 。

Consequently, the RL agent might opt for an action that might lead to a low immediate reward to be able to get higher rewards, later on, maximizing the cumulative reward over time. In case you are interested in maximizing the immediate reward, you might better use other techniques such as Bandits and Greedy approaches.

因此，RL代理可能會選擇可能導致立即獎勵較低的行動，以便能夠獲得更高的獎勵，此后，隨著時間的推移，最大化的累積獎勵。 如果您希望最大程度地獲得即時獎勵，則最好使用其他技術，例如強盜和貪婪方法。

5.注意后果 (5. Be Aware of the Consequences)

Similar to other types of Machine Learning, Reinforcement Learning (and especially Deep Reinforcement Learning (DRL) ) is very computationally expensive. So you have to expect to run many training episodes, testing and tuning iteratively the hyperparameters.

與其他類型的機器學習類似，強化學習(尤其是深度強化學習(DRL))在計算上非常昂貴 。因此，您必須期望運行許多訓練情節(jié)，迭代地測試和調整超參數。

Moreover, some DRL algorithms (like DQN) are unstable and may require more training episodes to converge than you think. Therefore, I would suggest allocating a decent amount of time for optimizing and perfecting the implementation before letting the RL agent train enough.

此外，某些DRL算法(例如DQN)不穩(wěn)定，可能需要比您想象的更多的訓練集來收斂。因此， 我建議在讓RL代理進行足夠的訓練之前， 分配一些時間來優(yōu)化和完善實現 。

結論 (Conclusion)

In this article, we laid the foundations needed to make sure whether Reinforcement Learning is a good paradigm to tackle your problem, and how to properly design the RL environment.

在本文中，我們奠定了必要的基礎，以確保強化學習是否是解決您的問題的良好范例，以及如何正確設計RL環(huán)境。

In the next part: “Choose your Weapons: all the Tools you Need to Build a Working RL Environment”, I am going to discuss how to build the infrastructure needed to build an RL environment with all the tools you might need!

在下一部分： “選擇武器：構建有效的RL環(huán)境所需的所有工具”中 ，我將討論如何使用所需的 所有工具來構建構建RL環(huán)境所需的基礎結構 ！

Buckle up and stay tuned!

系好安全帶，敬請期待！

giphygiphy

Originally published at https://anisdismail.com on June 21, 2020.

最初于 2020年6月21日在 https://anisdismail.com 上發(fā)布。

翻譯自: https://medium.com/analytics-vidhya/how-to-structure-a-reinforcement-learning-project-part-1-8a88f9025a73

構建強化學習

總結

以上是生活随笔為你收集整理的构建强化学习_如何构建强化学习项目（第1部分）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

项目

上一篇： latex 表格中虚线_如何识别和修复表
下一篇：机器学习 | 强化学习（8） | 探索与