蒙特卡洛树搜索算法实现_蒙特卡洛树搜索实现实时学习中的强化学习
蒙特卡洛樹搜索算法實(shí)現(xiàn)
In the previous article, we covered the fundamental concepts of reinforcement learning and closed the article with these two key questions:
在上一篇文章中 ,我們介紹了強(qiáng)化學(xué)習(xí)的基本概念,并用以下兩個(gè)關(guān)鍵問(wèn)題結(jié)束了本文:
1 — how can we find the best move among others if we cannot process all the successive states one by one due to limited amount of time?
1-如果由于時(shí)間有限而無(wú)法一一處理所有連續(xù)狀態(tài),我們?nèi)绾握业狡渌罴褷顟B(tài)?
2 — how do we map the task of finding best move to long-term rewards if we are limited in terms of computational resources and time?
2-如果我們?cè)谟?jì)算資源和時(shí)間方面受到限制,我們?nèi)绾斡成湔业阶罴逊椒ㄒ垣@得長(zhǎng)期回報(bào)的任務(wù)?
In this article, to answer these questions, we go through the Monte Carlo Tree Search fundamentals. Since in the next articles, we will implement this algorithm on “HEX” board game, I try to explain the concepts through examples in this board game environment.If you’re more interested in the code, find it in this link. There is also a more optimized version which is applicable on linux due to utilizing cython and you can find it in here.
在本文中,為了回答這些問(wèn)題,我們將介紹“蒙特卡洛樹搜索”的基本原理。 由于在下一篇文章中,我們將在“ HEX”棋盤游戲上實(shí)現(xiàn)此算法,因此我嘗試通過(guò)此棋盤游戲環(huán)境中的示例來(lái)解釋這些概念。如果您對(duì)代碼更感興趣,請(qǐng)?jiān)诖随溄又姓业剿?由于利用了cython ,還有一個(gè)更優(yōu)化的版本可用于linux,您可以在這里找到它。
Here are the outlines:
概述如下:
1 — Overview
1 —概述
2 — Exploration and Exploitation Trade-off
2 —勘探與開發(fā)的權(quán)衡
3 — HEX: A Classic Board Game
3 —十六進(jìn)制:經(jīng)典棋盤游戲
4 — Algorithm structure: Selection and Expansion
4 —算法結(jié)構(gòu):選擇和擴(kuò)展
5 — Algorithm structure: Rollout
5-算法結(jié)構(gòu):推出
6 — Algorithm structure: Backpropagation
6-算法結(jié)構(gòu):反向傳播
7 — Advantages and Disadvantages
7 —優(yōu)缺點(diǎn)
Conclusion
結(jié)論
總覽 (Overview)
Monte Carlo method was coined by Stanislaw Ulam for the first time after applying statistical approach “The Monte Carlo method”. The concept is simple. Using randomness to solve problems that might be deterministic in principle. For example, in mathematics, it is used for estimating the integral when we cannot directly calculate it. Also in this image, you can see how we can calculate pi based on Monte-Carlo simulations.
斯坦尼斯拉夫·烏蘭(Stanislaw Ulam)在采用統(tǒng)計(jì)方法“蒙特卡洛方法”之后首次提出了蒙特卡洛方法。 這個(gè)概念很簡(jiǎn)單。 使用隨機(jī)性來(lái)解決原則上可以確定的問(wèn)題。 例如,在數(shù)學(xué)中,當(dāng)我們無(wú)法直接計(jì)算積分時(shí),它用于估計(jì)積分。 同樣在此圖像中,您可以看到我們?nèi)绾位诿商乜迥M來(lái)計(jì)算pi。
source).來(lái)源 )。The image above indicates the fact that in monte carlo method the more samples we gather, more accurate estimation of target value we will attain.
上面的圖像表明,在蒙特卡洛方法中,我們收集的樣本越多,我們將獲得的目標(biāo)值的估計(jì)就越準(zhǔn)確。
- But how does Monte Carlo Methods come in handy for general game playing? 但是,蒙特卡洛方法如何在一般游戲中派上用場(chǎng)呢?
We use Monte Carlo method to estimate the quality of states stochastically based on simulations when we cannot process through all the states. Each simulation is a self-play that traverses the game tree from current state until a leaf state (end of game) is reached.
當(dāng)我們無(wú)法處理所有狀態(tài)時(shí),我們會(huì)使用Monte Carlo方法根據(jù)模擬隨機(jī) 估計(jì)狀態(tài)的質(zhì)量 。 每個(gè)模擬都是一種自玩游戲,可從當(dāng)前狀態(tài)遍歷游戲樹直到達(dá)到葉子狀態(tài)(游戲結(jié)束)。
So this algorithm is just perfect to our problem.
因此,該算法非常適合我們的問(wèn)題。
- Since it samples the future state-action space, it can estimate near optimal action in current state by keeping computation effort low (which addresses the first question).
-由于它對(duì)未來(lái)的狀態(tài)動(dòng)作空間進(jìn)行采樣,因此可以通過(guò)保持較低的計(jì)算工作量來(lái)解決當(dāng)前狀態(tài)下的最佳動(dòng)作(這解決了第一個(gè)問(wèn)題)。
- Also the fact that it chooses the best action based on long-term rewards (rewarding based on the result in tree leaves) answers the second question.
-同樣,它基于長(zhǎng)期獎(jiǎng)勵(lì)選擇最佳行動(dòng)(根據(jù)樹葉的結(jié)果進(jìn)行獎(jiǎng)勵(lì))這一事實(shí)也回答了第二個(gè)問(wèn)題。
This process is exactly like when a human wants to estimate the future action to come up with the best possible action in the game of chess. He thinks simulates various games (from current state to the last possible state of future) based on self-play in his/her mind and chooses the one that has the best overall results.
這個(gè)過(guò)程就像一個(gè)人想要估計(jì)將來(lái)的動(dòng)作以在國(guó)際象棋比賽中做出最好的動(dòng)作一樣。 他認(rèn)為可以根據(jù)自己的想法模擬各種游戲(從當(dāng)前狀態(tài)到將來(lái)的最后可能狀態(tài)),并選擇總體效果最好的游戲。
Monte Carlo Tree Search (MCTS), which combines monte carlo methods with tree search, is a method for finding optimal decisions in a given domain by taking random samples in the decision space and building a search tree according to the results.
蒙特卡洛樹搜索(MCTS)將蒙特卡洛方法與樹搜索相結(jié)合,是一種通過(guò)在決策空間中抽取隨機(jī)樣本并根據(jù)結(jié)果構(gòu)建搜索樹來(lái)在給定域中找到最佳決策的方法。
Before we explain the algorithm structure, we should first discuss the exploration and exploitation trade-off.
在解釋算法結(jié)構(gòu)之前,我們應(yīng)該首先討論勘探與開發(fā)之間的權(quán)衡。
勘探與開發(fā)的權(quán)衡 (Exploration and Exploitation Trade-off)
As explained, In reinforcement learning, an agent always aims to achieve an optimal strategy by repeatedly using the best actions that it has found in that problem (remember the chess example in the previous article). However, there is a probability that the current best action is not actually optimal. As such it will continue to evaluate alternatives periodically during the learning phase by executing them instead of the perceived optimal. In RL terms, this is known as exploration exploitation trade-off. All of the algorithms in RL (MCTS as well) are trying to balance the exploration-exploitation trade-off.
如前所述,在強(qiáng)化學(xué)習(xí)中,代理始終旨在通過(guò)重復(fù)使用在該問(wèn)題中發(fā)現(xiàn)的最佳行動(dòng)來(lái)實(shí)現(xiàn)最佳策略 (請(qǐng)記住上一篇國(guó)際象棋示例)。 但是,當(dāng)前的最佳動(dòng)作實(shí)際上可能不是最佳的。 因此,它將繼續(xù)通過(guò)執(zhí)行替代方案而不是感知到的最佳方案,在學(xué)習(xí)階段定期評(píng)估替代方案。 用RL術(shù)語(yǔ)來(lái)說(shuō),這稱為勘探開發(fā)權(quán)衡 。 RL中的所有算法(以及MCTS)都在嘗試平衡勘探與開發(fā)之間的權(quán)衡。
I think this video best explains the concept of exploration-exploitation:
我認(rèn)為這段視頻最能說(shuō)明勘探開發(fā)的概念:
十六進(jìn)制:經(jīng)典棋盤游戲 (HEX: A Classic Board Game)
Now it’s time to get to know the Hex game. It has simple rules:
現(xiàn)在是時(shí)候了解十六進(jìn)制游戲了。 它有簡(jiǎn)單的規(guī)則 :
Fig 2: HEX board. The winner is white player because it connected both white sides with chaining stones.圖2:HEX板。 獲勝者是白人玩家,因?yàn)樗面溄Y(jié)石將白色兩面連接起來(lái)。- Black and white alternate turns. 黑白交替輪流。
- On each turn a player places a single stone of its color on any unoccupied cell. 玩家在每個(gè)回合上將一塊彩色的石頭放在任何未占用的單元上。
- The winner is the player who forms a chain of their stones connecting their two opposing board sides. 獲勝者是將兩塊相對(duì)的棋盤面連接起來(lái)的石頭鏈。
Hex can never end in a draw and be played on any n × n board [1].
十六進(jìn)制永遠(yuǎn)不能以平局結(jié)束,不能在任何n×n棋盤上玩[1] 。
Now let`s go through the algorithm structure.
現(xiàn)在讓我們看一下算法的結(jié)構(gòu)。
算法結(jié)構(gòu) (Algorithm structure)
1 —選擇和擴(kuò)展 (1 — Selection and Expansion)
In this step, agent takes the current state of the game and selects a node (Each node represents the state resulted by choosing an action) in tree and traverses the tree. Each move in each state is assigned with two parameters namely as total rollouts and wins per rollouts (they will be covered in rollout section).
在此步驟中,代理獲取游戲的當(dāng)前狀態(tài) ,并在樹中選擇一個(gè)節(jié)點(diǎn)(每個(gè)節(jié)點(diǎn)代表通過(guò)選擇一個(gè)動(dòng)作所導(dǎo)致的狀態(tài))并遍歷該樹。 每種狀態(tài)下的每次移動(dòng)都分配有兩個(gè)參數(shù),即總卷數(shù)和每卷的勝利數(shù)(它們將在卷數(shù)部分中介紹)。
The strategy to select optimal node among other nodes really matters. Upper Confidence Bound applied to Trees (UCT) is the simplest and yet effective strategy to select optimal node. This strategy is designed to balance the exploitation-exploration trade-off. This is UCT formula:
在其他節(jié)點(diǎn)之間選擇最佳節(jié)點(diǎn)的策略確實(shí)很重要。 應(yīng)用于樹(UCT)的上限置信度是選擇最佳節(jié)點(diǎn)的最簡(jiǎn)單但有效的策略。 該策略旨在平衡開發(fā)與開發(fā)之間的權(quán)衡。 這是UCT公式:
Fig 3: UCT formula. first term (w/n) indicates the exploitation and the second term computes the exploration term ( c * sqrt(log t / n))圖3:UCT公式。 第一項(xiàng)(w / n)表示開發(fā),第二項(xiàng)計(jì)算勘探項(xiàng)(c * sqrt(log t / n))In this formula i indicates i-th node in children nodes. W is the number of wins per rollouts and n is the number of all rollouts. This part of formula represents the exploitation.Cis the exploration coefficient and it’s a constant in range of [0,1]. This parameter indicates how much agent have to favor unexplored nodes. t is the number of rollouts in parent node. the second term represents the exploration term.
在該公式中, i表示子節(jié)點(diǎn)中的第i個(gè)節(jié)點(diǎn)。 W是每次部署的獲勝數(shù), n是所有部署的數(shù)。 公式的這一部分代表開發(fā)。 C是勘探系數(shù),在[0,1]范圍內(nèi)是一個(gè)常數(shù)。 此參數(shù)指示有多少代理必須支持未探索的節(jié)點(diǎn)。 t是父節(jié)點(diǎn)中的部署數(shù)量。 第二項(xiàng)代表探索項(xiàng)。
Let’s go through an example in order to have all information provided to sink in. Look at the image below:
讓我們來(lái)看一個(gè)示例,以便提供所有信息以供參考。看下面的圖像:
Fig 4: Selection phase圖4:選擇階段Consider the action C3 in depth 2. is 2 and is 1. t is the parent node number of rollouts which is 4. As you see selection phase stops in the depth where we have an unvisited node. Then in the expansion phase when we visit B1 in depth 4, we add it to tree.
考慮深度2中的動(dòng)作C3為2且為1。t是展開的父節(jié)點(diǎn)數(shù),即4。如您所見,選擇階段停止在深度中我們沒(méi)有訪問(wèn)節(jié)點(diǎn)的位置。 然后在擴(kuò)展階段,當(dāng)我們?cè)L問(wèn)深度4的B1時(shí),將其添加到樹中。
2-推出(也稱為模擬,播放) (2 — Rollout (also called simulation, playout))
In this step, based on predefined policy (like completely random selection) we select actions until we reach a terminal state. The result of game for current player is either 0 (if it loses the rollout) or 1 (if it wins the rollout) at the terminal state. In the game of HEX, the terminal state is always reachable and the result of game is loss or win (no draws). But in games like chess we might get in an infinite loop due to the extensibility of chess branch factor and depth of search tree.
在此步驟中,基于預(yù)定義的策略(例如完全隨機(jī)選擇),我們選擇操作,直到達(dá)到最終狀態(tài)。 在終端狀態(tài)下,當(dāng)前玩家的游戲結(jié)果為0 (如果失敗,則失敗)或1 (如果勝利,則失敗)。 在十六進(jìn)制的游戲中,終端狀態(tài)始終是可到達(dá)的,游戲的結(jié)果是輸贏(無(wú)平局)。 但是在象棋這樣的游戲中,由于象棋分支因子的可擴(kuò)展性和搜索樹的深度,我們可能陷入無(wú)限循環(huán)。
Fig 5: Illustrating the Rollout phase following the previous steps (selection and expansion)圖5:說(shuō)明了先前步驟(選擇和擴(kuò)展)的“推出”階段In the image above, after that black player chose B1 in expansion step, in the simulation step a rollout is started to terminal state of game.
在上圖中,該黑人玩家在擴(kuò)展步驟中選擇了B1,然后在模擬步驟中開始將游戲推廣到游戲的最終狀態(tài)。
In here, we chose random actions to reach the terminal state of the game. In terminal state as you see, white player has won the game by connecting the left to right with its stones. Now it’s time to use this information in backpropagation part.
在這里,我們選擇了隨機(jī)動(dòng)作來(lái)達(dá)到游戲的最終狀態(tài)。 如您所見,在終端狀態(tài)下,白人玩家通過(guò)從左到右連接石頭贏得了比賽。 現(xiàn)在是時(shí)候在反向傳播部分中使用此信息了。
3-反向傳播 (3 — Backpropagation)
In this part, we update the statistics (rollout number and the number of wins per total rollouts) in the nodes which we traversed in tree for selection and expansion parts.
在本部分中,我們更新在樹中遍歷的節(jié)點(diǎn)中用于選擇和擴(kuò)展部分的節(jié)點(diǎn)中的統(tǒng)計(jì)信息(推廣數(shù)量和每總推廣的獲勝次數(shù))。
During backpropagation we need to update the rollout numbers and wins/losses stats of nodes. Only thing we need is to figure out the player who won the game in rollout (e.g. white player in figure 4).
在反向傳播期間,我們需要更新部署數(shù)量和節(jié)點(diǎn)的贏/虧狀態(tài)。 我們唯一需要做的就是找出在發(fā)布中贏得比賽的玩家(例如,圖4中的白人玩家)。
For the figure 4, since the black player is the winner (who chose the action in terminal state), all the states resulted by black player actions are rewarded by 1 and states which resulted by white player actions are given 0 reward (we can choose punishment by set it to -1).
對(duì)于圖4,由于黑人玩家是獲勝者(他們選擇了終端狀態(tài)的動(dòng)作),因此黑人玩家動(dòng)作導(dǎo)致的所有狀態(tài)都將獲得1獎(jiǎng)勵(lì),而白人玩家行為導(dǎo)致的所有狀態(tài)將獲得0獎(jiǎng)勵(lì)(我們可以選擇將其設(shè)置為-1)。
For all states (tree nodes selected through step 1), total rollouts number increases by one as the figure 6 displays.
對(duì)于所有狀態(tài)(通過(guò)步驟1選擇的樹節(jié)點(diǎn)),如圖6所示,總卷展數(shù)量增加1。
Fig 6: Wins (for black player), Losses (for white player) and total number of rollouts are updated for nodes through tree search.圖6:通過(guò)樹搜索更新了節(jié)點(diǎn)的獲勝(對(duì)于黑人玩家),損失(對(duì)于白人玩家)和推出總數(shù)。These steps keep repeating until a predefined condition ends the loop (like time limit).
這些步驟不斷重復(fù),直到預(yù)定義條件結(jié)束循環(huán)(如時(shí)間限制)為止。
的優(yōu)點(diǎn)和缺點(diǎn) (Advantages and Disadvantages)
Advantages:
優(yōu)點(diǎn):
1 — MCTS is a simple algorithm to implement.
1-MCTS是一種易于實(shí)現(xiàn)的算法。
2 — Monte Carlo Tree Search is a heuristic algorithm. MCTS can operate effectively without any knowledge in the particular domain, apart from the rules and end conditions, and can find its own moves and learn from them by playing random playouts.
2-蒙特卡洛樹搜索是一種啟發(fā)式算法。 MCTS可以在規(guī)則和最終條件之外的任何特定領(lǐng)域內(nèi)有效地運(yùn)作,而無(wú)需任何知識(shí),并且可以通過(guò)播放隨機(jī)播報(bào)找到自己的動(dòng)作并從中學(xué)習(xí)。
3 — The MCTS can be saved in any intermediate state and that state can be used in future use cases whenever required.
3-MCTS可以以任何中間狀態(tài)保存,并且可以在需要時(shí)在以后的用例中使用該狀態(tài)。
4 — MCTS supports asymmetric expansion of the search tree based on the circumstances in which it is operating.
4-MCTS支持基于其運(yùn)行情況的搜索樹的非對(duì)稱擴(kuò)展。
Disadvantages:
缺點(diǎn):
1 — As the tree growth becomes rapid after a few iterations, it might require a huge amount of memory.
1-隨著幾次迭代后樹的增長(zhǎng)變得Swift,它可能需要大量的內(nèi)存。
2 — There is a bit of a reliability issue with Monte Carlo Tree Search. In certain scenarios, there might be a single branch or path, that might lead to loss against the opposition when implemented for those turn-based games. This is mainly due to the vast amount of combinations and each of the nodes might not be visited enough number of times to understand its result or outcome in the long run.
2-蒙特卡洛樹搜索存在一些可靠性問(wèn)題。 在某些情況下,可能存在單個(gè)分支或路徑,當(dāng)為那些基于回合的游戲?qū)嵤r(shí),可能導(dǎo)致輸給對(duì)手。 這主要是由于大量的組合,從長(zhǎng)遠(yuǎn)來(lái)看,每個(gè)節(jié)點(diǎn)可能沒(méi)有被足夠多次地訪問(wèn)以了解其結(jié)果或結(jié)果。
3 — MCTS algorithm needs a huge number of iterations to be able to effectively decide the most efficient path. So, there is a bit of a speed issue there.
3-MCTS算法需要大量的迭代才能有效地確定最有效的路徑。 因此,那里存在速度問(wèn)題。
4 — MCTS can return a recommended move at any time because the statistics about the simulated games are constantly updated. The recommended moves aren’t great when the algorithm starts, but they continually improve as the algorithm runs.
4-MCTS可以隨時(shí)返回建議的舉動(dòng),因?yàn)橛嘘P(guān)模擬游戲的統(tǒng)計(jì)信息會(huì)不斷更新。 當(dāng)算法開始時(shí),建議的舉動(dòng)并不好,但是隨著算法的運(yùn)行,它們會(huì)不斷提高。
結(jié)論 (Conclusion)
Now we figured out how the MCTS algorithm can efficiently use randomness to sample all the possible scenarios and come up with the best action over its simulations. The quality of action chosen by MCTS in each time lies in the fact that how well it can handle the exploration and exploitation in the environment.
現(xiàn)在,我們弄清楚了MCTS算法如何能夠有效地使用隨機(jī)性對(duì)所有可能的場(chǎng)景進(jìn)行采樣,并在模擬過(guò)程中提出最佳措施。 MCTS每次選擇的行動(dòng)質(zhì)量取決于這樣一個(gè)事實(shí),即它可以很好地處理環(huán)境中的勘探和開發(fā)。
OK, now that we covered necessary theoretical concepts so far, we’re good to go to the next level with getting our hands dirty with code. In the next article, first, we’re going to describe the whole framework and necessary modules to implement, then we will implement the basic MCTS with UCT. After that, we will improve the framework by adding more functionality to our code.
好的,到目前為止,我們已經(jīng)涵蓋了必要的理論概念,我們很高興進(jìn)入新的階段,著手編寫代碼。 在下一篇文章中,首先,我們將描述整個(gè)框架和要實(shí)現(xiàn)的必要模塊,然后,我們將使用UCT實(shí)現(xiàn)基本的MCTS。 之后,我們將通過(guò)向代碼中添加更多功能來(lái)改進(jìn)框架。
翻譯自: https://towardsdatascience.com/monte-carlo-tree-search-implementing-reinforcement-learning-in-real-time-game-player-25b6f6ac3b43
蒙特卡洛樹搜索算法實(shí)現(xiàn)
總結(jié)
以上是生活随笔為你收集整理的蒙特卡洛树搜索算法实现_蒙特卡洛树搜索实现实时学习中的强化学习的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: cf841A Godsend
- 下一篇: visio 结合draw io 进行流程