當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

ai人工智能编程_从人工智能动态编程：Q学习

發布時間：2023/12/15 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 ai人工智能编程_从人工智能动态编程：Q学习小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

ai人工智能編程

A failure is not always a mistake, it may simply be the best one can do under the circumstances. The real mistake is to stop trying. — B. F. Skinner

失敗并不總是錯誤，它可能只是在這種情況下可以做的最好的事情。真正的錯誤是停止嘗試。 — BF斯金納

Reinforcement learning models are beating human players in games around the world. Huge international companies are investing millions in reinforcement learning. Reinforcement learning in today’s world is so powerful because it requires neither data nor labels. It could be a technique that leads to general artificial intelligence.

強化學習模型正在全球游戲中擊敗人類玩家。龐大的國際公司在強化學習上投入了數百萬美元。當今世界，強化學習是如此強大，因為它既不需要數據也不需要標簽。它可能是導致通用人工智能的技術。

有監督和無監督學習 (Supervised and Unsupervised Learning)

As a summary, in supervised learning, a model learns to map input to outputs using predefined and labeled data. An unsupervised learning approach teaches a model to cluster and group data using predefined data.

總而言之，在監督學習中，模型學習使用預定義和標記的數據 將輸入映射到輸出 。一種無監督的學習方法，教一個模型使用預定義的數據對數據進行聚類和分組 。

強化學習 (Reinforcement Learning)

However, in reinforcement learning, the model receives no data set and guidance, using a trial and error approach.

但是，在強化學習中，該模型沒有使用試錯法接收任何數據集和指導。

Reinforcement learning is an area of machine learning defined by how some model (called agent in reinforcement learning) behaves in an environment to maximize a given reward. The most similar real-world example is of a wild animal trying to find food in its ecosystem. In this example, the animal is the agent, the ecosystem is the environment, and the food is the reward.

強化學習是機器學習的一個領域，它是由某種模型(在強化學習中稱為主體)在環境中的行為來定義的，以最大化給定的獎勵。現實世界中最相似的例子是野生動物試圖在其生態系統中尋找食物。在此示例中，動物是主體，生態系統是環境，食物是獎勵。

Reinforcement learning is frequently used in the domain of game playing, where there is no immediate way to label how “good” an action was, since we would need to consider all future outcomes.

強化學習經常用于游戲領域，因為我們需要考慮所有未來的結果，因此無法立即標記動作的“良好”程度。

馬爾可夫決策過程 (Markov Decision Processes)

The Markov Decision Process is the most fundamental concept of reinforcement learning. There are a few components in an MDP that interact with each other:

馬爾可夫決策過程是強化學習的最基本概念。 MDP中有一些相互影響的組件：

Agent — the model
代理-模型
Environment — the overall situation
環境-總體情況
State — the situation at a specific time
狀態-特定時間的情況
Action — how the agent acts
行動-代理如何行動
Reward — feedback from the environment
獎勵-來自環境的反饋

MDP表示法 (MDP Notation)

Sutton, R. S. and Barto, A. G. Introduction to Reinforcement LearningSutton，RS和Barto，AG強化學習簡介

To repeat what was previously discussed in more mathematically formal terms, some notation must be defined.

為了以數學上更正式的形式重復先前討論的內容，必須定義一些符號。

t represents the current time step
t代表當前時間步長
S is the set of all possible states, with S_t being the state at time t
S是所有可能狀態的集合，其中S_t是時間t的狀態
A is the set of all possible actions, with A_t being the action performed at time t
A是所有可能動作的集合， A_t是在時間t執行的動作
R is the set of all possible rewards, with R_t being the reward received after performing A_(t-1)
R是所有可能的獎勵的集合， R_t是執行A_(t-1)后收到的獎勵
T is the last time step (the last step happens when a certain condition is reached or t is higher than a value)
T是最后一個時間步(當達到某個條件或t大于某個值時，最后一步發生)

The process can be written as:

該過程可以寫成：

The agent receives a state S_t

代理收到狀態S_t

The agent performs an action A_t based on S_t

代理基于S_t執行動作A_t

The agent receives a reward R_(t+1)

代理收到獎勵R_(t + 1)

The environments transitions into a new state S_(t+1)

環境過渡到新狀態S_(t + 1)

The cycle repeats for t+1

循環重復t + 1

預期折扣收益(做出長期決策) (Expected Discounted Return (Making Long-Term Decisions))

We discussed that in order for an agent to play a game well, it would need to take future rewards into consideration. This can be described as:

我們討論過，為了使特工更好地玩游戲，需要考慮未來的收益。這可以描述為：

G(t) = R_(t+1) + R_(t+2) +… + R_(T), where G(t) is the sum of the rewards the agent expects after time t.

G(t)= R_(t + 1)+ R_(t + 2)+…+ R_(T) ，其中G(t)是代理商在時間t之后期望得到的報酬之和。

However, if T is infinite, in order to to make G(t) converge to a single number, we define the discount rate γ to be a number smaller than 1, and define:

但是，如果T為無窮大，為了使G(t)收斂為一個數，我們將折現率γ定義為小于1的數，并定義：

G(t) = R_(t+1) + γR_(t+2) +γ2R_(t+2)+…

G(t)= R_(t + 1)+γR_(t + 2)+γ2R_(t + 2)+…

This can also be written as:

也可以寫成：

G(t) = R_(t+1) + γG(t+1)

G(t)= R_(t + 1)+γG(t + 1)

價值與質量(Q學習是質量學習) (Value and Quality (Q-Learning is Quality-Learning))

A policy describes how an agent will act given any state it finds itself in. An agent is said to follow a policy. Value and Quality functions describe how “good” it is for an agent to be in a state, or a state and perform an action.

策略描述了座席在發現自己所處的任何狀態下將如何行動。據說座席遵循策略。價值和質量功能描述了代理處于一種狀態或一種狀態并執行一個動作的“良好”程度。

Specifically, the value function v_p(s) is equal to the expected discounted return while starting in state s and following a policy p. The quality function q_p(s, a) is equal to the best expected discounted return possible while starting in state s, performing action a, and then following policy p.

具體來說，值函數v_p(s)等于從狀態s開始并遵循策略p時的預期折現收益。質量函數q_p(s，a)等于在狀態s中開始，執行動作a然后遵循策略p時可能的最佳預期折現收益。

v_p(s) = (G(t) | S_t=s)

v_p(s)=(G(t)| S_t = s)

q_p(s, a) = (G(t) | S_t=s, A_t = a)

q_p(s，a)=(G(t)| S_t = s，A_t = a)

A policy is better or equal to another policy if it has a greater or equal discounted expected return for every state. The optimal value and quality functions v* and q* use the best possible policy.

如果一個保單的每個州的預期收益都等于或大于折算的期望收益，則該保單優于或等于另一個保單。最優值和質量函數v *和q *使用最佳策略。

Q的Bellman方程* (Bellman Equation for Q*)

The Bellman Equation another extremely important concept that turns q-learning into dynamic programming combined with a gradient descent-like idea.

貝爾曼方程式是另一個極其重要的概念，它結合了類似梯度下降的思想，將q學習轉變為動態編程。

It states that when following the best policy, the q value of a state and action (q_p(s, a)) is the same as the reward received for performing a during s plus the maximum expected discounted reward after performing a during s multiplied by the discount rate.

它指出，繼最好的政策，狀態和動作的Q值時(q_p(S，A))是一樣的在S 進行加最大期望在S乘以執行后折扣獎勵獲得獎勵折現率。

q*(s_t, a_t) = R_(t+1) + γq*(s_(t+1), a_(t+1))

q *(s_t，a_t)= R_(t + 1)+γq*(s_(t + 1)，a_(t + 1))

The quality of the best action is equal to the reward plus the quality of best action on the next time step times the discount rate.

最佳動作的質量等于獎勵加上下一個步驟中最佳動作的質量乘以折現率。

Once we find q*, we can find the best policy by using q-learning to find the best policy.

一旦找到q * ，就可以通過使用q-learning找到最佳策略來找到最佳策略。

Q學習 (Q-Learning)

Q-learning is a technique which attempts to maximize the expected reward over all time steps by finding the best q function. In other words, the objective of q-learning is the same as the objective of dynamic programming, but with the discount rate.

Q學習是一種試圖通過找到最佳q函數在所有時間步長上最大化預期回報的技術。換句話說，q學習的目標與動態規劃的目標相同，但是具有折扣率。

In q-learning, a table with all possible state-action pairs is created, and the algorithm iteratively updates all the values of the table using the bellman equation until the optimal q-values are found.

在q學習中，創建了一個具有所有可能的狀態-動作對的表，該算法使用bellman方程迭代更新表的所有值，直到找到最佳q值。

We define a learning rate, a number between 0 and 1 describing how much of the old q-value we overwrite and the new one we keep each iteration.

我們定義一個學習率，一個介于0和1之間的數字，描述了我們覆蓋的舊q值和每次迭代保留的新q值的數量。

The process can be described like with the pseudocode:

可以使用偽代碼來描述該過程：

Q = np.zeros((state_size, action_size))
for i in range(max_t):
action = np.argmax(Q[current_state,:])
new_state, reward = step(action)
Q[state, action] = Q[state, action] * (1-learning_rate) + \ (reward + gamma * np.argmax(Q[new_state,:])) * learning_rate
state = new_state if(game_over(state)):
break

勘探與開發 (Exploration and Exploitation)

In the beginning, we do not know anything about our environment, so we want to prioritize exploring and gathering information, even it it means we do not get as much reward as possible.

最初，我們對環境一無所知，因此我們希望優先探索和收集信息，即使這意味著我們不會獲得盡可能多的回報。

Later, we want to increase our high score and prioritize finding ways to getting more rewards by exploiting the q-table.

后來，我們希望增加我們的高分，并優先考慮通過利用 q表獲得更多獎勵的方法。

To do this, we can create the variable epsilon, described by hyperparameters to describe when to explore, and when to exploit. Specifically, when a random number generated is higher than epsilon, we exploit, otherwise, we explore.

為此，我們可以創建由超參數描述的變量epsilon，以描述何時進行探索以及何時進行利用。具體來說，當生成的隨機數高于epsilon時，我們會利用，否則，我們會進行探索。

The new code is as follows:

新代碼如下：

Q = np.zeros((state_size, action_size))
epsilon = 1
for _ in range(batches):
for i in range(max_t):
if(epsilon > random.uniform(0, 1)):
action = np.argmax(Q[state,:])
else:
action = np.random.rand(possible_actions(state)) new_state, reward = time_step(action)
Q[state, action] = Q[state, action] * (1-learning_rate) + \(reward + gamma * np.argmax(Q[new_state,:])) * learning_rate
state = new_state epsilon *= epsilon_decay_rate
if(game_over(state)):
break

摘要 (Summary)

Reinforcement learning focuses on a situation where an agent receives no data set, and learns from the actions and rewards it receives from the environment.
強化學習的重點是代理沒有收到任何數據集，而是從行為中學習并從環境中獲得獎勵。
The Markov Decision Process is a control process that models decision making of an agent placed in an environment.
馬爾可夫決策過程是一個控制過程，可對放置在環境中的代理的決策進行建模。
The Bellman Equation describes a characteristic that the best policy has that turns the problem into modified dynamic programming.
貝爾曼方程式描述了一種最佳策略所具有的特征，該特征將問題轉化為修改后的動態規劃。
The agent prioritizes exploring in the beginning, but eventually transitions to exploiting
該代理從一開始就優先進行探索，但最終過渡到利用

翻譯自: https://medium.com/swlh/dynamic-programming-to-artificial-intelligence-q-learning-51a189fc0441

ai人工智能編程

總結

以上是生活随笔為你收集整理的ai人工智能编程_从人工智能动态编程：Q学习的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Uzi被盯上了？与范丞丞合作公司股权被冻
下一篇：重大发现！韩国出土20枚中国宋代钱币网