當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

游戏开发之强化学习

發布時間：2025/3/15 编程问答 11 豆豆

生活随笔收集整理的這篇文章主要介紹了游戏开发之强化学习小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

游戲開發之強化學習

基于價值
- Q-Learning（離線學習）
- - 簡述
  - 實現
- Saras（在線學習）
- - 簡述
  - 實現
- SarasLambda（在線學習）
- - 簡介
  - 實現

基于價值

Q-Learning（離線學習）

移步：莫煩Python Q-Learning原理

簡述

在某一個環境下，玩家Player想要知道自己當前狀態State的行為Action正確與否需要由環境Env來反饋。Player所做的決策都將得到Env給的反饋，從而不斷去更新Player在環境中每個State的Action權重；最終達到學習的目的。

真實獎勵算法：q_target = r + self.gamma * self.q_table.loc[s_, :].max()取比重最大的action。

從self.q_table.loc[s, a] += self.lr * (q_target - q_predict)公式角度來看待，也算是與AI中的BP算法有著異曲同工之妙。但是實際上與真正的人工智能還是有區別的。在這個算法中，可以找出明顯的劣勢：QTable所需要的空間可能會大到爆炸（State可能過多）

由于State不能過大，因此會導致QLearning過于依賴環境的影響，如果換了一種環境，就不能適應了。

實現

import numpy as np import pandas as pd"""算法鏈接: https://mofanpy.com/tutorials/machine-learning/reinforcement-learning/tabular-q1/算法思想: 貪婪1、 Q表中記錄每個state所對應的所有action的權重, 選出權重最大的那個action2、 Q表的更新是通過計算 error = (實際獎勵 - 預估獎勵), 得到error后進行update的 """ class QLearningTable:'''初始化'''def __init__(self, actions, learning_rate = 0.01, reward_decay = 0.9, e_greedy = 0.9):# 行為self.actions = actions# 學習率self.lr = learning_rate# Q2的權重self.gamma = reward_decay# 貪婪權重self.epsilon = e_greedy# Q Learning的 (狀態 <===> 行為) ===> 決策表self.q_table = pd.DataFrame(columns = self.actions, dtype = np.float64)'''檢查state是否存在, 不存在則創建state'''def check_state_exit(self, state):# 如果state不在q_table中if state not in self.q_table.index:# 創建一條states = pd.Series([0]*len(self.actions), index = self.q_table.columns, name = state)# 添加到q_tableself.q_table = self.q_table.append(s)'''在當前env的state下的action選擇'''def choose_action(self, observation):# 檢查state是否存在, 不存在則創建self.check_state_exit(observation)# 根據權重選擇 => 探索 or 貪婪if np.random.uniform() < self.epsilon:# 貪婪# 先獲取對應state的action權重state_action = self.q_table.loc[observation, :]# 選擇權重最大的action = np.random.choice(state_action[state_action == np.max(state_action)].index)else:# 探索action = np.random.choice(self.actions)# 返回選擇的actionreturn action'''學習s: current statea: actionr: rewards_: next state'''def learn(self, s, a, r, s_):# 檢查next state是否存在self.check_state_exit(s_)# 得到預測值q_predict = self.q_table.loc[s, a]# 判斷是否到達終點if s_ != 'terminal':# 非終點q_target = r + self.gamma * self.q_table.loc[s_, :].max()else:# 終點q_target = r# 更新q_tableself.q_table.loc[s, a] += self.lr * (q_target - q_predict)'''################################以下為偽代碼################################ ''' def main():# 創建環境env = Env()# actions, 表示環境下可以做的行為RL = QLearningTable(env.actions)# 重復100次for i in range(100):# 重置env環境env.reset()while True:# 獲取主角的狀態s = env.player.state()# 主角在當前狀態下所做的行為a = RL.choose_action(s)# 主角走了一步s_, r, done = env.step(a)# 學習RL.learn(s, a, r, s_)# 結束if done:breakif __name__ == "__main__":main()

Saras（在線學習）

移步：莫煩Python Saras原理

簡述

Saras算法與Q-Learning類似，Saras基于當前的State直接作出對應的Action，并且也想好了下一次的State要作出什么樣的Action，不斷去更新對應的表。不一樣的地方在于：Q-Learning是基于當前的State選擇對應的Action，再取下一個State中最大的Action作為最終獎勵，這是一個嘗試的過程，因為下一個State的最大獎勵是虛無的，是一種假象已經拿到獎勵的行為。

主要區別：A、策略的不同，主要在于Saras是確定自己走的每一步是什么樣的Action。B、學習的不同，Saras每一步都是直接作出對應的Action，因此目標函數為 q_target = r + self.gamma * self.q_table.loc[s_, a_] 。

實現

import numpy as np import pandas as pd from QLearningTable import QLearningTableclass SarsaTable(QLearningTable):def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):super(SarsaTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)'''學習s: current statea: actionr: rewards_: next statea_: next actiondone: is final point'''def learn(self, s, a, r, s_, a_, done):# 檢查next state是否存在self.check_state_exit(s_)# 得到預測值q_predict = self.q_table.loc[s, a]# 判斷是否到達終點if not done:# 非終點q_target = r + self.gamma * self.q_table.loc[s_, a_]else:# 終點q_target = r# 更新q_tableself.q_table.loc[s, a] += self.lr * (q_target - q_predict)'''################################以下為偽代碼################################ ''' def main():# 創建環境env = Env()# actions, 表示環境下可以做的行為RL = SarsaTable(env.actions)# 重復100次for i in range(100):# 重置env環境env.reset()# 獲取主角的狀態s = env.player.state()# 主角在當前狀態下所做的行為a = RL.choose_action(s)while True:# 主角走了一步s_, r, done = env.step(a)# 獲取下一個狀態會做的行為a_ = RL.choose_action(s_)# 學習RL.learn(s, a, r, s_, a_, done)# 賦值s = s_a = a_# 結束if done:breakif __name__ == "__main__":main()

SarasLambda（在線學習）

移步：莫煩Python SarasLambda原理

簡介

與Saras不同之處在于，可以通過lambda值來更新路徑權重。這樣容易快速收斂QTable。

實現

import numpy as np import pandas as pd from SarasTable import SarasTableclass SarasLambdaTable(SarasTable):def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9, trace_decay=0.9):super(SarasLambdaTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)# 后向觀測算法, eligibility trace.self.lambda_ = trace_decay# 空的 eligibility trace 表self.eligibility_trace = self.q_table.copy()'''檢查state是否存在, 不存在則創建state'''def check_state_exit(self, state):# 如果state不在q_table中if state not in self.q_table.index:# 創建一條states = pd.Series([0]*len(self.actions), index = self.q_table.columns, name = state)# 添加到q_tableself.q_table = self.q_table.append(s)# 添加到eligibility_traceself.eligibility_trace = self.eligibility_trace.append(s)'''學習s: current statea: actionr: rewards_: next statea_: next actiondone: is final point'''def learn(self, s, a, r, s_, a_, done):# 檢查next state是否存在self.check_state_exit(s_)# 得到預測值q_predict = self.q_table.loc[s, a]# 判斷是否到達終點if not done:# 非終點q_target = r + self.gamma * self.q_table.loc[s_, a_]else:# 終點q_target = r# 誤差error = q_target - q_predict# 對于經歷過的 state-action, 我們讓它為1, 證明他是得到 reward 路途中不可或缺的一環self.eligibility_trace.loc[s, :] *= 0self.eligibility_trace.loc[s, a] = 1# 更新q_table, 與之前不一樣, 更新的是所有的state-actionself.q_table += self.lr * self.eligibility_trace * error# 隨著時間衰減 eligibility trace 的值, 離獲取 reward 越遠的步, 他的"不可或缺性"越小self.eligibility_trace *= self.gamma * self.lambda_'''################################以下為偽代碼################################ ''' def main():# 創建環境env = Env()# actions, 表示環境下可以做的行為RL = SarasLambdaTable(env.actions)# 重復100次for i in range(100):# 重置env環境env.reset()# 獲取主角的狀態s = env.player.state()# 主角在當前狀態下所做的行為a = RL.choose_action(s)while True:# 主角走了一步s_, r, done = env.step(a)# 獲取下一個狀態會做的行為a_ = RL.choose_action(s_)# 學習RL.learn(s, a, r, s_, a_, done)# 賦值s = s_a = a_# 結束if done:breakif __name__ == "__main__":main()

總結

以上是生活随笔為你收集整理的游戏开发之强化学习的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

游戏开发

上一篇：一道腾讯产品面试题
下一篇： 2020中国奢侈品消费者数字行为洞察报告