當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

强化学习之DQN（附莫烦代码）

發布時間：2023/12/20 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了强化学习之DQN（附莫烦代码）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1.簡介

想象用Q-learning 電子游戲的每一幀來學習電子游戲，每個圖片就可以是一種狀態，游戲中的角色又可以有多種動作(上下左右，下蹲跳躍等等)。如果用Q表來記錄每一個動作所對應的狀態，那么這張Q表將大到無法想象。DQN不用Q表記錄Q值，而是用神經網絡來預測Q值，并通過不斷更新神經網絡從而學習到最優的行動路徑。

深度 Q 網絡（DQN）是將 Q learning 和卷積神經網絡（CNN）結合在一起

Off-policy是Q-Learning的特點，DQN中也延用了這一特點。而不同的是，Q-Learning中用來計算target和預測值的Q是同一個Q，也就是說使用了相同的神經網絡。這樣帶來的一個問題就是，每次更新神經網絡的時候，target也都會更新，這樣會容易導致參數不收斂?；貞浽谟斜O督學習中，標簽label都是固定的，不會隨著參數的更新而改變。

因此DQN在原來的Q網絡的基礎上又引入了一個target Q網絡，即用來計算target的網絡。它和Q網絡結構一樣，初始的權重也一樣，只是Q網絡每次迭代都會更新，而target Q網絡是每隔一段時間才會更新。DQN的target是?Rt+1+γmaxa′Q(St+1,a′;ω?)。用?ω?表示它比Q網絡的權重?ω更新得要慢一些。在訓練神經網絡參數時用到的損失函數(Loss function)，實際上就是q_target 減 q_eval的結果 (loss = q_target- q_eval )

反向傳播真正訓練的網絡是只有一個，就是eval_net。target_net 只做正向傳播得到q_target (q_target = r +γ*max Q(s,a)). 其中 Q(s,a)是若干個經過target-net正向傳播的結果。

相比于Q-Learning，DQN做的改進：一個是使用了卷積神經網絡來逼近行為值函數，一個是使用了target Q network來更新target，還有一個是使用了經驗回放Experience replay。由于在強化學習中，我們得到的觀測數據是有序的，step by step的，用這樣的數據去更新神經網絡的參數會有問題?；貞浽谟斜O督學習中，數據之間都是獨立的。因此DQN中使用經驗回放，即用一個Memory來存儲經歷過的數據，每次更新參數的時候從Memory中抽取一部分的數據來用于更新，以此來打破數據間的關聯。

首先初始化Memory D，它的容量為N;
初始化Q網絡，隨機生成權重ω;
初始化target Q網絡，權重為ω?=ω;
循環遍歷episode =1, 2, …, M:
初始化initial state?S1;
循環遍歷step =1,2,…, T:
- 用??greedy策略生成action?at：以?概率選擇一個隨機的action，或選擇at=maxaQ(St,a;ω)；
- 執行action?at，接收reward?rt及新的state St+1;
- 將transition樣本?(St,at,rt,St+1)存入D中；
- 從D中隨機抽取一個minibatch的transitions(Sj,aj,rj,Sj+1)；
- 令yj=rj，如果?j+1步是terminal的話，否則，令?yj=rj+γmaxa′Q(St+1,a′;ω?)；
- 對(yj?Q(St,aj;ω))2關于ω使用梯度下降法進行更新；
- 每隔C steps更新target Q網絡，ω?=ω。
End For;
End For.

2.代碼展示

（1）這里利用 Scipy 的 imresize 函數來下采樣圖像。函數 preprocess 會在將圖像輸入到 DQN 之前，對圖像進行預處理：

def preprocess(img):img_temp=img[31:195] #choose the important area of the imageimg_temp=img_temp.mean(axis=2) #convert to Grayscale#downsample imageimg_temp=imresize(img_temp,size=(IM_SIZE,IM_SIZE),interp='nearest')return img_temp

IM_SIZE 是一個全局參數，這里設置為 80。該函數具有描述每個步驟的注釋。下面是預處理前后的觀測空間：

考慮四個動作和觀測序列來確定當前情況并訓練智能體。update_state 函數用來將當前觀測狀態附加到以前的狀態，從而產生狀態序列：

def update_state(state,obs):obs_small=preprocess(obs)return np.append(state[1:],np.expand_dims(obs_small,0),axis=0)

（2）導入必要的模塊。使用 sys 模塊的 stdout.flush() 來刷新標準輸出（此例中是計算機屏幕）中的數據。random 模塊用于從經驗回放緩存（存儲過去經驗的緩存）中獲得隨機樣本。datatime 模塊用于記錄訓練花費的時間：

定義訓練的超參數，可以嘗試改變它們，定義了經驗回放緩存的最小和最大尺寸，以及目標網絡更新的次數：

定義 DQN 類，構造器使用 tf.contrib.layers.conv2d 函數構建 CNN 網絡，定義損失和訓練操作：

類中用 set_session() 函數建立會話，用 predict() 預測動作值函數，用 update() 更新網絡，在 sample_action() 函數中用 Epsilon 貪婪算法選擇動作：

另外還定義了加載和保存網絡的方法，因為訓練需要消耗大量時間：

定義將主 DQN 網絡的參數復制到目標網絡的方法如下：

定義函數 learn()，預測價值函數并更新原始的 DQN 網絡：

現在已經在主代碼中定義了所有要素，下面構建和訓練一個 DQN 網絡來玩 Atari 的游戲。代碼中有詳細的注釋，這主要是之前 Q learning 代碼的一個擴展，增加了經驗回放緩存：

下圖是每 100 次運行的平均獎勵，更清晰地展示了獎勵的提高：

這只是在前 500 次運行后的訓練結果。要想獲得更好的結果，需要訓練更多次，大約 1 萬次。訓練智能體需要運行很多次游戲，消耗大量的時間和內存。OpenAI Gym 提供了一個封裝，將游戲保存為一個視頻，因此，無須 render 函數，你可以使用這個封裝來保存視頻并在以后查看智能體是如何學習的。AI 工程師和愛好者可以上傳這些視頻來展示他們的結果。

莫煩代碼

神經網絡的搭建
為了使用 Tensorflow 來實現 DQN, 比較推薦的方式是搭建兩個神經網絡, target_net 用于預測 q_target 值, 他不會及時更新參數,eval_net 用于預測 q_eval, 這個神經網絡擁有最新的神經網絡參數. 不過這兩個神經網絡結構是完全一樣的, 只是里面的參數不一樣。兩個神經網絡是為了固定住一個神經網絡 (target_net) 的參數, target_net 是 eval_net 的一個歷史版本, 擁有 eval_net 很久之前的一組參數, 而且這組參數被固定一段時間, 然后再被 eval_net 的新參數所替換. 而 eval_net 是不斷在被提升的, 所以是一個可以被訓練的網絡 trainable=True. 而 target_net 的 trainable=False。

class DeepQNetwork:def _build_net(self):# ------------------ build evaluate_net ------------------self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s') # inputself.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target') # for calculating losswith tf.variable_scope('eval_net'):# c_names(collections_names) are the collections to store variablesc_names, n_l1, w_initializer, b_initializer = \['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1) # config of layers# first layer. collections is used later when assign to target netwith tf.variable_scope('l1'):w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)# second layer. collections is used later when assign to target netwith tf.variable_scope('l2'):w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)self.q_eval = tf.matmul(l1, w2) + b2with tf.variable_scope('loss'):self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))with tf.variable_scope('train'):self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)# ------------------ build target_net ------------------self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_') # inputwith tf.variable_scope('target_net'):# c_names(collections_names) are the collections to store variablesc_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]# first layer. collections is used later when assign to target netwith tf.variable_scope('l1'):w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)l1 = tf.nn.relu(tf.matmul(self.s_, w1) + b1)# second layer. collections is used later when assign to target netwith tf.variable_scope('l2'):w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)self.q_next = tf.matmul(l1, w2) + b2

2.思維決策的過程
定義完上次的神經網絡部分以后, 這次來定義其他部分，首先是函數值的初始化。

class DeepQNetwork:def __init__(self,n_actions,n_features,learning_rate=0.01,reward_decay=0.9,e_greedy=0.9,replace_target_iter=300,memory_size=500,batch_size=32,e_greedy_increment=None,output_graph=False,):self.n_actions = n_actionsself.n_features = n_featuresself.lr = learning_rateself.gamma = reward_decayself.epsilon_max = e_greedy # epsilon 的最大值self.replace_target_iter = replace_target_iter # 更換 target_net 的步數self.memory_size = memory_size # 記憶上限self.batch_size = batch_size # 每次更新時從 memory 里面取多少記憶出來self.epsilon_increment = e_greedy_increment # epsilon 的增量self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max # 是否開啟探索模式, 并逐步減少探索次數# total learning stepself.learn_step_counter = 0# initialize zero memory [s, a, r, s_]self.memory = np.zeros((self.memory_size, n_features * 2 + 2))# consist of [target_net, evaluate_net]self._build_net()t_params = tf.get_collection('target_net_params') # 提取 target_net 的參數e_params = tf.get_collection('eval_net_params') # 提取 eval_net 的參數self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]self.sess = tf.Session()if output_graph:# $ tensorboard --logdir=logs# tf.train.SummaryWriter soon be deprecated, use followingtf.summary.FileWriter("logs/", self.sess.graph)self.sess.run(tf.global_variables_initializer())self.cost_his = [] # 記錄所有 cost 變化, 用于最后 plot 出來觀看

記憶存儲，DQN 的精髓部分之一: 記錄下所有經歷過的步, 這些步可以進行反復的學習, 所以是一種 off-policy 方法

class DeepQNetwork:def store_transition(self, s, a, r, s_):if not hasattr(self, 'memory_counter'):self.memory_counter = 0transition = np.hstack((s, [a, r], s_))# replace the old memory with new memoryindex = self.memory_counter % self.memory_sizeself.memory[index, :] = transitionself.memory_counter += 1

行為選擇，讓 eval_net 神經網絡生成所有 action 的值, 并選擇值最大的 action；學習過程就是在 DeepQNetwork 中, 是如何學習, 更新參數的. 這里涉及了 target_net 和 eval_net 的交互使用，這是非常重要的一步。

class DeepQNetwork:def choose_action(self, observation):# to have batch dimension when feed into tf placeholderobservation = observation[np.newaxis, :]if np.random.uniform() < self.epsilon:# forward feed the observation and get q value for every actionsactions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})action = np.argmax(actions_value)else:action = np.random.randint(0, self.n_actions)return actiondef learn(self):# check to replace target parametersif self.learn_step_counter % self.replace_target_iter == 0:self.sess.run(self.replace_target_op)print('\ntarget_params_replaced\n')# sample batch memory from all memoryif self.memory_counter > self.memory_size:sample_index = np.random.choice(self.memory_size, size=self.batch_size)else:sample_index = np.random.choice(self.memory_counter, size=self.batch_size)batch_memory = self.memory[sample_index, :]# 獲取 q_next (target_net 產生了 q) 和 q_eval(eval_net 產生的 q)q_next, q_eval = self.sess.run([self.q_next, self.q_eval],feed_dict={self.s_: batch_memory[:, -self.n_features:], # fixed paramsself.s: batch_memory[:, :self.n_features], # newest params})# 下面這幾步十分重要. q_next, q_eval 包含所有 action 的值,# 而我們需要的只是已經選擇好的 action 的值, 其他的并不需要.# 所以我們將其他的 action 值全變成 0, 將用到的 action 誤差值反向傳遞回去, 作為更新憑據.# 這是我們最終要達到的樣子, 比如 q_target - q_eval = [1, 0, 0] - [-1, 0, 0] = [2, 0, 0]# q_eval = [-1, 0, 0] 表示這一個記憶中有我選用過 action 0, 而 action 0 帶來的 Q(s, a0) = -1, 所以其他的 Q(s, a1) = Q(s, a2) = 0.# q_target = [1, 0, 0] 表示這個記憶中的 r+gamma*maxQ(s_) = 1, 而且不管在 s_ 上我們取了哪個 action,# 我們都需要對應上 q_eval 中的 action 位置, 所以就將 1 放在了 action 0 的位置.# 下面是為了達到上面說的目的, 不過為了更方面讓程序運算, 達到目的的過程有點不同.# 是將 q_eval 全部賦值給 q_target, 這時 q_target-q_eval 全為 0,# 不過我們再根據 batch_memory 當中的 action 這個 column 來給 q_target 中的對應的 memory-action 位置來修改賦值.# 使新的賦值為 reward + gamma * maxQ(s_), 這樣 q_target-q_eval 就可以變成我們所需的樣子.# change q_target w.r.t q_eval's actionq_target = q_eval.copy()batch_index = np.arange(self.batch_size, dtype=np.int32)eval_act_index = batch_memory[:, self.n_features].astype(int)reward = batch_memory[:, self.n_features + 1]q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)# train eval network_, self.cost = self.sess.run([self._train_op, self.loss],feed_dict={self.s: batch_memory[:, :self.n_features],self.q_target: q_target})self.cost_his.append(self.cost)# increasing epsilonself.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_maxself.learn_step_counter += 1

3.交互過程
DQN 與環境交互的過程總體與Q-Learning一致，僅僅增加了記憶存儲的過程，這與前邊提到的 “Q-Leaning 方法基于當前策略進行交互和改進，每一次模型利用交互生成的數據進行學習，學習后的樣本被直接丟棄” 是一致的。

from maze_env import Maze from RL_brain import DeepQNetworkdef run_maze():step = 0for episode in range(1000):# initial observationobservation = env.reset()while True:# fresh envenv.render()# RL choose action based on observationaction = RL.choose_action(observation)# RL take action and get next observation and rewardobservation_, reward, done = env.step(action)RL.store_transition(observation, action, reward, observation_)if (step > 200) and (step % 5 == 0):RL.learn()# swap observationobservation = observation_# break while loop when end of this episodeif done:breakstep += 1# end of gameprint('game over')env.destroy()if __name__ == "__main__":# maze gameenv = Maze()RL = DeepQNetwork(env.n_actions, env.n_features,learning_rate=0.01,reward_decay=0.9,e_greedy=0.9,replace_target_iter=200,memory_size=20000,output_graph=True)env.after(100, run_maze)env.mainloop()RL.plot_cost()

最后表示對莫煩的感謝！！

總結

以上是生活随笔為你收集整理的强化学习之DQN（附莫烦代码）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Redis基数统计之HyperLogLo
下一篇： MySQL幻读及解决方法