當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

详细分析莫烦DQN代码

發布時間：2023/12/20 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了详细分析莫烦DQN代码小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

詳細分析莫煩DQN代碼

Python入門，莫煩是很好的選擇，快去b站搜視頻吧！
作為一只渣渣白，去看了莫煩的強化學習入門，現在來回憶總結下DQN，作為筆記記錄下來。
主要是對代碼做了詳細注釋
DQN有兩個網絡，一個eval網絡，一個target網絡，兩個網絡結構相同，只是target網絡的參數在一段時間后會被eval網絡更新。
maze_env.py是環境文件，建立的是一個陷阱游戲的環境，就不用細分析了。
RL_brain.py是建立網絡結構的文件：
在類DeepQNetwork中，有五個函數：
n_actions 是動作空間數，環境中上下左右所以是4，n_features是狀態特征數，根據位置坐標所以是2.
函數_build_net(self)：（講道理這個注釋是詳細到不能再詳細了）
建立eval網絡：

# ------------------ build evaluate_net ------------------ # input 用來接收observation self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s') # for calculating loss 用來接收q_target的值 self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target') # 兩層網絡l1,l2，神經元 10個，第二層有多少動作輸出多少 # variable_scope（）用于定義創建變量（層）的操作的上下文管理器 with tf.variable_scope('eval_net'):# c_names(collections_names) are the collections to store variables 在更新target_net參數時會用到# \表示沒有[],()的換行c_names, n_l1, w_initializer, b_initializer = \['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)# config of layers nl1第一層有多少個神經元# eval_net 的第一層. collections 是在更新 target_net 參數時會用到with tf.variable_scope('l1'):w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)print(l1)# eval_net 的第二層. collections 是在更新 target_net 參數時會用到with tf.variable_scope('l2'):w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)self.q_eval = tf.matmul(l1, w2) + b2#作為行為的Q值估計with tf.variable_scope('loss'): #求誤差self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval)) with tf.variable_scope('train'): #梯度下降self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)

兩層全連接，隱藏層神經元個數都是10個，最后輸出是q_eval，再求誤差。
target網絡建立和上面的大致相同，結構也相同。輸出是q_next。

函數：store_transition（）：存儲記憶

def store_transition(self, s, a, r, s_):# hasattr() 函數用于判斷對象是否包含對應的屬性如果對象有該屬性返回 True，否則返回 Falseif not hasattr(self, 'memory_counter'):self.memory_counter = 0# 記錄一條 [s, a, r, s_] 記錄transition = np.hstack((s, [a, r], s_))# numpy.hstack(tup)參數tup可以是元組，列表，或者numpy數組，返回結果為按順序堆疊numpy的數組（按列堆疊一個）。# 總 memory 大小是固定的, 如果超出總大小, 舊 memory 就被新 memory 替換index = self.memory_counter % self.memory_sizeself.memory[index, :] = transitionself.memory_counter += 1

存儲transition，按照記憶池大小，按行插入，超過的則覆蓋存儲。

函數choose_action（）：選擇動作

def choose_action(self, observation):# to have batch dimension when feed into tf placeholder 統一 observation 的 shape (1, size_of_observation)observation = observation[np.newaxis, :]#np.newaxis增加維度 []變成[[]]多加了一個行軸,一維變二維if np.random.uniform() < self.epsilon:# forward feed the observation and get q value for every actions# 讓 eval_net 神經網絡生成所有 action 的值, 并選擇值最大的 actionactions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})action = np.argmax(actions_value) #返回axis維度的最大值的索引else:action = np.random.randint(0, self.n_actions)return action

如果隨機生成的數小于epsilon，則按照q_eval中最大值對應的索引作為action，否則就在動作空間中隨機產生動作。

函數learn（）： agent學習過程

def learn(self):# 檢查是否替換 target_net 參數if self.learn_step_counter % self.replace_target_iter == 0:self.sess.run(self.replace_target_op) #判斷要不要換參數print('\ntarget_params_replaced\n')# sample batch memory from all memory 隨機抽取多少個記憶變成batch memoryif self.memory_counter > self.memory_size:sample_index = np.random.choice(self.memory_size, size=self.batch_size)else:sample_index = np.random.choice(self.memory_counter, size=self.batch_size)# 從 memory 中隨機抽取 batch_size 這么多記憶batch_memory = self.memory[sample_index, :] #隨機選出的記憶#獲取 q_next (target_net 產生了 q) 和 q_eval(eval_net 產生的 q)q_next, q_eval = self.sess.run([self.q_next, self.q_eval],feed_dict={self.s_: batch_memory[:, -self.n_features:], # fixed paramsself.s: batch_memory[:, :self.n_features], # newest params})# change q_target w.r.t q_eval's action 先讓target = evalq_target = q_eval.copy()batch_index = np.arange(self.batch_size, dtype=np.int32)#返回一個長度為self.batch_size的索引值列表aray([0,1,2,...,31])eval_act_index = batch_memory[:, self.n_features].astype(int)# 返回一個長度為32的動作列表,從記憶庫batch_memory中的標記的第2列，self.n_features=2# #即RL.store_transition(observation, action, reward, observation_)中的action# #注意從0開始記，所以eval_act_index得到的是action那一列reward = batch_memory[:, self.n_features + 1]# 返回一個長度為32獎勵的列表，提取出記憶庫中的rewardq_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)"""For example in this batch I have 2 samples and 3 actions:q_eval =[[1, 2, 3],[4, 5, 6]]q_target = q_eval =[[1, 2, 3],[4, 5, 6]]Then change q_target with the real q_target value w.r.t the q_eval's action.For example in:sample 0, I took action 0, and the max q_target value is -1;sample 1, I took action 2, and the max q_target value is -2:q_target =[[-1, 2, 3],[4, 5, -2]]So the (q_target - q_eval) becomes: q值并不是對位相減[[(-1)-(1), 0, 0],[0, 0, (-2)-(6)]]We then backpropagate this error w.r.t the corresponding action to network,最后我們將這個 (q_target - q_eval) 當成誤差, 反向傳遞會神經網絡.所有為 0 的 action 值是當時沒有選擇的 action, 之前有選擇的 action 才有不為0的值.我們只反向傳遞之前選擇的 action 的值,leave other action as error=0 cause we didn't choose it."""# train eval network_, self.cost = self.sess.run([self._train_op, self.loss],feed_dict={self.s: batch_memory[:, :self.n_features],self.q_target: q_target})self.cost_his.append(self.cost) # 記錄 cost 誤差# increasing epsilon 逐漸增加 epsilon, 降低行為的隨機self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_maxself.learn_step_counter += 1

每200步替換一次兩個網絡的參數，eval網絡的參數實時更新，并用于訓練 target網絡的用于求loss，每200步將eval的參數賦給target實現更新。

我也不知道這里為什么沒有用onehot，所以莫煩在講求值相減的時候也有點凌亂。其實就是將q_eval賦給q_target，然后按照被選擇的動作索引賦q_next的值，即只改變被選擇了動作位置處的q值，其他位置q值不變還是q_eval的值，這樣為了方便相減，求loss值，反向傳遞給神經網絡。

run_this.py文件，運行：

def run_maze():step = 0 #用來控制什么時候學習for episode in range(100):# 初始化環境observation = env.reset()#print(observation)while True:# 刷新環境env.render()# dqn根據觀測值選擇動作action = RL.choose_action(observation)# 環境根據行為給出下一個state，reward，是否終止observation_, reward, done = env.step(action)RL.store_transition(observation, action, reward, observation_)#dqn存儲記憶#數量大于200以后再訓練，每五步學習一次if (step > 200) and (step % 5 == 0):RL.learn()# 將下一個state_變為下次循環的stateobservation = observation_# 如果終止就跳出循環if done:breakstep += 1# end of gameprint('game over')env.destroy()

執行過程就顯得比較明了了，調用之前的函數，與環境交互獲得observation，選擇動作，存儲記憶，學習，訓練網絡。

以上是我對DQN代碼的理解，感謝莫煩大佬，本人水平有限，以上內容如有錯誤之處請批評指正，有相關疑問也歡迎討論。

總結

以上是生活随笔為你收集整理的详细分析莫烦DQN代码的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：使用nginx 和 switchhost
下一篇： 2009岁末之复用系统框架(B/S)