PYTORCH笔记 actor-critic (A2C)
生活随笔
收集整理的這篇文章主要介紹了
PYTORCH笔记 actor-critic (A2C)
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
????????理論知識見:強化學習筆記:Actor-critic_UQI-LIUWJ的博客-CSDN博客
由于actor-critic是policy gradient和DQN的結(jié)合,所以同時很多部分和policy network,DQN的代碼部分很接近
pytorch筆記:policy gradient_UQI-LIUWJ的博客-CSDN博客
pytorch 筆記: DQN(experience replay)_UQI-LIUWJ的博客-CSDN博客
1 導入庫 & 超參數(shù)
import gym import torch import torch.nn as nn import torch.nn.functional as F import numpy as np import timefrom torch.distributions import CategoricalGAMMA = 0.95 #獎勵折扣因子 LR = 0.01 #學習率EPISODE = 3000 # 生成多少個episode STEP = 3000 # 一個episode里面最多多少步 TEST = 10 # 每100步episode后進行測試,測試多少個2 actor 部分
2.1 actor 基本類
class PGNetwork(nn.Module):def __init__(self, state_dim, action_dim):super(PGNetwork, self).__init__()self.fc1 = nn.Linear(state_dim, 20)self.fc2 = nn.Linear(20, action_dim)def forward(self, x):x = F.relu(self.fc1(x))action_scores = self.fc2(x)return F.softmax(action_scores,dim=1)#PGNetwork的作用是輸入某一時刻的state向量,輸出是各個action被采納的概率#和policy gradient中的Policy一樣2.2 actor 類
2.2.1 __init__
class Actor(object):def __init__(self, env): # 初始化self.state_dim = env.observation_space.shape[0]#表示某一時刻狀態(tài)是幾個維度組成的#在推桿小車問題中,這一數(shù)值為4self.action_dim = env.action_space.n#表示某一時刻動作空間的維度(可以有幾個不同的動作)#在推桿小車問題中,這一數(shù)值為2self.network = PGNetwork(state_dim=self.state_dim, action_dim=self.action_dim)#輸入S輸出各個動作被采取的概率self.optimizer = torch.optim.Adam(self.network.parameters(), lr=LR)2.2.2 選擇動作
和policy gradient中的幾乎一模一樣
def choose_action(self, observation):# 選擇動作,這個動作不是根據(jù)Q值來選擇,而是使用softmax生成的概率來選# 在policy gradient和A2C中,不需要epsilon-greedy,因為概率本身就具有隨機性observation = torch.from_numpy(observation).float().unsqueeze(0)#print(state.shape) #torch.size([1,4])#通過unsqueeze操作變成[1,4]維的向量probs = self.network(observation)#Policy的返回結(jié)果,在狀態(tài)x下各個action被執(zhí)行的概率m = Categorical(probs) # 生成分布action = m.sample() # 從分布中采樣(根據(jù)各個action的概率)#print(m.log_prob(action))# m.log_prob(action)相當于probs.log()[0][action.item()].unsqueeze(0)#換句話說,就是選出來的這個action的概率,再加上log運算return action.item() # 返回一個元素值'''所以每一次select_action做的事情是,選擇一個合理的action,返回這個action;'''2.2.3 學習actor 網(wǎng)絡(luò)
也就是學習如何更好地選擇action
?
neg_log_prob 在后續(xù)的critic中會有計算的方法,相當于
def learn(self, state, action, td_error):observation = torch.from_numpy(state).float().unsqueeze(0)softmax_input = self.network(observation)#各個action被采取的概率action = torch.LongTensor([action])neg_log_prob = F.cross_entropy(input=softmax_input, target=action)# 反向傳播(梯度上升)# 這里需要最大化當前策略的價值#因此需要最大化neg_log_prob * tf_error,即最小化-neg_log_prob * td_errorloss_a = -neg_log_prob * td_errorself.optimizer.zero_grad()loss_a.backward()self.optimizer.step()#pytorch 老三樣3 critic部分
根據(jù)actor的采樣,用TD的方式計算V(s)
為了方便起見,這里沒有使用target network以及experience relay,這兩個可以看DQN 的pytorch代碼,里面有涉及
3.1 critic 基本類
class QNetwork(nn.Module):def __init__(self, state_dim, action_dim):super(QNetwork, self).__init__()self.fc1 = nn.Linear(state_dim, 20)self.fc2 = nn.Linear(20, 1) # 這個地方和之前略有區(qū)別,輸出不是動作維度,而是一維#因為我們這里需要計算的是V(s),而在DQN中,是Q(s,a),所以那里是兩維,這里是一維def forward(self, x):out = F.relu(self.fc1(x))out = self.fc2(out)return out3.2 Critic類
3.2.1 __init__
class Critic(object):#通過采樣數(shù)據(jù),學習V(S)def __init__(self, env):self.state_dim = env.observation_space.shape[0]#表示某一時刻狀態(tài)是幾個維度組成的#在推桿小車問題中,這一數(shù)值為4self.action_dim = env.action_space.n#表示某一時刻動作空間的維度(可以有幾個不同的動作)#在推桿小車問題中,這一數(shù)值為2self.network = QNetwork(state_dim=self.state_dim, action_dim=self.action_dim)#輸入S,輸出V(S)self.optimizer = torch.optim.Adam(self.network.parameters(), lr=LR)self.loss_func = nn.MSELoss()3.2.2? 訓練critic 網(wǎng)絡(luò)
def train_Q_network(self, state, reward, next_state):#類似于DQN的5.4,不過這里沒有用fixed network,experience relay的機制s, s_ = torch.FloatTensor(state), torch.FloatTensor(next_state)#當前狀態(tài),執(zhí)行了action之后的狀態(tài)v = self.network(s) # v(s)v_ = self.network(s_) # v(s')# 反向傳播loss_q = self.loss_func(reward + GAMMA * v_, v)#TD##r+γV(S') 和V(S) 之間的差距self.optimizer.zero_grad()loss_q.backward()self.optimizer.step()#pytorch老三樣with torch.no_grad():td_error = reward + GAMMA * v_ - v#表示不把相應(yīng)的梯度傳到actor中(actor和critic是獨立訓練的)return td_error4 主函數(shù)
def main():env = gym.make('CartPole-v1')#創(chuàng)建一個推車桿的gym環(huán)境actor = Actor(env)critic = Critic(env)for episode in range(EPISODE):state = env.reset()#state表示初始化這一個episode的環(huán)境for step in range(STEP):action = actor.choose_action(state) # 根據(jù)actor選擇actionnext_state, reward, done, _ = env.step(action)#四個返回的內(nèi)容是state,reward,done(是否重置環(huán)境),infotd_error = critic.train_Q_network(state, reward, next_state) # gradient = grad[r + gamma * V(s_) - V(s)]#先根據(jù)采樣的action,當前狀態(tài),后續(xù)狀態(tài),訓練critic,以獲得更準確的V(s)值actor.learn(state, action, td_error) # true_gradient = grad[logPi(a|s) * td_error]#然后根據(jù)前面學到的V(s)值,訓練actor,以更好地采樣動作state = next_stateif done:break# 每100步測試效果if episode % 100 == 0:total_reward = 0for i in range(TEST):state = env.reset()for j in range(STEP):#env.render()#渲染環(huán)境,如果你是在服務(wù)器上跑的,只想出結(jié)果,不想看動態(tài)推桿過程的話,可以注釋掉action = actor.choose_action(state) #采樣了一個actionstate, reward, done, _ = env.step(action)#四個返回的內(nèi)容是state,reward,done(是否重置環(huán)境),infototal_reward += rewardif done:breakave_reward = total_reward/TESTprint('episode: ', episode, 'Evaluation Average Reward:', ave_reward)if __name__ == '__main__':time_start = time.time()main()time_end = time.time()print('Total time is ', time_end - time_start, 's')''' episode: 0 Evaluation Average Reward: 17.2 episode: 100 Evaluation Average Reward: 10.6 episode: 200 Evaluation Average Reward: 11.4 episode: 300 Evaluation Average Reward: 10.7 episode: 400 Evaluation Average Reward: 9.3 episode: 500 Evaluation Average Reward: 9.5 episode: 600 Evaluation Average Reward: 9.5 episode: 700 Evaluation Average Reward: 9.6 episode: 800 Evaluation Average Reward: 9.9 episode: 900 Evaluation Average Reward: 8.9 episode: 1000 Evaluation Average Reward: 9.3 episode: 1100 Evaluation Average Reward: 9.8 episode: 1200 Evaluation Average Reward: 9.3 episode: 1300 Evaluation Average Reward: 9.0 episode: 1400 Evaluation Average Reward: 9.4 episode: 1500 Evaluation Average Reward: 9.3 episode: 1600 Evaluation Average Reward: 9.1 episode: 1700 Evaluation Average Reward: 9.0 episode: 1800 Evaluation Average Reward: 9.6 episode: 1900 Evaluation Average Reward: 8.8 episode: 2000 Evaluation Average Reward: 9.4 episode: 2100 Evaluation Average Reward: 9.2 episode: 2200 Evaluation Average Reward: 9.4 episode: 2300 Evaluation Average Reward: 9.2 episode: 2400 Evaluation Average Reward: 9.3 episode: 2500 Evaluation Average Reward: 9.5 episode: 2600 Evaluation Average Reward: 9.6 episode: 2700 Evaluation Average Reward: 9.2 episode: 2800 Evaluation Average Reward: 9.1 episode: 2900 Evaluation Average Reward: 9.6 Total time is 41.6014940738678 s '''總結(jié)
以上是生活随笔為你收集整理的PYTORCH笔记 actor-critic (A2C)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: NTU 课程笔记 CV6422 Stat
- 下一篇: 【量子位节选摘抄】张亚勤:未来10年AI