Silver-Slides Chapter 1 - 强化学习入门:基本概念介绍
一些知識點
機(jī)器學(xué)習(xí) = 監(jiān)督學(xué)習(xí) + 無監(jiān)督學(xué)習(xí) + 強(qiáng)化學(xué)習(xí)
RL的不同之處:
There is no supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters (sequential, non i.i.d data)
Agent’s actions aect the subsequent data it receives
RL的reward
A reward Rt is a scalar feedback signal
Indicates how well agent is doing at step t
The agent’s job is to maximise cumulative reward
All goals can be described by the maximisation of expected
cumulative rewardSequential decision making
Goal: select actions to maximise total future reward
Actions may have long term consequences
Reward may be delayed
It may be better to sacri ce immediate reward to gain more long-term reward
Exploration and Exploitation
Reinforcement learning is like trial-and-error learning. The agent should discover a good policy from its experiences of the environment without losing too much reward along the way.
- Exploration nds more information about the environment
- Exploitation exploits known information to maximise reward
It is usually important to explore as well as exploit
RL的元素
參考 silver-slides與https://zhuanlan.zhihu.com/p/26608059
Agent
相當(dāng)于主角,包括三個要素:
Policy, Value Function, Model.
Policy:
是Agent的行為指南,是一個從狀態(tài)(s)到行動(a)的映射,可以分為確定性策略(Deterministic policy)和隨機(jī)性策略(Stochastic policy),前者是指在某一特定狀態(tài)確定對應(yīng)著某一個行為 a=π(s) a = π ( s ) ,后者是指在某一狀態(tài)下,對應(yīng)不同行動有不同的概率,即 π(a|s)=p[At=a|St=s] π ( a | s ) = p [ A t = a | S t = s ] ,可以根據(jù)實際情況來決定具體采用哪種策略。
Value Function:
價值函數(shù)是對未來總Reward的一個預(yù)測,Used to evaluate the goodness/badness of states, and therefore to select between actions.
Model:
模型是指Agent通過對環(huán)境狀態(tài)的個人解讀所構(gòu)建出來的一個認(rèn)知框架,它可以用來預(yù)測環(huán)境接下來會有什么表現(xiàn),比如,如果我采取某個特定行動那么下一個狀態(tài)是什么,亦或是如果這樣做所獲得的獎勵是多少。不過模型這個東西有些情況下是沒有的。
所以這就可以將Agent在連續(xù)決策(sequential decision making )行動中所遇到的問題劃分為兩種,即Reinforcement Learning problem和Planning problem。
對于前者,沒有環(huán)境的模型,Agent只能通過和環(huán)境來互動來逐步提升它的策略。The environment is initially unknown. The agent interacts with the environment. The agent improves its policy
對于后者,環(huán)境模型已經(jīng)有了,所以你怎么走會產(chǎn)生什么樣的結(jié)果都是確定的了,這時候只要通過模型來計算那種行動最好從而提升自己策略就好。A model of the environment is known. The agent performs computations with its model (without any external interaction). The agent improves its policy, a.k.a. deliberation, reasoning, introspection, pondering, thought, search。
舉個例子就是,Reinforcement Learning problem是不知道游戲規(guī)則,通過游戲操縱桿采取行動,看分?jǐn)?shù)來獲得reward;Planning problem是知道游戲規(guī)則,可以查詢模擬器,通過提前計劃來找到最優(yōu)策略,比如tree search
有關(guān)Agent的分類:
從采取的方法上可以分為Value Based,Policy Based 和Actor Critic。第一種是基于價值函數(shù)的探索方式,第二種就是基于策略的探索方式,第三種就是前兩者結(jié)合。
從是否含有模型上Agent又可分為Model Free 和Model Based。
Environment
故事發(fā)生的場景,可分為兩種:
Fully Observable Environment:environment的所有信息agent都能觀測到,Agent state = environment state = information state。
Formally, this is a Markov decision process (MDP) 。Partially Observable Environment:environment的部分信息agent能觀測到,此時的環(huán)境狀態(tài)稱為部分可觀測MDP 。agent state ≠ ≠ environment state。
Formally this is a partially observable Markov decision process(POMDP) 。
Agent must construct its own state representation:
Complete history: Sat=Ht S t a = H t
Beliefs of environment state: Sat=(P[Set=s1],...,P[Set=sn]) S t a = ( P [ S t e = s 1 ] , . . . , P [ S t e = s n ] )
Recurrent neural network: Sat=σ(Sat?1Ws+OtWo) S t a = σ ( S t ? 1 a W s + O t W o )
State
State可分為三種,Environment State、Agent State、Information State,又稱為Markov state
Environment State:
指環(huán)境用來選擇下一步observation/reward的所有信息,是真正的環(huán)境所包含的信息,Agent一般情況下是看不到或憑agent自身能力不能完全地獲取其信息的。即便環(huán)境信息整個是可見的,也許還會包含很多無關(guān)信息。
Agent State:
指Agent用來選擇下一個行動的所有信息,也是我們算法進(jìn)行所需要的那些信息,我個人理解是Agent自己對Environment State的解讀與翻譯,它可能不完整,但我們的確是指望著這些信息來做決定的。
Information State/Markov state:
包含了History中所有的有用信息。感覺這只是個客觀的概念,并沒有和前兩種State形成并列關(guān)系,只是一個性質(zhì)。
它的核心思想是“在現(xiàn)在情況已知的情況下,過去的事件對于預(yù)測未來沒有用”,也就相當(dāng)于是現(xiàn)在的這個狀態(tài)已經(jīng)包含了預(yù)測未來所有的有用的信息,一旦你獲取了現(xiàn)在的有用信息,那么之前的那些信息都可以扔掉了!
The environment state Set S t e is Markov,The history Ht H t is Markov.
與State相關(guān)的有一個History:
The history is the sequence of observations, actions, rewards: Ht=O1,R1,A1,...,At?1,Ot,Rt H t = O 1 , R 1 , A 1 , . . . , A t ? 1 , O t , R t
它包含了到時間t為止所能觀察到的變量信息,如observation,action和reward。所以可以說接下來所發(fā)生的事情是基于歷史的,如agent的action或environment的observation/reward。
State就被定義為一個關(guān)于History的函數(shù): St=f(Ht) S t = f ( H t ) ,他們中間有某種對應(yīng)關(guān)系,因為State也是對環(huán)境里邊相關(guān)信息的一個觀察和集成,也正是這些信息決定了接下來所發(fā)生的一切。
What happens next depends on the history:
? The agent selects actions
? The environment selects observations/rewards
Observation
Action
Reward
它是一個標(biāo)量,是一個好壞的度量指標(biāo),然后Agent的終極目標(biāo)就是盡可能的最大化整個過程的累計獎勵(cumulative reward),所以很多時候要把目光放長遠(yuǎn)一點,不要撿個芝麻丟個西瓜。
A reward Rt is a scalar feedback signal. Indicates how well agent is doing at step t. The agent’s job is to maximise cumulative reward
總結(jié)
以上是生活随笔為你收集整理的Silver-Slides Chapter 1 - 强化学习入门:基本概念介绍的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: html5输入框表情,H5页面input
- 下一篇: 用了这么多年PPT才知道,按下这个键,2