它们是什么以及为什么我们不需要它们
Once in a while, when reading papers in the Reinforcement Learning domain, you may stumble across mysterious-sounding phrases such as ‘we deal with a filtered probability space’, ‘the expected value is conditional on a filtration’ or ‘the decision-making policy is ??-measurable’. Without formal training in measure theory [2,3], it may be difficult to grasp what exactly such a filtration entails. Formal definitions look something like this:
有時(shí),在閱讀強(qiáng)化學(xué)習(xí)領(lǐng)域的論文時(shí),您可能會(huì)偶然發(fā)現(xiàn)一些聽(tīng)起來(lái)很神秘的短語(yǔ),例如“我們處理過(guò)濾后的概率空間” ,“ 期望值取決于過(guò)濾條件 ”或“ 決策政策是“可 衡量的 ”。 沒(méi)有對(duì)度量理論的正式訓(xùn)練[2,3],可能很難掌握這種過(guò)濾到底需要什么。 正式定義看起來(lái)像這樣:
Formal definition of a filtration ([2], own work)過(guò)濾的正式定義([2],自己的工作)Boilerplate language for those familiar with measure theory, no doubt, but hardly helpful otherwise. Googling for answers likely leads through a maze of σ-algebras, Borel sets, Lebesgue measures and Hausdorff spaces, again presuming that one already knows the basics. Fortunately, only a very basic understanding of a filtration is needed to grasp its implications within the RL domain. This article will provide far from a full discussion on the topic, but aims to give a brief and (hopefully) intuitive outline of the core concept.
毫無(wú)疑問(wèn),對(duì)于那些熟悉度量理論的人來(lái)說(shuō),這些樣板語(yǔ)言非常有用,但對(duì)于其他方面幾乎沒(méi)有幫助。 谷歌搜索答案可能是通過(guò)σ-代數(shù),Borel集,Lebesgue測(cè)度和Hausdorff空間的迷宮而導(dǎo)致的,再次假設(shè)人們已經(jīng)知道基礎(chǔ)知識(shí)。 幸運(yùn)的是,只需要對(duì)過(guò)濾有一個(gè)非?;镜牧私饧纯烧莆掌湓赗L域中的含義。 本文將不提供有關(guān)該主題的完整討論,而旨在給出核心概念的簡(jiǎn)短(希望)直觀概述。
An example
一個(gè)例子
In RL, we typically define an outcome space Ω that contains all possible outcomes or samples that may occur, with ω being a specific sample path. For the sake of illustration, we will assume that our RL problem relates to a stock with price S? at day t . Of course we’d like to buy low and sell high (the precise decision-making problem is irrelevant here). We might denote the buying/selling decision as x?(ω), i.e., the decision is conditional on the price path. We start with price S? (a real number) and every day the price goes up or down according to some random process. We may simulate (or mathematically define) such a price path ω=[ω?,…,ω?] up front, before running the episode. However, that does not mean we should know stock price movements before they actually happen — even Warren Buffett could only dream of having such information! To claim we base our decisions on ω without being a clairvoyant, we may state the outcome space is ‘filtered’ (using the symbol ? ) meaning we can only observe the sample up to time t.
在RL中,我們通常定義一個(gè)結(jié)果空間Ω ,其中包含所有可能的結(jié)果或可能發(fā)生的樣本,其中ω是特定的樣本路徑。 為了便于說(shuō)明,我們假設(shè)RL問(wèn)題與在t天價(jià)格為S a的股票有關(guān)。 當(dāng)然,我們想買低賣高(確切的決策問(wèn)題在這里無(wú)關(guān)緊要)。 我們可以將買賣決策表示為x?(ω) , 即,決定取決于價(jià)格路徑。 我們從價(jià)格S? (一個(gè)實(shí)數(shù))開(kāi)始,然后每天價(jià)格都會(huì)根據(jù)某種隨機(jī)過(guò)程上升或下降。 在運(yùn)行情節(jié)之前,我們可以預(yù)先模擬(或數(shù)學(xué)定義)這樣的價(jià)格路徑ω= [ω?,…,ω?] 。 但是,這并不意味著我們不應(yīng)該在股價(jià)實(shí)際發(fā)生之前就知道它們的波動(dòng)-甚至沃倫·巴菲特也只能夢(mèng)想擁有這樣的信息! 為了聲明我們的決定基于ω而并非千篇一律 ,我們可以聲明結(jié)果空間已被“過(guò)濾”(使用符號(hào)?),這意味著我們只能觀察到時(shí)間t之前的樣本。
For most RL practitioners, this restriction must sound familiar. Don’t we usually base our decisions on the current state S?? Indeed, we do. In fact, as the Markov property implies that the stochastic process is memoryless — we only need the information embedded in the prevailing state S? — information from the past is irrelevant [5]. As we will shortly see, the filtration is richer and more generic than a state, yet for practical purposes their implications are similar.
對(duì)于大多數(shù)RL從業(yè)者來(lái)說(shuō),此限制必須聽(tīng)起來(lái)很熟悉。 我們通常不是根據(jù)當(dāng)前狀態(tài)S?做出決定嗎? 確實(shí),我們做到了。 實(shí)際上,正如馬爾可夫性質(zhì)所暗示的那樣,隨機(jī)過(guò)程是無(wú)記憶的-我們只需要嵌入盛行狀態(tài)S?中的信息-過(guò)去的信息就無(wú)關(guān)緊要[5]。 我們將很快看到,過(guò)濾比狀態(tài)更豐富,更通用,但是出于實(shí)際目的,它們的含義是相似的。
Let’s formalize our stock price problem a bit more. We start with a discrete problem setting, in which the price either goes up (u) or down (-d). Considering an episode horizon of three days, the outcome space Ω may be visualized by a binomial lattice [4]:
讓我們進(jìn)一步規(guī)范一下股價(jià)問(wèn)題。 我們從一個(gè)離散的問(wèn)題設(shè)置開(kāi)始,在該問(wèn)題中,價(jià)格上漲( u )或下跌( -d )。 考慮到三天的發(fā)作期,結(jié)果空間Ω可以通過(guò)二項(xiàng)式格子[4]可視化:
mage by author)作者作圖 )Definition of events and filtrations
事件和過(guò)濾的定義
At this point, we need to define the notion of an `event’ A ∈ Ω. Perhaps stated somewhat abstractly, an event is an element of the outcome space. Simply put, we can assign a probability to an event and assert whether it has happened or not. As we will soon show, it is not the same as a realization ω though.
此時(shí),我們需要定義“事件” A∈Ω的概念。 一個(gè)事件可能是結(jié)果空間的一個(gè)元素,也許有些抽象地表述。 簡(jiǎn)而言之,我們可以為事件分配概率并斷言事件是否發(fā)生。 正如我們將很快證明的那樣,它與實(shí)現(xiàn)ω不同 。
A filtration ? is a mathematical model that represents partial knowledge about the outcome. In essence, it tells us whether an event happened or not. The `filtration process’ may be visualized as a sequence of filters, with each filter providing us a more detailed view . Concretely, in an RL context the filtration provides us with the information needed to compute the current state S?, without giving any indication of future changes in the process [2]. Indeed, just like the Markov property.
過(guò)濾?是一個(gè)數(shù)學(xué)模型,代表關(guān)于結(jié)果的部分知識(shí)。 從本質(zhì)上講,它告訴我們事件是否發(fā)生。 “過(guò)濾過(guò)程”可以可視化為一系列過(guò)濾器,每個(gè)過(guò)濾器為我們提供更詳細(xì)的視圖。 具體而言,在RL上下文中,過(guò)濾為我們提供了計(jì)算當(dāng)前狀態(tài)S?所需的信息,而沒(méi)有提供任何對(duì)過(guò)程未來(lái)變化的指示[2]。 確實(shí),就像馬爾可夫財(cái)產(chǎn)一樣。
Formally, a filtration is a σ-algebra, and although you don’t need to know the ins and outs some background is useful. Loosely defined, a σ-algebra is a collection of subsets of the outcome space, containing a countable number of events as well as all their complements and unions. In measure theory this concept has major implications, for the purpose of this article you only need to remember that the σ-algebra is a collection of events.
形式上,過(guò)濾是一個(gè)σ-代數(shù),盡管您不需要了解來(lái)龍去脈,但有些背景是有用的。 松散定義的σ-代數(shù)是結(jié)果空間子集的集合,其中包含可數(shù)的事件以及它們的所有互補(bǔ)和并集。 在量度理論中,此概念具有重要意義,對(duì)于本文而言,您只需要記住σ-代數(shù)是事件的集合。
Example revisited — discrete case
再看示例-離散情況
Back to the example, because the filtration only comes alive when seeing it into action. We first need to define the events, using sequences such as ‘udu’ to describe price movements over time. At t=0 we basically don’t know anything — all paths are still possible. Thus, the event set A={uuu, uud, udu, udd, ddd, ddu, dud, duu} contains all possible paths ω ∈ Ω. At t=1, we know a little more: the stock price went either up or down. The corresponding events are defined by A?={uuu,uud,udu,udd} and A?={ddd,ddu,dud,duu}. If the stock price went up, we can surmise that our sample path ω will be in A? and not in A? (and vice versa, of course). At t=2, we have four event sets: A??={uuu,uud}, A??={udu,udd}, A??={duu,dud}, and A??={ddu,ddd}. Observe that the information is getting increasingly fine-grained; the sets to which ω might belong are becoming smaller and more numerous. At t=3, we obviously know the exact price path that has been followed.
回到示例,因?yàn)檫^(guò)濾只有在生效時(shí)才會(huì)生效。 我們首先需要定義事件,使用諸如“ udu”之類的序列來(lái)描述價(jià)格隨時(shí)間的變化。 在t = 0時(shí),我們基本上什么都不知道-所有路徑仍然可行。 因此,事件集A = {uuu,uud,udu,udd,ddd,ddu,dud,duu}包含所有可能的路徑ω∈Ω 。 在t = 1時(shí) ,我們知道的更多:股票價(jià)格上漲或下跌。 相應(yīng)的事件由A?= { u uu, u ud, u du, u dd}和A?= { d dd, d du, d ud, d uu}定義 。 如果股價(jià)上漲,我們可以推測(cè)樣本路徑ω將在A?中 ,而不在A?中 (當(dāng)然,反之亦然)。 在t = 2時(shí) ,我們有四個(gè)事件集: A??= {uuu,uud} , A??= { ud u, ud d} , A??= { du u, du d}和A??= { dd u, dd d} 。 觀察到信息越來(lái)越細(xì)化; ω可能屬于的集合越來(lái)越小。 在t = 3時(shí) ,我們顯然知道遵循的確切價(jià)格路徑。
Having defined the events, we can define the corresponding filtrations for t=0,1,2,3:
定義事件后,我們可以為t = 0,1,2,3定義相應(yīng)的過(guò)濾:
mage by author)作者制圖 )At t=0, every outcome is possible. We initialize the filtration with the empty set ? and outcome space Ω, also known as a trivial σ-algebra.
在t = 0時(shí) ,所有結(jié)果都是可能的。 我們用空集?和結(jié)果空間Ω(也稱為平凡 σ-代數(shù))初始化過(guò)濾。
At t=1, we can simply add A? and A? to ?? to obtain ??; recall from the definition that each filtration always includes all elements of its predecessor. We can use the freshly revealed information to compute S?. We also get a peek into the future (without actually revealing future information!): if the price went up, we cannot reach the lowest possible price at t=3 anymore. The event sets are illustrated below
在t = 1時(shí) ,我們可以簡(jiǎn)單地將A?和A?加到?以獲得obtain 。 從定義中回想起,每次過(guò)濾總是包含其前身的所有元素。 我們可以使用最新顯示的信息來(lái)計(jì)算S? 。 我們還會(huì)窺視未來(lái)(實(shí)際上并不會(huì)透露未來(lái)的信息!):如果價(jià)格上漲,我們將無(wú)法再達(dá)到t = 3時(shí)的最低價(jià)格。 事件集如下所示
Visualization of event sets A? and A? at t=1 (Source: [2], image by author)事件集A?和A?在t = 1時(shí)的可視化(來(lái)源:[2], 作者 作圖 )At t=2, we may distinguish between four events depending on the price paths revealed so far. Here things get a bit more involved, as we also need to add the unions and complements (in line with the requirements of the σ-algebra). This was not necessary for ??, as the union of A? and A? equals the outcome space and A? is the complement of A?. From an RL perspective, you might note that we have more information than strictly needed. For instance, an up-movement followed by a down-movement yields the same price as the reverse. In RL applications we would typically not store such redundant information, yet you can probably recognize the mathematical appeal.
在t = 2時(shí) ,我們可以根據(jù)到目前為止揭示的價(jià)格路徑來(lái)區(qū)分四個(gè)事件。 這里的事情涉及更多,因?yàn)槲覀冞€需要添加并集和補(bǔ)碼(符合σ-代數(shù)的要求)。 ??不需要 ,因?yàn)锳 necessary和A?的并集等于結(jié)果空間,而A?是A?的補(bǔ)碼 。 從RL角度來(lái)看,您可能會(huì)注意到,我們掌握的信息超出了嚴(yán)格需要的信息。 例如,向上運(yùn)動(dòng)然后向下運(yùn)動(dòng)會(huì)產(chǎn)生與反向運(yùn)動(dòng)相同的價(jià)格。 在RL應(yīng)用程序中,我們通常不會(huì)存儲(chǔ)此類冗余信息,但您可能會(huì)認(rèn)識(shí)到數(shù)學(xué)上的吸引力。
Visualization of event sets A??, A??, A?? and A?? at t=2 (Source: [2], image by author) 事件集A??, A?? , A??和A??在t = 2時(shí)的可視化 (來(lái)源:[2], 作者 作圖 )At t=3, we already have 256 sets, using the same procedure as before. You can see that filtrations quickly become extremely large. A filtration always contains all elements of the preceding step — our filtration gets richer and more fine-grained with the passing of time. All this means is that we can more precisely pinpoint the events to which our sample price path may or may not belong.
在t = 3處 ,我們已經(jīng)有256套,使用與以前相同的過(guò)程。 您會(huì)看到過(guò)濾很快變得非常大。 過(guò)濾始終包含上一步的所有元素-隨著時(shí)間的流逝,我們的過(guò)濾會(huì)變得越來(lái)越豐富,而且粒度越來(lái)越細(xì)。 這一切意味著我們可以更精確地查明樣本價(jià)格路徑可能或可能不屬于的事件。
A continuous example
一個(gè)連續(xù)的例子
We are almost there, but we would be remiss if we only treat discrete problems. In reality, stock prices do not only go ‘up’ or ‘down’; they will change within a continuous domain. The same holds for many other RL problems. Although conceptually the same as for the discrete case, providing explicit descriptions for filtrations in continuous settings is difficult. Again, some illustrations might help more than formal definitions.
我們快到了,但是如果我們只處理離散的問(wèn)題,我們將被解雇。 實(shí)際上,股票價(jià)格不僅會(huì)上漲或下跌。 他們將在一個(gè)連續(xù)的領(lǐng)域內(nèi)變化。 許多其他RL問(wèn)題也是如此。 盡管從概念上講與離散情況相同,但是很難提供連續(xù)設(shè)置中過(guò)濾的明確描述。 同樣,一些插圖可能比正式定義更有幫助。
Suppose that at every time step, we simulate a return from the real domain [-d,u]. Depending on the time we look ahead, we may then define an interval in which the future stock price will fall, say [329,335] at a given point in time. We can then define intervals within this domain. Any arbitrary interval may constitute an event, for instance:
假設(shè)在每個(gè)時(shí)間步上,我們都模擬來(lái)自實(shí)域[-d,u]的收益。 根據(jù)我們的展望時(shí)間,我們可以定義一個(gè)時(shí)間間隔,即在給定的時(shí)間點(diǎn),未來(lái)股價(jià)將下跌,例如[329,335] 。 然后,我們可以在此域內(nèi)定義間隔。 任何任意間隔都可能構(gòu)成一個(gè)事件,例如:
The complement of an interval could look like
間隔的補(bǔ)碼可能看起來(lái)像
Furthermore, a plethora of unions may be defined, such as
此外,可以定義過(guò)多的聯(lián)合,例如
As you may have guessed, there are infinitely many of such events in all shapes and sizes, yet they are still countable and we can assign a probability to each of them [2,3].
正如您可能已經(jīng)猜到的那樣,各種形狀和大小的事件有無(wú)數(shù)種,但它們?nèi)匀皇强蓴?shù)的,我們可以為每個(gè)事件分配一個(gè)概率[2,3]。
The further we look into the future, the more we can deviate from our current stock price. We might visualize this with a cone shape that expands over time (displayed below for t=50 and t=80). Within the cone, we can define infinitely many intervals. As before, we acquire a more detailed view once more time has passed.
我們對(duì)未來(lái)的展望越深,與當(dāng)前股價(jià)的偏差就越大。 我們可以用隨時(shí)間擴(kuò)展的圓錐形形象化(在下面顯示t = 50和t = 80 )。 在圓錐內(nèi),我們可以定義無(wú)限多個(gè)間隔。 和以前一樣,一旦時(shí)間過(guò)去,我們將獲得更詳細(xì)的視圖。
Event sets in continuous domain for t=50 and t=80. Within the cones, infinitely many intervals can be defined to construct the filtrations. (Source: [2], image by author)事件在t = 50和t = 80的連續(xù)域中設(shè)置。 在視錐內(nèi),可以定義無(wú)限多個(gè)間隔來(lái)構(gòu)造過(guò)濾。 (來(lái)源:[2], 作者按我的說(shuō)法 )Wrapping things up
整理東西
When encountering filtrations in any RL paper, the basics treated in this article should suffice. Essentially, the only purpose of introducing filtrations ?? is to ensure that decisions x?(ω) do not utilize information that has not yet been revealed. When the Markov property holds, a decision x?(S?) that operates on the current state serves the same purpose. The filtration provides a rich description of the past, yet we do not need this information in memoryless problems. Nevertheless, from a mathematical perspective it is an elegant solution with many interesting applications. The reinforcement learning community consists of many researchers and engineers from different backgrounds working in a variety of domains, not everyone speaks the same language. Sometimes it goes a long way to learn another language, even if only a few words.
當(dāng)在任何RL紙中遇到過(guò)濾時(shí),本文所處理的基本知識(shí)就足夠了。 從本質(zhì)上講,引入的過(guò)濾??的唯一目的是確保決策x?(ω)不利用還未被透露的信息。 當(dāng)馬爾可夫?qū)傩猿闪r(shí),在當(dāng)前狀態(tài)下運(yùn)行的決策x?(S?)具有相同的目的。 篩選提供了對(duì)過(guò)去的豐富描述,但是在無(wú)記憶問(wèn)題中我們不需要此信息。 但是,從數(shù)學(xué)角度來(lái)看,它是一種優(yōu)雅的解決方案,具有許多有趣的應(yīng)用程序。 強(qiáng)化學(xué)習(xí)社區(qū)由來(lái)自不同背景的許多研究人員和工程師組成,這些研究人員和工程師在不同的領(lǐng)域工作,并不是每個(gè)人都講相同的語(yǔ)言。 有時(shí)候,即使只有幾個(gè)單詞,學(xué)習(xí)另一種語(yǔ)言也會(huì)走很長(zhǎng)的路。
[This article is partially based on my ArXiv article ‘A Gentle Lecture Note on Filtrations in Reinforcement Learning’]
[本文部分基于我的ArXiv文章“關(guān)于強(qiáng)化學(xué)習(xí)中的過(guò)濾的溫和講義”]
[1] Van Heeswijk, W.J.A. (2020). A Gentle Lecture Note on Filtrations in Reinforcement Learning. arXiv preprint arXiv:2008.02622
[1] Van Heeswijk,WJA(2020)。 關(guān)于強(qiáng)化學(xué)習(xí)中的過(guò)濾的溫和的講義。 arXiv預(yù)印本arXiv:2008.02622
[2] Shreve, S. E. (2004). Stochastic Calculus for Finance II: Continuous-Time Models, Volume 11. Springer Science & Business Media.
[2] Shreve,SE(2004)。 金融隨機(jī)算術(shù)II:連續(xù)時(shí)間模型,第11卷。Springer科學(xué)與商業(yè)媒體。
[3] Shiryaev, A. N. (1996). Probability. Springer New York-Heidelberg.
[3] Shiryaev,AN(1996)。 可能性。 施普林格紐約-海德堡。
[4] Luenberger, D. G. (1997). Investment Science. Oxford University Press.
[4] Luenberger,DG(1997)。 投資科學(xué)。 牛津大學(xué)出版社。
[5] Powell, W. B. (2020). On State Variables, Bandit Problems and POMDPs. arXiv preprint arXiv:2002.06238
[5]鮑威爾,世界銀行(2020)。 關(guān)于狀態(tài)變量,強(qiáng)盜問(wèn)題和POMDP。 arXiv預(yù)印本arXiv:2002.06238
翻譯自: https://towardsdatascience.com/filtrations-in-reinforcement-learning-what-they-are-and-why-we-dont-need-them-463c93a170d4
總結(jié)
以上是生活随笔為你收集整理的它们是什么以及为什么我们不需要它们的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 数据开放 数据集_除开放式清洗之外:叙述
- 下一篇: 梦到别人吃屎预示着什么