版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、<p> Q-Learning By Examples</p><p> In this tutorial, you will discover step by step how an agent learns through training without teacher (unsupervised) in unknown environment. You will find out part
2、of reinforcement learning algorithm called Q-learning. Reinforcement learning algorithm has been widely used for many applications such as robotics, multi agent system, game, and etc. </p><p> Instead of le
3、arning the theory of reinforcement that you can read it from many books and other web sites (see Resources for more references), in this tutorial will introduce the concept through simple but comprehensive numerical exam
4、ple. You may also download the Matlab code or MS Excel Spreadsheet for free. </p><p> Suppose we have 5 rooms in a building connected by certain doors as shown in the figure below. We give name to each room
5、 A to E. We can consider outside of the building as one big room to cover the building, and name it as F. Notice that there are two doors lead to the building from F, that is through room B and room E. </p><p&
6、gt; We can represent the rooms by graph, each room as a vertex (or node) and each door as an edge (or link). Refer to my other tutorial on Graph if you are not sure about what is Graph. </p><p> We want t
7、o set the target room. If we put an agent in any room, we want the agent to go outside the building. In other word, the goal room is the node F. To set this kind of goal, we introduce give a kind of reward value to each
8、door (i.e. edge of the graph). The doors that lead immediately to the goal have instant reward of 100 (see diagram below, they have red arrows). Other doors that do not have direct connection to the target room have zero
9、 reward. Because the door is two way (from A can go</p><p> Additional loop with highest reward (100) is given to the goal room (F back to F) so that if the agent arrives at the goal, it will remain there f
10、orever. This type of goal is called absorbing goal because when it reaches the goal state, it will stay in the goal state. </p><p> Ladies and gentlemen, now is the time to introduce our superstar agent…. &
11、lt;/p><p> Imagine our agent as a dumb virtual robot that can learn through experience. The agent can pass one room to another but has no knowledge of the environment. It does not know which sequence of doors
12、the agent must pass to go outside the building. </p><p> Suppose we want to model some kind of simple evacuation of an agent from any room in the building. Now suppose we have an agent in Room C and we want
13、 the agent to learn to reach outside the house (F). (see diagram below) </p><p> How to make our agent learn from experience? </p><p> Before we discuss about how the agent will learn (using
14、Q learning) in the next section, let us discuss about some terminologies of state and action . </p><p> We call each room (including outside the building) as a state . Agent's movement from one room to
15、another room is called action . Let us draw back our state diagram. State is depicted using node in the state diagram, while action is represented by the arrow. </p><p> Suppose now the agent is in state C.
16、 From state C, the agent can go to state D because the state C is connected to D. From state C, however, the agent cannot directly go to state B because there is no direct door connecting room B and C (thus, no arrow). F
17、rom state D, the agent can go either to state B or state E or back to state C (look at the arrow out of state D). If the agent is in state E, then three possible actions are to go to state A, or state F or state D. If ag
18、ent is state B, it can g</p><p> We can put the state diagram and the instant reward values into the following reward table, or matrix R . </p><p> The minus sign in the table says that the ro
19、w state has no action to go to column state. For example, State A cannot go to state B (because no door connecting room A and B, remember?) </p><p> In the previous sections of this tutorial, we have modele
20、d the environment and the reward system for our agent. This section will describe learning algorithm called Q learning (which is a simplification of reinforcement learning). </p><p> We have model the envir
21、onment reward system as matrix R. </p><p> Now we need to put similar matrix name Q in the brain of our agent that will represent the memory of what the agent have learned through many experiences. The row
22、of matrix Q represents current state of the agent, the column of matrix Q pointing to the action to go to the next state. </p><p> In the beginning, we say that the agent know nothing, thus we put Q as zero
23、 matrix. In this example, for the simplicity of explanation, we assume the number of state is known (to be six). In more general case, you can start with zero matrix of single cell. It is a simple task to add more column
24、 and rows in Q matrix if a new state is found. </p><p> The transition rule of this Q learning is a very simple formula </p><p> The formula above have meaning that the entry value in matrix Q
25、 (that is row represent state and column represent action) is equal to corresponding entry of matrix R added by a multiplication of a learning parameter and maximum value of Q for all action in the next state.</p>
26、<p> Our virtual agent will learn through experience without teacher (this is called unsupervised learning). The agent will explore state to state until it reaches the goal. We call each exploration as an episode
27、. In one episode the agent will move from initial state until the goal state. Once the agent arrives at the goal state, program goes to the next episode. The algorithm below has been proved to be convergence (See referen
28、ces for the proof) </p><p><b> Q學(xué)習(xí)實(shí)例</b></p><p> 在本教程中,您將一步一步地發(fā)現(xiàn)在未知的環(huán)境中一個(gè)代理如何進(jìn)行沒(méi)有老師(非監(jiān)督)的學(xué)習(xí)訓(xùn)練。你會(huì)發(fā)現(xiàn)強(qiáng)化學(xué)習(xí)算法的一部分——稱為Q學(xué)習(xí)。強(qiáng)化學(xué)習(xí)算法已經(jīng)得到廣泛的應(yīng)用,如機(jī)器人技術(shù)、多代理系統(tǒng)、游戲,等等。</p><p> 雖然你可以
29、閱讀從許多書(shū)籍和其他網(wǎng)站(參見(jiàn)參考資料獲取更多的引用)來(lái)學(xué)習(xí)的加固的理論,但本教程將通過(guò)數(shù)值例子介紹簡(jiǎn)單而全面的概念。你也可以下載Matlab代碼或免費(fèi)的Excel電子表格。</p><p> 假設(shè)我們?cè)诮ㄖ杏?個(gè)房間,由某些大門連接如下圖所示。我們給每個(gè)房間一個(gè)名字,從A到E 。我們可以考慮大樓外部作為一個(gè)大房間里涵蓋了大樓,并將其命名為F 。請(qǐng)注意,有兩扇門可以從F到建筑里,就是通過(guò)B室和E室。</
30、p><p> 我們可以通過(guò)圖形表示房間。每個(gè)房間作為一個(gè)頂點(diǎn)(或節(jié)點(diǎn))和每個(gè)門作為一個(gè)邊緣(或鏈接)。如果你不確定是什么圖,請(qǐng)參考我其他教程上的圖形,。</p><p> 我們想要設(shè)定目標(biāo)房間。如果我們把一個(gè)系統(tǒng)放到任何房間中,我們想要它到建筑物的外面。換句話說(shuō),目標(biāo)的房間是節(jié)點(diǎn)F。為了設(shè)置這樣的目標(biāo),我們介紹給每扇門(即圖的邊)一個(gè)獎(jiǎng)勵(lì)價(jià)值。立即到達(dá)目標(biāo)的門有即時(shí)回報(bào)100(見(jiàn)圖表,他們
31、有紅色箭頭)。其他沒(méi)有直接連接到目標(biāo)的房間的門只有零回報(bào)。因?yàn)橥ㄟ^(guò)門是有兩個(gè)方向的(從A可以去E和從E可以回到A),我們給每個(gè)房間的前面的圖分配兩個(gè)箭頭。每個(gè)箭頭都包含一個(gè)即時(shí)回報(bào)價(jià)值。這個(gè)圖變得狀態(tài)關(guān)系圖如下所示</p><p> 額外的有最高的獎(jiǎng)勵(lì)(100)的路徑是考慮到目標(biāo)的房間(F回到F),以便使代理如果到達(dá)目標(biāo),它將永遠(yuǎn)留在那里。這種類型的目標(biāo)被稱為吸收目標(biāo),因?yàn)楫?dāng)它達(dá)到目標(biāo)狀態(tài),它將停留在目標(biāo)狀態(tài)。
32、</p><p> 現(xiàn)在是時(shí)候介紹我們的超級(jí)代理了…</p><p> 想象一下我們的代理作為一個(gè)愚蠢的虛擬機(jī)器人,這種機(jī)器人可以通過(guò)經(jīng)驗(yàn)學(xué)習(xí)。代理可以從一個(gè)房間到另一個(gè)房間,但是沒(méi)有對(duì)環(huán)境的認(rèn)知。它不知道去建筑物的外面必須通過(guò)哪個(gè)序列的門代理。</p><p> 假設(shè)我們想為某種簡(jiǎn)單的從任何教室疏散代理建?!,F(xiàn)在假設(shè)我們有一個(gè)代理在房間C,我們想要代理學(xué)會(huì)達(dá)
33、到在房子外面(F)。(見(jiàn)下圖)</p><p> 如何使我們的代理從經(jīng)驗(yàn)中學(xué)習(xí)?</p><p> 在我們討論關(guān)于代理將學(xué)習(xí)(使用Q學(xué)習(xí))之前,在接下來(lái)的部分中,我們討論一些術(shù)語(yǔ)和行動(dòng)。</p><p> 我們稱每個(gè)房間(包括建筑外)為一個(gè)區(qū)域。代理的運(yùn)動(dòng)從一個(gè)房間到另一個(gè)房間叫行動(dòng)。讓我們收回我們的狀態(tài)關(guān)系圖。狀態(tài)是使用狀態(tài)關(guān)系圖的節(jié)點(diǎn)描述,行動(dòng)用箭頭表示。
34、</p><p> 假設(shè)現(xiàn)在代理是在區(qū)域C,代理可以去區(qū)域D,因?yàn)闋顟B(tài)C是連接到D。從國(guó)家C,然而,代理不能直接去國(guó)家B,因?yàn)闆](méi)有直接連接房間門B和C(因此,沒(méi)有箭頭)。從區(qū)域D,代理要么去區(qū)域B或E或回到狀態(tài)C(看了箭頭區(qū)域D)。如果代理是在區(qū)域E。然后三種可能的行動(dòng)去F或D。如果代理在區(qū)域B,它要么去D或F .從A,它只能回到區(qū)域E。</p><p> 我們可以把狀態(tài)圖和即時(shí)回報(bào)值
35、分為以下獎(jiǎng)勵(lì)表或矩陣R。</p><p> 在表中的減號(hào)表示,這行沒(méi)有去列的行動(dòng)。例如,A不能去B(因?yàn)闆](méi)有門連接房間A和B。)</p><p> 前一節(jié)本教程中,我們已經(jīng)為我們代理的環(huán)境和獎(jiǎng)勵(lì)系統(tǒng)建模。這一小節(jié)將描述學(xué)習(xí)算法稱為Q學(xué)習(xí)(這是一個(gè)簡(jiǎn)化的強(qiáng)化學(xué)習(xí))。</p><p> 我們有模型環(huán)境回饋系統(tǒng)為矩陣R。</p><p>
36、現(xiàn)在我們需要把相似矩陣的大腦中名為Q將代表我們的代理,大腦可以通過(guò)經(jīng)驗(yàn)學(xué)習(xí)到很多。一排排的矩陣Q代表了當(dāng)前的狀態(tài)的代理,列的矩陣Q指向行動(dòng)去下一個(gè)狀態(tài)。</p><p> 在開(kāi)始的時(shí)候,我們說(shuō)代理一無(wú)所知。因此我們把Q看作零矩陣。在這個(gè)例子中,簡(jiǎn)單的解釋,我們假設(shè)的區(qū)域數(shù)是已知的(6)。在更一般的情況下,你可以從單個(gè)細(xì)胞零矩陣開(kāi)始。如果有一個(gè)新的區(qū)域發(fā)現(xiàn),在Q矩陣中添加更多的列和行是一個(gè)簡(jiǎn)單任務(wù)。</p&
37、gt;<p> Q學(xué)習(xí)的轉(zhuǎn)換規(guī)則是一個(gè)非常簡(jiǎn)單的公式</p><p> 上面的公式已經(jīng)意味著條目值在矩陣Q(即行代表區(qū)域和列代表行動(dòng))等同于相應(yīng)的條目的矩陣R添加一個(gè)乘法的學(xué)習(xí)參數(shù)和在這個(gè)狀態(tài)下所有行動(dòng)的Q的最大值。</p><p> 我們的虛擬代理將通過(guò)經(jīng)驗(yàn)在沒(méi)有老師的情況下學(xué)習(xí)(這就是所謂的無(wú)監(jiān)督學(xué)習(xí))。代理將探索各區(qū)域,直到達(dá)到目標(biāo)。我們調(diào)用每個(gè)勘探作為一個(gè)插話。
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫(kù)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
評(píng)論
0/150
提交評(píng)論