版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
1、<p> Q-Learning By Examples</p><p> In this tutorial, you will discover step by step how an agent learns through training without teacher (unsupervised) in unknown environment. You will find out part
2、of reinforcement learning algorithm called Q-learning. Reinforcement learning algorithm has been widely used for many applications such as robotics, multi agent system, game, and etc. </p><p> Instead of le
3、arning the theory of reinforcement that you can read it from many books and other web sites (see Resources for more references), in this tutorial will introduce the concept through simple but comprehensive numerical exam
4、ple. You may also download the Matlab code or MS Excel Spreadsheet for free. </p><p> Suppose we have 5 rooms in a building connected by certain doors as shown in the figure below. We give name to each room
5、 A to E. We can consider outside of the building as one big room to cover the building, and name it as F. Notice that there are two doors lead to the building from F, that is through room B and room E. </p><p&
6、gt; We can represent the rooms by graph, each room as a vertex (or node) and each door as an edge (or link). Refer to my other tutorial on Graph if you are not sure about what is Graph. </p><p> We want t
7、o set the target room. If we put an agent in any room, we want the agent to go outside the building. In other word, the goal room is the node F. To set this kind of goal, we introduce give a kind of reward value to each
8、door (i.e. edge of the graph). The doors that lead immediately to the goal have instant reward of 100 (see diagram below, they have red arrows). Other doors that do not have direct connection to the target room have zero
9、 reward. Because the door is two way (from A can go</p><p> Additional loop with highest reward (100) is given to the goal room (F back to F) so that if the agent arrives at the goal, it will remain there f
10、orever. This type of goal is called absorbing goal because when it reaches the goal state, it will stay in the goal state. </p><p> Ladies and gentlemen, now is the time to introduce our superstar agent…. &
11、lt;/p><p> Imagine our agent as a dumb virtual robot that can learn through experience. The agent can pass one room to another but has no knowledge of the environment. It does not know which sequence of doors
12、the agent must pass to go outside the building. </p><p> Suppose we want to model some kind of simple evacuation of an agent from any room in the building. Now suppose we have an agent in Room C and we want
13、 the agent to learn to reach outside the house (F). (see diagram below) </p><p> How to make our agent learn from experience? </p><p> Before we discuss about how the agent will learn (using
14、Q learning) in the next section, let us discuss about some terminologies of state and action . </p><p> We call each room (including outside the building) as a state . Agent's movement from one room to
15、another room is called action . Let us draw back our state diagram. State is depicted using node in the state diagram, while action is represented by the arrow. </p><p> Suppose now the agent is in state C.
16、 From state C, the agent can go to state D because the state C is connected to D. From state C, however, the agent cannot directly go to state B because there is no direct door connecting room B and C (thus, no arrow). F
17、rom state D, the agent can go either to state B or state E or back to state C (look at the arrow out of state D). If the agent is in state E, then three possible actions are to go to state A, or state F or state D. If ag
18、ent is state B, it can g</p><p> We can put the state diagram and the instant reward values into the following reward table, or matrix R . </p><p> The minus sign in the table says that the ro
19、w state has no action to go to column state. For example, State A cannot go to state B (because no door connecting room A and B, remember?) </p><p> In the previous sections of this tutorial, we have modele
20、d the environment and the reward system for our agent. This section will describe learning algorithm called Q learning (which is a simplification of reinforcement learning). </p><p> We have model the envir
21、onment reward system as matrix R. </p><p> Now we need to put similar matrix name Q in the brain of our agent that will represent the memory of what the agent have learned through many experiences. The row
22、of matrix Q represents current state of the agent, the column of matrix Q pointing to the action to go to the next state. </p><p> In the beginning, we say that the agent know nothing, thus we put Q as zero
23、 matrix. In this example, for the simplicity of explanation, we assume the number of state is known (to be six). In more general case, you can start with zero matrix of single cell. It is a simple task to add more column
24、 and rows in Q matrix if a new state is found. </p><p> The transition rule of this Q learning is a very simple formula </p><p> The formula above have meaning that the entry value in matrix Q
25、 (that is row represent state and column represent action) is equal to corresponding entry of matrix R added by a multiplication of a learning parameter and maximum value of Q for all action in the next state.</p>
26、<p> Our virtual agent will learn through experience without teacher (this is called unsupervised learning). The agent will explore state to state until it reaches the goal. We call each exploration as an episode
27、. In one episode the agent will move from initial state until the goal state. Once the agent arrives at the goal state, program goes to the next episode. The algorithm below has been proved to be convergence (See referen
28、ces for the proof) </p><p><b> Q學習實例</b></p><p> 在本教程中,您將一步一步地發(fā)現在未知的環(huán)境中一個代理如何進行沒有老師(非監(jiān)督)的學習訓練。你會發(fā)現強化學習算法的一部分——稱為Q學習。強化學習算法已經得到廣泛的應用,如機器人技術、多代理系統、游戲,等等。</p><p> 雖然你可以
29、閱讀從許多書籍和其他網站(參見參考資料獲取更多的引用)來學習的加固的理論,但本教程將通過數值例子介紹簡單而全面的概念。你也可以下載Matlab代碼或免費的Excel電子表格。</p><p> 假設我們在建筑中有5個房間,由某些大門連接如下圖所示。我們給每個房間一個名字,從A到E 。我們可以考慮大樓外部作為一個大房間里涵蓋了大樓,并將其命名為F 。請注意,有兩扇門可以從F到建筑里,就是通過B室和E室。</
30、p><p> 我們可以通過圖形表示房間。每個房間作為一個頂點(或節(jié)點)和每個門作為一個邊緣(或鏈接)。如果你不確定是什么圖,請參考我其他教程上的圖形,。</p><p> 我們想要設定目標房間。如果我們把一個系統放到任何房間中,我們想要它到建筑物的外面。換句話說,目標的房間是節(jié)點F。為了設置這樣的目標,我們介紹給每扇門(即圖的邊)一個獎勵價值。立即到達目標的門有即時回報100(見圖表,他們
31、有紅色箭頭)。其他沒有直接連接到目標的房間的門只有零回報。因為通過門是有兩個方向的(從A可以去E和從E可以回到A),我們給每個房間的前面的圖分配兩個箭頭。每個箭頭都包含一個即時回報價值。這個圖變得狀態(tài)關系圖如下所示</p><p> 額外的有最高的獎勵(100)的路徑是考慮到目標的房間(F回到F),以便使代理如果到達目標,它將永遠留在那里。這種類型的目標被稱為吸收目標,因為當它達到目標狀態(tài),它將停留在目標狀態(tài)。
32、</p><p> 現在是時候介紹我們的超級代理了…</p><p> 想象一下我們的代理作為一個愚蠢的虛擬機器人,這種機器人可以通過經驗學習。代理可以從一個房間到另一個房間,但是沒有對環(huán)境的認知。它不知道去建筑物的外面必須通過哪個序列的門代理。</p><p> 假設我們想為某種簡單的從任何教室疏散代理建?!,F在假設我們有一個代理在房間C,我們想要代理學會達
33、到在房子外面(F)。(見下圖)</p><p> 如何使我們的代理從經驗中學習?</p><p> 在我們討論關于代理將學習(使用Q學習)之前,在接下來的部分中,我們討論一些術語和行動。</p><p> 我們稱每個房間(包括建筑外)為一個區(qū)域。代理的運動從一個房間到另一個房間叫行動。讓我們收回我們的狀態(tài)關系圖。狀態(tài)是使用狀態(tài)關系圖的節(jié)點描述,行動用箭頭表示。
34、</p><p> 假設現在代理是在區(qū)域C,代理可以去區(qū)域D,因為狀態(tài)C是連接到D。從國家C,然而,代理不能直接去國家B,因為沒有直接連接房間門B和C(因此,沒有箭頭)。從區(qū)域D,代理要么去區(qū)域B或E或回到狀態(tài)C(看了箭頭區(qū)域D)。如果代理是在區(qū)域E。然后三種可能的行動去F或D。如果代理在區(qū)域B,它要么去D或F .從A,它只能回到區(qū)域E。</p><p> 我們可以把狀態(tài)圖和即時回報值
35、分為以下獎勵表或矩陣R。</p><p> 在表中的減號表示,這行沒有去列的行動。例如,A不能去B(因為沒有門連接房間A和B。)</p><p> 前一節(jié)本教程中,我們已經為我們代理的環(huán)境和獎勵系統建模。這一小節(jié)將描述學習算法稱為Q學習(這是一個簡化的強化學習)。</p><p> 我們有模型環(huán)境回饋系統為矩陣R。</p><p>
36、現在我們需要把相似矩陣的大腦中名為Q將代表我們的代理,大腦可以通過經驗學習到很多。一排排的矩陣Q代表了當前的狀態(tài)的代理,列的矩陣Q指向行動去下一個狀態(tài)。</p><p> 在開始的時候,我們說代理一無所知。因此我們把Q看作零矩陣。在這個例子中,簡單的解釋,我們假設的區(qū)域數是已知的(6)。在更一般的情況下,你可以從單個細胞零矩陣開始。如果有一個新的區(qū)域發(fā)現,在Q矩陣中添加更多的列和行是一個簡單任務。</p&
37、gt;<p> Q學習的轉換規(guī)則是一個非常簡單的公式</p><p> 上面的公式已經意味著條目值在矩陣Q(即行代表區(qū)域和列代表行動)等同于相應的條目的矩陣R添加一個乘法的學習參數和在這個狀態(tài)下所有行動的Q的最大值。</p><p> 我們的虛擬代理將通過經驗在沒有老師的情況下學習(這就是所謂的無監(jiān)督學習)。代理將探索各區(qū)域,直到達到目標。我們調用每個勘探作為一個插話。
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
評論
0/150
提交評論