Wednesday, October 14, 2015

A Tutorial for Reinforcement Learning

A Tutorial for Reinforcement Learning
Abhijit Gosavi
Department of Engineering Management and Systems Engineering
Missouri University of Science and Technology
219 Engineering Management, Rolla, MO 65409
Email:gosavia@mst.edu
November 22, 2014
If you find this tutorial useful, or the codes in C at
http://web.mst.edu/~gosavia/bookcodes.html
useful, or the MATLAB codes at
http://web.mst.edu/~gosavia/mrrl_website.html
useful, please do cite my book (for which this material was prepared), now in its second edition.
A. Gosavi. Simulation-Based Optimization: Parametric Optimization Techniques and Re- inforcement Learning, Springer, New York, NY, Second edition, 2014.
Book website: http://web.mst.edu/˜gosavia/book.html
1
Contents
1
Introduction
3
2
MDPs and SMDPs
3
3
RL

6

3.1
Average reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7

3.2
Selecting the learning rate or step size . . . . . . . . . . . . . . . . . . . . .
10

3.3
Discounted reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11

3.4
Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
4
Example
12
5
Conclusions
12
2
1Introduction
The tutorial is written for those who would like an introduction to reinforcement learning (RL). The aim is to provide an intuitive presentation of the ideas rather than concentrate on the deeper mathematics underlying the topic.
RL is generally used to solve the so-called Markov decision problem (MDP). In other words, the problem that you are attempting to solve with RL should be an MDP or its variant. The theory of RL relies on dynamic programming (DP) and artificial intelligence (AI). We will begin with a quick description of MDPs. We will discuss what we mean by “complex” and “large-scale” MDPs. Then we will explain why RL is needed to solve complex and large-scale MDPs. The semi-Markov decision problem (SMDP) will also be covered.
The tutorial is meant to serve as an introduction to these topics and is based mostly on the book: “Simulation-based optimization: Parametric Optimization techniques and rein- forcement learning” [3]. The book discusses this topic in greater detail in the context of simulators. There are at least two other textbooks that I would recommend you to read:
(i) Neuro-dynamic programming [1] (lots of details on convergence analysis) and (ii)Rein- forcement Learning: An Introduction [8] (lots of details on underlying AI concepts). A more recent tutorial on this topic is [7]. This tutorial has 2 sections:
Section 2 discusses MDPs and SMDPs.
Section 3 discusses RL.
By the end of this tutorial, you should be able to
Identify problem structures that can be set up as MDPs / SMDPs.
Use some RL algorithms.
We will not discuss how to use function approximation, but will provide some general advice towards the end.
2MDPs and SMDPs
The framework of the MDP has the following elements: (1) state of the system, (2) actions,
(3) transition probabilities, (4) transition rewards, (5) a policy, and (6) a performance metric. We assume that the system is modeled by a so-called abstract stochastic process called the Markov chain. When we observe the system, we observe its Markov chain, which is defined by the states. We explain these ideas in more detail below.
State: The “state” of a system is a parameter or a set of parameters that can be used to describe a system. For example the geographical coordinates of a robot can be used to describe its “state.” A system whose state changes with time is called a dynamicsystem. Then it is not hard to see why a moving robot produces a dynamic system.
Another example of a dynamic system is the queue that forms in a supermarket in front of the counter. Imagine that the state of the queuing system is defined by the number of
3
people in the queue. Then, it should be clear that the state fluctuates with time, and then this is dynamic system.
It is to be understood that the transition from one state to another in an MDP is usually a random affair. Consider a queue in which there is one server and one waiting line. In this queue, the state x, defined by the number of people in the queue, transitions to + 1 with some probability and to − 1 with the remaining probability. The former type of transition occurs when a new customer arrives, while the latter event occurs when one customer departs from the system because of service completion.
Actions: Now, usually, the motion of the robot can be controlled, and in fact we are interested in controlling it in an optimal manner. Assume that the robot can move in discrete steps, and that after every step the robot takes, it can go Northgo South,go East, or go West. These four options are called actions or controls allowed for the robot.
For the queuing system discussed above, an action could be as follows: when the number of customers in a line exceedssome prefixed number, (say 10), the remaining customers are diverted to a new counter that is opened. Hence, two actions for this system can be described as: (1) Open a new counter (2) Do not open a new counter.
Transition Probability: Assume that action is selected in state i. Let the next state be j. Let p(i, a, j) denote the probability of going from state to state under the influence of action in one step. This quantity is also called the transition probability. If an MDP has 3 states and 2 actions, there are 9 transition probabilities per action.
Immediate Rewards: Usually, the system receives an immediate reward (which could be positive or negative) when it transitions from one state to another. This is denoted by r(i, a, j).
Policy: The policy defines the action to be chosen in every state visited by the system. Note that in some states, no actions are to be chosen. States in which decisions are to be made, i.e., actions are to be chosen, are called decision-making states. In this tutorial, by states, we will mean decision-making states.
Performance Metric: Associated with any given policy, there exists a so-called performance metric — with which the performance of the policy is judged. Our goal is to select the policy that has the best performance metric. We will first consider the metric called the average reward of a policy. We will later discuss the metric called average reward. We will assume that the system is run for a long time and that we are interested in a metric measured over what is called the infinite time horizon.
Time of transition: We will assume for the MDP that the time of transition is unity (1), which means it is the same for every transition. Hence clearly 1 here does not have to mean 1 hour or minute or second. It is some fixed quantity fixed by the analyst. For the SMDP, this quantity is not fixed as we will see later.
Let us assume that a policy named π is to be followed. Then π(i) will denote the action selected by this policy for state i. Every time there is a jump in the Markov chain (of the system under the policy in consideration), we say a transition (or jump) has occurred. It is
4
[s=1
important to understand that during a transition we may go from a state to itself!
Let xdenote the state of the system before the sth transition. Then, the following quantity, in which xi, is called the average reward of the policy π starting at state i.
]
r(xs, π(xs), xs+1)|xi
ρ= lim

(1)

k!1
k
This average reward essentially denotes the sum of the total immediate rewards earned divided by the number of jumps (transitions), calculated over a very long time horizon (that is assumes a large value.) In the above, the starting state is and π(xs) denotes the action in state xs. Also note that E[.] denotes the average value of the quantity inside the square brackets.
It is not hard to show that the limit in (1) is such that its value is the same for any value of x1, if the underlying Markov chains in the system satisfy certain conditions (related to the regularity of the Markov chains); in many real-world problems such conditions are often satisfied. Then
ρρ for any value of i.
The objective of the average-reward MDP is to find the policy that maximizes the per- formance metric (average reward) of the policy.
Another popular performance metric, commonly used in the literature, is discounted reward. The following quantity is called the discounted reward of a policy π. Again, let xs
denote the state of the system before the sth jump (transition).


ψklim E k
γ1r(xs, π(xs), xs+1)  xi,
(2)
!1 s=1







where γ denotes the discount factor; this factor, γ, is less than
1 but greater than 0.
Eqn.
(2) has a simple interpretation:
ψ= E[r(x1, π(x1), x2) + γr(x2, π(x2), x3) + γ2r(x3, π(x3), x4) + ···]
The discounted reward essentially measures the present value of the sum of the rewards earned in the future over an infinite time horizon, where γ is used to discount money’s value. We should point out that:
γ (
1
1



1 + µ,
(3)
where µ is the rate of interest; the rate is expressed as a fraction here and not in percent. When µ > 0, we have that 0 < γ < 1. It is worthwhile pointing out that in the MDP, we have 1/(1 + µ) raised to the power 1 because in the MDP we assume a fixed rate of discounting and that the time duration of each transition is fixed at 1. This mechanism thus captures within the MDP framework the notion of time value of money.
The objective of the discounted-reward MDP is to find the policy that maximizes the performance metric (discounted reward) of the policy starting from every state.
Note that for average reward problems, the immediate reward in our algorithms can be earned during the entire duration of the transition. However, for the discounted problems, we will assume that the immediate reward is earned immediately after the transition starts.
5
The MDP can be solved with the classical method of dynamic programming (DP). How- ever, DP needs all the transition probabilities (the p(i, a, j) terms) and the transition rewards (the r(i, a, j) terms) of the MDP.
For Semi-Markov decision problems (SMDPs), an additional parameter of interest is the time spent in each transition. The time spent in transition from state to state under the influence of action is denoted by t(i, a, j). To solve SMDPs via DP one also needs the transition times (the t(i, a, j) terms). For SMDPs, the average reward that we seek to maximize is defined as:


E
[
k

x1
i



s=1 r(xs, π(xs), xs+1)

ρ= lim


|

]
.
(4)


k



k
!1 E







[∑s=1 t(xs, π(xs), xs+1)|xi]

(Technically, lim should be replaced by lim inf everywhere in this tutorial, but we will not worry about such technicalities here.) It can be shown that the quantity has the same limit for any starting state (under certain conditions). A possible unit for average reward here is $ per hour.
For discounted reward, we will, as stated above, assume the immediate reward is earned immediately after the transition starts and does not depend on the duration of the transition. Thus, the immediate reward is a lumpsum reward earned at the start of the transition (when the immediate reward is a function of the time interval, see instead the algorithm in [2]). Also, because of the variability in time, we will assume continuously compounded rate of interest. Then, we seek to maximize:
ψklim [r(x1, π(x1), x2) +
k




s=2
r(xs, π(xs), xs+1)
s+1 e dτ xi]
,
!1
















where denotes the discounting factor over a period of length τ (under continuous com- pounding) and τis the time of occurrence of the sth jump (transition). Note that since the immediate reward does not depend on the time τ, it can be taken out of the integral. Essentially, what we have above is the sum of discounted rewards with the discount factor in each transition appropriately calculated using the notion of continuous compounding.
Curses: For systems which have a large number of governing random variables, it is often hard to derive the exact values of the associated transition probabilities. This is called the curse of modeling. For large-scale systems with millions of states, it is impractical to store these values. This is called the curse of dimensionality.
DP breaks down on problems which suffer from any one of these curses because it needs all these values.
Reinforcement Learning (RL) can generate near-optimal solutions to large and complex MDPs. In other words, RL is able to make inroads into problems which suffer from one or more of these two curses and cannot be solved by DP.
3RL
We will describe a basic RL algorithm that can be used for average reward SMDPs. Note that if t(i, a, j) = 1 for all values of i, j,and a, we have an MDP. Hence our presentation will
6
be for an SMDP, but it can easily be translated into that of an MDP by setting t(i, a, j) = 1 in the steps.
It is also important to understand that the transition probabilities and rewards of the system are not needed if any one of the following is true:
1.we can play around in the real world system choosing actions and observing the rewards
2.if we have a simulator of the system.
The simulator of the system can usually be written on the basis of the knowledge of some other easily accessible parameters. For example, the queue can be simulated with the knowledge of the distribution functions of the inter-arrival time and the service time. Thus the transition probabilities of the system are usually not required for writing the simulation program.
Also, it is important to know that the RL algorithm that we will describe below requires the updating of certain quantities (called Q-factors) in its database whenever the system visits a new state.
When the simulator is written in C or in any special package such as ARENA, it is possible to update certain quantities that the algorithm needs whenever a new state is visited.
Usually, the updating that we will need has to be performed immediately after a new state is visited. In the simulator, or in real time, it IS possible to keep track of the state of the system so that when it changes, one can update the relevant quantities.
The key idea in RL is store a so-called Q-factor for each state-action pair in the system. Thus, Q(i, a) will denote the Q-factor for state and action a. The values of these Q-factors are initialized to suitable numbers in the beginning (e.g., zero or some small number to all the Q-factors). Then the system is simulated (or controlled in real time) using the algorithm. In each state visited, some action is selected and the system is allowed to transition to the next state. The immediate reward (and the transition time) that is generated in the transition is recorded as the feedback. The feedback is used to update the Q-factor for the action selected in the previous state. Roughly speaking if the feedback is good, the Q-factor of that particular action and the state in which the action was selected is increased (rewarded) using the Relaxed-SMART algorithm. If the feedback is poor, the Q-factor is punished by reducing its value.
Then the same reward-punishment policy is carried out in the next state. This is done for a large number of transitions. At the end of this phase, also called the learning phase, the action whose Q-factor has the highest value is declared to be the optimal action for that state. Thus the optimal policy is determined. Note that this strategy does not require the transition probabilities.
3.1Average reward
We begin with average reward. The algorithm that we will describe is called R-SMART(Relaxed Semi-Markov Average reward Technique). The original version in [5] required that the average reward be initialized to a value close to its optimal value. We present here a modified version from [4] that does not require this. We first present theCF-version of R- SMART, where CF stands for the contraction factor. We will later present another version
7
of this algorithm.
Steps in CF-version of R-SMART:
The steps in the Learning Phase are given below.
• Step 1: Set the Q-factors to some arbitrary values (e.g. 0), that is:
Q(i, a← 0 for all and a.
Set the iteration count, k, to 1. Let ρdenote the average reward in the kth iteration of the algorithm. Set ρto 0 or any other value. Let the first state be i. Let A(i) denote the set of actions allowed in state i. Let αdenote the main learning rate in the kth iteration. Selecting a value for α (or any learning rate/step size) is a topic that we discuss below. Let βdenote the secondary learning rate. Also set the following two quantities to 0: total reward and total time. IT ERMAX will denote the number of iterations for which the algorithm is run. It should be set to a large number. Set η, a scaling constant, to a small positive value, which is close to but less than 1, e.g., 0.99.
• Step 2: Determine the action associated to the Q-factor that has the highest value in state i. (Imagine there are two actions in a state and their values are Q(i, a) = 19 and Q(i, 2) = 45; the clearly action 2 has the greatest Q-factor.) This is called the greedy action. Select the greedy action with probability (1p(k)), where for instance p(k) could be defined as follows:
G1p(k) = Gk
in which G> G1, and Gand Gare large positive constants, e.g., 1000 and 2000 respectively. With a probability of p(k), choose one of the other actions. This can be done easily. The non-greedy actions are called exploratory actions, and selecting an exploratory action is called exploration. It should be clear that our probability of exploration will decay with k, the iteration number, and it should. (For the two-action case, you can generate a random number between 0 and 1. If the number is less than or equal to (1 − pk), choose the greedy action; otherwise, choose the other action.)
Let the action selected be denoted by a. If is a greedy action, set ϕ = 0 (this means that the action selected is greedy with respect to the Q-factor.) Otherwise, set ϕ = 1. Simulate action a.
• Step 3: Let the next state be denoted by j. Let r(i, a, j) denote the transition reward and t(i, a, j) denote the transition time. Then update Q(i, a) as follows:
Q(i, a)
(1
αQ i, a
) +
αk
r
i, a, j
ρkt
i, a, j
) +
η max Q(j, b.


) (
[
(

(

(j)
]













2A

In the above, maxb2A(jQ(j, b) is essentially the maximum numeric value of all theQ-factors in state j. The value of αshould be selected appropriately. See the next subsection for a discussion on learning rate rules.
8
Step 4: If ϕ = 1, that is the action was non-greedy, go to Step 5. Otherwise, update total reward and total time as follows.
total reward ← total reward + r(i, a, j)total time total time + t(i, a, j).
Then update the average reward as:






ρk+1 ← (1 − βk)ρβ[
total

reward
,




total

time

where β is chosen using a rule that decays to 0 faster than α. The next subsection discusses how appropriate values forα and β can be chosen.
Step 5: Increment by 1. Set: i ← j. If k < ITERMAX, return to Step 2. Otherwise, declare the action for whichQ(i, .) is maximum to be the optimal action for state (do this for all values of i, i.e., for all states to generate a policy), and STOP.
Note that in the above, the exploration probability p(k) gradually decays to 1, and in the limit, the algorithm selects the greedy actions. However, this decay should be gradual. If the decay is very quick (e.g., you use small values for Gand G2), the algorithm will most likely converge to an incorrect solution. In other words, the algorithm should be allowed to explore sufficiently.
Further note: η is a positive scalar selected by the user such that 0 < η < 1. The positive scalar η has to be less than 1; it is the contracting factor (CF) that enables the algorithm to converge gracefully to the optimal solution. Its value should be as close to 1 as possible and should not be changed during the learning. An example could be η = 0.99. For a sufficiently large value ofη, the algorithm is guaranteed to converge to the optimal solution. In practice, η = 1 may also generate convergent behavior, but there is no known proof of this.
The next phase is called the frozen phase because the Q-factors are not updated here. This phase is performed to estimate the average reward of the policy declared by the frozen phase to be the optimal policy. (By optimal, of course, we only mean the best that RL can generate; it may not necessarily be optimal, but hopefully is very similar to the optimal in its performance.) Steps in the Frozen Phase are as follows.
• Step 1: Use the Q-factors learned in the Learning Phase. Set iteration count to 0. IT ERMAX will denote the number of iterations for which the frozen phase is run. It should be set to a large number. Also set the following two quantities to 0: total reward and total time.
Step 2: Select for state the action which has the maximum Q-factor. Let that action be denoted by u. Simulate actionb.
Step 3: Let the next state be denoted by j. Let r(i, u, j) denote the transition reward and t(i, u, j) denote the transition time.
Then update total reward and total time as follows.
total reward ← total reward + r(i, u, j)total time total time + t(i, u, j).
9
Step 4: Increment by 1. Set: ← j. If k < ITERMAX, return to Step 2. Otherwise, calculate the average reward of the policy learned in the learning phase as follows:
ρ total rewardtotal time
and STOP.
The value of ρin the learning phase can also be used as an estimate of the actual ρ while the learning is on. But typically a frozen phase is carried out to get a cleaner estimate of the actual average reward of the policy learned.
Steps in the SSP-version of R-SMART:
We now present the SSP-version of the algorithm, where SSP stands for the stochastic shortest-path problem, a problem that we do not consider in this tutorial. The problem we solve is not an SSP; the algorithm will use a connection to the SSP to solve our average reward problem. Thus, this SSP-version is essentially going to solve the average reward problem that we have considered above. The algorithm’s steps are identical to those of the CF-version described above with two differences:
Step 1: There is no need to select any value for η because η is not needed in this algorithm. Further, choose any state in the system to be a distinguished state, and then denote it by .
The main update in Step 3 will be as follows:
Q i, a
← (1 
αQ i, a
) +
αk
r
i, a, j
ρkt i, a, j
) +
I
j
̸=
i
max Q j, b
.
(
) (
[
(

(
(


(j)
(
)]














2A


In the above, I(≠ ) is the indicator function whose value can be calculated as follows: I(≠ ) = 1 when ≠ and I(≠ ) = 0 when . Note that the indicator function replaces η used in the CF-version.
Note that, in practice, the CF-version of R-SMART often outperforms the SSP-version, although the SSP version can be shown to converge under milder conditions.
3.2Selecting the learning rate or step size
The learning rate (α or β) should be a positive value typically less than 1. It is also a function of k. We need to ensure some conditions on these rules for RL within simulators. These are described in detail in [1]. Some commonly used examples ofstep-sizes are:
αa/(k)
where for instance = 90 and = 100. Another rule is the log rule: log(k)/k (make sure starts at 2; otherwise use log(+ 1)/(+ 1) for starting at 1) studied in [6]. The log rule converges to 0 slower than the a/(k) rule for many values of and b.
10
The rule 1/k where = 1 and = 0 may not lead to good behavior [6], although it is still widely used and was also used by this author in some of his own prior work. (We, obviously, do not recommend it any more, especially after the findings in [6] in the context of Q-Learning).
When we have two step sizes like in R-SMART above, we must make sure that β converges to 0 faster. We can then use α =log(k)/k and β = 90/(100 + k). Ideally, as tends to βkshould tend to 0.
3.3Discounted reward
For discounted reward, the learning phase is sufficient. The steps we describe are from the famous Q-Learning algorithm of Watkins [11]. They apply to the MDP; we will discuss the SMDP extension later.
• Step 1: Set the Q-factors to some arbitrary values (e.g., 0), that is:
Q(i, a← 0 for all and a.
Let A(i) denote the set of actions allowed in state i. Let αdenote the main learning rate in the kth iteration. Set = 1. IT ERMAX will denote the number of iterations for which the algorithm is run. It should be set to a large number.
Step 2: Select an action in state with probability 1/|A(i)|, where |A(i)denotes the number of elements in the set A(i). Simulate action a.
Step 3: Let the next state be denoted by j. Let r(i, a, j) denote the transition reward
and t(i, a, j) denote the transition time. Then update Q(i, a) as follows:
Q i, a
← (1 
αk
Q
i, a
) +
αk
r
i, a, j
) +
γ max Q j, b
,
(
)
(

[
(

(j)
(
)]










2A


where you should compute αusing one of the rules discussed above. Also, γ denotes the discount factor.
Step 4: Increment by 1. Set: ← j. If k < ITERMAX, return to Step 2. Otherwise, for each state declare the action for which Q(i, .) is maximum to be the optimal action, and STOP.
Note that in the algorithm above, there is no decay of exploration. This is because in Q-Learning, the exploration need not decay. However, in practice, to get the algorithm to converge faster, people do employ decay of exploration. It is only recommended in online applications and not in simulators.
For the SMDP extension, you can use the following update in Step 3 of the above algo- rithm (see pages 245-246 of [3] for more details):
Q i, a
← (1 
αk
Q
i, a
) +
αk
r
i, a, j
) +
t(i;a;jmax Q(j, b)
,
(
)
(

[
(

(j)
]










2A

11
where the exponential term arises as follows:

1

γ (
≈ e .

1 + µ
in which τ denotes the time and the discounting mechanism was explained above in Equation
(3). Note that in the above approximation to obtain the exponential term, we use the fact that µ is quite small.
Note on the vanishing discount approach: You can actually use a discounted RL algorithm to solve the average reward problem via the vanishing discount approach. In this heuristic approach, one uses a discounted algorithm with γ very close to 1 (for MDPs) and µ very close to 0 (for SMDPs). This can work very well in practice.
3.4Codes
Some MATLAB codes for these algorithms have been placed on the following website [9]:
http://web.mst.edu/~gosavia/mrrl_website.html
Codes in C have been placed at the following website:
http://web.mst.edu/~gosavia/bookcodes.html
4Example
We end with a simple example from Gosavi [3]. Figure 1 shows a simple MDP; the legend in the figure explains the transition rewards and probabilities. This is an average reward problem on which you could use R-SMART. The optimal policy for this MDP is: action 2 in state 1 and action 1 in state 2. The average reward of the optimal policy is 10.56.
5Conclusions
The tutorial presented above showed you one way to solve an MDP provided you have the simulator of the system or if you can actually experiment in the real-world system. Transition probabilities of the state transitions were not needed in this approach; this is the most attractive feature of this approach.
We did not discuss what is to be done for large-scale problems. That is beyond the scope of this tutorial. What was discussed above is called the lookup-table approach in which each Q-factor is stored explicitly (separately). For large-scale problems, clearly it is not possible to store the Q-factors explicitly because there is too many of them. Instead one stores a few scalars, called basis functions, which on demand can generate the Q-factor for any state-action pair. Function approximation when done improperly can become unstable [3].
We can provide some advice to beginners in this regard. First, attempt to clusterstates so that you obtain a manageable state space for which you can actually use a lookup table.
12
(1,0.7,6)
(2,0.8,13)
(1,0.3,-5)
(2,0.1,17)
1
(1,0.4,7)
2



(2,0.2,-14)

(2,0.9,10)
(1,0.6,12)
Legend:
(a,p,r): a = action
p = transition probability
r = immediate reward
Figure 1: A two state MDP
In other words, divide the state space for each action into grids and use only one Q-factor for all the states in each grid. The total number of grids should typically be a manageable number such as 10000. If this does not work well, produce a smaller number of grids but use an incremental Widrow-Hoff algorithm, that is, a neuron (see [3]), in each grid. If you prefer using linear regression, go ahead because that will work just as well. If this works well, and you want to see additional improvement,do attempt to use neural networks, either neurons or those based on back-propagation [10]. It is with function approximation that you can scale your algorithm up to realistic problem sizes.
I wish you all the best for your new adventures with RL, but cannot promise any help with homework or term papers —sorry :-(
References
[1]D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena, MA, 1996.
[2]S. J. Bradtke and M. Duff. Reinforcement learning methods for continuous- time Markov decision problems. InAdvances in Neural Information Processing Systems 7. MIT Press, Cambridge, MA, USA, 1995.
[3]A. Gosavi. Simulation-Based Optimization: Parametric Optimization Tech- niques and Reinforcement Learning, Springer, New York, NY, 2014. http://web.mst.edu/˜gosavia/book.html
[4]A. Gosavi. Target-Sensitive Control of Markov and Semi-Markov Processes. Interna- tional Journal of Control, Automation, and Systems9(5):1-11, 2011.
13
[5]A Gosavi. Reinforcement Learning for Long-run Average Cost. European Jour- nal of Operational Research. Vol 155, pp. 654-674, 2004. Can be found at: http://web.mst.edu/˜gosavia/rsmart.pdf
[6]A Gosavi. On Step Sizes, Stochastic Shortest Paths, and Survival Probabilities in Reinforcement Learning, Conference Proceedings of the Winter Simulation Conference, 2009. Available at: http://web.mst.edu/˜gosavia/wsc 2008.pdf.
[7]A Gosavi. Reinforcement Learning: A Tutorial Survey and Recent Advances.INFORMS Journal on Computing. Vol 21(2), pp. 178–192, 2009. Available at: http://web.mst.edu/˜gosavia/joc.pdf
[8]R. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, USA, 1998.
[9] MATLAB
repository
for

Reinforcement
Learning,

created
by
A.
Gosavi
at
Missouri
University
of
Science
and
Technology,
http://web.mst.edu/˜gosavia/mrrl

website.html.











[10]P. J. Werbos. Beyond regression: New tools for prediction and analysis of behavioral sciences.. Ph.D. thesis, Harvard University, USA, 1974.
[11]C. J. Watkins. Learning from delayed rewards. Ph.D. thesis, Kings College, Cambridge, England, May 1989.
14

Tuesday, October 13, 2015

Reinforcement Learning - Flappy Bird. Brian Gatt. + 3D Flappy Bird with Reinforcement Learning

source: http://www.brian-gatt.com/portfolio/flappy/documentation.pdf


Department of Computing - MSc. Computer Games and Entertainment
IS71027B – AI for Games
Dr. Jeremy Gow
Term 2 - Assignment 1 (2013-2014)
Reinforcement Learning - Flappy Bird
Brian Gatt (ma301bg)
28 March 2014
Brian Gatt


IS71021A – Splines, Interpolation and Quaternions
Table of Contents

..........................................................................................................1
3
3
3
3
4
4
4
5
6
7
7
7
7
7
8
8
9
9
9
9
9
Page 2 of 10


Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
Introduction
The following is the documentation for an application of reinforcement learning on the popular game Dong Nguyen’s ‘Flappy Bird’. An overview of the attached deliverable is provided explaining the aims, background and implementation details. The appendices section contains proof of the resulting program.
Aims and Objectives
The main aim of this project is to apply and implement the AI technique of reinforcement learning. Dong Nguyen’s popular ‘Flappy Bird’ game was chosen as a test bed to implement reinforcement learning in a simple game context. The following quote is the original project proposal which summarises the intent of the project:
The project will consist of a ‘Flappy Bird’ implementation using the Unity game engine and C#. The bird will be automatically controlled using reinforcement learning, in particular Q-Learning. The deliverable will provide a visualization of the learning process (on-line learning) with accompanying statistics in a text file format. Certain attributes will be made available for modification via the Unity editor interface.
In the end, all of the intended objectives were achieved while also allowing for on-line learning from player input.
Background
Reinforcement Learning
Reinforcement learning is a machine learning technique which is prevalent in today’s modern games.
Agents learn from their past experience which in turn allows them to better judge future actions. This is achieved by providing the agent with a reward for their actions in relation to the current world state.
Reinforcement learning is based on three important factors:
1.The reward function – A function which rewards or punishes (perceived as a negative reward) the agent for the action he just performed.
2.The learning rule – A rule which reinforces the agent’s memory based on the current experience and reward.
3.The exploration strategy – The strategy which the agent employs in order to select actions.
The reward function, learning rule and exploration strategy generally depend on the current world state; a set of variables which define what the learning rule will take into consideration to adapt.
One popular implementation of reinforcement learning is Q-learning. It is a learning rule which evaluates Q-values, values which represent the quality score of a state-action tuple. The learning rule returns a Q-value by blending the prior state-action Q-value and the best Q-value of the current state. This is represented as:
Page 3 of 10
Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
( , ) = (1 − )( , ) + ( + max( (, )))
The variable is the learning rate which linearly blends the Q-values, is the perceived reward and is the discount factor which defines “how much an action’s Q-value depends on the Q-value at the state (or states) it leads to.” (Millington & Funge, 2009). Both and are bound between 0 and 1 inclusive.
Reinforcement Learning requires time in order to train to a state where agents can react in reasonable manners. This is all dependant on the parameters chosen for the learning rule, the rewards and how the world is encoded (discretizing the world state in a reasonable manner). The data structures used to store this information can also hinder the experience.
Design
The design for this implementation closely followed GitHub user’s SarvagyaVaish implementation (Vaish, 2014).
Vaish uses Q-Learning in order to train the ‘Flappy Bird’ character using a reward function which rewards the bird with a score of 1 on each frame the bird is alive. Once the bird dies, the reward function heavily punishes with a score of -1000. The learning state is defined as the horizontal and vertical proximity between the bird and the upcoming pipe hazard. The bird is allowed to do any of the available actions (Do Nothing, Jump) at any time during the simulation.
We took some liberties in relation to Vaish’s implementation in order to make the implementation easier and adapt it to the Unity engine. One clear example being the data structures used to store the character’s experience. Vaish uses a multi-dimensional fixed size array based on the maximum distances between the bird and the pipes on the horizontal and vertical axis. He then creates a basic hash mechanism and stores Q-values in this data structure, overwriting values which exceed the minimum and maximum in the lowest or greatest element respectively. In our case, we use a simple map or dictionary and store the experiences accordingly.
Implementation
Following are some details on how to configure the AI parameters and how the implementation is defined in terms of the major Unity game objects, components, and behaviours.
Scene
The scene mainly consists of the main camera, the ‘Flappy Bird’ character and hazard spawn points. The ‘GameController’ game object is an empty game object which is used to control the overall game. The ‘Destroyer’ object is used to destroy previously instantiated hazards in order to keep the game object instance count to a minimum. The ‘Cover Up’ game object is essentially a cover layer which hides what is occurring behind the scenes so that users are not able to see hazard spawn points and destruction points. Finally, the ‘ceiling’ game object was necessary to avoid cheating. During implementation, there were cases where the bird was exploring areas beyond its intended space, allowing him to stay alive yet not achieving a better score since he was, literally, jumping over the pipes.
Page 4 of 10
Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
Configuration
AI parameter configuration is achieved by modifying the variables within the ‘Player Character’ script component attached to the ‘Flappy Bird’ game object. Below is a screenshot of the mentioned component.
Figure 1: ‘Flappy Bird’ Configuration
The ‘Controller Type’ variable lets the user choose between ‘InputController’, ‘AIController’ and
‘HybridController’. The ‘InputController’ is an implementation artifact which was used to test the game mechanics. It does not learn and only reacts to user input. The ‘AIController’ is the reinforcement learning controller strategy which learns and acts on its own. The ‘HybridController’ is an extension of the ‘AIController’ which learns but also allows user to provide his own input to allow the learning algorithm to learn from a different source.
The Intelligence properties expose the learning parameters previously mentioned in the ‘Background’ and ‘Design’ sections. Note that the ‘Saved State’ field is an implementation artifact. It was intended to allow learning to be resumable from different sessions, alas, encoding the state of the learning algorithm within the log file became unwieldy due to excessive file sizes so it was abandoned. The ‘Precision’ field specifies the scale for how many decimal places for the bird-pipe proximity are taken into consideration.
The ‘GameController’ game object exists solely to host the ‘Game Controller’ script controller which hosts minor configuration parameters for the overall game. Below is a screenshot of the mentioned component:
Page 5 of 10
Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
Figure 2: ’GameController’ Overall Configuration
Please note that the ‘Best Score’ field does not affect the game per se but is only there as a visual cue to keep up with the best score recorded.
To speed up the simulation, simply modify the Unity time engine parameter from ‘Edit’ – ‘Project Settings’ – ‘Time’ and modify the ‘Time Scale’ parameter.
Classes and Interfaces
Following is a brief overview of the major classes and interfaces which compose the overall implementation.
GameController
Manages the overall game flow by managing the state of the game and restarting accordingly. It is based on the singleton pattern and is persisted across scene loads. It stores the state of the AI algorithms on death of the player and restores them once the level is initiated.
PlayerCharacter
Contains the behaviour of the player character. The strategy pattern is used to represent the controller implementations and is switched on scene load accordingly. The update event delegates to the underlying controller implementation.
AIController
A Controller implementation which uses the Q-Learning algorithm to store and base the actions of the player character.
CoalescedQValueStore
An IQValueStore implementation. This follows Millington’s and Funge’s (Millington & Funge, 2009) recommendations by coupling the state and the action as one entity. This implementation follows the QValueStore implementation which was used in earlier stages of development. We were initially afraid that the original version had multiple hash collisions so we implemented the coalesced version which provided better, more stable, results.
QLearning
The implementation of the Q-Learning algorithm according to Millington and Funge.
Page 6 of 10
Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
User Manual
Prerequisites
Please ensure that Unity is installed on your system. This project was developed and tested using Unity version 4.3.
Launching the Project
In order to launch the project, double-click on the ‘MainScene’ scene file located in ‘Assets/Scenes/’. Alternatively, open Unity and via the ‘Open Project…’ menu item, navigate to the top-level directory of the project and launch it from there.
Initiating the Simulation
In order to start the simulation, simply click the ‘play’ button in the Unity editor. Ensure that prior to starting the simulation, the parameters are set up correctly. Certain parameters can also be modified mid-run. It is recommended that for mid-run parameter modification, the simulation is paused via the ‘pause’ button in the Unity editor so that it is easier to modify AI parameters accordingly.
Evaluation
Based on our implementation and the generated log files (refer to ‘Appendices’), it takes time for the character to learn the problem. Important elements which defines the learning algorithm are the space quantization parameters and the underlying data structures used to store the experiences. Imprecise space quantization precision can lead to faster results but the character will start to generalize quickly. On the other hand precise values lead to longer training but more fine-tuned results. Following is a chart which shows the best run we managed to achieve (refer to the attached ‘FlappyBirdRL.best.1x.csv’ for a detailed overview):


Final Scores VS Run Instances


200









180









160









140









120









100









80









60









40









20









0









0
200
400
600
800
1000
1200
1400
1600
1800
Page 7 of 10
Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
Whether or not reinforcement learning is useful for this type of application is debatable. According to the one of the comments on SarvagyaVaish repository, ‘Flappy Bird’ uses a deterministic physics model where both the height jump and width can easily be computed.
Figure 3: Flappy Bird's deterministic physics model (Vaish, 2014)
Based on these two values, a simpler scheme can be used. Another project (Jou, 2014) shows another implementation of this concept but focusing on computer vision. We believe that this project exploits the deterministic physics model in order to achieve its results.
Conclusion
Reinforcement learning is an AI technique which allows agents to learn and adapt via a reward- punishment system. This technique is implemented and applied on Dong Nguyen’s ‘Flappy Bird’ and its implications are evaluated. During development, testing was continuously performed in order to ensure that the requirements are met and the deliverable is of a high quality. The appendices section shows a running demonstration of the artefact.
References
Jou, E. (2014, February 24). Chinese Robot Will Decimate Your Flappy Bird Score. Retrieved from Kotaku: http://kotaku.com/chinese-robot-will-decimate-your-flappy-bird-score-1529530681
Millington, I., & Funge, J. (2009). Artificial Intelligence for Games. Morgan Kaufmann.
Vaish, S. (2014, February 15). Flappy Bird RL. Retrieved February 21, 2014, from Github Pages: http://sarvagyavaish.github.io/FlappyBirdRL/
Page 8 of 10
Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
Appendices
List of Included Log files
FlappyBirdRL.3.x1 – Sample log file (using non-coalesced QValueStore).
FlappyBirdRL.4.x1 – Sample log file using different parameters (using non-coalesced QValueStore).
FlappyBirdRL.coalesced.x4 – The log file generated from a 4x speed up when using the
CoalescedQValue store.
FlappyBirdRL.video.x2 – The log file generated by the run which is shown in the video linked below.
FlappyBirdRL.best.x1 – The log file generated by what we consider, the best (and longest) run, in which 176 pipes are recorded as the best score when using the CoalescedQValue store.
Video Demonstration
The video shows a demonstration of the attached deliverable. The game was sped up by a factor of 2 in order to keep footage short while showing the concept of reinforcement learning applied to ‘Flappy Bird’.
Screenshots
Figure 4: Screenshot
Page 9 of 10
Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
Figure 5: Screenshot
Figure 6: Screenshot
Page 10 of 10



================================================================










3D Flappy Bird with Reinforcement Learning

Hang Qi
Jian Gong
Lunbo Xu

hangqi@cs.ucla.edu
jiangong@cs.ucla.edu
lunbo_xu@cs.ucla.edu



ABSTRACT

In this work, we present a reinforcement learning method to allow a arti cial bird to survive in a randomly generated environment without collision with any obstacles by deciding when to ap the wings. The setting is based on a very popular mobile game called FlappyBird. We implemented it in 3D with WebGL, making it easy for public access and experiments with browsers. The bird is given two different perception models: (1) instant fovy, (2) fovy with short-term memory. In both models, the perception state is represented by a 16 1 vector, consisting of coordinates of the two pillars in front of the bird. A reinforcement learning algorithm Q- learning is used to learn a decision policy, powered by which the bird will be able to survive by repeatedly deciding which action to perform given its current perception state. The experiments show our Q-learning algorithm allows the bird to survive for more than 200 pillars.

Keywords
Arti cial Life, Reinforcement Learning, Flappy Bird

1. INTRODUCTION

Recently, as the mobile game app Flappy Bird hitting the market and becoming popular, people complaint about how hard it is to accomplish the task preventing the bird from crashing into pillars. As human are trying hard to adapt the physics in the game, we found it is promissing to use reinforcement learning [8] algorithms to let the bird learns a decision rule by itself through interactions with the environment.
In this work, we modeled the physics in a graphics environ- ment, simulate the perception of the bird by feeding in the relative coordinates of selected key points of the pillars, and quantize the relative position between the bird and pillars into a state space with reasonable size. A decision policy be- tween to jump or not to jump at each state is learned using Q-learning algorithm[10].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
In this report, we will briey talked about the motivation and related work in section 2. Section 3 will discuss our formulation of the world and the learning problem. Section 4 will present the learning algorithm we used to learn the decision policy. Tools and implementation details will be presented win section 5. Finally, section 6 and section 7 will include our experiments and discussions on future directions.

2. RELATED WORK

Recently a mobile game app called Flappy Bird, hit the market and became very popular on Android and iOS plat- form. This game was originally created by Nguyen Ha Dong, a Vietnamese developer. In this game, a player can control a bird to make a jump by tapping the touch screen. The goal is to let the bird y as far as possible between columns of green pipes without coming into contact with them. A plethora of Flappy Bird copies and clones emerged, such as Squishy Bird [4], Clumsy Bird [5], and even MMO version [1]. The popularity of the game not only originates from its simplicity, but also from the challenges it raises to human's adaptiveness to its dynamics.
Although the rule of the game is simple, it is agreed to be a challenging to make the bird y over more than ten columns of pipes. It requires the player to keep sending commands to the bird precisely at the correct moments with high concentration and fast action.
Some works have been done to let the bird become intelli- gent enough to y by itself. FlappyBirdRL [9] is one of the works which used a 2D model and reinforcement learning to teach the virtual bird y. It assumes that the bird fully knows the con guration of the environment, including ob- stacles out of reach from the bird's sight, which is obviously not true for a real bird.
Thus, in our project, we try to overcome this limitation by explictly modeling its eld of view. In our 2.5D world, we de ned the eld of view of the bird in vertical direction, which means the bird can only see things that is within a certain degrees of its sight. Particularly, anything column that is too high or too low will fall outside of the bird's eld of view. Detailed modeling method is discussed in the following section.
Figure 1: Occluded points or those not in fovy (red points) are unknown to the bird. The bird know only what it saw (green points) with instant fovy perception. The eld of view of the bird is repre- sented by white rays. RGB lines are the axises of the world coordinate system.
3. FORMULATION
In this section, we introduce our modeling and formulation of the world.
3.1 World
Although the bird is ying in a 3D world, the space is essentially represented in 2D. In our 2D model, the bird is represented as its bounding box, a Wb Hb rectangle. The bird is given a constant horizontal velocity vx to y, whereas the vertical velocity vy is controlled by gravity by default. Whenever the bird chose to perform a jump action, a vertical velocity increment ∆vy is added.
Cylinder pillars are represented by rectangles as well. The y positions of the gap between two pillars in the same col- umn are generated randomly. The bird is dead and therfore game is over when the bounding box of the bird overlaps with pillars. To pervent the game to be too difficult to ac- complish, however, the gap between pillars in the same col- umn, the distance between two adjacent columns, and the width of the pillars are xed to constant values (yellow lines in Figure 1).
3.2 Perception
To learn a decision policy, the input to the learning algo- rithm is the the perception of the bird of course. However, to simulate the real vision as a full-size image directly would have introduced too much complexities in segmentation, tri- angulation, 3D reconstruction, and vision feature extraction. Given that we want to focus our work in reinforcement learn- ing, we consider the perception in two scenarios based on different assumptions.
3.2.1 Instant Fovy
In the rst scenario, we assume that the bird can only see two nearest columns of pillars and cannot see occluded corners. Each column is abstracted as four key points (i.e. corners) as shown in Figure 1. Hence, eight coordinates rela- tive to bird's eys position f(xi; yi)g8i=1 are used to represent the vision. For those occluded points or points not in the fovy (i.e. eld of view y), they are marked unknown to the bird by setting x = 1 explicitly. (1; 1) is used to repre- sent the occluded corner above the bird, whereas ( 1; 1) is used for points below the bird.
Figure 2: With short-term memory, the bird not only knows what it sees at the current posi- tion (green points), but also remembers what it saw(yellow points).
Figure 3: From a given state s, the bird can perform two actions (1) jump and (2) not jump, each of which leads the bird to a new state.
3.2.2Fovy with Short-term Memory
In the second scenario, in addition to perception, we as- sume the bird has a short-term memory so that it will re- member the existance of the corner once he saw it. In this case, we can simply give all the four corners to the bird with- out marking any occlusion explicitly since the bird will see all the points anyway as it jumps up and down. 2.
3.3States and actions
Given the above discussion on perception, we can repre- sent the state of the bird by a 16 1 vector
s = (x1; y1; x2; y2; x3; y3; x4; y4)
consisting of coordinates (relative to the bird's eyes) of the eight key points of the nearest two pillars. Real numbers are rounded to their nearest tenth to limit the size of the state space. It is clear that in the case of instant fovy, the state space are smaller due to the occlusion marks.
The action leads to the transition between one state to another. In our con guration, the bird can only perform two actions: (1) jump, (2) not jump as shown in Figure 3. Give the current state st, the next state st+1 can be easily obtained by computing the movement of the bird according to the action at performed.

4. LEARNING

To let the bird survive in the game, we want to learn a decision policy at = (st) which decides an action at for each state st that maximize the total reward accumulated over time t:
1
tRat (st; st+1);
t=0
where is a discount factor, Rat (st; st+1) is the reward gained from action at which causes the state transition from st to st+1.
If we know the exact transition probability P(s; s) and the reward of the states V (s), we can solve the optimal pol- icy that maximize the expected reward from the following
equation:

{s


))}:

max
P(s; s)(R (s; s) +
(s) = arg
a
a
V (s
However, in our game environment, we assume the bird has no prior knowledge about the world, so that the transition probability and rewards are unknown to us. This leads to the use of reinforcement learning[8].
In the framework of Q-Learning[10], a speci c reinforce- ment learning algorithm, we want to learn a policy in the form of a quality table
Q : S A ! R;
which maps a state s 2 S and an action a 2 A to a real number r 2 R that measures the reward of performing action a at state s. Once this quality table is learned, the decision can be easily made at any given state s as follows:
a = arg max Q(s; a):
a2A
Starting from a arbitary Q, the algorithm of Q-learning update the table iteratively using equation
Qt+1(st; at) = Qt(st; at) + [Rt+1 + F Qt(st; at)] ;
where F = maxa Qt(st+1; a) is the maximum expected fu- ture reward at state st+1, is the learning rate controlling the extend to which we want to forget the history, is the discount factor for the expected future reward. This equa- tion ensentially says an action at will get higher reward if it lands to a state st+1 that has high reward and we are not expected to die fast at that state. It has been proved that Q-learning algorithm is guaranteed to converge to the optimal decision policy.

5. GRAPHICS

This project is implemented with WebGL technique, which enables to render complex graphics on HTML5 canvas with JavaScript. Our project can run directly in the browser without the help of any other plugins efficiently. This makes it easy for public to access our project.
In this section, we'll introduce the major development tools, and some graphics techniques used in this project. All the graphics related functions and APIs are wrapped in the graphics.js le.

5.1 Tools

Three.js [2] is a lightweight cross-browser JavaScript li- brary/API used to create and display animated 3D com- puter graphics on a Web browser . Three.js scripts may be
Figure 4: Background scene is built with skybox. It consists of six properly aligned images.
used in conjunction with the HTML5 canvas element, SVG or WebGL. It greatly speeds up our development process.
Blender [3] is a free and open source 3D creation software, which is used in this project for building and converting 3D models, including the greek columns1 and the apping bird2.
Chrome Developer Tools are used to assist the interface design and most importantly, to debug JavaScript functions and APIs.

5.2 Scene Rendering

The background scene in the game is created with Sky- Box technique [6]. Cube mapping is performed to project images on to different faces of the cube. As shown in Fig- ure 4, images were properly aligned on the box. As the cube moves with the camera, it creates an illusion of distant three-dimensional surroundings. In our implementation, a separate camera is used to build the skybox3.
The animation in the scene is mainly controled by the enforcement learning module to decide the exact moment the bird need to make a jump. Whenever the bird jumps, the model of bird will play an prede ned animation to ap.
The point light is placed close to the setting sun in the background. We also adjusted the colors of both point light and ambient light to render the scene more realistically.

6. RESULTS

Under the rst senario where we assumpe the bird only aware points in the fovy, the bird's intelligence is evolving faster due to the limited size of the state space. Figure 5 plots the score against generation. Although the training speed is fast, after generation 16, it starts to converge to a non-promissing results. We believe this "degeneration" is due to the limitation of the features for reinforcement learning and the possible contradiction in the underlying randomly generated world.
In the second senario where the bird is given both fovy and short-term memory, the score is ploted against generation in Figure 6. In this case, our bird is able to y further at its best performance. However, the training process takes much more time due to the large state space.
For the sake of easy debugging, we also implemented a quick training mdoe in which we were able to train without visualize the ying bird. Quick training take much less time
1The greek column model is downloaded from http:// archive3d.net/?category=555
2The bird model is created by mirada from ro.me. [7] 3Images in skybox are created by Jochum Skoglund.
Figure 5: Reinforcement learning result if the bird aware only unocluded points in the fovy.
Figure 6: Reinforcement learning result if the bird aware all eight points by its sight and short-term memory.
since it cut off the time used for real-time graphics render- ing. It only takes ve minutes to quick training the bird to generation 18,000. However, it is interesting to witness the bird's growth from scratch (generation 0). Related video clips can be found on our project webpage.

7. FUTURE WORK

Our future work includes implementing real 3D environ- ment and modi cation of the training method. First, in- stead of ying up and down only, we want to enable the bird turning left and right. Features extracted for 3D game environment are more complicated than the one we use for 2D game model.
Besides, we also want to improve the feature extraction for reinforcement learning. First, we need to add the verticle ve- locity vy of the bird into the state space, since it is nature that the bird aware its own velocity and we strongly believe that some failure results of training were due to unaware- ness of this parameter. In addition, we would like to set a constant time interval, e.g. one second, between two consec- utive decision to prevent the bird to ying up straightly.

8. ACKNOWLEDGEMENT

We want to give special thanks to Professor Terzopoulos who gives us a great course about arti cial life and overview of related techniques. We had enough freedom when do- ing this interesting project and have learnt a lot during the process.

9.REFERENCES

[1]Flappy Bird Massively Multiplayer Online. http://flapmmo.com/, Feb. 2014.
[2]R. Cabello. Three.js. http://www.threejs.org/, Apr. 2010.
[3]B. Foundation. Blender: Open source 3D graphics and animation software. http://www.blender.org/, 1995.
[4]R. Games. Splashy Fish. https://play.google.com/store/apps/details?id= it.junglestudios.splashyfish, Jan. 2014.
[5]E. Le~ao. A MelonJS port of the famous Flappy Bird Game.
https://github.com/ellisonleao/clumsy-bird, Jan. 2014.
[6]E. Meiri. Tutorial 25 - SkyBox. http://ogldev. atspace.co.uk/www/tutorial25/tutorial25.html, Oct. 2010.
[7]Mirada. Dynamic Procedural Terrain Using 3D Simple Noise. http://alteredqualia.com/three/examples/ webgl_terrain_dynamic.html.
[8]R. S. Sutton and A. G. Barto. Introduction to reinforcement learning. MIT Press, 1998.
[9]S. Vaish. Flappy Bird hack using Reinforcement Learning. https://github.com/SarvagyaVaish/FlappyBirdRL, Feb. 2014.
[10]C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279{292, 1992.
 

Saturday, October 10, 2015

UN delay could open door to robot wars, say experts

http://www.theguardian.com/science/2015/oct/06/autonomous-weapons-un-delay-robot-wars

UN delay could open door to robot wars, say experts.

Observers say UK and US are seeking to water down agreement so that any autonomous weapons deployed before talks conclude will be beyond reach of ban

A Sentry Tech remote-controlled weapon station
A Sentry Tech remote-controlled weapon station. Photograph: Supplied
Harriet Grant
Tuesday 6 October 2015 06.00 BST Last modified on Tuesday 6 October 2015 19.20 BST

Share on Pinterest Share on LinkedIn Share on Google+ Share on WhatsApp
The United Nations has been warned that its protracted negotiations over the future of lethal autonomous weapons – or “killer robots” – are moving too slowly to stop robot wars becoming a reality.

Lobbying for a pre-emptive ban on the weapons is intensifying at the UN general assembly in New York, but a deal may not emerge quickly enough to prevent devices from being deployed, experts say.

“There is indeed a danger now that [the process] may get stuck,” said Christof Heyns, the UN special rapporteur on extrajudicial, summary or arbitrary executions.

0:00

Robot swarms: will they bomb us or save us? A look at the evolution and development of intelligent robots that can ‘think’ and coordinate themselves. Link to video
“A lot of money is going into development and people will want a return on their investment,” he said. “If there is not a pre-emptive ban on the high-level autonomous weapons then once the genie is out of the bottle it will be extremely difficult to get it back in.”

Observers say the UK and US are seeking to water down an agreement so that it only includes emerging technology, meaning that any weapons put into practice while discussions continue are beyond the reach of a ban.

“China wanted to discuss ‘existing and emerging technologies’ but the wording insisted on by the US and the UK is that it is only about emerging technologies,” said Noel Sharkey, a professor of artificial intelligence and robotics at the University of Sheffield and co-founder of the International Committee for Robot Arms Control, a coalition of robotics experts who are campaigning against the military use of robots.

“The UK and US are both insisting that the wording for any mandate about autonomous weapons should discuss only emerging technologies. Ostensibly this is because there is concern that … we will want to ban some of their current defensive weapons like the Phalanx or the Iron Dome.

“However, if the discussions go on for several years as they seem to be doing, many of the weapons that we are concerned about will already have been developed and potentially used.”

Iron Dome
A rocket is launched from Israel’s Iron Dome system. Photograph: Dan Balilty/AP
Sharkey said current developments made that a likely scenario. “Governments are continuing to test autonomous weapons systems, for example with the X49B, which is a fighter jet that can fly on its own, and there are contracts already out for swarms of autonomous gun ships. So if we are tied up [discussing a ban] for a long time then the word ‘emerging’ is worrying.”

“People say that the convention on conventional weapons is a graveyard for good ideas because it’s notoriously slow moving. If we see the brakes being applied now that would take discussions into a fourth year.”

No fully autonomous weapons are yet in use, but many semi-autonomous lethal precursors are in development. One such weapon is the South Korean sentry robot SGR-1, which patrols the country’s border with North Korea and detects intruders as far as two miles away using heat and light sensors.

The robots are armed with machine guns and although currently controlled by humans from a distance, they are reportedly capable of making a decision to kill without human intervention.

Israel is deploying machine-gun turrets along its border with the Gaza Strip to target Palestinian infiltrators automatically. And the UK’s Taranis fighter jet flies autonomously and can identify and locate enemies. Although it does not yet act completely autonomously, it was described by a defence procurement minister as having “almost no need for operator input”.

Campaigners from across the fields of robotics and technology have made several high-profile pleas for a pre-emptive ban on offensive autonomous weapons in the past two years, including in a letter in July that was signed by more than 1,000 artificial intelligence researchers. The letter said offensive weapons that operate on their own would lower the threshold of going to battle and result in greater loss of human life.

Sharkey has repeatedly argued at the UN that allowing robots to make the decision to kill is wrong and fraught with risk. “We shouldn’t delegate the decision to kill to a machine full stop,” he said. “Having met with the UN, the Red Cross, roboticists and many groups across civil society, we have a general agreement that these weapons could not comply with the laws of war. There is a problem with them being able to discriminate between civilian and military targets, there is no software that can do that.

“The concern that exercises me most is that people like the US government keep talking about gaining a military edge. So the talk is of using large numbers – swarms – of robots.”

The UN convention on conventional weapons (CCW) examines whether certain weapons might be “excessively injurious or have indiscriminate effects”. This same convention was where the discussions that led to a ban on landmines and sight-blinding laser weapons began.

Only five countries have backed a ban so far, with countries such as the US, the UK and France arguing that a human will always have “meaningful control” over a robots decision to kill – a concept that is much debated.

There is high-level support within the UN for serious restrictions on lethal autonomous weapons. One possible route to a ban if the CCW cannot reach agreement on the next stage would be for states who want a ban to reach an agreement outside the UN, as happened with cluster bombs. But this would mean going ahead without the agreement of some of the major producers of the weapons.

Heyns believes the UN is a vital step in any global decision on banning certain weapons. “Some people say the CCW is where ideas go to die but that is not necessarily true,” he said. “They get aired there and we shouldn’t underestimate what has already been done. It’s not insignificant. At both meetings the states sent very high-level representatives. It’s not the end of the road, it’s a step and you can’t skip it.”
.