source: http://www.brian-gatt.com/portfolio/flappy/documentation.pdf
Department of Computing - MSc. Computer Games and Entertainment
IS71027B – AI for Games
Dr. Jeremy Gow
Term 2 - Assignment 1 (2013-2014)
Reinforcement Learning - Flappy Bird
Brian Gatt (ma301bg)
28 March 2014
Brian Gatt
|
IS71021A – Splines, Interpolation and Quaternions
|
Table of Contents
|
|
..........................................................................................................1
|
|
Page 2 of 10
Brian Gatt
|
IS71021A – Splines, Interpolation and Quaternions
|
Introduction
The following is the documentation for an application
of reinforcement learning on the popular game Dong Nguyen’s ‘Flappy
Bird’. An overview of the attached deliverable is provided explaining
the aims, background and implementation details. The appendices section
contains proof of the resulting program.
Aims and Objectives
The main aim of this project is to apply and
implement the AI technique of reinforcement learning. Dong Nguyen’s
popular ‘Flappy Bird’ game was chosen as a test bed to implement
reinforcement learning in a simple game context. The following quote is
the original project proposal which summarises the intent of the
project:
The project will consist of a ‘Flappy Bird’
implementation using the Unity game engine and C#. The bird will be
automatically controlled using reinforcement learning, in particular Q-Learning. The deliverable will provide a visualization of the learning process (on-line
learning) with accompanying statistics in a text file format. Certain
attributes will be made available for modification via the Unity editor
interface.
In the end, all of the intended objectives were achieved while also allowing for on-line learning from player input.
Background
Reinforcement Learning
Reinforcement learning is a machine learning technique which is prevalent in today’s modern games.
Agents learn from their past experience which in turn
allows them to better judge future actions. This is achieved by
providing the agent with a reward for their actions in relation to the
current world state.
Reinforcement learning is based on three important factors:
1.The
reward function – A function which rewards or punishes (perceived as a
negative reward) the agent for the action he just performed.
2.The learning rule – A rule which reinforces the agent’s memory based on the current experience and reward.
3.The exploration strategy – The strategy which the agent employs in order to select actions.
The reward function, learning rule and exploration
strategy generally depend on the current world state; a set of variables
which define what the learning rule will take into consideration to
adapt.
One popular implementation of reinforcement learning is Q-learning. It is a learning rule which evaluates Q-values, values which represent the quality score of a state-action tuple. The learning rule returns a Q-value by blending the prior state-action Q-value and the best Q-value of the current state. This is represented as:
Page 3 of 10
Brian Gatt
|
IS71021A – Splines, Interpolation and Quaternions
|
( , ) = (1 − )( , ) + ( + max( (′, ′)))
The variable is the learning rate which linearly blends the Q-values, is the perceived reward and is the discount factor which defines “how much an action’s Q-value depends on the Q-value at the state (or states) it leads to.” (Millington & Funge, 2009). Both and are bound between 0 and 1 inclusive.
Reinforcement Learning requires time in order to
train to a state where agents can react in reasonable manners. This is
all dependant on the parameters chosen for the learning rule, the
rewards and how the world is encoded (discretizing the world state in a
reasonable manner). The data structures used to store this information
can also hinder the experience.
Design
The design for this implementation closely followed GitHub user’s SarvagyaVaish implementation (Vaish, 2014).
Vaish uses Q-Learning in order to train
the ‘Flappy Bird’ character using a reward function which rewards the
bird with a score of 1 on each frame the bird is alive. Once the bird
dies, the reward function heavily punishes with a score of -1000.
The learning state is defined as the horizontal and vertical proximity
between the bird and the upcoming pipe hazard. The bird is allowed to do
any of the available actions (Do Nothing, Jump) at any time during the
simulation.
We took some liberties in relation to Vaish’s
implementation in order to make the implementation easier and adapt it
to the Unity engine. One clear example being the data structures used to
store the character’s experience. Vaish uses a multi-dimensional
fixed size array based on the maximum distances between the bird and
the pipes on the horizontal and vertical axis. He then creates a basic
hash mechanism and stores Q-values in this data structure,
overwriting values which exceed the minimum and maximum in the lowest or
greatest element respectively. In our case, we use a simple map or
dictionary and store the experiences accordingly.
Implementation
Following are some details on how to configure the
AI parameters and how the implementation is defined in terms of the
major Unity game objects, components, and behaviours.
Scene
The scene mainly consists of the main camera, the
‘Flappy Bird’ character and hazard spawn points. The ‘GameController’
game object is an empty game object which is used to control the overall
game. The ‘Destroyer’ object is used to destroy previously instantiated
hazards in order to keep the game object instance count to a minimum.
The ‘Cover Up’ game object is essentially a cover layer which hides what
is occurring behind the scenes so that users are not able to see hazard
spawn points and destruction points. Finally, the ‘ceiling’ game object
was necessary to avoid cheating. During implementation, there were
cases where the bird was exploring areas beyond its intended space,
allowing him to stay alive yet not achieving a better score since he
was, literally, jumping over the pipes.
Page 4 of 10
Brian Gatt
|
IS71021A – Splines, Interpolation and Quaternions
|
Configuration
AI parameter configuration is achieved by modifying
the variables within the ‘Player Character’ script component attached to
the ‘Flappy Bird’ game object. Below is a screenshot of the mentioned
component.
Figure 1: ‘Flappy Bird’ Configuration
The ‘Controller Type’ variable lets the user choose between ‘InputController’, ‘AIController’ and
‘HybridController’. The ‘InputController’ is an
implementation artifact which was used to test the game mechanics. It
does not learn and only reacts to user input. The ‘AIController’ is the
reinforcement learning controller strategy which learns and acts on its
own. The ‘HybridController’ is an extension of the ‘AIController’ which
learns but also allows user to provide his own input to allow the
learning algorithm to learn from a different source.
The Intelligence properties expose the learning
parameters previously mentioned in the ‘Background’ and ‘Design’
sections. Note that the ‘Saved State’ field is an implementation
artifact. It was intended to allow learning to be resumable from
different sessions, alas, encoding the state of the learning algorithm
within the log file became unwieldy due to excessive file sizes so it
was abandoned. The ‘Precision’ field specifies the scale for how many
decimal places for the bird-pipe proximity are taken into consideration.
The ‘GameController’ game object exists solely to
host the ‘Game Controller’ script controller which hosts minor
configuration parameters for the overall game. Below is a screenshot of
the mentioned component:
Page 5 of 10
Brian Gatt
|
IS71021A – Splines, Interpolation and Quaternions
|
Figure 2: ’GameController’ Overall Configuration
Please note that the ‘Best Score’ field does not
affect the game per se but is only there as a visual cue to keep up with
the best score recorded.
To speed up the simulation, simply modify the Unity
time engine parameter from ‘Edit’ – ‘Project Settings’ – ‘Time’ and
modify the ‘Time Scale’ parameter.
Classes and Interfaces
Following is a brief overview of the major classes and interfaces which compose the overall implementation.
GameController
Manages the overall game flow by managing the state
of the game and restarting accordingly. It is based on the singleton
pattern and is persisted across scene loads. It stores the state of the
AI algorithms on death of the player and restores them once the level is
initiated.
PlayerCharacter
Contains the behaviour of the player character. The
strategy pattern is used to represent the controller implementations and
is switched on scene load accordingly. The update event delegates to
the underlying controller implementation.
AIController
A Controller implementation which uses the Q-Learning algorithm to store and base the actions of the player character.
CoalescedQValueStore
An IQValueStore implementation.
This follows Millington’s and Funge’s (Millington & Funge, 2009)
recommendations by coupling the state and the action as one entity. This
implementation follows the QValueStore implementation
which was used in earlier stages of development. We were initially
afraid that the original version had multiple hash collisions so we
implemented the coalesced version which provided better, more stable,
results.
QLearning
The implementation of the Q-Learning algorithm according to Millington and Funge.
Page 6 of 10
Brian Gatt
|
IS71021A – Splines, Interpolation and Quaternions
|
User Manual
Prerequisites
Please ensure that Unity is installed on your system. This project was developed and tested using Unity version 4.3.
Launching the Project
In order to launch the project, double-click
on the ‘MainScene’ scene file located in ‘Assets/Scenes/’.
Alternatively, open Unity and via the ‘Open Project…’ menu item,
navigate to the top-level directory of the project and launch it from there.
Initiating the Simulation
In order to start the simulation, simply click the
‘play’ button in the Unity editor. Ensure that prior to starting the
simulation, the parameters are set up correctly. Certain parameters can
also be modified mid-run. It is recommended that for mid-run
parameter modification, the simulation is paused via the ‘pause’ button
in the Unity editor so that it is easier to modify AI parameters
accordingly.
Evaluation
Based on our implementation and the generated log
files (refer to ‘Appendices’), it takes time for the character to learn
the problem. Important elements which defines the learning algorithm are
the space quantization parameters and the underlying data structures
used to store the experiences. Imprecise space quantization precision
can lead to faster results but the character will start to generalize
quickly. On the other hand precise values lead to longer training but
more fine-tuned results. Following is a chart which shows
the best run we managed to achieve (refer to the attached
‘FlappyBirdRL.best.1x.csv’ for a detailed overview):
Final Scores VS Run Instances
|
|||||||||
200
|
|||||||||
180
|
|||||||||
160
|
|||||||||
140
|
|||||||||
120
|
|||||||||
100
|
|||||||||
80
|
|||||||||
60
|
|||||||||
40
|
|||||||||
20
|
|||||||||
0
|
|||||||||
0
|
200
|
400
|
600
|
800
|
1000
|
1200
|
1400
|
1600
|
1800
|
Page 7 of 10
Brian Gatt
|
IS71021A – Splines, Interpolation and Quaternions
|
Whether or not reinforcement learning is useful for
this type of application is debatable. According to the one of the
comments on SarvagyaVaish repository, ‘Flappy Bird’ uses a deterministic physics model where both the height jump and width can easily be computed.
Figure 3: Flappy Bird's deterministic physics model (Vaish, 2014)
Based on these two values, a simpler scheme can be
used. Another project (Jou, 2014) shows another implementation of this
concept but focusing on computer vision. We believe that this project
exploits the deterministic physics model in order to achieve its
results.
Conclusion
Reinforcement learning is an AI technique which
allows agents to learn and adapt via a reward- punishment system. This
technique is implemented and applied on Dong Nguyen’s ‘Flappy Bird’ and
its implications are evaluated. During development, testing was
continuously performed in order to ensure that the requirements are met
and the deliverable is of a high quality. The appendices section shows a
running demonstration of the artefact.
References
Jou, E. (2014, February 24). Chinese Robot Will Decimate Your Flappy Bird Score. Retrieved from Kotaku: http://kotaku.com/chinese-robot-will-decimate-your-flappy-bird-score-1529530681
Millington, I., & Funge, J. (2009). Artificial Intelligence for Games. Morgan Kaufmann.
Vaish, S. (2014, February 15). Flappy Bird RL. Retrieved February 21, 2014, from Github Pages: http://sarvagyavaish.github.io/FlappyBirdRL/
Page 8 of 10
Brian Gatt
|
IS71021A – Splines, Interpolation and Quaternions
|
Appendices
List of Included Log files
FlappyBirdRL.3.x1 – Sample log file (using non-coalesced QValueStore).
FlappyBirdRL.4.x1 – Sample log file using different parameters (using non-coalesced QValueStore).
FlappyBirdRL.coalesced.x4 – The log file generated from a 4x speed up when using the
CoalescedQValue store.
FlappyBirdRL.video.x2 – The log file generated by the run which is shown in the video linked below.
FlappyBirdRL.best.x1
– The log file generated by what we consider, the best (and longest)
run, in which 176 pipes are recorded as the best score when using the CoalescedQValue store.
Source Repository https://bitbucket.org/briangatt/flappy-bird-rl-unity
Video Demonstration
The video shows a demonstration of the attached
deliverable. The game was sped up by a factor of 2 in order to keep
footage short while showing the concept of reinforcement learning
applied to ‘Flappy Bird’.
Screenshots
Figure 4: Screenshot
Page 9 of 10
Brian Gatt
|
IS71021A – Splines, Interpolation and Quaternions
|
Figure 5: Screenshot
Figure 6: Screenshot
Page 10 of 10
================================================================
https://github.com/ellisonleao/clumsy-bird, Jan. 2014.
================================================================
3D Flappy Bird with Reinforcement Learning
Hang Qi
|
Jian Gong
|
Lunbo Xu
|
hangqi@cs.ucla.edu
|
jiangong@cs.ucla.edu
|
lunbo_xu@cs.ucla.edu
|
ABSTRACT
In this work, we present a reinforcement learning method to allow a arti cial
bird to survive in a randomly generated environment without collision
with any obstacles by deciding when to ap the wings. The setting is
based on a very popular mobile game called FlappyBird.
We implemented it in 3D with WebGL, making it easy for public access
and experiments with browsers. The bird is given two different perception
models: (1) instant fovy, (2) fovy with short-term memory.
In both models, the perception state is represented by a 16 1 vector,
consisting of coordinates of the two pillars in front of the bird. A
reinforcement learning algorithm Q- learning is used to learn a decision
policy, powered by which the bird will be able to survive by repeatedly
deciding which action to perform given its current perception state.
The experiments show our Q-learning algorithm allows the bird to survive for more than 200 pillars.
Keywords
Arti cial Life, Reinforcement Learning, Flappy Bird
1. INTRODUCTION
Recently, as the mobile game app Flappy Bird hitting
the market and becoming popular, people complaint about how hard it is
to accomplish the task preventing the bird from crashing into pillars.
As human are trying hard to adapt the physics in the game, we found it
is promissing to use reinforcement learning [8] algorithms to let the
bird learns a decision rule by itself through interactions with the
environment.
In this work, we modeled the physics in a graphics
environ- ment, simulate the perception of the bird by feeding in the
relative coordinates of selected key points of the pillars, and quantize
the relative position between the bird and pillars into a state space
with reasonable size. A decision policy be- tween to jump or not to jump
at each state is learned using Q-learning algorithm[10].
Permission to make digital or hard copies of all or
part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial
advantage and that copies bear this notice and the full citation on the
first page. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
In this report, we will briey talked about the
motivation and related work in section 2. Section 3 will discuss our
formulation of the world and the learning problem. Section 4 will
present the learning algorithm we used to learn the decision policy.
Tools and implementation details will be presented win section 5.
Finally, section 6 and section 7 will include our experiments and
discussions on future directions.
2. RELATED WORK
Recently a mobile game app called Flappy Bird, hit
the market and became very popular on Android and iOS plat- form. This
game was originally created by Nguyen Ha Dong, a Vietnamese developer.
In this game, a player can control a bird to make a jump by tapping the
touch screen. The goal is to let the bird y as far as possible between
columns of green pipes without coming into contact with them. A plethora
of Flappy Bird copies and clones emerged, such as Squishy Bird [4],
Clumsy Bird [5], and even MMO version [1]. The popularity of the game
not only originates from its simplicity, but also from the challenges it
raises to human's adaptiveness to its dynamics.
Although the rule of the game is simple, it is agreed
to be a challenging to make the bird y over more than ten columns of
pipes. It requires the player to keep sending commands to the bird
precisely at the correct moments with high concentration and fast
action.
Some works have been done to let the bird become
intelli- gent enough to y by itself. FlappyBirdRL [9] is one of the
works which used a 2D model and reinforcement learning to teach the
virtual bird y. It assumes that the bird fully knows the con guration
of the environment, including ob- stacles out of reach from the bird's
sight, which is obviously not true for a real bird.
Thus, in our project, we try to overcome this limitation by explictly modeling its eld of view. In our 2.5D world, we de ned the eld
of view of the bird in vertical direction, which means the bird can
only see things that is within a certain degrees of its sight.
Particularly, anything column that is too high or too low will fall
outside of the bird's eld of view. Detailed modeling method is discussed in the following section.
Figure 1: Occluded points or those not in fovy (red
points) are unknown to the bird. The bird know only what it saw (green
points) with instant fovy perception. The eld of view of the bird is
repre- sented by white rays. RGB lines are the axises of the world
coordinate system.
3. FORMULATION
In this section, we introduce our modeling and formulation of the world.
3.1 World
Although the bird is ying in a 3D world, the space
is essentially represented in 2D. In our 2D model, the bird is
represented as its bounding box, a Wb Hb rectangle. The bird is given a constant horizontal velocity vx to y, whereas the vertical velocity vy is controlled by gravity by default. Whenever the bird chose to perform a jump action, a vertical velocity increment ∆vy is added.
Cylinder pillars are represented by rectangles as well. The y positions
of the gap between two pillars in the same col- umn are generated
randomly. The bird is dead and therfore game is over when the bounding
box of the bird overlaps with pillars. To pervent the game to be too
difficult to ac- complish, however, the gap between pillars in the same
col- umn, the distance between two adjacent columns, and the width of
the pillars are xed to constant values (yellow lines in Figure 1).
3.2 Perception
To learn a decision policy, the input to the
learning algo- rithm is the the perception of the bird of course.
However, to simulate the real vision as a full-size image
directly would have introduced too much complexities in segmentation,
tri- angulation, 3D reconstruction, and vision feature extraction. Given
that we want to focus our work in reinforcement learn- ing, we consider
the perception in two scenarios based on different assumptions.
3.2.1 Instant Fovy
In the rst scenario, we
assume that the bird can only see two nearest columns of pillars and
cannot see occluded corners. Each column is abstracted as four key
points (i.e. corners) as shown in Figure 1. Hence, eight coordinates
rela- tive to bird's eys position f(xi; yi)g8i=1 are used to represent the vision. For those occluded points or points not in the fovy (i.e. eld of view y), they are marked unknown to the bird by setting x = 1 explicitly. (1; 1) is used to repre- sent the occluded corner above the bird, whereas ( 1; 1) is used for points below the bird.
Figure 2: With short-term memory, the
bird not only knows what it sees at the current posi- tion (green
points), but also remembers what it saw(yellow points).
Figure 3: From a given state s, the bird can perform two actions (1) jump and (2) not jump, each of which leads the bird to a new state.
3.2.2Fovy with Short-term Memory
In the second scenario, in addition to perception, we as- sume the bird has a short-term
memory so that it will re- member the existance of the corner once he
saw it. In this case, we can simply give all the four corners to the
bird with- out marking any occlusion explicitly since the bird will see
all the points anyway as it jumps up and down. 2.
3.3States and actions
Given the above discussion on perception, we can repre- sent the state of the bird by a 16 1 vector
s = (x1; y1; x2; y2; x3; y3; x4; y4)
consisting of coordinates (relative to the bird's
eyes) of the eight key points of the nearest two pillars. Real numbers
are rounded to their nearest tenth to limit the size of the state space.
It is clear that in the case of instant fovy, the state space are
smaller due to the occlusion marks.
The action leads to the transition between one state to another. In our con guration, the bird can only perform two actions: (1) jump, (2) not jump as shown in Figure 3. Give the current state st, the next state st+1 can be easily obtained by computing the movement of the bird according to the action at performed.
4. LEARNING
To let the bird survive in the game, we want to learn a decision policy at = (st) which decides an action at for each state st that maximize the total reward accumulated over time t:
∑1
tRat (st; st+1);
t=0
where is a discount factor, Rat (st; st+1) is the reward gained from action at which causes the state transition from st to st+1.
If we know the exact transition probability P(s; s′) and the reward of the states V (s), we can solve the optimal pol- icy that maximize the expected reward from the following
equation:
|
{∑s′
|
))}:
|
|||
max
|
P(s; s′)(R (s; s′) +
|
′
|
|||
(s) = arg
|
a
|
a
|
V (s
|
However, in our game environment, we assume the bird
has no prior knowledge about the world, so that the transition
probability and rewards are unknown to us. This leads to the use of
reinforcement learning[8].
In the framework of Q-Learning[10], a speci c reinforce- ment learning algorithm, we want to learn a policy in the form of a quality table
Q : S A ! R;
which maps a state s 2 S and an action a 2 A to a real number r 2 R that measures the reward of performing action a at state s. Once this quality table is learned, the decision can be easily made at any given state s as follows:
a = arg max Q(s; a):
a2A
Starting from a arbitary Q, the algorithm of Q-learning update the table iteratively using equation
Qt+1(st; at) = Qt(st; at) + [Rt+1 + F Qt(st; at)] ;
where F = maxa Qt(st+1; a) is the maximum expected fu- ture reward at state st+1,
is the learning rate controlling the extend to which we want to forget
the history, is the discount factor for the expected future reward. This
equa- tion ensentially says an action at will get higher reward if it lands to a state st+1 that has high reward and we are not expected to die fast at that state. It has been proved that Q-learning algorithm is guaranteed to converge to the optimal decision policy.
5. GRAPHICS
This project is implemented with WebGL technique,
which enables to render complex graphics on HTML5 canvas with
JavaScript. Our project can run directly in the browser without the help
of any other plugins efficiently. This makes it easy for public to access
our project.
In this section, we'll introduce the major
development tools, and some graphics techniques used in this project.
All the graphics related functions and APIs are wrapped in the graphics.js le.
5.1 Tools
Three.js [2] is a lightweight cross-browser JavaScript li- brary/API used to create and display animated 3D com- puter graphics on a Web browser . Three.js scripts may be
Figure 4: Background scene is built with skybox. It consists of six properly aligned images.
used in conjunction with the HTML5 canvas element, SVG or WebGL. It greatly speeds up our development process.
Blender [3] is a free and open source 3D creation
software, which is used in this project for building and converting 3D
models, including the greek columns1 and the apping bird2.
Chrome Developer Tools are used to assist the interface design and most importantly, to debug JavaScript functions and APIs.
5.2 Scene Rendering
The background scene in the game is created with
Sky- Box technique [6]. Cube mapping is performed to project images on
to different faces of the cube. As shown in Fig- ure 4, images were
properly aligned on the box. As the cube moves with the camera, it
creates an illusion of distant three-dimensional surroundings. In our implementation, a separate camera is used to build the skybox3.
The animation in the scene is mainly controled by
the enforcement learning module to decide the exact moment the bird need
to make a jump. Whenever the bird jumps, the model of bird will play an
prede ned animation to ap.
The point light is placed close to the setting sun in
the background. We also adjusted the colors of both point light and
ambient light to render the scene more realistically.
6. RESULTS
Under the rst senario where
we assumpe the bird only aware points in the fovy, the bird's
intelligence is evolving faster due to the limited size of the state
space. Figure 5 plots the score against generation. Although the
training speed is fast, after generation 16, it starts to converge to a non-promissing
results. We believe this "degeneration" is due to the limitation of the
features for reinforcement learning and the possible contradiction in
the underlying randomly generated world.
In the second senario where the bird is given both fovy and short-term
memory, the score is ploted against generation in Figure 6. In this
case, our bird is able to y further at its best performance. However,
the training process takes much more time due to the large state space.
For the sake of easy debugging, we also implemented a
quick training mdoe in which we were able to train without visualize
the ying bird. Quick training take much less time
1The greek column model is downloaded from http:// archive3d.net/?category=555
2The bird model is created by mirada from ro.me. [7] 3Images in skybox are created by Jochum Skoglund.
Figure 5: Reinforcement learning result if the bird aware only unocluded points in the fovy.
Figure 6: Reinforcement learning result if the bird aware all eight points by its sight and short-term memory.
since it cut off the time used for real-time graphics render- ing. It only takes ve
minutes to quick training the bird to generation 18,000. However, it is
interesting to witness the bird's growth from scratch (generation 0).
Related video clips can be found on our project webpage.
7. FUTURE WORK
Our future work includes implementing real 3D environ- ment and modi cation
of the training method. First, in- stead of ying up and down only, we
want to enable the bird turning left and right. Features extracted for
3D game environment are more complicated than the one we use for 2D game
model.
Besides, we also want to improve the feature
extraction for reinforcement learning. First, we need to add the
verticle ve- locity vy of
the bird into the state space, since it is nature that the bird aware
its own velocity and we strongly believe that some failure results of
training were due to unaware- ness of this parameter. In addition, we
would like to set a constant time interval, e.g. one second, between two
consec- utive decision to prevent the bird to ying up straightly.
8. ACKNOWLEDGEMENT
We want to give special thanks to Professor Terzopoulos who gives us a great course about arti cial
life and overview of related techniques. We had enough freedom when do-
ing this interesting project and have learnt a lot during the process.
9.REFERENCES
[1]Flappy Bird Massively Multiplayer Online. http://flapmmo.com/, Feb. 2014.
[2]R. Cabello. Three.js. http://www.threejs.org/, Apr. 2010.
[3]B. Foundation. Blender: Open source 3D graphics and animation software. http://www.blender.org/, 1995.
[4]R. Games. Splashy Fish. https://play.google.com/store/apps/details?id= it.junglestudios.splashyfish, Jan. 2014.
[5]E. Le~ao. A MelonJS port of the famous Flappy Bird Game.
[6]E. Meiri. Tutorial 25 - SkyBox. http://ogldev. atspace.co.uk/www/tutorial25/tutorial25.html, Oct. 2010.
[7]Mirada. Dynamic Procedural Terrain Using 3D Simple Noise. http://alteredqualia.com/three/examples/ webgl_terrain_dynamic.html.
[8]R. S. Sutton and A. G. Barto. Introduction to reinforcement learning. MIT Press, 1998.
[9]S. Vaish. Flappy Bird hack using Reinforcement Learning. https://github.com/SarvagyaVaish/FlappyBirdRL, Feb. 2014.
[10]C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279{292, 1992.
No comments:
Post a Comment