Tuesday, October 13, 2015

Reinforcement Learning - Flappy Bird. Brian Gatt. + 3D Flappy Bird with Reinforcement Learning

source: http://www.brian-gatt.com/portfolio/flappy/documentation.pdf


Department of Computing - MSc. Computer Games and Entertainment
IS71027B – AI for Games
Dr. Jeremy Gow
Term 2 - Assignment 1 (2013-2014)
Reinforcement Learning - Flappy Bird
Brian Gatt (ma301bg)
28 March 2014
Brian Gatt


IS71021A – Splines, Interpolation and Quaternions
Table of Contents

..........................................................................................................1
3
3
3
3
4
4
4
5
6
7
7
7
7
7
8
8
9
9
9
9
9
Page 2 of 10


Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
Introduction
The following is the documentation for an application of reinforcement learning on the popular game Dong Nguyen’s ‘Flappy Bird’. An overview of the attached deliverable is provided explaining the aims, background and implementation details. The appendices section contains proof of the resulting program.
Aims and Objectives
The main aim of this project is to apply and implement the AI technique of reinforcement learning. Dong Nguyen’s popular ‘Flappy Bird’ game was chosen as a test bed to implement reinforcement learning in a simple game context. The following quote is the original project proposal which summarises the intent of the project:
The project will consist of a ‘Flappy Bird’ implementation using the Unity game engine and C#. The bird will be automatically controlled using reinforcement learning, in particular Q-Learning. The deliverable will provide a visualization of the learning process (on-line learning) with accompanying statistics in a text file format. Certain attributes will be made available for modification via the Unity editor interface.
In the end, all of the intended objectives were achieved while also allowing for on-line learning from player input.
Background
Reinforcement Learning
Reinforcement learning is a machine learning technique which is prevalent in today’s modern games.
Agents learn from their past experience which in turn allows them to better judge future actions. This is achieved by providing the agent with a reward for their actions in relation to the current world state.
Reinforcement learning is based on three important factors:
1.The reward function – A function which rewards or punishes (perceived as a negative reward) the agent for the action he just performed.
2.The learning rule – A rule which reinforces the agent’s memory based on the current experience and reward.
3.The exploration strategy – The strategy which the agent employs in order to select actions.
The reward function, learning rule and exploration strategy generally depend on the current world state; a set of variables which define what the learning rule will take into consideration to adapt.
One popular implementation of reinforcement learning is Q-learning. It is a learning rule which evaluates Q-values, values which represent the quality score of a state-action tuple. The learning rule returns a Q-value by blending the prior state-action Q-value and the best Q-value of the current state. This is represented as:
Page 3 of 10
Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
( , ) = (1 − )( , ) + ( + max( (, )))
The variable is the learning rate which linearly blends the Q-values, is the perceived reward and is the discount factor which defines “how much an action’s Q-value depends on the Q-value at the state (or states) it leads to.” (Millington & Funge, 2009). Both and are bound between 0 and 1 inclusive.
Reinforcement Learning requires time in order to train to a state where agents can react in reasonable manners. This is all dependant on the parameters chosen for the learning rule, the rewards and how the world is encoded (discretizing the world state in a reasonable manner). The data structures used to store this information can also hinder the experience.
Design
The design for this implementation closely followed GitHub user’s SarvagyaVaish implementation (Vaish, 2014).
Vaish uses Q-Learning in order to train the ‘Flappy Bird’ character using a reward function which rewards the bird with a score of 1 on each frame the bird is alive. Once the bird dies, the reward function heavily punishes with a score of -1000. The learning state is defined as the horizontal and vertical proximity between the bird and the upcoming pipe hazard. The bird is allowed to do any of the available actions (Do Nothing, Jump) at any time during the simulation.
We took some liberties in relation to Vaish’s implementation in order to make the implementation easier and adapt it to the Unity engine. One clear example being the data structures used to store the character’s experience. Vaish uses a multi-dimensional fixed size array based on the maximum distances between the bird and the pipes on the horizontal and vertical axis. He then creates a basic hash mechanism and stores Q-values in this data structure, overwriting values which exceed the minimum and maximum in the lowest or greatest element respectively. In our case, we use a simple map or dictionary and store the experiences accordingly.
Implementation
Following are some details on how to configure the AI parameters and how the implementation is defined in terms of the major Unity game objects, components, and behaviours.
Scene
The scene mainly consists of the main camera, the ‘Flappy Bird’ character and hazard spawn points. The ‘GameController’ game object is an empty game object which is used to control the overall game. The ‘Destroyer’ object is used to destroy previously instantiated hazards in order to keep the game object instance count to a minimum. The ‘Cover Up’ game object is essentially a cover layer which hides what is occurring behind the scenes so that users are not able to see hazard spawn points and destruction points. Finally, the ‘ceiling’ game object was necessary to avoid cheating. During implementation, there were cases where the bird was exploring areas beyond its intended space, allowing him to stay alive yet not achieving a better score since he was, literally, jumping over the pipes.
Page 4 of 10
Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
Configuration
AI parameter configuration is achieved by modifying the variables within the ‘Player Character’ script component attached to the ‘Flappy Bird’ game object. Below is a screenshot of the mentioned component.
Figure 1: ‘Flappy Bird’ Configuration
The ‘Controller Type’ variable lets the user choose between ‘InputController’, ‘AIController’ and
‘HybridController’. The ‘InputController’ is an implementation artifact which was used to test the game mechanics. It does not learn and only reacts to user input. The ‘AIController’ is the reinforcement learning controller strategy which learns and acts on its own. The ‘HybridController’ is an extension of the ‘AIController’ which learns but also allows user to provide his own input to allow the learning algorithm to learn from a different source.
The Intelligence properties expose the learning parameters previously mentioned in the ‘Background’ and ‘Design’ sections. Note that the ‘Saved State’ field is an implementation artifact. It was intended to allow learning to be resumable from different sessions, alas, encoding the state of the learning algorithm within the log file became unwieldy due to excessive file sizes so it was abandoned. The ‘Precision’ field specifies the scale for how many decimal places for the bird-pipe proximity are taken into consideration.
The ‘GameController’ game object exists solely to host the ‘Game Controller’ script controller which hosts minor configuration parameters for the overall game. Below is a screenshot of the mentioned component:
Page 5 of 10
Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
Figure 2: ’GameController’ Overall Configuration
Please note that the ‘Best Score’ field does not affect the game per se but is only there as a visual cue to keep up with the best score recorded.
To speed up the simulation, simply modify the Unity time engine parameter from ‘Edit’ – ‘Project Settings’ – ‘Time’ and modify the ‘Time Scale’ parameter.
Classes and Interfaces
Following is a brief overview of the major classes and interfaces which compose the overall implementation.
GameController
Manages the overall game flow by managing the state of the game and restarting accordingly. It is based on the singleton pattern and is persisted across scene loads. It stores the state of the AI algorithms on death of the player and restores them once the level is initiated.
PlayerCharacter
Contains the behaviour of the player character. The strategy pattern is used to represent the controller implementations and is switched on scene load accordingly. The update event delegates to the underlying controller implementation.
AIController
A Controller implementation which uses the Q-Learning algorithm to store and base the actions of the player character.
CoalescedQValueStore
An IQValueStore implementation. This follows Millington’s and Funge’s (Millington & Funge, 2009) recommendations by coupling the state and the action as one entity. This implementation follows the QValueStore implementation which was used in earlier stages of development. We were initially afraid that the original version had multiple hash collisions so we implemented the coalesced version which provided better, more stable, results.
QLearning
The implementation of the Q-Learning algorithm according to Millington and Funge.
Page 6 of 10
Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
User Manual
Prerequisites
Please ensure that Unity is installed on your system. This project was developed and tested using Unity version 4.3.
Launching the Project
In order to launch the project, double-click on the ‘MainScene’ scene file located in ‘Assets/Scenes/’. Alternatively, open Unity and via the ‘Open Project…’ menu item, navigate to the top-level directory of the project and launch it from there.
Initiating the Simulation
In order to start the simulation, simply click the ‘play’ button in the Unity editor. Ensure that prior to starting the simulation, the parameters are set up correctly. Certain parameters can also be modified mid-run. It is recommended that for mid-run parameter modification, the simulation is paused via the ‘pause’ button in the Unity editor so that it is easier to modify AI parameters accordingly.
Evaluation
Based on our implementation and the generated log files (refer to ‘Appendices’), it takes time for the character to learn the problem. Important elements which defines the learning algorithm are the space quantization parameters and the underlying data structures used to store the experiences. Imprecise space quantization precision can lead to faster results but the character will start to generalize quickly. On the other hand precise values lead to longer training but more fine-tuned results. Following is a chart which shows the best run we managed to achieve (refer to the attached ‘FlappyBirdRL.best.1x.csv’ for a detailed overview):


Final Scores VS Run Instances


200









180









160









140









120









100









80









60









40









20









0









0
200
400
600
800
1000
1200
1400
1600
1800
Page 7 of 10
Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
Whether or not reinforcement learning is useful for this type of application is debatable. According to the one of the comments on SarvagyaVaish repository, ‘Flappy Bird’ uses a deterministic physics model where both the height jump and width can easily be computed.
Figure 3: Flappy Bird's deterministic physics model (Vaish, 2014)
Based on these two values, a simpler scheme can be used. Another project (Jou, 2014) shows another implementation of this concept but focusing on computer vision. We believe that this project exploits the deterministic physics model in order to achieve its results.
Conclusion
Reinforcement learning is an AI technique which allows agents to learn and adapt via a reward- punishment system. This technique is implemented and applied on Dong Nguyen’s ‘Flappy Bird’ and its implications are evaluated. During development, testing was continuously performed in order to ensure that the requirements are met and the deliverable is of a high quality. The appendices section shows a running demonstration of the artefact.
References
Jou, E. (2014, February 24). Chinese Robot Will Decimate Your Flappy Bird Score. Retrieved from Kotaku: http://kotaku.com/chinese-robot-will-decimate-your-flappy-bird-score-1529530681
Millington, I., & Funge, J. (2009). Artificial Intelligence for Games. Morgan Kaufmann.
Vaish, S. (2014, February 15). Flappy Bird RL. Retrieved February 21, 2014, from Github Pages: http://sarvagyavaish.github.io/FlappyBirdRL/
Page 8 of 10
Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
Appendices
List of Included Log files
FlappyBirdRL.3.x1 – Sample log file (using non-coalesced QValueStore).
FlappyBirdRL.4.x1 – Sample log file using different parameters (using non-coalesced QValueStore).
FlappyBirdRL.coalesced.x4 – The log file generated from a 4x speed up when using the
CoalescedQValue store.
FlappyBirdRL.video.x2 – The log file generated by the run which is shown in the video linked below.
FlappyBirdRL.best.x1 – The log file generated by what we consider, the best (and longest) run, in which 176 pipes are recorded as the best score when using the CoalescedQValue store.
Video Demonstration
The video shows a demonstration of the attached deliverable. The game was sped up by a factor of 2 in order to keep footage short while showing the concept of reinforcement learning applied to ‘Flappy Bird’.
Screenshots
Figure 4: Screenshot
Page 9 of 10
Brian Gatt
IS71021A – Splines, Interpolation and Quaternions
Figure 5: Screenshot
Figure 6: Screenshot
Page 10 of 10



================================================================










3D Flappy Bird with Reinforcement Learning

Hang Qi
Jian Gong
Lunbo Xu

hangqi@cs.ucla.edu
jiangong@cs.ucla.edu
lunbo_xu@cs.ucla.edu



ABSTRACT

In this work, we present a reinforcement learning method to allow a arti cial bird to survive in a randomly generated environment without collision with any obstacles by deciding when to ap the wings. The setting is based on a very popular mobile game called FlappyBird. We implemented it in 3D with WebGL, making it easy for public access and experiments with browsers. The bird is given two different perception models: (1) instant fovy, (2) fovy with short-term memory. In both models, the perception state is represented by a 16 1 vector, consisting of coordinates of the two pillars in front of the bird. A reinforcement learning algorithm Q- learning is used to learn a decision policy, powered by which the bird will be able to survive by repeatedly deciding which action to perform given its current perception state. The experiments show our Q-learning algorithm allows the bird to survive for more than 200 pillars.

Keywords
Arti cial Life, Reinforcement Learning, Flappy Bird

1. INTRODUCTION

Recently, as the mobile game app Flappy Bird hitting the market and becoming popular, people complaint about how hard it is to accomplish the task preventing the bird from crashing into pillars. As human are trying hard to adapt the physics in the game, we found it is promissing to use reinforcement learning [8] algorithms to let the bird learns a decision rule by itself through interactions with the environment.
In this work, we modeled the physics in a graphics environ- ment, simulate the perception of the bird by feeding in the relative coordinates of selected key points of the pillars, and quantize the relative position between the bird and pillars into a state space with reasonable size. A decision policy be- tween to jump or not to jump at each state is learned using Q-learning algorithm[10].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
In this report, we will briey talked about the motivation and related work in section 2. Section 3 will discuss our formulation of the world and the learning problem. Section 4 will present the learning algorithm we used to learn the decision policy. Tools and implementation details will be presented win section 5. Finally, section 6 and section 7 will include our experiments and discussions on future directions.

2. RELATED WORK

Recently a mobile game app called Flappy Bird, hit the market and became very popular on Android and iOS plat- form. This game was originally created by Nguyen Ha Dong, a Vietnamese developer. In this game, a player can control a bird to make a jump by tapping the touch screen. The goal is to let the bird y as far as possible between columns of green pipes without coming into contact with them. A plethora of Flappy Bird copies and clones emerged, such as Squishy Bird [4], Clumsy Bird [5], and even MMO version [1]. The popularity of the game not only originates from its simplicity, but also from the challenges it raises to human's adaptiveness to its dynamics.
Although the rule of the game is simple, it is agreed to be a challenging to make the bird y over more than ten columns of pipes. It requires the player to keep sending commands to the bird precisely at the correct moments with high concentration and fast action.
Some works have been done to let the bird become intelli- gent enough to y by itself. FlappyBirdRL [9] is one of the works which used a 2D model and reinforcement learning to teach the virtual bird y. It assumes that the bird fully knows the con guration of the environment, including ob- stacles out of reach from the bird's sight, which is obviously not true for a real bird.
Thus, in our project, we try to overcome this limitation by explictly modeling its eld of view. In our 2.5D world, we de ned the eld of view of the bird in vertical direction, which means the bird can only see things that is within a certain degrees of its sight. Particularly, anything column that is too high or too low will fall outside of the bird's eld of view. Detailed modeling method is discussed in the following section.
Figure 1: Occluded points or those not in fovy (red points) are unknown to the bird. The bird know only what it saw (green points) with instant fovy perception. The eld of view of the bird is repre- sented by white rays. RGB lines are the axises of the world coordinate system.
3. FORMULATION
In this section, we introduce our modeling and formulation of the world.
3.1 World
Although the bird is ying in a 3D world, the space is essentially represented in 2D. In our 2D model, the bird is represented as its bounding box, a Wb Hb rectangle. The bird is given a constant horizontal velocity vx to y, whereas the vertical velocity vy is controlled by gravity by default. Whenever the bird chose to perform a jump action, a vertical velocity increment ∆vy is added.
Cylinder pillars are represented by rectangles as well. The y positions of the gap between two pillars in the same col- umn are generated randomly. The bird is dead and therfore game is over when the bounding box of the bird overlaps with pillars. To pervent the game to be too difficult to ac- complish, however, the gap between pillars in the same col- umn, the distance between two adjacent columns, and the width of the pillars are xed to constant values (yellow lines in Figure 1).
3.2 Perception
To learn a decision policy, the input to the learning algo- rithm is the the perception of the bird of course. However, to simulate the real vision as a full-size image directly would have introduced too much complexities in segmentation, tri- angulation, 3D reconstruction, and vision feature extraction. Given that we want to focus our work in reinforcement learn- ing, we consider the perception in two scenarios based on different assumptions.
3.2.1 Instant Fovy
In the rst scenario, we assume that the bird can only see two nearest columns of pillars and cannot see occluded corners. Each column is abstracted as four key points (i.e. corners) as shown in Figure 1. Hence, eight coordinates rela- tive to bird's eys position f(xi; yi)g8i=1 are used to represent the vision. For those occluded points or points not in the fovy (i.e. eld of view y), they are marked unknown to the bird by setting x = 1 explicitly. (1; 1) is used to repre- sent the occluded corner above the bird, whereas ( 1; 1) is used for points below the bird.
Figure 2: With short-term memory, the bird not only knows what it sees at the current posi- tion (green points), but also remembers what it saw(yellow points).
Figure 3: From a given state s, the bird can perform two actions (1) jump and (2) not jump, each of which leads the bird to a new state.
3.2.2Fovy with Short-term Memory
In the second scenario, in addition to perception, we as- sume the bird has a short-term memory so that it will re- member the existance of the corner once he saw it. In this case, we can simply give all the four corners to the bird with- out marking any occlusion explicitly since the bird will see all the points anyway as it jumps up and down. 2.
3.3States and actions
Given the above discussion on perception, we can repre- sent the state of the bird by a 16 1 vector
s = (x1; y1; x2; y2; x3; y3; x4; y4)
consisting of coordinates (relative to the bird's eyes) of the eight key points of the nearest two pillars. Real numbers are rounded to their nearest tenth to limit the size of the state space. It is clear that in the case of instant fovy, the state space are smaller due to the occlusion marks.
The action leads to the transition between one state to another. In our con guration, the bird can only perform two actions: (1) jump, (2) not jump as shown in Figure 3. Give the current state st, the next state st+1 can be easily obtained by computing the movement of the bird according to the action at performed.

4. LEARNING

To let the bird survive in the game, we want to learn a decision policy at = (st) which decides an action at for each state st that maximize the total reward accumulated over time t:
1
tRat (st; st+1);
t=0
where is a discount factor, Rat (st; st+1) is the reward gained from action at which causes the state transition from st to st+1.
If we know the exact transition probability P(s; s) and the reward of the states V (s), we can solve the optimal pol- icy that maximize the expected reward from the following
equation:

{s


))}:

max
P(s; s)(R (s; s) +
(s) = arg
a
a
V (s
However, in our game environment, we assume the bird has no prior knowledge about the world, so that the transition probability and rewards are unknown to us. This leads to the use of reinforcement learning[8].
In the framework of Q-Learning[10], a speci c reinforce- ment learning algorithm, we want to learn a policy in the form of a quality table
Q : S A ! R;
which maps a state s 2 S and an action a 2 A to a real number r 2 R that measures the reward of performing action a at state s. Once this quality table is learned, the decision can be easily made at any given state s as follows:
a = arg max Q(s; a):
a2A
Starting from a arbitary Q, the algorithm of Q-learning update the table iteratively using equation
Qt+1(st; at) = Qt(st; at) + [Rt+1 + F Qt(st; at)] ;
where F = maxa Qt(st+1; a) is the maximum expected fu- ture reward at state st+1, is the learning rate controlling the extend to which we want to forget the history, is the discount factor for the expected future reward. This equa- tion ensentially says an action at will get higher reward if it lands to a state st+1 that has high reward and we are not expected to die fast at that state. It has been proved that Q-learning algorithm is guaranteed to converge to the optimal decision policy.

5. GRAPHICS

This project is implemented with WebGL technique, which enables to render complex graphics on HTML5 canvas with JavaScript. Our project can run directly in the browser without the help of any other plugins efficiently. This makes it easy for public to access our project.
In this section, we'll introduce the major development tools, and some graphics techniques used in this project. All the graphics related functions and APIs are wrapped in the graphics.js le.

5.1 Tools

Three.js [2] is a lightweight cross-browser JavaScript li- brary/API used to create and display animated 3D com- puter graphics on a Web browser . Three.js scripts may be
Figure 4: Background scene is built with skybox. It consists of six properly aligned images.
used in conjunction with the HTML5 canvas element, SVG or WebGL. It greatly speeds up our development process.
Blender [3] is a free and open source 3D creation software, which is used in this project for building and converting 3D models, including the greek columns1 and the apping bird2.
Chrome Developer Tools are used to assist the interface design and most importantly, to debug JavaScript functions and APIs.

5.2 Scene Rendering

The background scene in the game is created with Sky- Box technique [6]. Cube mapping is performed to project images on to different faces of the cube. As shown in Fig- ure 4, images were properly aligned on the box. As the cube moves with the camera, it creates an illusion of distant three-dimensional surroundings. In our implementation, a separate camera is used to build the skybox3.
The animation in the scene is mainly controled by the enforcement learning module to decide the exact moment the bird need to make a jump. Whenever the bird jumps, the model of bird will play an prede ned animation to ap.
The point light is placed close to the setting sun in the background. We also adjusted the colors of both point light and ambient light to render the scene more realistically.

6. RESULTS

Under the rst senario where we assumpe the bird only aware points in the fovy, the bird's intelligence is evolving faster due to the limited size of the state space. Figure 5 plots the score against generation. Although the training speed is fast, after generation 16, it starts to converge to a non-promissing results. We believe this "degeneration" is due to the limitation of the features for reinforcement learning and the possible contradiction in the underlying randomly generated world.
In the second senario where the bird is given both fovy and short-term memory, the score is ploted against generation in Figure 6. In this case, our bird is able to y further at its best performance. However, the training process takes much more time due to the large state space.
For the sake of easy debugging, we also implemented a quick training mdoe in which we were able to train without visualize the ying bird. Quick training take much less time
1The greek column model is downloaded from http:// archive3d.net/?category=555
2The bird model is created by mirada from ro.me. [7] 3Images in skybox are created by Jochum Skoglund.
Figure 5: Reinforcement learning result if the bird aware only unocluded points in the fovy.
Figure 6: Reinforcement learning result if the bird aware all eight points by its sight and short-term memory.
since it cut off the time used for real-time graphics render- ing. It only takes ve minutes to quick training the bird to generation 18,000. However, it is interesting to witness the bird's growth from scratch (generation 0). Related video clips can be found on our project webpage.

7. FUTURE WORK

Our future work includes implementing real 3D environ- ment and modi cation of the training method. First, in- stead of ying up and down only, we want to enable the bird turning left and right. Features extracted for 3D game environment are more complicated than the one we use for 2D game model.
Besides, we also want to improve the feature extraction for reinforcement learning. First, we need to add the verticle ve- locity vy of the bird into the state space, since it is nature that the bird aware its own velocity and we strongly believe that some failure results of training were due to unaware- ness of this parameter. In addition, we would like to set a constant time interval, e.g. one second, between two consec- utive decision to prevent the bird to ying up straightly.

8. ACKNOWLEDGEMENT

We want to give special thanks to Professor Terzopoulos who gives us a great course about arti cial life and overview of related techniques. We had enough freedom when do- ing this interesting project and have learnt a lot during the process.

9.REFERENCES

[1]Flappy Bird Massively Multiplayer Online. http://flapmmo.com/, Feb. 2014.
[2]R. Cabello. Three.js. http://www.threejs.org/, Apr. 2010.
[3]B. Foundation. Blender: Open source 3D graphics and animation software. http://www.blender.org/, 1995.
[4]R. Games. Splashy Fish. https://play.google.com/store/apps/details?id= it.junglestudios.splashyfish, Jan. 2014.
[5]E. Le~ao. A MelonJS port of the famous Flappy Bird Game.
https://github.com/ellisonleao/clumsy-bird, Jan. 2014.
[6]E. Meiri. Tutorial 25 - SkyBox. http://ogldev. atspace.co.uk/www/tutorial25/tutorial25.html, Oct. 2010.
[7]Mirada. Dynamic Procedural Terrain Using 3D Simple Noise. http://alteredqualia.com/three/examples/ webgl_terrain_dynamic.html.
[8]R. S. Sutton and A. G. Barto. Introduction to reinforcement learning. MIT Press, 1998.
[9]S. Vaish. Flappy Bird hack using Reinforcement Learning. https://github.com/SarvagyaVaish/FlappyBirdRL, Feb. 2014.
[10]C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279{292, 1992.
 

No comments:

Post a Comment