Tuesday, December 8, 2015

Learning about deep learning

https://www.lab41.org/learning-about-deep-learning/

GAB41

No. 15
Learning About Deep Learning!

Abhinav Ganesh September 2015PDF Print
With all the coverage on CNN and Fox News lately, Deep Learning has quickly become a household term. Well, not quite, but Deep Learning is definitely all the craze these days for those of us steeped in Machine Learning and big data. We at Lab41 have been trying to find signal in noise over the past few months, and thought it worthwhile to share our “Unauthoritative Practical Getting Started Guide” with others looking to get started with Deep Learning. Before I go on, I must warn you this largely avoids the difficult Greek symbols and other maths underpinning this complex topic. If you’re looking for formal definitions and proofs, you’ll have to follow the links to resources peppered throughout this post. Now let’s get Deeply Learned!

Monday, December 7, 2015

Learning About Deep Learning!

Source:
https://www.lab41.org/learning-about-deep-learning/

No. 15
Learning About Deep Learning!
Abhinav Ganesh September 2015PDF Print
With all the coverage on CNN and Fox News lately, Deep Learning has quickly become a household term. Well, not quite, but Deep Learning is definitely all the craze these days for those of us steeped in Machine Learning and big data. We at Lab41 have been trying to find signal in noise over the past few months, and thought it worthwhile to share our “Unauthoritative Practical Getting Started Guide” with others looking to get started with Deep Learning. Before I go on, I must warn you this largely avoids the difficult Greek symbols and other maths underpinning this complex topic. If you’re looking for formal definitions and proofs, you’ll have to follow the links to resources peppered throughout this post. Now let’s get Deeply Learned!
Why Deep Learning?

Before we get started, we have to ask the most basic question: Why is Deep Learning so interesting?

dog
Photo Credit: TechCrunch
Well, it’s partly because you can be so incredibly flexible and creative with the tasks you want these algorithms to complete:

Andrej Karpathy’s blog post, “The Unreasonable Effectiveness of Recurrent Neural Networks,” provides Deep Learning code to automatically generate text from scratch that looks and reads like Shakespeare. The same code base also can do other impressive things, including generating software source code (in C) that looks like a real programmer wrote it. Pretty soon I’ll probably be out of a job.
People also are starting to write music using neural networks, which lucky for me enabled the extension of my favorite music from the famous song “Let it Go” from Disney’s “Frozen”. It’s amazing; the music actually sounds pretty good, which is incredible since the algorithms did not require human intervention and still managed to learn the rhythm, tempo, and style of the song.
Yarin Gal, a third-year Ph.D. student at Cambridge, was able to take famous pieces of art and extrapolate what the painting would have looked like if the artist had drawn more on the canvas.
These are but a sample of the creative examples posted to places like Hacker News on a seemingly weekly basis, which makes it very exciting how many domains and applications will be improved by this still-bourgeoning field. But the question we asked ourselves is how do we actually get started with understanding deep learning? Since we’re not Ph.D. Machine Learning candidates and don’t work for any of the Google/Facebook/Microsoft brain trusts in the field, we initially felt a bit uneasy about how to tackle the foundations and applications. Luckily for us, people like Karpathy, repos on GitHub, and several other amazing resources are out there if you know where to look. Let’s do just that…

How to Learn about Deep Learning?

As we initially approached this question, we realized that there are already a ton of online resources that can help you get started and we have aggregated many of these resources on our Github Wiki.

Screen Shot 2015-09-17 at 2.22.52 PM

This is great news and shifts the focus from “finding resources” to “filtering good ones.” If you follow down this path, you’ll quickly find that one of the first resources everyone mentions is Andrew Ng’s Coursera class on machine learning (ML).

We found the lectures about basic ML concepts like linear and logistic regression to be useful if you don’t have much of a ML background. More specific to Deep Learning, we found the lectures on neural networks to be a great primer for understanding the basic concepts and architectures. Once we got a hang of some of the basic neural net lingo we found a lecture series that Quoc Le (from the Google Brain project) gave at Carnegie Mellon a few years ago to be very effective. That lecture series not only explains what neural networks are, but also shows examples of how a place like Google uses them and why we call this field “deep learning”.

Screen Shot 2015-09-17 at 4.43.47 PM

After we watched these different lecture series’, we felt like we were getting a grasp of the field. However, we were still looking for something that tied the math and theory to a practical application in a cohesive way. One of the best resources we evaluated for putting this picture together were the notes by the aforementioned Andrej Karpathy for his class at Stanford “Convolutional Neural Networks for Visual Recognition”. His notes are tremendously clear and understandable (unlike many other deep learning texts we found) and explain how to think of complicated deep learning concepts in simpler terms. We really can’t emphasize enough how his notes (which led us to referring to him as Andrej the Jiant) illuminated so many difficult topics and helped us transition from “reading about” to “actually doing” Deep Learning.

Photo Credit: Stanford University

Once we got a good sense for the overall picture with a combination of class notes and lecture videos, we were in a good position to start understanding recent academic literature in this space. We found the transition from more high-level resources like Andrew Ng’s class to more specific example based resources like Andrej Karpathy’s notes to be extremely valuable in solidifying our understanding of key concepts. As we alluded to earlier, the resources mentioned above represent only a
few
of the resources we used to get started. More detailed resources lists can be found on our Github Wiki, which includes deep learning textbooks and academic papers we’re reading that apply deep learning in interesting domains.

Next Steps

So where do we plan to go from here? We’ve seen great progress in the computer vision field with people getting higher and higher scores on ImageNet. This is great news for the field and for image-related applications. However, as a recent Nature article from Bengio, Hinton, and LeCun highlighted, Natural Language Processing could be the next frontier to benefit from Deep Learning architectures. Classes like Richard Socher’s at Stanford, “Deep Learning for Natural Language Processing” (the class notes and lectures are freely available!) only reinforce this notion and indicate the growing interest in this area.

Photo Credit: Stanford University

Since text processing is central to so many of our challenges at Lab41, we wanted to use NLP as our first hands-on foray into the field. Stay tuned to our blog as we document our exploration of Deep Learning! Our journey will be documented in code and Wiki within our Github repository and with blog entries centered on the following topics:

Deep Learning applied to text: Sentiment analysis vs. Traditional ML approaches
Easy setup of development environment for Deep Learning using Docker and GPU’s
Deep Learning framework comparison: Theano vs Caffe vs Torch vs What is a Keras?

In-Q-Tel

2 New first steps - AI projects to study

Neural Slime Volleyball

source: http://blog.otoro.net/2015/03/28/neural-slime-volleyball/

Recurrent neural network playing slime volleyball. Can you beat them?
I remember playing this game called slime volleyball, back in the day when Java applets were still popular. Although the game had somewhat dodgy physics, people like me were hooked to its simplicity and spent countless hours at night playing the game in the dorm rather than getting any actual work done.
As I can’t find any versions on the web apart from the old antiquated Java applets, I set out to create my own js+html5 canvas based version of the game (complete with the unrealistic arcade-style ‘physics’). I set out to also try to apply the genetic algorithm coded earlier to train a simple recurrent neural network to play slime volleyball. Basically, I want to find out whether even a simple conventional neuroevolution techniques can train a neural network to become an expert at the this game, before exploring more advanced methods such as NEAT.
The first step was to write a simple physics engine to get the ball to bounce off the ground, collide with the fence, and with the players. This was done using the designer-artist-friendly p5.js library in javascript for the graphics, and some simple physics math routines. I had to brush up the vector maths to get the ball bouncing function to work properly. After this was all done, the next step was to add in keyboard / touchpad so that the players can move and jump around, even when using a smartphone / tablet.
The fun and exciting part was to create the AI module to control the agent, and to see whether it can become good at playing the game. I ended up using basic CNE method implemented earlier, as an initial test, to train a standard recurrent neural network, hacked together using the convnet.js library. Below is a diagram of the recurrent network we will train to play slime volleyball, where the magic is done:

The inputs of the network would be the position and velocity of the agent, the position and velocity of the ball, and also of the opponent. The output would be three signals that would trigger the ‘forward’, ‘backward’, and ‘jump’ controls to be activated. In addition, an extra 4 hidden neurons would act as hidden state and fed back to the input, this way it is essentially an infinitely deep feed forward neural network, and potentially remember previous events and states automatically in the hopes of being able to formulate more complicated gameplay strategies. One thing to note is that the activation functions would fire only if the signal is higher than a certain threshold (0.75).
I also made the agent’s states be the same independent of whether the agent was playing on the left or the right hand side of the fence, by having their locations be relative to the fence, and the ball positions adjusted accordingly according to which side they were playing in. That way, a trained agent can use the same neural network to play on either side of the fence.
Rather than using the sigmoid function, I ended up using the hyperbolic tangent (tanh) function to control the activations, which convnet.js supports.
The tanh function is defined as:

The tanh function can be a reasonable activation function for a neural network, as it tends towards +1 or -1 when the inputs get steered one way or the other. The x-axis would be the game inputs, such as the locations and velocities of the agent, the ball, and the opponent (all scaled to be +/- 1.0 give or take another 1.0) and also the output and hidden states in the neural network (which will be within +/- 1.0 by definition).

As velocities and ball locations can be positive or negative, this may be more efficient and a more natural choice compared to the sigmoid. As explained earlier, I also scaled my inputs so they were all in the order of +/- 1.0 size, similar to the output states of the hidden neurons, so that all inputs to the network will have roughly the same orders of magnitude in size on average.
Training such a recurrent neural network involves tweaks on the genetic algorithm trainer I made earlier, since there’s really no fitness function that can return a score, as either one wins or loses a match. What I ended up doing is to write a similar training function that gets each agent in the training population to play against other agents. If the agent wins, its score increases by one, and decreases by one if it loses. On ties (games that longer than the equivalent of 20 real seconds in simulation), no score is added or deducted. Each agent will play against 10 random agents in the population in the training loop. The top 20% of the population is kept, the rest discarded, and crossover and mutations are performed for the next generation. This is referred to as the ‘arms race’ method to train agents to play a one-on-one game.
By using this method, the agents did not need to be programmed by hand any heuristics and rules of the game, but will simply explore the game and figure out how to win. And the end result suggests that they seem to be quite good at it, after a few hundred generations of evolution! Check out the demo of the final result below on the youtube video.

The next step can be employ more advanced methods such as NEAT, or ESP for the AI, but that can be overkill for a simple pong-line game. It is also a candidate for applying the Deep Q-Learner already built in convnetjs, as the game playing strategy is quite simple. For now I think I have created a fairly robust slime volleyball player that is virtually impossible to beat by a human player consistently.
Try the game out yourself and see if you can beat it consistently. It works on both desktop (keyboard control), or smartphone / tablet via touch controls. Desktop version is easier to control either via keyboard arrows or mouse dragging. Feel free to play around with the source on github, but apologies if it’s not the neatest structured code as it is intended to be more of a sketch rather than a proper program.
Update (13-May-2015)
This demo at one point got to the front page of Y Combinator’s Hacker News. I made another demo showing the evolution of Agent’s behaviour over time, from knowing nothing at the beginning. Please see this post for more information.

========================================================================

Reinforcement Learning, Control, and 3D Visualization

source: http://blog.dlib.net/2015/06/reinforcement-learning-control-and-3d.html?m=0

Over the last few months I've spent a lot of time studying optimal control and reinforcement learning. Aside from reading, one of the best ways to learn about something is to do it yourself, which in this case means a lot of playing around with the well known algorithms, and for those I really like, including them into dlib, which is the subject of this post. So far I've added two methods, the first, added in a previous dlib release was the well known least squares policy iteration reinforcement learning algorithm. The second, and my favorite so far due to its practicality, is a tool for solving model predictive control problems.

There is a dlib example program that explains the new model predictive control tool in detail. But the basic idea is that it takes as input a simple linear equation defining how some process evolves in time and then tells you what control input you should apply to make the process go into some user specified state. For example, imagine you have an air vehicle with a rocket on it and you want it to hover at some specific location in the air. You could use a model predictive controller to find out what direction to fire the rocket at each moment to get the desired outcome. In fact, the dlib example program is just that. It produces the following visualization where the vehicle is the black dot and you want it to hover at the green location. The rocket thrust is shown as the red line:


// The contents of this file are in the public domain. See LICENSE_FOR_EXAMPLE_PROGRAMS.txt

/*

    This is an example illustrating the use of the linear model predictive

    control tool from the dlib C++ Library.  To explain what it does, suppose
you have some process you want to control and the process dynamics are

    That is, the next state the system goes into is a linear function of its
described by the linear equation:
        x_{i+1} = A*x_i + B*u_i + C

                
current state (x_i) and the current control (u_i) plus some constant bias or
    disturbance.  

    drive the state (x) to some reference value, which is what we show in this
A model predictive controller can find the control (u) you should apply to
    example.  In particular, we will simulate a simple vehicle moving around in

*/
a planet's gravity.  We will use MPC to get the vehicle to fly to and then
    hover at a certain point in the air.
    




#include <dlib/gui_widgets.h>
#include <dlib/control.h>

#include <dlib/image_transforms.h>

using namespace std;
using namespace dlib;

//  ----------------------------------------------------------------------------

int main()

{

    const int STATES = 4;

    const int CONTROLS = 2;

    // The first thing we do is setup our vehicle dynamics model (A*x + B*u + C).

    // Our state space (the x) will have 4 dimensions, the 2D vehicle position

    // and also the 2D velocity.  The control space (u) will be just 2 variables

    // which encode the amount of force we apply to the vehicle along each axis.

    // Therefore, the A matrix defines a simple constant velocity model.

    matrix<double,STATES,STATES> A;

    A = 1, 0, 1, 0,  // next_pos = pos + velocity

        0, 1, 0, 1,  // next_pos = pos + velocity

        0, 0, 1, 0,  // next_velocity = velocity

        0, 0, 0, 1;  // next_velocity = velocity



    // Here we say that the control variables effect only the velocity. That is,

    // the control applies an acceleration to the vehicle.

    matrix<double,STATES,CONTROLS> B;

    B = 0, 0,

        0, 0,

        1, 0,

        0, 1;

    // Let's also say there is a small constant acceleration in one direction.

    // This is the force of gravity in our model. 

    matrix<double,STATES,1> C;

    C = 0,

        0,

        0,

        0.1;

    const int HORIZON = 30;

    // Now we need to setup some MPC specific parameters.  To understand them,

    // let's first talk about how MPC works.  When the MPC tool finds the "best"

    // control to apply it does it by simulating the process for HORIZON time

    // steps and selecting the control that leads to the best performance over

    // the next HORIZON steps.

    //  

    // To be precise, each time you ask it for a control, it solves the

    // following quadratic program:

    //   

    //     min     sum_i trans(x_i-target_i)*Q*(x_i-target_i) + trans(u_i)*R*u_i 

    //    x_i,u_i

    //

    //     such that: x_0     == current_state 

    //                x_{i+1} == A*x_i + B*u_i + C

    //                lower <= u_i <= upper

    //                0 <= i < HORIZON

    //

    // and reports u_0 as the control you should take given that you are currently

    // in current_state.  Q and R are user supplied matrices that define how we

    // penalize variations away from the target state as well as how much we want

    // to avoid generating large control signals.  We also allow you to specify

    // upper and lower bound constraints on the controls.  The next few lines

    // define these parameters for our simple example.



    matrix<double,STATES,1> Q;

    // Setup Q so that the MPC only cares about matching the target position and

    // ignores the velocity.  

    Q = 1, 1, 0, 0;

    matrix<double,CONTROLS,1> R, lower, upper;

    R = 1, 1;

    lower = -0.5, -0.5;

    upper =  0.5,  0.5;

    // Finally, create the MPC controller.

    mpc<STATES,CONTROLS,HORIZON> controller(A,B,C,Q,R,lower,upper);

    // Let's tell the controller to send our vehicle to a random location.  It

    // will try to find the controls that makes the vehicle just hover at this

    // target position.

    dlib::rand rnd;

    matrix<double,STATES,1> target;

    target = rnd.get_random_double()*400,rnd.get_random_double()*400,0,0;

    controller.set_target(target);

    // Now let's start simulating our vehicle.  Our vehicle moves around inside

    // a 400x400 unit sized world.

    matrix<rgb_pixel> world(400,400);

    image_window win;

    matrix<double,STATES,1> current_state;

    // And we start it at the center of the world with zero velocity.

    current_state = 200,200,0,0;

    int iter = 0;

    while(!win.is_closed())

    {

        // Find the best control action given our current state.

        matrix<double,CONTROLS,1> action = controller(current_state);

        cout << "best control: " << trans(action);

        // Now draw our vehicle on the world.  We will draw the vehicle as a

        // black circle and its target position as a green circle.  

        assign_all_pixels(world, rgb_pixel(255,255,255));

        const dpoint pos = point(current_state(0),current_state(1));

        const dpoint goal = point(target(0),target(1));

        draw_solid_circle(world, goal, 9, rgb_pixel(100,255,100));

        draw_solid_circle(world, pos, 7, 0);

        // We will also draw the control as a line showing which direction the

        // vehicle's thruster is firing.

        draw_line(world, pos, pos-50*action, rgb_pixel(255,0,0));

        win.set_image(world);

        // Take a step in the simulation

        current_state = A*current_state + B*action + C;

        dlib::sleep(100);

        // Every 100 iterations change the target to some other random location. 

        ++iter;

        if (iter > 100)

        {

            iter = 0;

            target = rnd.get_random_double()*400,rnd.get_random_double()*400,0,0;

            controller.set_target(target);

        }

    }

}



//  ----------------------------------------------------------------------------

Sunday, October 18, 2015

A Guided Genetic Algorithm for the Planning in Lunar Lander Games

Zhangbo Liu

Department of Computer Science

University of British Columbia

2366 Main Mall

Vancouver, B.C, V6T 1Z4, Canada

email: zephyr@cs.ubc.ca

KEYWORDS

Guided Genetic Algorithm, Reinforcement Learning,

Planning, Games

ABSTRACT

We propose a guided genetic algorithm (GA) for plan- ning in games. In guided GA, an extra reinforcement component is inserted into the evolution procedure of GA. During each evolution procedure, the reinforcement component will simulate the execution of a series of ac- tions of an individual before the real trial and adjust the series of actions according to the reinforcement thus try to improve the performance. We then apply it to a Lunar Lander game in which the falling lunar module needs to learn to land on a platform safely. We com- pare the performance of guided GA and general GA as well as Q-Learning on the game. The result shows that the guided GA could guarantee to reach the goal and achieve much higher performance than general GA and Q-Learning.

INTRODUCTION

There are two main strategies for solving reinforcement learning problems. The ¯rst is to search in the space of behaviors in order to ¯nd one that performs well in the environment. The second is to use statistical tech- niques and dynamic programming methods to estimate the utility of taking actions in states of the world (Kael- bling et al. 1996). Genetic algorithms (GA) and Tempo- ral Di®erence (TD-based) algorithms (e.g. Q-Learning)belong to each of the two categories, respectively.

Both GA and TD-based algorithms have advantages and disadvantages. GA leads to very good exploration with its large population that can be generated within a gen- eration but weak exploitation with elitism selection op- erator, because its other two operators, the crossover and mutation operators are usually randomly working. TD-based algorithms use two strategies to solve prob- lems with continuous space which are discretization and function approximation. It usually faces the curse of di- mensionality when using discretization. With function approximation it is said to be able to alleviate such a

problem but might be stuck into certain local optima. In this paper, we ¯rst investigate the GA as well as Q- Learning approach on the Lunar Lander game. Then we propose the guided GA by inserting a reinforcement component into the evolution procedure of GA. Since general GA uses random crossover and mutation opera- tions, its performance is quite unstable. Guided GA is designed to achieve higher e±ciency by involving the re- ward concept of reinforcement learning into general GA while keep all components of general GA unchanged so that the extension from general GA to guided GA is easy to achieve.

The remainder of this paper is organized as follows. In section 2 we introduce research work that is relevant to this paper. In section 3 we describe the Lunar Lander game as well as alternative approaches for the problem implemented with general GA and Q-Learning. In sec- tion 4 we present the guided GA approach for the prob- lem. The results of the experiment are shown in section 5 following by the conclusions.

RELATED WORK

Reinforcement Learning for Continuous State- Action Space Problems

The issue of using reinforcement learning to solve contin- uous state-action space problem has been investigated by many researchers. And game is actually an ideal test bed.

There are a few well known benchmark problems in the reinforcement learning domain such as Mountain Car (Moore and Atkeson 1995), Cart-Pole (Barto et al. 1983) and Acrobot (Boone 1997). In the Mountain Car prob- lem, the car must reach the top of the hill as fast as possible and stop there. This problem is of dimension 2, the variables being the position and velocity of the car. The Cart-Pole is a 4-dimensional physical system in which the cart has to go from the start point to the goal and keep the orientation of its pole vertical within a certain threshold when it reaches the goal. The Acrobot problem is also a 4-dimensional problem which consists of a two-link arm with one single actuator at the elbow. The goal of the controller is to balance the Acrobot at

its unstable, inverted vertical position, in the minimum time. Another implementation described in (Ng et al. 2004) made an autonomous inverted helicopter °ight.

Two main strategies here are discretization and func- tion approximation. For the ¯rst strategy, discretization techniques have been widely pursued and provide con- vergence results and rates of convergence (Munos and Moore 2002), (Monson et al. 2004). For the second strategy, several approaches come out on how to con¯g- ure with multiple function approximators (Gaskett et al. 1999), (Smart and Kaelbling 2000).

Reinforcement Learning + Genetic Algorithm

Some researches on combining the advantages of GA and TD-based reinforcement learning have been proposed in (Chiang et al. 1997), (Lin and Jou 1999). However, both of them use gradient decent learning method which is complex and the learning speed is always too slow to achieve the optimum solution. The idea of guided GA we propose is inspired by (Ito and Matsuno 2002), in which Q-Learning is carried out and ¯tness of the genes is calculated from the reinforcedQ-table. However, in guided GA, instead of usingQ-table, we directly insert a reinforcement component into the evolution procedure of the general GA so that the large Q-table and hid- den state problem are avoided. In (Juang 2005), Juang proposed another approach to combine online clustering and Q-valuebased GA for reinforcement fuzzy system design. Compared with the approach described in that paper, guided GA is much simpler in structure and eas- ier to implement while the problem we address in this paper has a higher dimension than that of in (Juang 2005).

THE LUNAR LANDER GAME

The Lunar Lander Game

The Lunar Lander game is actually a physically-basedproblem in which the controller needs to gently guide and land a lunar module onto a small landing platform, as shown in Figure 1. The space is a 400 £ 300 pixel rectangle area. It simulates the real environment on the moon that the lunar module has mass and is in°uenced by the gravity on the moon (1.63m/s2). The controller here has 5-dimensional state spaces which are: position (x; y), velocity (x; y) and orientation (µ). The controller is able to do four actions: rotate left, rotate right, thrust and do nothing (drift).

When agent becomes the controller instead human player, the problem becomes to an advanced path ¯nd- ing issue. The successful landing requirement consists the checking of the following variables when any part of the lunar module reach the ground:

² Distance from the pad

Figure 1: The Lunar Lander Game

²Speed

²Degrees of rotation

All of them must be below certain thresholds to achieve safe landing, otherwise it will crash and the game will start from beginning again. The game runs in real time thus it is a good test bed for problems with continuous state and discrete action spaces.

Alternative Approaches

Genetic Algorithm

One alternative approach to this problem is using ge- netic algorithm (GA) for planning. For a complete in- troduction to GA please refer to (Goldberg 1989). The GA approach to this problem follows the steps below in one epoch to try to achieve the goal.

First, the genome is encoded as a series of genes each of which contains an action-duration pair, as shown in Figure 2. The duration here represents the period of time that each speci¯c action is applied. At the begin- ning, all the actions and durations in those genes in one genome are randomly assigned. A number of genomes will be created together in one generation.

Figure 2: Genome encoding

Next, the controller starts a trial according to theaction-duration series in each genome and uses a ¯t- ness function to evaluate their utilities when they crash. There might be many approaches to build the ¯tness

function to this problem. In (Buckland and LaMothe 2002), Buckland and LaMothe suggested the following ¯tness function:

Fitness = w1 ¢ disFit + w2 ¢ rotFit + w3 ¢ airTime (1)

where disFit and rotFit represent the value function of the position and the orientation feature separately. TheairTime is the time period that the lunar module stays in the air which is de¯ned as na=(v + 1) where na is the number of actions it does ignoring the duration and v is the velocity at landing. wi are the weights that are applied to balance the function. Those weights are quite important and must be carefully decided in order to achieve a good performance. If the safe landing re- quirement is satis¯ed, the ¯tness value will be assigned with a prede¯ned Big Number instead of calculating using the equation (1).

After one trial for all genomes of the current generation, the ¯tness value of each genome will be calculated out and the best n genomes with the highest ¯tness value will remain and put into the next generation. Other genomes of the next generation are created by using crossover and mutation operators. The crossover opera- tor works by stepping through each gene in its parents' genome and swapping them at random to generate their o®spring. The mutation operator runs down the length of a genome and alters the genes in both action and duration according to the mutation rate.

The operator will periodically do one epoch after an- other until one genome's result reaches the goal or the number of generations exceeds the prede¯ned maximum value. An implementation of a GA solution to this prob- lem can be found in (Buckland and LaMothe 2002).

Q-Learning

Based on our experience, Q-Learning with only dis- cretization won't work for this problem. So we im- plement a linear, gradient-descent version of Watkins's Q(¸) to this problem with binary features, "-greedy pol- icy, and accumulating traces described in (Sutton and Barto 1998). Tile coding (Sutton and Barto 1998) is also used to partition the continuous space into multi- ple tilings.

THE GUIDED GENETIC ALGORITHM AP- PROACH

The approaches we mentioned in the previous section both have advantages and disadvantages. The GA is simple to implement and is able to achieve the goal, while its disadvantage is that all its actions are randomly assigned so that its performance is quite unstable. The basic concept of Q-Learning approach is also simple and supposed to be e±cient. However, for this game which is a realtimecontinuous-state problem, Q-Learning with

discretization does not work and Q-Learning with func- tion approximation is hard to accommodate. We design the guided GA which incorporates the concept of reward in Q-Learning into GA. Here we call our function "rein- forcement function" because unlike the reward function in Q-Learning whose values need to be summed to cal- culate the Q-value (Q = Prewards), the reinforcement function here gets the immediate ¯tness value and will be extended to ¯tness function at the end of each epoch. In the following subsections we ¯rst introduce the rein- forcement function design for guided GA to this problem then discuss the details of the algorithm.

Reinforcement Function Design

To model the reinforcement function is a very challeng- ing work. It has to be smoothly transformed to the ¯tness function of the general GA (equation (1)) at the end of each epoch so that we can easily extend the gen- eral GA to a guided GA without modifying the existing ¯tness function. On the other hand, it should be prop- erly de¯ned to e±ciently guide the agents to perform better. We tried many di®erent versions until ¯nally reaching a solution.

In equation (1) there are 3 parameters and we need to modify two of them which are disFit and airTime in our reward function. The main di®erence between equation

(1) and the reinforcement function is that in equation (1), all lunar modules reach the ground (position:y = 0) and each of them has an accumulator na whose value is the number of actions they do during the whole pro- cedure; while in the reinforcement function, the lunar modules are in the air and they only focus on the next action. Based on this di®erence, we build our reinforce- ment function as follows:

We use disFitx to represent disFit in (1), then we builddisFity which is similar to disFitx but for y coordinate. Then our distance function is:

disFit0 = s(disFitx)2 +	(	disFity)2	(2)
		wy

where wy is used for balancing the weight betweendisFitx and disFity.

airTime, as mentioned in equation (1), is de¯ned as na=(v + 1). In our reinforcement function, na no longer exists, while we ¯nd that a single de¯ned function does not work well all the time since on di®erent stages our focuses might be di®erent. For example, when the lunar module is high in the air we would pay more attention on its horizontal position; while when it is close to the ground it needs to slow down to prepare for landing. So instead of simply rede¯ning it as 1=(v + 1), we take the vertical position into consideration and come with the following de¯nition:

n¤ De¯ning disFit0 and airTime0 ¤n

if position:y < h1f disFit0 = disFit0 £ r; ifposition:y < h2

airTime0 = 1=(wt £ v + 1); gelse airTime0 = 1=(v + 1);

where h1 and h2 (h1 > h2) are values of height at which we think should change our strategies and wt is the weight that can help slow down the velocity of the lunar module to very small values when they nearly reach the ground. r is a scaling factor. Then the reward function we build is:

R = w1 ¢ disFit0 + w2 ¢ rotFit + w3 ¢ airTime0 (3)

where wi and rotFit are the same as in (1).

Algorithm Description

In each epoch of the GA, the evolution of its genomes is done by three operators: selection, crossover and mu- tation. The selection is based on elitism, while the crossover and mutation are by random, which leads to the unstable performance of the general GA. In order to better perform the evolution, we insert a reinforce- ment component whose idea comes from the reward in Q-Learning. There are two strategies to do this. The ¯rst one is on-line updating which is similar to other reinforcement learning algorithms. The second one is o®-line updating which updates the whole genome at one time before each epoch. We choose the latter based on the consideration of both the standard mechanism of GA and the real time property of the problem. The high-level description of the guided GA is shown below:

algorithm guided genetic; begin

obtain last generation;

put a few best individuals directly into new generation;

use crossover operator to generate new generation; use mutate operator on the new generation; evolve the new generation;

end

What we add here is the last step whose input is the mutated new generation. Below is the procedure:

procedure evolve; begin

for each individual i in the generation

for each gene in i's action-duration series get duration d, current state s;

from state s consider all possible actionsa0i with duration d, suppose s0i are possible resulting states;

select a0 and s0 based on equation (3);

if s0 satis¯es safe landing requirement

aÃ a0; else if a0 =6 a

aÃ a0 with probability (1 ¡ "); update state;

end

where the greedy rate " has the same meaning as the"- greedy policy in reinforcement learning. For any given gene of an individual's genome, there are 4 possible ac- tions and numerous durations (in our implementation for the problem the duration ranges from 1 to 30, which means for any given state there are 120 possible states in the next step). And we would only change the action in action-duration pair so that for any given state there are only 4 possible states in the next step.

We use the "-greedy policy here, but unlike the so called greedy genetic algorithm (Ahuja et al. 1995) which fo- cuses on greedy crossover, guided GA is inspired by (Ito and Matsuno 2002) in which the authors used Q-table to integrate Q-Learning with GA. However, for our prob- lem using Q-table won't work because of the large state space. Instead, we use the above method to directly insert the reinforcement component into the evolution procedure without saving any previous state or function value in the memory.

EXPERIMENTAL DETAILS

Experimental Design and Results

We conducted an experiment to test the performance of our guided GA and compared it with the general GA and Q-Learning. For guided GA and general GA, we made all variables the same for both of them to ensure fairness. The parameters of our experiment were given as follows:

1.Both of the two contained 100 individuals in one generation. The maximum number of generations was 500. It supposes to be failed if it did not achieve the goal within 500 generations and then would start from the beginning again. The length of chromosome was 50. The crossover rate was 0.7 and the mutation rate was 0.05. The " was 0.1.

2.The thresholds for the safe landing requirements were:

(a)Distance = 10.0

(b)Velocity = 0.5

(c)Rotation = ¼/16

3.To de¯ne the values of weights was the most di±- cult work for the experiment. Below are the best value settings that were selected by empirical study:

(a)w1 = 1; w2 = 400; w3 = 4 (got from (Buckland and LaMothe 2002))

(b) wy = 3; wt = 6; r = 1:7; h1 = 100; h2 = 30

We also introduced the same feature of guided GA toQ-Learning implementation for building its reward func- tion.

To learn to solve a problem by reinforcement learning, the learning agent must achieve the goal (by trial-and-error) at least once (Lin 1992). Testing results showed that general and guided GA were able to achieve the goal almost every time. However, it was very hard forQ-Learning to complete the task. Besides the general reasons such as function approximation strategy often falls into local optimal and Q-Learning converges too slowly, we believed that another important reason was in this realtime problem the control of duration of the action is crucial. GAs could evolve the durations with the crossover and mutation operation. But Q-Learningcould not. Adding duration together with action into the state space might make the state space extremely huge, thus lead to Q-Learning's fail. Based on this fact, we only compared the data we got from the testings using general GA and guided GA. The experimental re- sults that we ran both of general and guided GA for 25 trials are shown in Figure 3.

Figure 3: Experimental Results

From the results we can observe that for most of the time the performance of the guided GA were much higher than the general GA except the last trial. Figure 4 shows the ¯tness that both of them gained during all the generations before the last generation in the 13th trial. According to the data, both the highest and the average ¯tness of guided GA were higher than general GA.

Analysis

Some questions came out when we observed the data of the results. First, what was the goal's ¯t- ness/reinforcement value? Second, why the highest ¯t- ness of guided GA was much higher than that of general GA while they achieved the goal in very close steps? Third, why guided GA lost in the last trial while per- formed much better in previous trials?

Figure 4: Fitness Gained in the 13th Trial

We used the thresholds for the problem to calculate out the ¯tness value and found that the ¯tness value of the goal was just no more than 900. The reason why those individuals with very high ¯tness values failed to achieve the goal was that there were three parameters in the ¯tness/reinforcement function. No matter how high the ¯tness value that certain individual gained, as long as there was one parameter whose value was above the threshold then it failed to achieve the goal. So it was possible that one individual with a low ¯tness achieved the goal in the next generation by randomly evolving its genome which accidentally hit all the thresholds and triggered a sudden success. And that was the reason that sometimes the individual who achieved the goal was not the one who performed the best in the previous state.

Both general and guided GA involved randomness that brought the uncertainty to the procedure. So the possi- ble explanation to the third question was that the ran- domness caused a sudden success to the general GA before the guided GA got out of certain local optimal states.

Although the highest ¯tness in each step did not make much sense to us, the average ¯tness were useful because higher average ¯tness demonstrated a better chance for the whole generation to achieve the goal. For all the trials we observed, the average ¯tness of guided GA were much higher than the average ¯tness of general GA.

CONCLUSION AND FUTURE WORK

In this paper we proposed a guided genetic algorithm by adding a reinforcement component into GA. We success- fully applied the guided GA for the planning of Lunar Lander game. Based on the experimental results, guided GA achieved much higher performance than general GA and Q-Learning.

The guided GA which we proposed in this paper demon- strated very good performance. However, it still has some shortcomings and has the potential to be im- proved. Possible improvement direction are: ¯rst, ¯gure out a method that could update the reinforcement func- tion more e®ectively; second, optimize the procedure of

crossover and mutation; last but not the least, ¯nd out some rules to model the reinforcement function without doing trial-and-error.

ACKNOWLEDGEMENT

We would thank Dr. David Poole, Dr. Michiel van de Panne, Dr. Chris Gaskett and Joel Lanir for their invaluable suggestions on our work. We also appreciate peer reviewers for their precious comments.

REFERENCES

R. Ahuja, J. Orlin, and A. Tivari. A greedy genetic algo- rithm for the quadratic assignment problem. Working pa- per 3826-95, Sloan School of Management, MIT, 1995.

A.G. Barto, R.S. Sutton, and C.W Anderson. Neurolike adaptive elements that can learn di±cult control prob- lems. IEEE. Trans. on System Man and Cybernetics, 1983.

G. Boone. Minimum-time control of the acrobot. Interna- tional Conference on Robotics and Automation, 1997.

M. Buckland and A. LaMothe. AI techniques for game pro- gramming. Premier Press, 2002.

C.K. Chiang, H. Y. Chung, and J. J. Lin. A self-learningfuzzy logic controller using genetic algorithms with rein- forcements. IEEE Transactions on Fuzzy Systems, 1997.

C. Gaskett, D. Wettergreen, and A. Zelinsky. Q-learning in continuous state and action spaces. In Australian Joint Conference on Arti¯cial Intelligence, pages 417{428, 1999.

D.E. Goldberg. Genetic Algorithm in Search, Optimiza- tion and Machine Learning. Kluwer Academic Publishers, 1989.

K. Ito and F. Matsuno. A study of reinforcement learning for the robot with many degrees of freedom - acquisition of locomotion patterns for multi-legged robot. In ICRA '02. IEEE International Conference on Robotics and Au- tomation, pages (4):3392{3397, 2002.

C.F. Juang. Combination of online clustering and q-valuebased ga for reinforcement fuzzy system design. IEEE Transaction on Fuzzy Systems, 2005.

L.P. Kaelbling, M.L. Littman, and A.P. Moore. Reinforce- ment learning: A survey. Journal of Arti¯cial Intelligence Research, 4:237{285, 1996.

C.T. Lin and C.P. Jou. Controlling chaos by ga-based rein- forcement learning neural network. IEEE Transaction on Neural Networks, 1999.

L.J. Lin. Self-improving reactive agents based on reinforce- ment learning, planning and teaching. Mach. Learn., 8(3-4):293{321, 1992.

C.K. Monson, D. Wingate, K.D. Seppi, and T.S. Peterson. Variable resolution discretization in the joint space. In- ternational Conference on Machine Learning and Appli- cations, 2004.

A.W. Moore and C.G. Atkeson. The parti-game algorithm for variable resolution reinforcement learning in multi- dimensional state-spaces. Mach. Learn., 21(3):199{233, 1995.

R. Munos and Andrew Moore. Variable resolution discretiza- tion in optimal control. Mach. Learn.,49(2-3):291{323, 2002.

A.Y. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang. Inverted autonomous helicopter °ight via reinforcement learning, 2004.

W.D. Smart and L.P. Kaelbling. Practical reinforcement learning in continuous spaces. In Proc. 17th International Conf. on Machine Learning, pages 903{910. Morgan Kauf- mann, San Francisco, CA, 2000.

R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Number 206-214. MIT Press, Cambridge, MA, 1998.

BIOGRAPHY

ZHANGBO LIU studies computer science at the Univer- sity of British Columbia, Canada. His main research in- terests are human-computer interaction and arti¯cial intelli- gence in games.