Neural Slime Volleyball

source: http://blog.otoro.net/2015/03/28/neural-slime-volleyball/

Recurrent neural network playing slime volleyball. Can you beat them?
I remember playing this game called slime volleyball, back in the day when Java applets were still popular. Although the game had somewhat dodgy physics, people like me were hooked to its simplicity and spent countless hours at night playing the game in the dorm rather than getting any actual work done.
As I can’t find any versions on the web apart from the old antiquated Java applets, I set out to create my own js+html5 canvas based version of the game (complete with the unrealistic arcade-style ‘physics’). I set out to also try to apply the genetic algorithm coded earlier to train a simple recurrent neural network to play slime volleyball. Basically, I want to find out whether even a simple conventional neuroevolution techniques can train a neural network to become an expert at the this game, before exploring more advanced methods such as NEAT.
The first step was to write a simple physics engine to get the ball to bounce off the ground, collide with the fence, and with the players. This was done using the designer-artist-friendly p5.js library in javascript for the graphics, and some simple physics math routines. I had to brush up the vector maths to get the ball bouncing function to work properly. After this was all done, the next step was to add in keyboard / touchpad so that the players can move and jump around, even when using a smartphone / tablet.
The fun and exciting part was to create the AI module to control the agent, and to see whether it can become good at playing the game. I ended up using basic CNE method implemented earlier, as an initial test, to train a standard recurrent neural network, hacked together using the convnet.js library. Below is a diagram of the recurrent network we will train to play slime volleyball, where the magic is done:

The inputs of the network would be the position and velocity of the agent, the position and velocity of the ball, and also of the opponent. The output would be three signals that would trigger the ‘forward’, ‘backward’, and ‘jump’ controls to be activated. In addition, an extra 4 hidden neurons would act as hidden state and fed back to the input, this way it is essentially an infinitely deep feed forward neural network, and potentially remember previous events and states automatically in the hopes of being able to formulate more complicated gameplay strategies. One thing to note is that the activation functions would fire only if the signal is higher than a certain threshold (0.75).
I also made the agent’s states be the same independent of whether the agent was playing on the left or the right hand side of the fence, by having their locations be relative to the fence, and the ball positions adjusted accordingly according to which side they were playing in. That way, a trained agent can use the same neural network to play on either side of the fence.
Rather than using the sigmoid function, I ended up using the hyperbolic tangent (tanh) function to control the activations, which convnet.js supports.
The tanh function is defined as:

The tanh function can be a reasonable activation function for a neural network, as it tends towards +1 or -1 when the inputs get steered one way or the other. The x-axis would be the game inputs, such as the locations and velocities of the agent, the ball, and the opponent (all scaled to be +/- 1.0 give or take another 1.0) and also the output and hidden states in the neural network (which will be within +/- 1.0 by definition).

As velocities and ball locations can be positive or negative, this may be more efficient and a more natural choice compared to the sigmoid. As explained earlier, I also scaled my inputs so they were all in the order of +/- 1.0 size, similar to the output states of the hidden neurons, so that all inputs to the network will have roughly the same orders of magnitude in size on average.
Training such a recurrent neural network involves tweaks on the genetic algorithm trainer I made earlier, since there’s really no fitness function that can return a score, as either one wins or loses a match. What I ended up doing is to write a similar training function that gets each agent in the training population to play against other agents. If the agent wins, its score increases by one, and decreases by one if it loses. On ties (games that longer than the equivalent of 20 real seconds in simulation), no score is added or deducted. Each agent will play against 10 random agents in the population in the training loop. The top 20% of the population is kept, the rest discarded, and crossover and mutations are performed for the next generation. This is referred to as the ‘arms race’ method to train agents to play a one-on-one game.
By using this method, the agents did not need to be programmed by hand any heuristics and rules of the game, but will simply explore the game and figure out how to win. And the end result suggests that they seem to be quite good at it, after a few hundred generations of evolution! Check out the demo of the final result below on the youtube video.

The next step can be employ more advanced methods such as NEAT, or ESP for the AI, but that can be overkill for a simple pong-line game. It is also a candidate for applying the Deep Q-Learner already built in convnetjs, as the game playing strategy is quite simple. For now I think I have created a fairly robust slime volleyball player that is virtually impossible to beat by a human player consistently.
Try the game out yourself and see if you can beat it consistently. It works on both desktop (keyboard control), or smartphone / tablet via touch controls. Desktop version is easier to control either via keyboard arrows or mouse dragging. Feel free to play around with the source on github, but apologies if it’s not the neatest structured code as it is intended to be more of a sketch rather than a proper program.
Update (13-May-2015)
This demo at one point got to the front page of Y Combinator’s Hacker News. I made another demo showing the evolution of Agent’s behaviour over time, from knowing nothing at the beginning. Please see this post for more information.

========================================================================

Reinforcement Learning, Control, and 3D Visualization

source: http://blog.dlib.net/2015/06/reinforcement-learning-control-and-3d.html?m=0

Over the last few months I've spent a lot of time studying optimal control and reinforcement learning. Aside from reading, one of the best ways to learn about something is to do it yourself, which in this case means a lot of playing around with the well known algorithms, and for those I really like, including them into dlib, which is the subject of this post. So far I've added two methods, the first, added in a previous dlib release was the well known least squares policy iteration reinforcement learning algorithm. The second, and my favorite so far due to its practicality, is a tool for solving model predictive control problems.

There is a dlib example program that explains the new model predictive control tool in detail. But the basic idea is that it takes as input a simple linear equation defining how some process evolves in time and then tells you what control input you should apply to make the process go into some user specified state. For example, imagine you have an air vehicle with a rocket on it and you want it to hover at some specific location in the air. You could use a model predictive controller to find out what direction to fire the rocket at each moment to get the desired outcome. In fact, the dlib example program is just that. It produces the following visualization where the vehicle is the black dot and you want it to hover at the green location. The rocket thrust is shown as the red line:


// The contents of this file are in the public domain. See LICENSE_FOR_EXAMPLE_PROGRAMS.txt

/*

    This is an example illustrating the use of the linear model predictive

    control tool from the dlib C++ Library.  To explain what it does, suppose
you have some process you want to control and the process dynamics are

    That is, the next state the system goes into is a linear function of its
described by the linear equation:
        x_{i+1} = A*x_i + B*u_i + C

                
current state (x_i) and the current control (u_i) plus some constant bias or
    disturbance.  

    drive the state (x) to some reference value, which is what we show in this
A model predictive controller can find the control (u) you should apply to
    example.  In particular, we will simulate a simple vehicle moving around in

*/
a planet's gravity.  We will use MPC to get the vehicle to fly to and then
    hover at a certain point in the air.
    




#include <dlib/gui_widgets.h>
#include <dlib/control.h>

#include <dlib/image_transforms.h>

using namespace std;
using namespace dlib;

//  ----------------------------------------------------------------------------

int main()

{

    const int STATES = 4;

    const int CONTROLS = 2;

    // The first thing we do is setup our vehicle dynamics model (A*x + B*u + C).

    // Our state space (the x) will have 4 dimensions, the 2D vehicle position

    // and also the 2D velocity.  The control space (u) will be just 2 variables

    // which encode the amount of force we apply to the vehicle along each axis.

    // Therefore, the A matrix defines a simple constant velocity model.

    matrix<double,STATES,STATES> A;

    A = 1, 0, 1, 0,  // next_pos = pos + velocity

        0, 1, 0, 1,  // next_pos = pos + velocity

        0, 0, 1, 0,  // next_velocity = velocity

        0, 0, 0, 1;  // next_velocity = velocity



    // Here we say that the control variables effect only the velocity. That is,

    // the control applies an acceleration to the vehicle.

    matrix<double,STATES,CONTROLS> B;

    B = 0, 0,

        0, 0,

        1, 0,

        0, 1;

    // Let's also say there is a small constant acceleration in one direction.

    // This is the force of gravity in our model. 

    matrix<double,STATES,1> C;

    C = 0,

        0,

        0,

        0.1;

    const int HORIZON = 30;

    // Now we need to setup some MPC specific parameters.  To understand them,

    // let's first talk about how MPC works.  When the MPC tool finds the "best"

    // control to apply it does it by simulating the process for HORIZON time

    // steps and selecting the control that leads to the best performance over

    // the next HORIZON steps.

    //  

    // To be precise, each time you ask it for a control, it solves the

    // following quadratic program:

    //   

    //     min     sum_i trans(x_i-target_i)*Q*(x_i-target_i) + trans(u_i)*R*u_i 

    //    x_i,u_i

    //

    //     such that: x_0     == current_state 

    //                x_{i+1} == A*x_i + B*u_i + C

    //                lower <= u_i <= upper

    //                0 <= i < HORIZON

    //

    // and reports u_0 as the control you should take given that you are currently

    // in current_state.  Q and R are user supplied matrices that define how we

    // penalize variations away from the target state as well as how much we want

    // to avoid generating large control signals.  We also allow you to specify

    // upper and lower bound constraints on the controls.  The next few lines

    // define these parameters for our simple example.



    matrix<double,STATES,1> Q;

    // Setup Q so that the MPC only cares about matching the target position and

    // ignores the velocity.  

    Q = 1, 1, 0, 0;

    matrix<double,CONTROLS,1> R, lower, upper;

    R = 1, 1;

    lower = -0.5, -0.5;

    upper =  0.5,  0.5;

    // Finally, create the MPC controller.

    mpc<STATES,CONTROLS,HORIZON> controller(A,B,C,Q,R,lower,upper);

    // Let's tell the controller to send our vehicle to a random location.  It

    // will try to find the controls that makes the vehicle just hover at this

    // target position.

    dlib::rand rnd;

    matrix<double,STATES,1> target;

    target = rnd.get_random_double()*400,rnd.get_random_double()*400,0,0;

    controller.set_target(target);

    // Now let's start simulating our vehicle.  Our vehicle moves around inside

    // a 400x400 unit sized world.

    matrix<rgb_pixel> world(400,400);

    image_window win;

    matrix<double,STATES,1> current_state;

    // And we start it at the center of the world with zero velocity.

    current_state = 200,200,0,0;

    int iter = 0;

    while(!win.is_closed())

    {

        // Find the best control action given our current state.

        matrix<double,CONTROLS,1> action = controller(current_state);

        cout << "best control: " << trans(action);

        // Now draw our vehicle on the world.  We will draw the vehicle as a

        // black circle and its target position as a green circle.  

        assign_all_pixels(world, rgb_pixel(255,255,255));

        const dpoint pos = point(current_state(0),current_state(1));

        const dpoint goal = point(target(0),target(1));

        draw_solid_circle(world, goal, 9, rgb_pixel(100,255,100));

        draw_solid_circle(world, pos, 7, 0);

        // We will also draw the control as a line showing which direction the

        // vehicle's thruster is firing.

        draw_line(world, pos, pos-50*action, rgb_pixel(255,0,0));

        win.set_image(world);

        // Take a step in the simulation

        current_state = A*current_state + B*action + C;

        dlib::sleep(100);

        // Every 100 iterations change the target to some other random location. 

        ++iter;

        if (iter > 100)

        {

            iter = 0;

            target = rnd.get_random_double()*400,rnd.get_random_double()*400,0,0;

            controller.set_target(target);

        }

    }

}



//  ----------------------------------------------------------------------------

My quest to real A.I.

Monday, December 7, 2015

2 New first steps - AI projects to study

Neural Slime Volleyball

Reinforcement Learning, Control, and 3D Visualization

No comments:

Post a Comment