12/08/2018, 17:28

An Intro to Reinforcement Learning

Recently I have tried reading a book by Richard Sutton and Andrew Barto on reinforcement learning called “Reinforcement Learning: an introduction” . It gave me a little primer on what Reinforcement Learning really means, and why it’s different than just another neural net. I have ...

Recently I have tried reading a book by Richard Sutton and Andrew Barto on reinforcement learning called “Reinforcement Learning: an introduction”. It gave me a little primer on what Reinforcement Learning really means, and why it’s different than just another neural net. I have tried my best understanding the core concepts despite some math being unfathomable to me. I will try summarise what I have learned.

What is it?

Reinforcement learning (or RL) solves a very specific problem: figuring out how to act over time, so that you get the most long term reward. Both these sequences of actions and the reward bit are important components that make RL a “good” approach to solve a problem.

For example, WALL-E is the last living robot on earth we will try to get home, so without thinking about sequence of actions it will roam around aimlessly. futile. Lets think about RL approach to get WALL-E home

What it isn’t?

There are many things with the word “learning” in them that aren’t Reinforcement Learning.

supervised learning. This is a kind of Machine Learning where someone gave you a training data that has everything labelled correctly, you learn from it, and hope that at exam time what you’ve learnt is correct. This is the “I have 1000 images of cats, now I can tell you if this new image is a cat” problem.

unsupervised learning. This is another kind of Machine Learning where someone gave you a bunch of data with no labels, and just by staring at it you try to find structure in it and make up labels. This is the “I have 1000 images of cats and dogs, but nobody told me what a cat or a dog looks like; now I can tell you if this new image is like what I call a cat or a dog”. Classification is a very common problem that can be solved with both of these Machine Learning approaches (but can’t be solved very well with RL, which isn’t really suited for one-off actions).

Neural nets are very good at solving these 2 kinds of Machine Learning problems. For example, the secrets straight out of your nightmares are created by a “deep” neural net, a neural net that has several layers in between its input and output layers.

If you put some neural nets on top of some Reinforcement Learning algorithms, you get something called Deep Reinforcement Learning, which is a area of research that brought you supercomputers that win at Go competition.

The world

RL problems are usually set up in a scenario that is built out of states, and you can move between them by taking actions. Once you follow an action, you reap a reward, and you keep doing this until someone tells you to stop.

In the WALL-E example, the states could be the (x,y) positions on the map, and you move between two states (i.e. locations) by moving the motors in a particular direction. The reward might be set up in such a way that you only get 1 point if you reach the home base, and 0 otherwise. If there’s particularly dangerous spots in the world you want to make sure the WALL-E learns to avoid (like cliffs or a cat), you can make sure any actions that end up in those states get a reward of -1.

Some environments are less like real worlds and more like abstract worlds: when you’re playing Texas Hold’em poker, the state that you’re in could be the hand that you have, and what cards are on the table, and the actions could be folding, raising, checking. If you only give a reward at the end of the game, it’s difficult to know how you’re actually doing. These problems have much more complicated reward signals (and certainly, states): how players are doing, which are staying, how they’re playing, need to be takenn into consideration.

If you’re interested in the math behind this, the environments are usually represented by a Markov Decision Process (MDP), or a Partially Observable Markov Decision Process (POMDP). The difference between the two is that in the latter case you’re not told exactly what your state in the world is. You still know what actions you took, and what reward you’re accumulating, but since you don’t know what they actually mean in the world, you have to make up your own representation of it. These are typically harder and weirder problems, and these were the ones I was focusing my research on, btw!

Learning how to act

The thing that tells us which action to take in a state is our policy. If we can figure out the best action to take in every state in the world, then we have an optimal policy.

In order to figure out if a policy is better than another, we need to figure out how “good” it is to be in a certain state according to that policy because comparing them: from one state to another can lead you to a pot of gold, or one to sudden death. One is clearly superior). We call this the value of a state, and it’s basically the reward we expect we’re going to get from that state if we follow what the policy tells us to do.

The expected reward bit is subtle but important: if we just care about immediate reward, a state that doesn’t lead us to instant death sounds pretty good! However, if we keep taking these seemingly-ok-because-they-didn’t-kill-us actions, you might still end up at the edge of the cliff, one step away from instant death. By forthinking a number of steps before, you avoid being trapped in useless trajectories like this.

Most basic RL algorithms try to learn one of these functions:

the state-value function, which is the value of every state in the world. This is usually called V (for value) the action-value function, which is the value of taking an action from a state, for all actions and states in the world. This is usually called Q The difference between the two is potentially religious. The state-value function basically says “where you are in the world is important, so figure out the sequence of good states and follow that”. The action-value function says “we’re in a state, and some of the actions we can take are awesome, and some are terribad, figure out the awesome ones”.

The point of an RL algorithm is to basically learn these functions, and then pick the one with the highest value: that’s your optimal policy!

How do we learn?

We learn things about the world by exploring the world. You can think about it as roaming the world in “practice mode”, which gives you experience, which helps you learn what your policy is (what to do in a particular state). When it’s “exam time mode”, you use the policy you’ve learnt and act according to that. The more data you have, the better you learn.

If we think about our practice policy as the way we decided to act while in practice mode, and our optimal policy as the way we will act during “exam time” (always be the very best you at exams), then there are two fundamentally different ways in which you can learn:

on-policy learning: in practice mode, you are following the practice policy to explore the environment, and learning how well it works. the more you learn, the better it gets. in “exam time mode”, you still use this practice policy you’ve perfected. off-policy learning: in practice mode, you are following the practice policy to explore the environment, and learning what the optimal policy should look like, based on what you’re discovering. in “exam time mode”, you would use the optimal policy you’ve been learning.

I try doing a demo on Q-learning. this is the action-value function, i.e. the value of all of the actions, from all of the states). It works like this:

  1. Initialize a Q function in a way that the value of any action from any state is a random number. This little thing is important so that we don’t bias your policy with contradictions.
  2. Start in a random state (Let's call it S).
  3. From this state, we need to figure out how to move in the world. We’re gonna do something a policy called epsilon-greedy: most of the time, we’re going to move according to what the policy says (“greedily”). However, epsilon percent of the time, we’re going to move randomly. This means that we still get to do some random exploration, which is important to make sure we see new states we might not otherwise.
  4. and follow that action! Once we take it, we’ll end up in a state S_2, and the world tells you what is now the current reward we got. Let's call it R. We’re going to use this reward to update our Q function for the state we were in, and the action we took; more precisely: we’re going to update our Q(S,A) value. Note how we basically always update the last state-action pair, by seeing the results of that action in the world.
  5. The update step is a bit complicated math equation which myself dont get either(I care about output lol). But the TL;DR is: if this action was a good action, then the state that we ended up in should be a better state than the one we were currently in (closer to the goal). If we got a bad reward(output), then we reduce the value of Q(S,A); if we didn’t, then we increase it.
  6. This is an off-policy algorithm. How we calculate the Q(S,A) values isn’t affected by how we actually moved in the world; we assume we followed the greedy (aka always better) policy, even if we didn’t.

Anyway, now, we’re in a new state, so back at Step 2. Repeat Steps 2-6 until you end up in a goal state which is our destination. Once we do, we can go back to Step 1 and start in a new random state to see unexplored worlds.

If we do this enough times, we eventually experience enough of the world that your Q function will tell us what to do! In the next tutorial, I will try follow up with a working demo to see the result. Until then ciao!

0