Q-Learning

In this problem, since we our dealing with a two-dimensional state space, we replace Q(s, a) with Q(s1, s2, a), but other than that, the Q-learning algorithm remains more or less the same.

To recap, the algorithm is as follows:

  1. Initialize Q(s1, s2, a) by setting all of the elements equal to small random values;

  2. Observe the current state, (s1, s2);

  3. Based on the exploration strategy, choose an action to take, a;

  4. Take action a and observe the resulting reward, r, and the new state of the environment, (s1’, s2’);

  5. Update Q(s1, s2, a) based on the update rule:

Q’(s1, s2, a) = (1 β€” w)*Q(s1, s2, a) + w*(r+d*Q(s1’, s2’, argmax a’ Q(s1’, s2’, a’)))

Where w is the learning rate and d is the discount rate;

6. Repeat steps 2–5 until convergence.

To implement Q-learning in OpenAI Gym, we need ways of observing the current state; taking an action and observing the consequences of that action. These can be done as follows.

The initial state of an environment is returned when you reset the environment:

> print(env.reset())
array([-0.50926558, 0. ])

To take an action (for example, a = 2), it is necessary to β€œstep forward” the environment by that action using the step() method. This returns a 4-ple giving the new state, reward, a Boolean indicating whether or not the episode has terminated (due to the goal being reached or 200 steps having elapsed), and any additional information (this is always empty for this problem).

> print(env.step(2))
(array([-0.50837305, 0.00089253]), -1.0, False, {})

If we assume an epsilon-greedy exploration strategy where epsilon decays linearly to a specified minimum (min_eps) over the total number of episodes, we can put all of the above together with the algorithm from the previous section and produce the following function for implementing Q-learning.

Last updated