Deep Reinforcement Learning With Python | Part 2 | Creating & Training The RL Agent Using Deep Q Network (DQN)

Mohammed AL-Ma'amari
Towards Data Science
7 min readJul 9, 2020

--

In the first part, we went through making the game environment and explained it line by line. In this part, we are going to learn how to create and train a Deep Q Network (DQN) and enable agents to use it in order to be experts at our game.

In this part, we are going to be discussing :

1- Why Deep Q Network (DQN) ?

2- What is DQN?

3- How DQN Works?

4- Explaining our DQN Architecture.

5- Explaining the Agent Class.

6- Training the Agent.

Someone might ask “Why didn’t you use Q-Learning instead of DQN ?” The answer to this question depends on many things such as:

How complex the environment is ?

In our case, we can answer this question in two ways:

  • If we want the input to the RL agent to be as close as the input to a human, we will choose the input to be the array representation of the field.
What the RL agent sees VS What a human player sees

In this case, the environment would be complex as we use Q-Learning and since the Q-Table is tremendously big, it would be impossible to store it. To prove this, consider the following calculations:

Number of states that the input array has = (number of different values every item in the array can take) ^ (width*height)

Number of states that the input array can has = 4 ^ 10 * 20

= 4 ^ 200 = 2.58225e120

Q-Table size = ACTION_SPACE size * Number of states that the input array can has

Q-Table size = 5 * 2.58225e120 = 1.291125e121

To store an array with this number of items (each item is 8 bits), we need 9.39417077e109 Terabytes.

That is why we simply use DQN instead of Q-Learning.

  • On the other hand, if you want to use Q-Learning, it would be more efficient to use another kind of input. For example, you can use the X coordinate of the player and the hole, player’s width and hole’s width. This way the input is much simpler than using array representation.

What is DQN ?

It is simply a normal neural network, the only difference is that its input is a state of the environment and its output is the best action to perform in order to maximise the reward for the current step.

We do this by using Experience Replay and Replay Memory, those concepts will be explained in the next section.

How Deep Q Network (DQN) Works ?

To fully understand how DQN works, you need to know some concepts related to DQN:

1- Experience Replay and Replay Memory :

Similar to the way humans learn by using their memory of previous experience DQNs use this technique too.

Experience Replay : some data that is collected after every step the agent perform, this experience replay contains [current_state, current_action, step_reward, next_state].

Replay Memory : is a stack of n experience replays, replay memory is mainly used to train the DQN by getting a random sample of replays and use those replays as the input to the DQN.

Why Using random sample of replays instead of using sequential replays?

  • When using sequential replays the DQN tends to overfit instead of generalizing.

A key reason for using replay memory is to break the correlation between consecutive samples.

2- Model and Target Model:

To get a consistent results we will train two models, the first model “model” will be fit after every step made by the agent, on the other hand, the second model “target_model” loads the weights of “model” every n steps (n = UPDATE_TARGET_EVERY).

We do this because on the beginning, everything is random, from the initial weights of the “model” to the actions performed by the agent. This randomness makes it harder on the model to perform good actions, but when we have another model that uses the knowledge gained by the first model every n steps, we have some degree of consistency.

After we explained some key concepts know we can summarize the process of learning, I will use the words of DeepLizard from this wonderful blog :

Explaining Our DQN Architecture:

For our DQN many architectures were tried many of them did not work, but eventually one architecture proved to be working well.

Failed Tries:

  • One of the first failures was an architecture with two output layers, the first output layer is responsible for predicting the best move (left, right, or no move) while the other output layer is responsible for predicting the best width changing action (increasing the width, decreasing the width, not changing the width).
  • Another failures included too deep networks, in addition to their slow training process their performance were too poor.

Finding Better Architectures:

After some failures a grid search was performed to find architectures that can outperform humans playing the game, following tables show results of some grid searches:

Note: next tables are ordered to show best result lastly.

From the result of the first grid search we can clearly see that complicated and deep networks failed to learn how to play the game, on the other hand, the simplest network worked the best.

Using the result from the first grid search, another grid search was performed and got some good results:

From this result we see that “Best Only does not enhance the model’s performance, on the other side, using both ECC(Epsilon Conditional Constentation) and EF(Epsilon Fluctuation) together can improve the model’s performance.

We will discuss ECC and EF in another blog.

Some other grid searches results:

  • Testing “Best Only”:
  • Testing even simpler networks:

After all these grid searches we finally settled on using an architecture with one convolutional layer with 32 filter, batch size of 128, two dense(fully connected) layers with 32 nodes each, and we will use both ECC and EF together.

Network Summary:

Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 20, 10, 1) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 18, 8, 32) 320
_________________________________________________________________
dropout_1 (Dropout) (None, 18, 8, 32) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 4608) 0
_________________________________________________________________
dense_1 (Dense) (None, 32) 147488
_________________________________________________________________
dense_2 (Dense) (None, 32) 1056
_________________________________________________________________
output (Dense) (None, 5) 165
=================================================================
Total params: 149,029
Trainable params: 149,029
Non-trainable params: 0
_________________________________________________________________
  • Input Layer : The input shape is the same as the shape of the array that represents the playing field (20 by 10).
  • Convolutional layers : One Conv2D layer with 32 filters with size of 2*2
  • Dropout of 20%
  • Flatten : convert the output of the convolutional layer from 2D into 1D array.
  • Dense (Fully Connected) layers : Two dense layers each has 32 nodes.
  • Output Layer : The output layer contains 5 output nodes each node represents an action [no_action, move_left, move_right, decrease_width, increase_width]

Explaining the Agent Class:

Agent class is a class that contains everything related to the agent such as the DQN, the training function, the replay memory and other stuff, following is a line-by-line explanation of this class.

Model Creation :

These two functions are used to create a model given two lists:

  • conv_list : each item of this list defines number of filters for a convolutional layer.
  • dense_list : each item of this list defines number of nodes for a dense layer.

Training the Agent:

In order to keep tracking the best model and save it to be used after training the following function is used :

Next are some constants :

Then comes the architectures that will be trained:

A grid search will be performed using the previous three architectures and the result of the grid search is stored in a dataframe.

Check out my Github repository of this code:

To recap, we discussed:

  • The reasons behind choosing DQN instead of Q-Learning.
  • DQNs, a brief explanation.
  • How DQNs work?
  • What architectures we used and why?
  • The Agent class, explained code.
  • The process of training the models and grid search for the best one.

In the next part we will :

  • Analyse the training results using Tensorboard.
  • Trying the best model.

You can follow me on:

--

--

I am a computer engineer | I love machine learning and data science and spend my time learning new stuff about them.