Overview of Deep Reinforcement Learning Methods - presented by Prof. Steven L. Brunton

Overview of Deep Reinforcement Learning Methods

Prof. Steven L. Brunton

Prof. Steven L. Brunton

Preamble

No slides

Introduction

Welcome back, everyone. In this lecture, we will continue our discussion on Reinforcement learning. Specifically, we will focus on deep RL, which is also known as Reinforcement learning using deep neural networks. This trend has been one of the most exciting developments in both control theory and machine learning over the past decade. There are a plethora of new and exciting results emerging every month, making it an exciting topic to explore.

It is worth noting that this lecture follows a new chapter in the second edition of our data-driven science and engineering book. I wrote this chapter on Reinforcement learning to cap off the control theory section. It also served as an excellent opportunity for me to learn more about Reinforcement learning.

I would like to offer a meta point for all of you. Explaining things to others is an effective way to learn something. In my case, I used this opportunity to write a book chapter and create this video sequence to deepen my understanding of Reinforcement learning.

Reinforcement Learning

I hope you are enjoying this as much as I am. Today, I am going to touch on the different techniques of deep Reinforcement learning. However, due to the vastness of the field, I will only be discussing the tip of the iceberg. I have previously shared a video where I talked about the use of deep neural networks and reinforcement learning at a high level. My focus then was on providing examples and demonstrations of the cool things you can do with these techniques. In this lecture, I will delve deeper into the algorithms and explore how the different flavors of deep Reinforcement learning fit into the classic pictures we have been developing in our previous lectures.

Without further ado, I will jump right in and discuss some exciting concepts.

References
  • 1.
    https://faculty.washington.edu/sbrunton/databookRL.pdf

Deep Policy Network

In this transcript, the speaker explains the concept of a deep policy network, which is a type of deep neural network embedded into a reinforcement learner. The purpose of this network is to optimize the policy function, π, which is a complex function of the state and the action being taken. The goal is to maximize future rewards in the system.

The speaker goes on to explain that the policy function can be parameterized as a neural network with weights θ. The input of the neural network is the state and the output is the probability of which action to take. The network is optimized over these parameters, θ, to maximize future rewards and give the best policy possible.

Overall, the deep policy network is a relatively simple approach to incorporating deep neural networks into reinforcement learning. By optimizing the policy function, the network can learn to make decisions and take actions that lead to the most reward in the system.

There is a blog by Andre Carpathia discussing deep reinforcement learning in which he provides a python code with roughly 150 lines of code that codes up a deep policy network. This network takes a high-dimensional state, specifically pixel space, and optimizes the best decision for moving up or down a pong paddle. The policy network's parameters are represented by network weights θ, and gradients are computed through back propagation to optimize these weights for the best chance of receiving a future reward.

There is an actor-critic method that can be used to embed information into the deep policy network. The general idea is to represent the policy as a network and optimize its parameters using gradients through back propagation. The following section explains how policy gradient optimization works and how to update θ.

The cumulative future reward σ is the expected probability of finding oneself in a state S at long times given policy θ. The reward function is the sum of all future possible rewards, folded up using a quality function. To compute the gradient, one takes the gradient of the reward with respect to θ, and then divides it by the policy π θ on the left. This results in the expected value of the quality function times the gradient of the log of the policy with respect to θ. This information is used to update the weights θ.

Overall, the idea behind deep reinforcement learning is to optimize the policy network's parameters to make the best decisions for future rewards. The network can be optimized using back propagation and different methods can be used to embed information into the deep policy network.

References
  • 1.
    http://karpathy.github.io/2016/05/31/rl/

Policy Gradient Optimization

This is a bit of a mathematical aside, but for those interested in how to compute the policy gradient, there is at least one way to do it. Moving on to the topic of deep Q learning, it is important to note that many impressive demonstrations of deep Reinforcement learning in the past five to ten years have utilized this method. Essentially, deep Q learning involves learning a quality function with a neural network.

The basic formula for Q learning is an off-policy temporal difference learning algorithm. The estimated future reward is calculated based on the current state (S) and action taken (a), while the actual reward is the result of taking that action. The difference between these two values is used to update the quality function through trial and error. The quality function can be parameterized by neural network weights (θ) to optimize it for large state spaces.

For example, in games like backgammon, chess, or go, the state space is astronomically large. Chess has more than 10^80 board combinations, more than the number of nucleons in the known universe, and go has an even larger state space. Instead of iterating over states repeatedly, the quality function can be parameterized by a lower-dimensional set of parameters (data) and optimized over θ. This allows for the extraction of low-dimensional features of the quality function, which is important in addressing the curse of dimensionality present in high-dimensional state spaces.

Deep Q-Learning

In solving this problem, I will write down the cost function involved. This is the neural network cost function used when building a deep Q learner. Essentially, the loss function that the network is trying to minimize is the expectation of the square of the temporal difference error. The cue functions are parameters represented by theta, and the neural network will use stochastic gradient descent back propagation to optimize these parameters to give the best possible Q function that minimizes the temporal difference error.

There is strong evidence that biological learners are also minimizing this temporal difference error at some level of their neurological hardware. The update can be turned into a loss function, and the neural network can optimize those parameters for a powerful approach. This has been seen in the Deepmind Atari video game playing, where a deep Q learner with a convolutional level layer can learn from the pixel space.

No summary
References
  • 1.
    V. Mnih et al. (2015) Human-level control through deep reinforcement learning. Nature

The speaker discusses an algorithm called the deep Q learner and its ability to determine the appropriate actions to take based on its current state. The algorithm has learned that drilling a hole in one side of the game will increase its reward and is the most efficient solution. The speaker notes that the algorithm's ability to exploit the physics of the game to find solutions is comparable to that of expert humans. This is why it is referred to as having human level control.

References
  • 1.
    V. Mnih et al. (2015) Human-level control through deep reinforcement learning. Nature

The architecture consists of convolutional layers and fully connected layers, which enable the conversion of pixel space to joystick signals. Essentially, it is a deep Q learning demonstration that utilizes convolutional Q-learning. Additionally, there is a list of video games presented. The games above the line indicate that the deep Q learning program is better than or equal to human performance, while the games below the line indicate that it is still not as proficient as humans.

References
  • 1.
    V. Mnih et al. (2015) Human-level control through deep reinforcement learning. Nature

Okay, so that was a discussion on deep Q learning. Essentially, one can use the traditional Q-learning method and create a loss function for the neural network. Through trial and error and experience, the neural network will learn how to analyze the data to provide the best Q function possible.

There's a variation of the quality function called deep dueling Q networks or dueling deep Q networks (D.D.Q.N.). This method splits the quality function into two networks: a value network that is a function of the current state, and an advantage network that determines the advantage of taking an action in that state. This architecture is useful when the difference in quality for different actions is subtle. The value function is optimized to explain the Q function from the state, while the advantage network determines the effect of taking actions.

Another important concept in reinforcement learning is actor-critic learning. Actor-critic methods combine the best of policy-based and value-based learning. In actor-critic learning, there are two learners: an actor and a critic. The actor learns a good policy, while the critic critiques that policy based on its estimate of the value function. Essentially, the actor represents the policy and the critic learns the value function.

One simple way to implement actor-critic learning is to use the policy gradient algorithm. The parameters of the policy are updated based on information from the critic's estimate of the value function.

This passage discusses a policy update method that utilizes the temporal difference signal from a value learner in order to update the value function. The critic provides an error signal that is used to update the policy, resulting in a combination of value and gradient policy information.

One method that can be used in the context of deep neural networks is the advantage actor-critic network, which utilizes a deep dueling Q network to split the quality function into the value function and the advantage of taking an action. The actor is a deep policy network with weights θ, while the critic is a deep dueling Q network that assesses the quality of taking an action in a given state.

Advatange Actor-Critic Network

In this transcript, the speaker discusses the policy iteration and policy gradient iteration techniques. They note that the latter is faster than traditional model-free techniques, but requires a model with parameters (θ) to take the derivative. The speaker also mentions the use of an actor-critic method, in which a Q network is used to learn the quality function, and the policy is updated using a policy gradient network. The speaker notes that this approach combines value-based and policy-based optimization, which is different from Q-learning, where the Q function is updated based on Q information and the policy is optimized separately. Overall, the speaker finds this approach to be a cool and innovative way to optimize policies.

In this transcript, the speaker discusses the policy iteration and policy gradient iteration, which update the deep policy network much faster than the model free techniques discussed in the previous lecture. However, this method requires a model with parameters that can be differentiated with respect to θ. The speaker then explains that the deep policy gradient requires a quality function, and they use an actor-critic method, using a Q network for the quality function while updating the policy using the policy gradient network. The Q function is updated using the temporal difference error. This approach combines the best of value-based and policy-based formulations.

The speaker mentions that deep quality function networks are very popular and can be used to do deep policy gradients in actor-critic methods. They also briefly touch on deep model predictive control, a different flavor of optimization that requires a lot of computational power, but allows for learning optimal nonlinear controllers for tasks like teaching a quad rotor to fly through an obstacle field. The speaker suggests that once the right control actions are learned, they can be embedded in a neural network to rapidly encode the information of these deep model predictive controllers.

Overall, the speaker provides a high-level overview of some important topics in deep reinforcement learning, building on previous lectures.