Slide Outline for Overview of Deep Reinforcement Learning Methods

Introduction

Reinforcement Learning

Deep Policy Network

Policy Gradient Optimization

Deep Q-Learning

Advatange Actor-Critic Network

Preamble

No slides

Introduction

Reinforcement
Learning
STEVEN L. BRUNTON
UNIVERSITY of
WASHINGTON

00:03

Welcome back, everyone. In this lecture, we will continue our discussion on Reinforcement learning. Specifically, we will focus on deep RL, which is also known as Reinforcement learning using deep neural networks. This trend has been one of the most exciting developments in both control theory and machine learning over the past decade. There are a plethora of new and exciting results emerging every month, making it an exciting topic to explore.

It is worth noting that this lecture follows a new chapter in the second edition of our data-driven science and engineering book. I wrote this chapter on Reinforcement learning to cap off the control theory section. It also served as an excellent opportunity for me to learn more about Reinforcement learning.

I would like to offer a meta point for all of you. Explaining things to others is an effective way to learn something. In my case, I used this opportunity to write a book chapter and create this video sequence to deepen my understanding of Reinforcement learning.

Reinforcement Learning

00:44

I hope you are enjoying this as much as I am. Today, I am going to touch on the different techniques of deep Reinforcement learning. However, due to the vastness of the field, I will only be discussing the tip of the iceberg. I have previously shared a video where I talked about the use of deep neural networks and reinforcement learning at a high level. My focus then was on providing examples and demonstrations of the cool things you can do with these techniques. In this lecture, I will delve deeper into the algorithms and explore how the different flavors of deep Reinforcement learning fit into the classic pictures we have been developing in our previous lectures.

Without further ado, I will jump right in and discuss some exciting concepts.

Deep Policy Network

02:06

In this transcript, the speaker explains the concept of a deep policy network, which is a type of deep neural network embedded into a reinforcement learner. The purpose of this network is to optimize the policy function, π, which is a complex function of the state and the action being taken. The goal is to maximize future rewards in the system.

The speaker goes on to explain that the policy function can be parameterized as a neural network with weights θ. The input of the neural network is the state and the output is the probability of which action to take. The network is optimized over these parameters, θ, to maximize future rewards and give the best policy possible.

Overall, the deep policy network is a relatively simple approach to incorporating deep neural networks into reinforcement learning. By optimizing the policy function, the network can learn to make decisions and take actions that lead to the most reward in the system.

DEEP POLICY NETWORK
raw pixels
hidden layer
probability of moving UP
POLICY (s,a)
Onew = gold +aV0Rz.O
Andrej Karpathy's Blog "Deep Reinforcement Learning: Pong from Pixels" http://karpathy.github.io/ 2016/05/31/rl/

03:32

There is a blog by Andre Carpathia discussing deep reinforcement learning in which he provides a python code with roughly 150 lines of code that codes up a deep policy network. This network takes a high-dimensional state, specifically pixel space, and optimizes the best decision for moving up or down a pong paddle. The policy network's parameters are represented by network weights θ, and gradients are computed through back propagation to optimize these weights for the best chance of receiving a future reward.

There is an actor-critic method that can be used to embed information into the deep policy network. The general idea is to represent the policy as a network and optimize its parameters using gradients through back propagation. The following section explains how policy gradient optimization works and how to update θ.

The cumulative future reward σ is the expected probability of finding oneself in a state S at long times given policy θ. The reward function is the sum of all future possible rewards, folded up using a quality function. To compute the gradient, one takes the gradient of the reward with respect to θ, and then divides it by the policy π θ on the left. This results in the expected value of the quality function times the gradient of the log of the policy with respect to θ. This information is used to update the weights θ.

Overall, the idea behind deep reinforcement learning is to optimize the policy network's parameters to make the best decisions for future rewards. The network can be optimized using back propagation and different methods can be used to embed information into the deep policy network.

Policy Gradient Optimization

06:11 - 11:18

This is a bit of a mathematical aside, but for those interested in how to compute the policy gradient, there is at least one way to do it. Moving on to the topic of deep Q learning, it is important to note that many impressive demonstrations of deep Reinforcement learning in the past five to ten years have utilized this method. Essentially, deep Q learning involves learning a quality function with a neural network.

The basic formula for Q learning is an off-policy temporal difference learning algorithm. The estimated future reward is calculated based on the current state (S) and action taken (a), while the actual reward is the result of taking that action. The difference between these two values is used to update the quality function through trial and error. The quality function can be parameterized by neural network weights (θ) to optimize it for large state spaces.

For example, in games like backgammon, chess, or go, the state space is astronomically large. Chess has more than 10^80 board combinations, more than the number of nucleons in the known universe, and go has an even larger state space. Instead of iterating over states repeatedly, the quality function can be parameterized by a lower-dimensional set of parameters (data) and optimized over θ. This allows for the extraction of low-dimensional features of the quality function, which is important in addressing the curse of dimensionality present in high-dimensional state spaces.

Deep Q-Learning

11:56 - 14:01

In solving this problem, I will write down the cost function involved. This is the neural network cost function used when building a deep Q learner. Essentially, the loss function that the network is trying to minimize is the expectation of the square of the temporal difference error. The cue functions are parameters represented by theta, and the neural network will use stochastic gradient descent back propagation to optimize these parameters to give the best possible Q function that minimizes the temporal difference error.

There is strong evidence that biological learners are also minimizing this temporal difference error at some level of their neurological hardware. The update can be turned into a loss function, and the neural network can optimize those parameters for a powerful approach. This has been seen in the Deepmind Atari video game playing, where a deep Q learner with a convolutional level layer can learn from the pixel space.

LETTER
dol:10 1038/nature14236
Human-level control through deep reinforcement
learning
Volodymyr Mnih's, Koray Kavukcuoglu's, David Silver Andrel A. Rusu1, Joel Veness Marc G. Bellemare'. Alex Graves', Martin Riedmiller'. Andreas K. Fidjeland', Georg Ostrovski', Stig Petersen'. Charles Beattle', Amir Sadik loannis Antonoglou',
Helen King'. Dharshan Kumaran', Daan Wierstra', Shane Legg & Demis Hassabis'
ima...

15:15

No summary

LETTER
dol:10.1038/nature14236
Human-level control through deep reinforcement
learning
Volodymyr Mnih'. Koray Kavukcuoglu', David Silver's. Andrel A. Rusu'. Joel Veness', Marc G. Bellemare1, Alex Graves', Martin Riedmiller', Andreas K. Fidjeland1, Georg Ostrovski Stig Petersen', Charles Beattle'. Amir Sadik¹. loannis Antonoglou',
Helen King'. Dharshan Kumaran', Daan Wierstra', Shane Legg & Demis Hassabis'
After 240 minutes of training
This is where the magic happens:
it realizes that digging a tunnel through the wall is the most effective technique to beat
the game.

15:35

The speaker discusses an algorithm called the deep Q learner and its ability to determine the appropriate actions to take based on its current state. The algorithm has learned that drilling a hole in one side of the game will increase its reward and is the most efficient solution. The speaker notes that the algorithm's ability to exploit the physics of the game to find solutions is comparable to that of expert humans. This is why it is referred to as having human level control.

LETTER
dok 10.1038/nature14236
Human-level control through deep reinforcement
learning
Graves'
Volodymyr Mnih'* Koray Andreas Kavukcuoglu's K. Fidjeland David Georg Ostrovski Silver Andrei Stig Petersen1, A. Rusu1, Hassabis Joel Charles Veness', Beattle', Marc Amir G. Bellemare1. Sadik' loannis Alex Antonoglou
Martin Helen King'. Riedmiller Dharshan Kumaran'. Daan Wierstra', Shane Legg & Demis
Convolution
Fully connected
Fully connected
Convolution

16:13

The architecture consists of convolutional layers and fully connected layers, which enable the conversion of pixel space to joystick signals. Essentially, it is a deep Q learning demonstration that utilizes convolutional Q-learning. Additionally, there is a list of video games presented. The games above the line indicate that the deep Q learning program is better than or equal to human performance, while the games below the line indicate that it is still not as proficient as humans.

Video Pinball
2539%
Boxing
1707%
Breakout
1327%
Star Gunner
598%
Robotank
508%
Atlantis
449%
Crazy Climber
419%
Gopher
400%
Demon Attack
294%
Name This Game
278%
Krull
277%
Assault
246%
Road Runner
232%
Kangaroo
224%
James Bond
145%
Tennis
143%
Pong
132%
Space Invaders
121%
Beam Rider
119%
Tutankham
112%
Kung-Fu Master
102%
Freeway
102%
Time Pilot
100%
Enduro
Fishing Derby
Up and Down
Ice Hockey
Q*bert
H.E.R.O.
At human-level or above
Asterix
Below human-level
Battle Zone
Wizard of Wor
Chopper Command
Centipede
Bank Heist
River Raid
Zaxxon
Amidar
Alien
Venture
Seaquest
Double Dunk
Bowling
Ms. Pac-Man
Asteroids
Frostbite
Gravitar
Private Eye
Montezuma's Revenge
600 1,000
4,500%

16:32

Okay, so that was a discussion on deep Q learning. Essentially, one can use the traditional Q-learning method and create a loss function for the neural network. Through trial and error and experience, the neural network will learn how to analyze the data to provide the best Q function possible.

DEEP Q-LEARNING
Qnew ()ad a + -
Q(s,a) ~ Q(s,a,6)
PARAMETERIZE Q FUNCTION WITH NN
ADVANTAGE NETWORK
Q(s,a,0)=V(s,A1)+A(s,a,02
DEEP DUELING Q NETWORK (DDQN)

16:44 - 17:05

There's a variation of the quality function called deep dueling Q networks or dueling deep Q networks (D.D.Q.N.). This method splits the quality function into two networks: a value network that is a function of the current state, and an advantage network that determines the advantage of taking an action in that state. This architecture is useful when the difference in quality for different actions is subtle. The value function is optimized to explain the Q function from the state, while the advantage network determines the effect of taking actions.

Another important concept in reinforcement learning is actor-critic learning. Actor-critic methods combine the best of policy-based and value-based learning. In actor-critic learning, there are two learners: an actor and a critic. The actor learns a good policy, while the critic critiques that policy based on its estimate of the value function. Essentially, the actor represents the policy and the critic learns the value function.

One simple way to implement actor-critic learning is to use the policy gradient algorithm. The parameters of the policy are updated based on information from the critic's estimate of the value function.

ACTOR-CRITIC NETWORK
s, a) a, 0)
ACTOR: POLICY BASED
CRITIC: VALUE BASED
USE TD SIGNAL FROM CRITIC TO
UPDATE POLICY PARAMETERS

18:29 - 19:38

This passage discusses a policy update method that utilizes the temporal difference signal from a value learner in order to update the value function. The critic provides an error signal that is used to update the policy, resulting in a combination of value and gradient policy information.

One method that can be used in the context of deep neural networks is the advantage actor-critic network, which utilizes a deep dueling Q network to split the quality function into the value function and the advantage of taking an action. The actor is a deep policy network with weights θ, while the critic is a deep dueling Q network that assesses the quality of taking an action in a given state.

Advatange Actor-Critic Network

ADVANTAGE ACTOR-CRITIC NETWORK
ACTOR: DEEP POLICY
NETWORK
Q(Skako2)
CRITIC: DEEP DUELING
Q NETWORK
UPDATE

20:26 - 21:03

In this transcript, the speaker discusses the policy iteration and policy gradient iteration techniques. They note that the latter is faster than traditional model-free techniques, but requires a model with parameters (θ) to take the derivative. The speaker also mentions the use of an actor-critic method, in which a Q network is used to learn the quality function, and the policy is updated using a policy gradient network. The speaker notes that this approach combines value-based and policy-based optimization, which is different from Q-learning, where the Q function is updated based on Q information and the policy is optimized separately. Overall, the speaker finds this approach to be a cool and innovative way to optimize policies.

Model-based RL
Model-free RL
Markov Decision Process P(s',s,a)
Gradient free
Off Policy
On Policy
Policy Iteration To(s,a)
Actor
TD(0)
Critic
Value Iteration V(s)
Q(s,a)
TD(00) III MC
Dynamic programming
TD-A
& Bellman optimality
Learning
SARSA
Nonlinear Dynamics
Gradient based
Deep
X = f(x(t),u(t),t) dt
Deep
Policy
Onew = Oold + aVeRs,
Optimal Control & HJB
Network
Policy Gradient Optimization
Deep RL

23:05

In this transcript, the speaker discusses the policy iteration and policy gradient iteration, which update the deep policy network much faster than the model free techniques discussed in the previous lecture. However, this method requires a model with parameters that can be differentiated with respect to θ. The speaker then explains that the deep policy gradient requires a quality function, and they use an actor-critic method, using a Q network for the quality function while updating the policy using the policy gradient network. The Q function is updated using the temporal difference error. This approach combines the best of value-based and policy-based formulations.

The speaker mentions that deep quality function networks are very popular and can be used to do deep policy gradients in actor-critic methods. They also briefly touch on deep model predictive control, a different flavor of optimization that requires a lot of computational power, but allows for learning optimal nonlinear controllers for tasks like teaching a quad rotor to fly through an obstacle field. The speaker suggests that once the right control actions are learned, they can be embedded in a neural network to rapidly encode the information of these deep model predictive controllers.

Overall, the speaker provides a high-level overview of some important topics in deep reinforcement learning, building on previous lectures.

Overview of Deep Reinforcement Learning Methods

Prof. Steven L. Brunton

Table of contents

Preamble

Introduction

Reinforcement Learning

Deep Policy Network

Policy Gradient Optimization

Deep Q-Learning

Advatange Actor-Critic Network