Overview of Deep Reinforcement Learning Methods

Prof. Steven L. Brunton

Slide at 11:18

POLICY GRADIENT OPTIMIZATION
RE.O =(s) = tg(s,a)L(s,a),
SES aEA
,(s,a)
SES aEA
= (s) Tg(s,a)L(s,a) Votg(s,a)
(s,a)
SES aEA
= (s) (s,a)Q(s,a) Volog (g(s, a))
SES aEA
(Q(s,a) Volog (rg(s,a)) )
Onew gold +aVoRz.O

Share slide

Summary (AI generated)

This is a bit of a mathematical aside, but for those interested in how to compute the policy gradient, there is at least one way to do it. Moving on to the topic of deep Q learning, it is important to note that many impressive demonstrations of deep Reinforcement learning in the past five to ten years have utilized this method. Essentially, deep Q learning involves learning a quality function with a neural network.

The basic formula for Q learning is an off-policy temporal difference learning algorithm. The estimated future reward is calculated based on the current state (S) and action taken (a), while the actual reward is the result of taking that action. The difference between these two values is used to update the quality function through trial and error. The quality function can be parameterized by neural network weights (θ) to optimize it for large state spaces.

For example, in games like backgammon, chess, or go, the state space is astronomically large. Chess has more than 10^80 board combinations, more than the number of nucleons in the known universe, and go has an even larger state space. Instead of iterating over states repeatedly, the quality function can be parameterized by a lower-dimensional set of parameters (data) and optimized over θ. This allows for the extraction of low-dimensional features of the quality function, which is important in addressing the curse of dimensionality present in high-dimensional state spaces.