Overview of Deep Reinforcement Learning Methods
Prof. Steven L. Brunton
Summary (AI generated)
There is a blog by Andre Carpathia discussing deep reinforcement learning in which he provides a python code with roughly 150 lines of code that codes up a deep policy network. This network takes a high-dimensional state, specifically pixel space, and optimizes the best decision for moving up or down a pong paddle. The policy network's parameters are represented by network weights θ, and gradients are computed through back propagation to optimize these weights for the best chance of receiving a future reward.
There is an actor-critic method that can be used to embed information into the deep policy network. The general idea is to represent the policy as a network and optimize its parameters using gradients through back propagation. The following section explains how policy gradient optimization works and how to update θ.
The cumulative future reward σ is the expected probability of finding oneself in a state S at long times given policy θ. The reward function is the sum of all future possible rewards, folded up using a quality function. To compute the gradient, one takes the gradient of the reward with respect to θ, and then divides it by the policy π θ on the left. This results in the expected value of the quality function times the gradient of the log of the policy with respect to θ. This information is used to update the weights θ.
Overall, the idea behind deep reinforcement learning is to optimize the policy network's parameters to make the best decisions for future rewards. The network can be optimized using back propagation and different methods can be used to embed information into the deep policy network.