In the following posts, I will attempt to breakdown the Actor-Critic Policy Gradient Algorithm.
Research often highlights the practicality of the policy gradient method in evaluating models with high dimensions, and continuous action spaces, as opposed to discrete and limited action space where some iteration of Q-learning, or more generally a function approximation has been applied. In contrast to value function approximation where convergence is not guaranteed, albeit in practice displaying significant results (DQN), the variants of the policy gradient method offers guarantees on convergence to at least a local optima. The downsides are, likewise the risk of converging to a local optima, inefficiencies (sample complexity high), and the fact that stand alone, the associated variance is high as well.
In terms of progression, in the coming posts I will:
- Break down the policy gradient method, and introduce the associated algorithm.
- Attempt to address the so-called “likelihood ratio” trick that is often time just rushed through in the material, which will require a detour in understanding the chain rule, and the likelihood estimator.
- Touch on the topic of variance, what this actually means, and a comparison to an algorithm that exhibits high bias.
- Wrap back around to the Actor-Critic algorithm which introduces the base-line to address the issue of high variance that the policy gradient method exhibits on a stand alone basis.
So why go through this, and take notes? The current so-called state of the art in Deep Reinforcement Learning is a variant of the Actor-Critic architecture and has been given the name of Asynchronous Advantage Actor-Critic, or A3C, and presented here by Mnih et al.
So lets begin… The policy gradient method essentially is using the gradient approach in the optimization problem to find the \(\theta\) that maximizes \(J(\theta)\). If your background has been in dealing with neural nets or some optimization problem where the objective is to minimize the cost function, you may be used to seeing the gradient \(\textit{descent}\) update used. In this case we are trying to \(\textbf{maximize}\) the value of the policy as a goal, thus need to ascend or increase which leads us to the (approximate) gradient ascent update rule. \[\theta’ = \theta + \alpha\frac{\partial J(\theta)}{\partial \theta}\] or equivalently: \[\Delta\theta=\alpha\nabla_{\theta}J(\theta)\]
For clarity, we are trying to find the best \(\theta\), that maximizes the “quality” of \(\pi(a |s;\theta)\).
A couple questions should be considered now. First, what is \(J(\theta)\), the quality of the policy, and secondly, how can we obtain the gradient of this particular function?
Will answer in the following post…