Previous Posts:
- Actor-Critic and Policy Gradient Methods #1
- Actor-Critic and Policy Gradient Methods #2
- Actor-Critic and Policy Gradient Methods #3
In this post, we will break down the ratio trick that helped us in the previous post to digest the policy gradient theorem, using the steps outlined by D. Meyer[1].
First-off lets recall the identity \(\nabla_{\theta}\log(w) = \frac{1}{w}\nabla_{\theta}w\). Lets remember this as the identity will come in use later.
Start with a function: \[y=\pi(s,a;\theta)\] Take the log of both sides and link to new variable \(z\): \[z=\log y=\log \pi(s,a;\theta)\] Take the derivative, recalling the chain rule definition: \[\frac{\partial z}{\partial\theta}=\frac{\partial z}{\partial y}\frac{\partial y}{\partial \theta}\] where \[\frac{\partial z}{\partial y}=\frac{1}{\pi(s,a;\theta)}\] \[\frac{\partial y}{\partial \theta}=\frac{\partial \pi(s,a;\theta)}{\partial \theta}=\nabla_{\theta}\pi(s,a;\theta)\] thus \[\frac{\partial z}{\partial\theta}=\frac{\nabla_{\theta}\pi(s,a;\theta)}{\pi(s,a;\theta)}\] and using the identity we introduced earlier, we arrive upon: \[\frac{\partial z}{\partial\theta}=\frac{\nabla_{\theta}\pi(s,a;\theta)}{\pi(s,a;\theta)}= \nabla_{\theta}\log\pi_{\theta}(s,a)\]
Finally: \[\nabla_{\theta}\pi(s,a;\theta)=\pi(s,a;\theta)\nabla_{\theta}\log\pi_{\theta}(s,a)\]
which is the final result that we were after, and the “trick” that helps make the policy gradient method palatable to a certain extent.
At last we can move on to looking at some algorithms…
(All mistakes are mine, corrections appreciated.)
References:
- Meyer, David. Notes on policy gradients and the log derivative trick for reinforcement learning