Ok, so now that we got the objective function defined, \(J(\theta)\), which to remind ourselves is a function that is evaluating the quality of the policy, where higher is better. We also have an update rule that is basically hill “climbing” as we attempt an ascend to an optima, preferably a global but most likely will be local; \(\theta’ = \theta + \alpha\nabla_{\theta}J(\theta)\).
Now lets get a grip on the \(\nabla_{\theta}J(\theta)\). We talked about gradients in a separate series “Gradient Descent with a dash of Linear Algebra”, and understand that the gradient is a column vector of partial derivatives with respect to each parameter that parameterizes the policy, \(\pi(a|s, \theta)\).
The policy gradient theorem states: for any differentiable policy, for any policy objective function \(J(\theta)\), the policy gradient is given by:
\[\nabla_{\theta}J(\theta)=\mathbf{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(s,a)Q^{\pi_{\theta}}(s,a)]\]The question is how do we get from \(\nabla_{\theta}J(\theta)\) to \(E_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(s,a)Q^{\pi_{\theta}}(s,a)]\)? Will work backwards step by step, starting from the end result stated above.
First, rewrite the expected value over policies, \(E_{\pi_{\theta}}[..]\) as \(\sum_{s}d(s)\sum_{a}\pi_{\theta}(s,a)\) to get: \[\nabla_{\theta}J(\theta)=\sum_{s}d(s)\sum_{a}\pi_{\theta}(s,a)\nabla_{\theta}\log\pi_{\theta}(s,a)Q^{\pi_{\theta}}(s,a)\] The \(Q^{\pi_{\theta}}(s,a)\) is the state-action value function, and an estimate of the return obtained from taking action, \(a\), from state \(s\), and following policy, \(\pi\), there after. We’ll leave this as is.
Next use the \(\textbf{so-called ratio trick}\), \(\nabla_{\theta}\pi_{\theta}(s,a) = \pi_{\theta}(s,a)\frac{\nabla_{\theta}\pi_{\theta}(s,a)}{\pi_{\theta}(s,a)}=\pi_{\theta}(s,a)\nabla\log\pi_{\theta}(s,a)\) , which we will break down in the next post, to unwind further. This leaves us with: \[\nabla_{\theta}J(\theta)=\sum_{s}d(s)\sum_{a}\nabla_{\theta}\pi_{\theta}(s,a)Q^{\pi_{\theta}}(s,a)\]
As a result of linearity, we can push out the \(\nabla\) outside the summations:\[\nabla_{\theta}J(\theta)=\nabla_{\theta}\sum_{s}d(s)\sum_{a}\pi_{\theta}(s,a)Q^{\pi_{\theta}}(s,a)\]
We can re-introduce the expectation over policies for clarity.
\[\nabla_{\theta}J(\theta)=\nabla_{\theta}\mathbf{E}_{\pi_{\theta}}[Q^{\pi_{\theta}}(s,a)]\]What does this mean?! I guess it makes some intuitive sense in that we are trying to find the direction, given by the gradient, in which the expected value of the Q function for a given policy is higher. The reason that researchers want to work away from this formulation is that the gradient of the expected value could be relatively difficult to compute.
Now we need to just get a handle on this ratio trick that is used often when talking of policy gradient methods.
(All mistakes are mine, corrections appreciated.)
References:
- Silver, David. Lecture 7: Policy Gradient
- Meyer, David. Notes on policy gradients and the log derivative trick for reinforcement learning
- Mohamed, Shakir. Machine Learning Trick of the Day (5): Log Derivative Trick