Actor-Critic and Policy Gradient Methods #6

Finally, after discussing the policy gradient method, and variance derived from a Monte-Carlo driven algorithm, we can trek into the space of Actor-Critic algorithms.

To quickly recap, the derivation of a policy gradient requires the calculation of an integral, which can be estimated via Monte-Carlo techniques, which transforms the integration problem to one of calculating a weighted average samples.[1] In the case of a RL agent, the samples equate to the episodes, thus the agent can obtain an estimate through acting in the environment. As we discussed previously, a small sample set can misrepresent the expected value, thus a large number of episodes are required to counter the variance.

To address the scope for high variance, the literature indicates that a general approach to reducing the variance associated with a Monte-Carlo approach is to introduce a control variate, a base-line or a parameterized function approximator that has zero mean. The Actor-Critic algorithm introduces a “critic”, which to frame intuitively is a judge that judges the actions of the actor. If the actor is the source of high variance, the critic supplies some sanity, and provides low-variance feedback on the quality of the performance.[2]

The introduction of a critic, not only reduces variance, but also maintains favorable convergence properties of the policy gradient methods, where convergence is obtained if the estimated gradients are unbiased and if the learning rates satisfy: \[\sum^{\infty}\alpha_{a,k} = \infty\] \[\sum^{\infty}{\alpha^2_{a,k}}<\infty\]

As suggested earlier, the introduced term has zero mean which guarantees that the expected value of the estimator is unchanged, and the “biasedness” is not impacted, while the same can not be said for the variance.

This is assuming the term is independent of the action space. We can do a quick work through of the math to show that the introduction of a control variate does not in fact affect the expected value.

Introduce the variate: \[\nabla_{\theta}J(\theta) = E_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(s,a)(Q^{\pi_{\theta}}(s,a)-B(s)]\]

By linearity we can break up the expected value: \[\nabla_{\theta}J(\theta) = E_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(s,a)(Q^{\pi_{\theta}}(s,a)]-E_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(s,a)B(s)]\]

Focus on the operand on the right side: \[E_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(s,a)B(s)]\]

Using the ratio trick, in reverse: \[\sum_Sd^{\pi_{\theta}}(s)\sum_A\nabla_{\theta}\pi_{\theta}(s,a)B(s)\]

Pull out \(B(s)\) as independent of actions and \(\nabla\) under linearity: \[\sum_Sd^{\pi_{\theta}}(s)B(s)\nabla_{\theta}\sum_A\pi_{\theta}(s,a)\]

We know that that PMFs satisfy \(\sum_x p(x) = 1\) and the derivative of a constant is 0: \[\sum_Sd^{\pi_{\theta}}(s)B(s)0 = 0\]

Thus we derive that the expected value is unchanged: \[\nabla_{\theta}J(\theta) = E_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(s,a)(Q^{\pi_{\theta}}(s,a)-0]\]

In summary, an algorithm that uses the policy gradient method enjoys convergence if certain conditions are satisfied with regards to bias and the learning rate, though remains exposed to high variance. This translates directly to high sample complexity, a large number of episodes needed to derive a satisfiable estimate using Monte Carlo methods. Variance can be reduced using a zero mean term, Critic, which supplies low variance knowledge, without impacting the expected value, and thus is capable of maintaining convergence properties inherited from the policy gradient method. In the next post, will cover different critics followed by current state of the art algorithms.

References

Grondman et.al. A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients
E. Greensmith, P. L. Bartlett, and J. Baxter. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

Actor-Critic and Policy Gradient Methods #6

Previous Posts:

References

Related Posts