A compilation of concepts I want to remember...

Navigation
 » Home
 » About Me
 » Github

Actor-Critic and Policy Gradient Methods #2

24 Jan 2017 » rldm

This is a continuation of a series of posts to assess the actor-critic policy gradient algorithm. In the previous post we left off by introducing the gradient ascent update \[\Delta\theta=\alpha\nabla_{\theta}J(\theta)\], and left a couple questions unanswered. Specifically, what is the \(J(\theta)\) function, and how do we calculate the value associated with the gradient of \(J(\theta)\).

tl;dr: \(J(\theta)\) is the objective function and represents the quality of a given policy. There are three variations typically introduced but the policy gradient method is uniformly applicable.

First, in terms of \(J(\theta)\), which is evaluating the quality or “goodness” of a particular policy, \(\pi(a|s,\theta)\), is typically presented in three forms, one for the episodic case, continuous case, and a case where the average reward per time step is considered.

For the episodic case, the start value can be used, thus

\[J_1(\theta) = V^{\pi_{\theta}}(s_1) = \mathbf{E}_{\pi_{\theta}}[v_1]\]

In English, assuming my understanding is correct, what this is indicating is that the policy objective function \(J(\theta)\) is defined by the expected return, cumulative discounted rewards, starting from \(s_1\) for a given policy.

For the continuous case, the average value is typically considered, where the objective function is defined as:\[J_{aaV}(\theta)=\sum_sd^{\pi_{\theta}}(s)V^{\pi_{\theta}}(s)\]

\(d^{\pi_{\theta}}\), is the the probability of being in state, \(s\), under policy, \(\pi_{\theta}\). Thus \(J_{aaV}(\theta)\) is just the probability of being in state, \(s\), multiplied by the expected return from state, \(s\), given a particular policy, \(\pi\). Basically as the name indicates, its the \(\textit{average}\) expected returns over all possible states, with the probability of each state being drawn from a stationary distribution, \(d^{\pi_{\theta}}(s)\).

Finally, the average reward per time step is defined as: \[J_{avR}(\theta) = \sum_{s}d^{\pi_{\theta}}(s)\sum_{a}\pi_{\theta}(s,a)\mathbf{R}_s^a\]

Breaking down the symbols, starting from the right side, we are taking the rewards obtained for taking a particular action, discounted by the probability of that action given a policy, and this is evaluated and summed over all actions, and finally evaluated and summed for all possible states.

As defining the objective functions took a bit longer than expected, I will continue the interpretation in the next post.