Gradient descent with a dash of Linear Algebra #3

This is #3 of a series where I am trying to digest the presentation, by Ian Goodfellow.

In the last post, we broke down and assessed the Hessian matrix, \(H\). We continue on in this post by analyzing the impact of a gradient step on the cost function and finding the optimal gradient step with respect to cost reduction.

This is followed by the introduction of the Taylor series approximation, and the expansion of the cost function, \(J(\theta)\). The taylor series is typically defined as \[\sum_{n=0}^{\inf}\frac{f^{(n)}(a)}{n!}(x-a)^n\] and in the expansion would look like: \[f(x)= f(a) + \frac{f’(a)}{(x-a)} + \frac{f’‘(a)}{2!}(x-a)^2…\]

When applied to the cost function case, keeping in mind that \((\vec{\theta}-\vec{\theta)_0}\cdot(\vec{\theta}-\vec{\theta}_0)=(\vec{\theta}-\vec{\theta_0})^T(\vec{\theta}-\vec{\theta}_0)\). Note, that I use the vector symbol, to make it clear that we are dealing with vectors. This is applicable to all \(\theta\) when referring generally to parameters.

\[J(\theta) = J(\theta_0) + (\theta-\theta_0)^T\textbf{g}+\frac{1}{2}(\theta-\theta_0)^T\textbf{H}(\theta-\theta_0)+…\]

Specifically, Ian approximates the cost function, \(J(\theta)\) out to the 2nd order in order to assess the sensitivity of the cost to a gradient step, resulting in: \[J(\theta-\epsilon \textbf{g}) \approx J(\theta) -\epsilon \textbf{g}^T\textbf{g} + \frac{1}{2}\epsilon^2\textbf{g}^T\textbf{H}\textbf{g}\]

For easier comprehension, we can move the minus sign out: \[J(\theta-\epsilon \textbf{g}) \approx J(\theta) -(\epsilon \textbf{g}^T\textbf{g} - \frac{1}{2}\epsilon^2\textbf{g}^T\textbf{H}\textbf{g})\]

and break down the equation as: \[\text{new} J(\theta) \approx \text{old} J(\theta) - \text{adjustment to cost as a result of gradient step}, \ \epsilon\]

Now if \(\textbf{g}^T\textbf{H}\textbf{g} \leq 0\), its guaranteed that the term \((\epsilon \textbf{g}^T\textbf{g} - \frac{1}{2}\epsilon^2\textbf{g}^T\textbf{H}\textbf{g})\) is positive, thus we are reducing the old \(J(\theta)\) by some amount.

So the question is what is the optimal step size, or the \(\epsilon\) that results in the largest cost reduction? We can just take the derivative, set to 0, and solve for \(\epsilon\). \[\frac{\partial}{\partial \epsilon}J(\theta) =( \textbf{g}^T\textbf{g} - \epsilon\textbf{g}^T\textbf{H}\textbf{g})=0\] \[\epsilon^* = \frac{\textbf{g}^T\textbf{g}}{\textbf{g}^T\textbf{H}\textbf{g}}\]

He also touches upon the worst case scenario when \(\textbf{g}\) aligns with \(\lambda_{max}\). This goes back to the previous post, where if the direction aligns with the direction of eigenvalue, i.e, the eigenvector, \(\textbf{g}\), and gradient run parallel, the angle between is 0, thus \(\cos^2(0) = 1\), and we are left with subtracting \(\lambda_{max}\) with out any discounting, thus the “adjustment to cost as a result of gradient step” will be the smallest when \(\textbf{g}\) aligns with \(\lambda_{max}\) and below holds: \[(\epsilon - \frac{1}{2}\epsilon^2\lambda_{max})\textbf{g}^T\textbf{g}\]

Will continue on the next post…all mistakes are mine, if any please point out and I will amend.

Gradient descent with a dash of Linear Algebra #3

Related Posts