A compilation of concepts I want to remember...

Navigation
 » Home
 » About Me
 » Github

Gradient descent with a dash of Linear Algebra #4

16 Jan 2017 » deeplearning, math

This is #4 of a series where I am trying to digest the presentation, by Ian Goodfellow.

In the last post, we assessed the impact of a gradient step on the cost function, \(J(\theta)\). Now we consider the types of destinations that the method will encounter as the gradient steps progress.

In the following slides he talks on critical points. From single variate calculus, we know that if a function, \(f\), is differentiable and \(x\) is an optima then \(f’(x) = 0\), and \(x\) is a critical point associated with the function, \(f\).

Since the gradient is the first derivative, and a zero gradient implies a critical point, with the Hessian matrix, \(\textbf{H}\), in hand we can classify the optima as a minima, maxima, or saddle point.

If all the eigenvalues, \(\lambda_i > 0\), then the critical point is a minima, which he short hands as \(\lambda_{min} > 0\). If all eigenvalues, \(\lambda_i < 0\), then critical point is a maxima, while if mixed then classified as a saddle point.

Is there is a way to back out the same information without calculating eigenvalues?

If we take advantage of a couple properties we can.

  1. \(\det(H) = \lambda_1\lambda_2…\lambda_n\)
  2. \(\text{tr}(H) = \lambda_1 + \lambda_2 +…+\lambda_n\)

where the \(\det(H)\) can be found without calculating eigenvalues and the tr\((H) = \sum_{i=1}^nh_{ii}\) where \(h_{ii}\) represents the components of \(\textbf{H}\) along the diagonal.

Thus we can use the commonly defined rules as follows:

  1. if \(\det(H) > 0 \wedge \text{tr}(H) > 0 \rightarrow \text{Minima}\)
  2. if \(\det(H) > 0 \wedge \text{tr}(H) < 0 \rightarrow \text{Maxima}\)
  3. else \(\rightarrow \text{Saddlepoint}\)

and we are done…for now.