Gradient descent with a dash of Linear Algebra #1

Updates: Dec/10/2019: with links and fixed grammar.

In this series of posts I am trying to digest the presentation by Ian Goodfellow found in the deep learning resource deep learning book, as its sparse in details (which I am sure was expanded on in the live session) and dense in mathematical notation, at least from the perspective of a non-math person. The purpose of this post is to breakdown the math, and reinforce my understanding.

Ian starts by introducing the cost function, \(J(\theta)\), that we want to minimize, the “gradient”, \(g=\nabla_{\theta}J(\theta)\), which is just basically a column vector of partial derivatives with respect to each parameter,\(\frac{\partial J(\theta)}{\partial \theta_i}\), applied to \(\vec{\theta}\), and \(\textbf{H}\), the Hessian matrix, which is the partial derivatives with respect to each component of the gradient, resulting in a \(i\times j\) matrix.

A quick example to move away from abstraction and to crystallize what we are dealing with… Consider a hypothetical cost function \(J(\theta_1,\theta_2) = \theta_1^2 + \theta_1\theta_2\). In this case the gradient vector would be \[\begin{bmatrix}\frac{\partial J(\theta)}{\partial \theta_1} & \frac{\partial J(\theta)}{\partial \theta_2}\end{bmatrix}=\begin{bmatrix}2\theta_1+\theta_2 & \theta_1\end{bmatrix}\] The Hessian matrix, \(\textbf{H}\), in this case would be:

\[\begin{bmatrix} \frac{\partial}{\partial\theta_1}\frac{\partial}{\partial \theta_1}J(\theta) & \frac{\partial}{\partial \theta_1}\frac{\partial}{\partial \theta_2}J(\theta)\\ \frac{\partial}{\partial\theta_2}\frac{\partial}{\partial \theta_1}J(\theta) & \frac{\partial}{\partial \theta_2}\frac{\partial}{\partial \theta_2}J(\theta) \end{bmatrix}= \begin{bmatrix}2 & 1\\ 1 & 0 \end{bmatrix}\]

Ok…one post to handle one slide..this may take a series of posts…

Gradient descent with a dash of Linear Algebra #1

Related Posts