Happy New Year optimisation with Gradient Descent!

Happy 2019 everyone!!

New year, new hopes and a new journey ahead towards minimising a loss function to reach our goals throughout the year!

If only our journey were a multivariate continuous function, things would be so much easier…!

Our life may be way more complex than that, but fortunately, there is a field where optimisation finds extremely practical and fruitful applications. This field is no other than machine learning of course.

Let us dive in the new year by sliding down new exciting slopes using one of the most popular optimisation techniques: the fabulous Gradient Descent.

Gradient descent is a widely popular optimisation technique used for training neural networks (NNs). Despite not assuring to reach a theoretical global minimum, empirical use and evidence shows that it still works “well enough” by reaching local minima (which are usually proximal to the global optimum). Thus, instead of providing an analytical solution for achieving the global optimum (which is most often not possible due to the high complexity of the functions represented by NNs), it manages to attain local optima using heuristics.

The algorithm for calculating Gradient Descent is very simple:

  • Pick a random point (where the function-of-interest is defined)
  • Compute the gradient of the function at that point.
  • Move to the negative direction of the gradient, scaled by a step size / learning rate factor.

As a quick Calculus refresher, the gradient of a function is a vector of all partial derivatives of the function. For example, let’s assume a function with three variables:

f = f(x,y,z)

The gradient of f is:

\nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z} ]

Example

Let’s try to calculate the gradient descent for a 2-variable function:

f(x, y) = -(\cos(x) + \cos(y))

starting from point (-\pi/2, 0) .

The gradient of f(x, y) is:

\nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} ] = [\sin(x), \sin(y) ]

so at point (-\pi/2, 0) , we get:

\nabla f(-\frac{\pi}{2}, 0) = (\sin(-\frac{\pi}{2}), \sin(0) ) = (-1, 0)

Assuming a learning rate a of 0.5, we move to the negative direction of the gradient:

​​​​​start_point – a * gradient_at_start_point = (-\frac{\pi}{2}, 0) - 0.5 * (-1, 0) = (-\frac{\pi}{2}, 0) + (0.5, 0) = (\frac{(1-\pi)}{2}, 0) \approx (-1.07, 0)

(see Figure 1)

Presentation1

Fig. 1. Calculating gradient descent for 2-variable function f(x, y) = -(cos(x) + cos(y)).

As we said, our function in that case is a loss function, which we want to minimise. We can see that by using gradient descent we’re moving closer to the Global minimum of our function, so we’re getting closer to our goal!

Quiz:

Can you think what would happen if our learning rate was a = π ?…

Answer:

Then, starting from point (-\pi/2, 0)  we would move to (\pi/2, 0)  and then back again to (-\pi/2, 0) , ending up to an endless oscillation that would never converge. So, selecting an appropriate learning rate is critical for the convergence of the optimisation method!

Hope that helped a bit build your intuition regarding gradient descent, a simple yet powerful technique for optimising a multivariate continuous loss function during neural network training.

Happy New Year to all! 🙂

Inspired by brilliant.org

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s