Gradient Descent#
Basic Terminology#
You have already seen the first example of machine learning as decision trees. Now its time for ML with neural networks. Before going into the details of it, let us first understand the idea of “gradient descent”. Let us first understand three basic terminology of ML
Data(\(x_i\), \(y_i\)) Here the index \(i\) represents the \(i^{th}\) datapoint. Traditionally \(x_i\) refers to the input/instance and \(y_i\) refers to the output or labels of the data. For example each \(x_i\) can be the images of certain animals (collection of RGB information of all the pixel in that particular image) and \(y_i\) can be a string of numbers denoting the type of animal (ex: \(100\cdots\)=> Dog, \(010\cdots\)=> Cat, \(001\cdots\)=> Lion and so on)
Model: A complicated enough function \(f\) such that \(f(x_i;\theta)=y_i\) where \(\theta\) denotes some internal parameters of the function. For example, if \(x_i\) denotes time and \(y_i\) denotes the position of a projectile then one might try to find a model \(f(x;a,b,c)= ax^2+bx +c\) with the internal parameters \(a\), \(b\) and \(c\) such that \(f\) is a very good approximation of the relation between the input/output of the data (\(x_i\), \(y_i\)). In a general setup \(f\) needs to be built out of thousands or millions of internal parameters for the dataset with animal pictures and the labels.
Loss Function: In principle, you are free to choose and try-out different functions \(f\) with different internal parameters to fit a particular data. The question of goodness or badness of the function is determined by another function, called the Loss function.
It determines how well the the model \(y=f(x;\theta)\) predicts the data (\(x_i\), \(y_i\)). One quick example would be \(\mathcal{L}(\theta)=\sum_i(y_i-f(x_i;\theta))^2\), dubbed as the mean square error. There exists several different kinds of loss functions depending on the scope and aim of the model that we are working with. In all of them the basic target is to methodically change the internal parameters \(\theta\) of the model \(f(x;\theta)\) such that the the loss function \(\mathcal{L}(\theta)\) is optimised.
Introduction#
The method of Gradient Descent is a fundamental optimization algorithm used in machine learning. Students can modify code cells and visualize how learning rate, number of iterations, and function choice affect convergence.
Why?
For most applications of machine learning, the final goal boils down to optimising the *loss function* given a *dataset* and a *model*.What?
At its core, gradient descent is a method of reaching to the extrema of a given function.The Gradient or slope measures how a function changes with regards to small changes in its parameters. Example: For a function \(f(x)=x^2\) its slope is \(f'(x)\equiv \frac{d f(x)}{dx}= 2x\).
Imagine you are standing on a smooth mountain at night and you need to go to the bottom of the mountain for shelter quickly then
you investigate the slope (the *gradient) at your feet in various directions.
you take a small step downhill in the direction of the steepest descent.
do the same until you reach the bottom.
How
More mathematically, for a differentiable loss function \(\mathcal{L}(\theta)\), the gradient descent update is
which should be repeated until the the \(\theta\) values converge or not change much as the steps (denoted by \(t\)) evolve. In the formula above
\(\alpha\) is called the learning rate, which define the step size.
\(\theta_{t}\) denotes the value of the internal parameters in the current step and \(\theta_{t+1}\) tells you what should be the value of the internal parameters on the next step if one follows the gradient descent method.
\(\nabla_\theta(\mathcal{L}(\theta))\) symbolises the gradient of the loss function with respect to various internal parameters.
Gradient Descent Implementation#
At first let us introduce some packages
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from ipywidgets import interact, FloatSlider, IntSlider
# If you're using Google Colab, run this once
# !pip install ipywidgets --quiet
%matplotlib inline
1. Quadratic Function and Its Gradient#
def f(theta):
return theta**2
def grad_f(theta):
return 2*theta
Gradient Descent Implementation#
Update rule:
def gradient_descent(start, lr, iterations):
theta = start
path = [theta]
for i in range(iterations):
theta = theta - lr * grad_f(theta)
path.append(theta)
return np.array(path)
Visualizing descent on the Quadratic#
Use sliders to adjust the learning rate and number of steps and check the following statements
if \(\alpha\) is too small -> very slow convergence
if \(\alpha\) is too big -> might oscillate or diverge
def plot_gd(lr=0.1, steps=20, start=5.0):
steps = int(steps)
path = gradient_descent(start=start, lr=lr, iterations=steps)
xs = np.linspace(-6, 6, 400)
ys = f(xs)
# Compute convergence step
convergence_eps = 1e-3
converged_at = next((i for i in range(1, len(path)) if abs(path[i] - path[i-1]) < convergence_eps), None)
plt.figure(figsize=(8, 4))
plt.plot(xs, ys, label='$f(x)=x^2$', color='gray', alpha=0.5)
plt.plot(path, f(path), 'o', color='red', label='Descent Steps')
for i in range(len(path)-1):
x0, x1 = path[i], path[i+1]
y0, y1 = f(x0), f(x1)
plt.annotate(
'', xy=(x1, y1), xytext=(x0, y0),
arrowprops=dict(arrowstyle='->', color='red', lw=1.5)
)
title = f'Gradient Descent Path)'
if converged_at is not None:
title += f"\nConverged at step {converged_at}"
else:
title += "\nDid not converge within given steps"
plt.title(title)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.legend()
plt.grid(True)
plt.show()
interact(plot_gd, lr=FloatSlider(value=0.1, min=0.01, max=1.0, step=0.01), steps=IntSlider(value=20, min=1, max=100, step=1), start=FloatSlider(value=8.0, min=0.1, max=9.0, step=0.1))
<function __main__.plot_gd(lr=0.1, steps=20, start=5.0)>
Types of Gradient Descent#
There are three popular formats of gradient descent
Batch Gradient Descent#
Computes the gradient of the loss function using the entire dataset.
Pros: Stable and accurate updates; good for convex functions.
Cons: Very slow on large datasets; high memory usage.
Stochastic Gradient Descent#
Updates parameters using only one random sample at each step and introduces some noise
Pros: Fast and can escape local minima; low memory footprint.
Cons: Noisy updates lead to fluctuations; may not converge smoothly.
Mini-batch Gradient Descent#
Computes the gradient using a small random subset (mini-batch) of the data.
Pros: Combines speed of SGD with stability of batch GD; suitable for GPUs.
Cons: Still introduces some noise; batch size selection is critical.
All these variations are trade-offs between speed, stability, and resource usage, and the choice depends on the dataset size and hardware constraints.
SGD vs BGD: 2D Quadratic Function#
# Define the quadratic loss function and its gradient
def loss(theta):
"""Quadratic loss: (θ1 - 2)^2 + (θ2 - 3)^2"""
return (theta[0] - 2)**2 + (theta[1] - 3)**2
def grad_loss(theta):
"""Exact gradient of the loss"""
return np.array([2*(theta[0] - 2), 2*(theta[1] - 3)])
# Batch Gradient Descent
def batch_gd(start, lr, steps):
theta = np.array(start, dtype=float)
path = [theta.copy()]
for _ in range(steps):
theta -= lr * grad_loss(theta)
path.append(theta.copy())
return np.array(path)
# Stochastic Gradient Descent (adds Gaussian noise to gradient)
def sgd(start, lr, steps, noise_scale):
theta = np.array(start, dtype=float)
path = [theta.copy()]
for _ in range(steps):
g = grad_loss(theta)
g += np.random.randn(2) * noise_scale
theta -= lr * g
path.append(theta.copy())
return np.array(path)
def plot_descent(lr=0.1, steps=30, noise=0.5, start1=5.0, start2=5.0, elev=45, azim=135):
# Compute trajectories
bd_path = batch_gd([start1, start2], lr, int(steps))
sd_path = sgd([start1, start2], lr, int(steps), noise)
# Dynamically choose plotting range around data
# include both start and optimum (2,3)
mins = np.min([[start1, start2], [2, 3]], axis=0) - 1
maxs = np.max([[start1, start2], [2, 3]], axis=0) + 1
A = np.linspace(mins[0], maxs[0], 100)
B = np.linspace(mins[1], maxs[1], 100)
AA, BB = np.meshgrid(A, B)
ZZ = (AA - 2)**2 + (BB - 3)**2
# Plot
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(AA, BB, ZZ, cmap='viridis', alpha=0.6, edgecolor='none')
ax.plot(bd_path[:,0], bd_path[:,1], [loss(p) for p in bd_path],
'ro--', label='Batch GD', lw=2)
ax.plot(sd_path[:,0], sd_path[:,1], [loss(p) for p in sd_path],
'bx--', label='SGD', lw=2)
ax.set_title(f"GD on f=(θ₁-2)²+(θ₂-3)²\n"
f"lr={lr}, steps={steps}, noise={noise}")
ax.set_xlabel('θ₁'); ax.set_ylabel('θ₂'); ax.set_zlabel('Loss')
ax.legend()
# apply interactive view angles
ax.view_init(elev=elev, azim=azim)
plt.show()
# Now include elev/azim sliders in the interact call
interact(
plot_descent,
lr=FloatSlider(value=0.1, min=0.001, max=1.0, step=0.001, description='Learning rate'),
steps=IntSlider(value=30, min=1, max=100, step=1, description='Steps'),
noise=FloatSlider(value=0.5, min=0.0, max=2.0, step=0.05, description='SGD noise'),
start1=FloatSlider(value=5.0, min=-5.0, max=10.0, step=0.5, description='θ₁ start'),
start2=FloatSlider(value=5.0, min=-5.0, max=10.0, step=0.5, description='θ₂ start'),
elev=IntSlider(value=45, min=0, max=90, step=5, description='Elevation'),
azim=IntSlider(value=135, min=-180, max=180, step=15, description='Azimuth'),
)
<function __main__.plot_descent(lr=0.1, steps=30, noise=0.5, start1=5.0, start2=5.0, elev=45, azim=135)>
Further study#
Study adaptive optimizers (AdaGrad, RMSProp, Adam)
Dive into neural networks