Gradient Descent vs Normal Equation

To understand the difference between solving linear regression using gradient descent and linear algebra (normal equation), let's break down both methods mathematically and compare their characteristics.

1. Linear Regression Objective

The goal is to find the optimal parameters (weights) θ that minimize the Mean Squared Error (MSE) cost function:

\[ J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 \]

where:

  • \(h_\ heta(x) = \ heta^T x\) is the linear hypothesis function
  • m is the number of training examples

2. Solving with Linear Algebra (Normal Equation)

The closed-form solution uses matrix operations to directly compute θ:

\[ \theta = (X^T X)^{-1} X^T y \]

Steps:

  • Compute \(X^T X\) (a square matrix of size n×n, where n is the number of features)
  • Invert \(X^T X\)
  • Multiply the inverse by \(X^T y\)

Advantages:

  • Exact solution (no iterations)
  • No learning rate or hyperparameters

Limitations:

  • Computational complexity: \(O(n^3)\) for matrix inversion, which is slow for large n
  • Fails if \(X^T X\) is non-invertible (singular), though pseudoinverses can mitigate this

3. Solving with Gradient Descent

An iterative optimization algorithm that updates θ to minimize J(θ):

\[ \theta := \theta - \alpha \nabla J(\theta) \]

The gradient of J(θ) is:

\[ \nabla J(\theta) = \frac{1}{m} X^T(X\theta - y) \]

Update Rule:

\[ \theta := \theta - \alpha(\frac{1}{m} X^T(X\theta - y)) \]

Steps:

  • Initialize θ (e.g., to zeros)
  • Iteratively adjust θ using the gradient and learning rate α
  • Stop when convergence is reached (e.g., changes in J(θ) are minimal)

Advantages:

  • Works for large n (computationally efficient per iteration)
  • Scales to massive datasets with stochastic/mini-batch variants
  • No matrix inversion required

Limitations:

  • Requires tuning the learning rate α
  • May converge slowly or diverge if α is poorly chosen

4. Key Differences

AspectNormal Equation (Linear Algebra)Gradient Descent
ApproachDirect analytical solutionIterative numerical approximation
Computation\(O(n^3)\) time for matrix inversion\(O(n)\) per iteration; total time depends on iterations
ScalabilityFails for large n (e.g., >10k features)Efficient for large n

5. When to Use Each Method

  • Normal Equation: Use for small datasets (n≤10,000) or when an exact solution is needed
  • Gradient Descent: Use for large datasets, high-dimensional features, or when memory constraints exist

Summary

  • Normal Equation gives an exact solution via matrix inversion but is impractical for large n
  • Gradient Descent iteratively minimizes the cost function and scales efficiently, making it suitable for modern machine learning tasks