To understand the difference between solving linear regression using gradient descent and linear algebra (normal equation), let's break down both methods mathematically and compare their characteristics.
1. Linear Regression Objective
The goal is to find the optimal parameters (weights) θ that minimize the Mean Squared Error (MSE) cost function:
\[
J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2
\]where:
- \(h_\ heta(x) = \ heta^T x\) is the linear hypothesis function
- m is the number of training examples
2. Solving with Linear Algebra (Normal Equation)
The closed-form solution uses matrix operations to directly compute θ:
\[
\theta = (X^T X)^{-1} X^T y
\]Steps:
- Compute \(X^T X\) (a square matrix of size n×n, where n is the number of features)
- Invert \(X^T X\)
- Multiply the inverse by \(X^T y\)
Advantages:
- Exact solution (no iterations)
- No learning rate or hyperparameters
Limitations:
- Computational complexity: \(O(n^3)\) for matrix inversion, which is slow for large n
- Fails if \(X^T X\) is non-invertible (singular), though pseudoinverses can mitigate this
3. Solving with Gradient Descent
An iterative optimization algorithm that updates θ to minimize J(θ):
\[
\theta := \theta - \alpha \nabla J(\theta)
\]The gradient of J(θ) is:
\[
\nabla J(\theta) = \frac{1}{m} X^T(X\theta - y)
\]Update Rule:
\[
\theta := \theta - \alpha(\frac{1}{m} X^T(X\theta - y))
\]Steps:
- Initialize θ (e.g., to zeros)
- Iteratively adjust θ using the gradient and learning rate α
- Stop when convergence is reached (e.g., changes in J(θ) are minimal)
Advantages:
- Works for large n (computationally efficient per iteration)
- Scales to massive datasets with stochastic/mini-batch variants
- No matrix inversion required
Limitations:
- Requires tuning the learning rate α
- May converge slowly or diverge if α is poorly chosen
4. Key Differences
Aspect | Normal Equation (Linear Algebra) | Gradient Descent |
---|
Approach | Direct analytical solution | Iterative numerical approximation |
Computation | \(O(n^3)\) time for matrix inversion | \(O(n)\) per iteration; total time depends on iterations |
Scalability | Fails for large n (e.g., >10k features) | Efficient for large n |
5. When to Use Each Method
- Normal Equation: Use for small datasets (n≤10,000) or when an exact solution is needed
- Gradient Descent: Use for large datasets, high-dimensional features, or when memory constraints exist
Summary
- Normal Equation gives an exact solution via matrix inversion but is impractical for large n
- Gradient Descent iteratively minimizes the cost function and scales efficiently, making it suitable for modern machine learning tasks