To understand the difference between solving linear regression using gradient descent and linear algebra (normal equation), let's break down both methods mathematically and compare their characteristics.

1. Linear Regression Objective

The goal is to find the optimal parameters (weights) θ that minimize the Mean Squared Error (MSE) cost function:

\[ J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 \]

where:

\(h_\ heta(x) = \ heta^T x\) is the linear hypothesis function
m is the number of training examples

2. Solving with Linear Algebra (Normal Equation)

The closed-form solution uses matrix operations to directly compute θ:

\[ \theta = (X^T X)^{-1} X^T y \]

Steps:

Compute \(X^T X\) (a square matrix of size n×n, where n is the number of features)
Invert \(X^T X\)
Multiply the inverse by \(X^T y\)

Advantages:

Exact solution (no iterations)
No learning rate or hyperparameters

Limitations:

Computational complexity: \(O(n^3)\) for matrix inversion, which is slow for large n
Fails if \(X^T X\) is non-invertible (singular), though pseudoinverses can mitigate this

3. Solving with Gradient Descent

An iterative optimization algorithm that updates θ to minimize J(θ):

\[ \theta := \theta - \alpha \nabla J(\theta) \]

The gradient of J(θ) is:

\[ \nabla J(\theta) = \frac{1}{m} X^T(X\theta - y) \]

Update Rule:

\[ \theta := \theta - \alpha(\frac{1}{m} X^T(X\theta - y)) \]

Steps:

Initialize θ (e.g., to zeros)
Iteratively adjust θ using the gradient and learning rate α
Stop when convergence is reached (e.g., changes in J(θ) are minimal)

Advantages:

Works for large n (computationally efficient per iteration)
Scales to massive datasets with stochastic/mini-batch variants
No matrix inversion required

Limitations:

Requires tuning the learning rate α
May converge slowly or diverge if α is poorly chosen

4. Key Differences

Aspect	Normal Equation (Linear Algebra)	Gradient Descent
Approach	Direct analytical solution	Iterative numerical approximation
Computation	\(O(n^3)\) time for matrix inversion	\(O(n)\) per iteration; total time depends on iterations
Scalability	Fails for large n (e.g., >10k features)	Efficient for large n

5. When to Use Each Method

Normal Equation: Use for small datasets (n≤10,000) or when an exact solution is needed
Gradient Descent: Use for large datasets, high-dimensional features, or when memory constraints exist

Summary

Normal Equation gives an exact solution via matrix inversion but is impractical for large n
Gradient Descent iteratively minimizes the cost function and scales efficiently, making it suitable for modern machine learning tasks

Gradient Descent vs Normal Equation

1. Linear Regression Objective

2. Solving with Linear Algebra (Normal Equation)

Advantages:

Limitations:

3. Solving with Gradient Descent

Advantages:

Limitations:

4. Key Differences

5. When to Use Each Method

Summary