Logistic Regression

Logistic Regression #

Linear Regression Limitation #

Linear regression is not suitable for classification because:

Output Range:
- Linear regression predicts continuous values (−∞ to +∞).
- Classification needs bounded probabilities (0 to 1). Predictions outside [0,1] cannot be interpreted as valid class probabilities.
Decision Boundary:
- If you force a threshold (e.g., ≥0.5 → class 1, else class 0), the model assumes a linear relationship between input features and probability of class. Real-world class separation is often nonlinear.
Violation of Assumptions:
- Linear regression assumes homoscedasticity and normally distributed errors.
- Classification problems involve heteroscedasticity (variance changes with class), violating these assumptions.
Multiple Classes:
- Linear regression cannot naturally extend to multi-class classification. Logistic regression or SVM handle this better.
Sensitivity to Outliers:
- Outliers can push regression predictions far outside the [0,1] range, distorting classification thresholds.

Because of these issues, logistic regression or classification-specific algorithms (e.g., SVM, decision trees) are preferred.

How Logsitic Regression solves the problem #

Logistic regression solves the problems of using linear regression for classification in these ways:

Bounded Outputs:
- Logistic regression uses the sigmoid function:
  
  \[ p(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}} \]
- This maps predictions to the range [0,1], making them interpretable as probabilities.
Decision Boundary:
- The probability threshold (commonly 0.5) defines the decision boundary.
- The boundary is still linear in the feature space, but probabilities are modeled correctly.
Error Distribution:
- Logistic regression does not assume normality of errors.
- Instead, it uses maximum likelihood estimation (MLE) to find parameters that best fit the classification problem.
Robustness to Outliers:
- While not immune, logistic regression is less sensitive than linear regression because extreme predictions are squashed by the sigmoid.
Extension to Multiple Classes:
- Logistic regression extends to multi-class problems through one-vs-rest (OvR) or multinomial logistic regression (softmax regression).

In short:

Linear regression → predicts unbounded values, unsuitable for probability.
Logistic regression → predicts bounded probabilities, explicitly designed for classification.

Convex and Non-Convex function #

A convex function has a bowl-shaped curve, while a non-convex function can have multiple dips and peaks.

Convex function #

Definition: A function \(f(x)\) is convex if for any two points \(x_1, x_2\) in its domain and any \(\lambda \in [0,1]\):

\[ f(\lambda x_1 + (1-\lambda)x_2) \leq \lambda f(x_1) + (1-\lambda)f(x_2) \]
Visual property: The line segment between any two points on the curve lies above or on the curve.
Optimization:
- Has a single global minimum.
- Gradient Descent is guaranteed to converge.
Examples:
- \(f(x) = x^2\)
- Logistic Regression cost function (Log Loss).

Non-convex function #

Definition: Fails the convex inequality above.
Visual property: The curve has multiple local minima/maxima.
Optimization:
- Gradient Descent can get stuck in a local minimum instead of the global one.
- Requires advanced methods (random restarts, stochastic optimization).
Examples:
- \(f(x) = \sin(x)\)
- Mean Squared Error applied to Logistic Regression.
- Deep Neural Network loss surfaces.

👉 In Logistic Regression, the log loss cost function is convex, ensuring a single minimum. In contrast, if we incorrectly used squared error, the cost would be non-convex, making training unstable.

import numpy as np
import matplotlib.pyplot as plt

# Define functions
x = np.linspace(-5, 5, 400)
convex_func = x**2
nonconvex_func = np.sin(x)

# Find minima for convex (single global min at x=0)
convex_min_x = 0
convex_min_y = convex_min_x**2

# For non-convex (local minima)
local_minima_x = [-3*np.pi/2, -np.pi/2, np.pi/2, 3*np.pi/2]  # approximate within range
local_minima_x = [val for val in local_minima_x if -5 <= val <= 5]
local_minima_y = [np.sin(val) for val in local_minima_x]

# Create side-by-side subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Convex function plot
axes[0].plot(x, convex_func, label=r"$f(x) = x^2$ (Convex)")
axes[0].scatter(convex_min_x, convex_min_y, color='red', zorder=5)
axes[0].annotate("Global Min", (convex_min_x, convex_min_y), textcoords="offset points", xytext=(-10,-15), ha='center')
axes[0].set_title("Convex Function")
axes[0].set_xlabel("x")
axes[0].set_ylabel("f(x)")
axes[0].legend()
axes[0].grid(True)

# Non-convex function plot
axes[1].plot(x, nonconvex_func, label=r"$f(x) = \sin(x)$ (Non-Convex)", color="red")
axes[1].scatter(local_minima_x, local_minima_y, color='blue', zorder=5)
for xi, yi in zip(local_minima_x, local_minima_y):
    axes[1].annotate("Local Min", (xi, yi), textcoords="offset points", xytext=(-5,-15), ha='center')
axes[1].set_title("Non-Convex Function")
axes[1].set_xlabel("x")
axes[1].set_ylabel("f(x)")
axes[1].legend()
axes[1].grid(True)

plt.tight_layout()
plt.show()

../../../_images/e8fb553ec33d09f8611336555054df44582b4695bf98133c345e8b17365a3128.png

# Generating a single plot showing the linear score z, its sigmoid, and decision boundary.
import numpy as np
import matplotlib.pyplot as plt

# Parameters
theta0 = -2.0
theta1 = 0.8

# Input range
x = np.linspace(-10, 10, 400)
z = theta0 + theta1 * x
sigmoid = 1 / (1 + np.exp(-z))

# Sample points to illustrate classification
x_samples = np.array([-8, -4, -2, -1, 0.5, 2, 5, 8])
z_samples = theta0 + theta1 * x_samples
probs_samples = 1 / (1 + np.exp(-z_samples))
labels = (probs_samples >= 0.5).astype(int)

plt.figure(figsize=(9,6))
plt.plot(x, z, label='Linear score $z=\\theta_0+\\theta_1 x$')
plt.plot(x, sigmoid, label='Sigmoid $\\sigma(z)$')
plt.axhline(0.5, linestyle='--', label='Decision threshold (0.5)')
# Decision boundary where z = 0 -> x = -theta0/theta1
x_boundary = -theta0 / theta1
plt.axvline(x_boundary, linestyle=':', label=f'Decision boundary $x={x_boundary:.2f}$')

# Plot sample points and their probabilities
for xi, pi, li in zip(x_samples, probs_samples, labels):
    plt.scatter([xi], [pi], s=60)
    plt.text(xi + 0.2, pi - 0.06, f'{pi:.2f}, label={li}', fontsize=9)

plt.xlabel('Feature value $x$')
plt.ylabel('Value')
plt.title('Logistic Regression: Linear score, Sigmoid, and Decision Boundary')
plt.legend()
plt.grid(True)
plt.ylim(-3, 1.1)
plt.xlim(x.min(), x.max())
plt.show()

../../../_images/036a737fd9a1707b0143cefc295c8dd55516ac7167ba373122f4c3d1115893b2.png

Why Linear Regression Fails for Classification#

Outliers: The best-fit line can be heavily influenced by extreme values.
Output Range: Linear regression can produce outputs less than 0 or greater than 1, which are invalid for probabilities.

Solution: Logistic Regression#

Apply a squashing function to the linear output to constrain predictions between 0 and 1.
The sigmoid (logistic) function is used:

\[ \sigma(z) = \frac{1}{1 + e^{-z}} \]

Key properties:

Outputs always between 0 and 1.
Midpoint at 0.5 when \(z = 0\).
If \(z > 0\), \(\sigma(z) > 0.5\).
Hypothesis function with sigmoid:

\[ h_\theta(x) = \sigma(\theta_0 + \theta_1 x_1) \]

For multiple features:

\[ h_\theta(x) = \sigma(\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots) \]

Cost Function#

Linear regression cost function leads to non-convexity when combined with sigmoid, causing local minima.
Logistic regression uses log loss (cross-entropy) for convexity:

\[\begin{split} \text{Cost}(h_\theta(x), y) = \begin{cases} - \log(h_\theta(x)) & \text{if } y = 1 \\ - \log(1 - h_\theta(x)) & \text{if } y = 0 \end{cases} \end{split}\]

Combined form:

\[ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \big[ y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)})) \big] \]

Ensures convexity, allowing gradient descent to reliably find the global minimum.

Gradient Descent#

Parameter update rule:

\[ \theta_j := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j} \]

Repeat until convergence.
For multiple features, update each \(\theta_j\) in the same way.

Summary#

Fit a linear model: \(\theta_0 + \theta_1 x_1 + \dots\)
Apply sigmoid activation to squash outputs between 0 and 1.
Use log loss to ensure a convex cost function.
Optimize parameters using gradient descent.
Predictions can now be interpreted as probabilities for classification.

Logistic Regression

Contents