Cost Function#

Unlike linear regression, we cannot use Mean Squared Error (MSE) directly because of the non-linear sigmoid output—it would lead to a non-convex cost function, which is hard to optimize.

Here’s a breakdown:


Sigmoid Function#

First, logistic regression outputs probabilities using the sigmoid function:

\[ \hat{y} = h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}} \]

Where:

  • \(\hat{y}\) = predicted probability that \(y = 1\)

  • \(\theta\) = model parameters

  • \(x\) = input features


Likelihood Function#

Logistic regression is based on Maximum Likelihood Estimation (MLE).

The likelihood is the probability of observing the given data with parameters \(\theta\):

\[ L(\theta) = \prod_{i=1}^{m} P(y^{(i)} | x^{(i)}; \theta) \]

For binary labels \(y \in \{0,1\}\):

\[ P(y|x; \theta) = (\hat{y})^y (1-\hat{y})^{1-y} \]

So the likelihood becomes:

\[ L(\theta) = \prod_{i=1}^{m} (\hat{y}^{(i)})^{y^{(i)}} (1-\hat{y}^{(i)})^{1-y^{(i)}} \]

Log-Likelihood#

We usually take the log of the likelihood to simplify calculations:

\[ \ell(\theta) = \log L(\theta) = \sum_{i=1}^{m} \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big] \]

Cost Function (Negative Log-Likelihood / Log Loss)#

To minimize a function, we take negative log-likelihood:

\[ J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big] \]
  • This is the primary cost function used in logistic regression.

  • Intuition:

    • If the model predicts correctly, \(\hat{y}\) is close to \(y\), so log loss is small.

    • If the model is confident but wrong, log loss is very large.


Variants / Regularized Cost Functions#

To prevent overfitting, we add regularization:

  1. L2 Regularization (Ridge)

\[ J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2 \]
  1. L1 Regularization (Lasso)

\[ J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big] + \frac{\lambda}{m} \sum_{j=1}^{n} |\theta_j| \]
  • \(\lambda\) = regularization parameter

  • L2 penalizes large weights

  • L1 encourages sparsity (many weights become 0)


Alternative (Less Common) Cost Functions#

  • Mean Squared Error (MSE): Sometimes used, but not preferred because it makes the cost function non-convex for logistic regression.

  • Hinge Loss: Used in SVMs, not typical for logistic regression.


Summary Table

Cost Function

Formula

Notes

Log Loss (Binary)

\(-\frac{1}{m} \sum [y \log \hat{y} + (1-y) \log (1-\hat{y})]\)

Standard cost for logistic regression

L2 Regularized

Log Loss + \(\frac{\lambda}{2m} \sum \theta_j^2\)

Penalizes large weights, prevents overfitting

L1 Regularized

Log Loss + ( \frac{\lambda}{m} \sum

\theta_j

)

Encourages sparse models

MSE (Not Recommended)

\(\frac{1}{2m} \sum (\hat{y}-y)^2\)

Non-convex for logistic regression, rarely used