Cost Function

Cost Function #

Unlike linear regression, we cannot use Mean Squared Error (MSE) directly because of the non-linear sigmoid output—it would lead to a non-convex cost function, which is hard to optimize.

Here’s a breakdown:

Sigmoid Function #

First, logistic regression outputs probabilities using the sigmoid function:

\[ \hat{y} = h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}} \]

Where:

\(\hat{y}\) = predicted probability that \(y = 1\)
\(\theta\) = model parameters
\(x\) = input features

Likelihood Function #

Logistic regression is based on Maximum Likelihood Estimation (MLE).

The likelihood is the probability of observing the given data with parameters \(\theta\):

\[ L(\theta) = \prod_{i=1}^{m} P(y^{(i)} | x^{(i)}; \theta) \]

For binary labels \(y \in \{0,1\}\):

\[ P(y|x; \theta) = (\hat{y})^y (1-\hat{y})^{1-y} \]

So the likelihood becomes:

\[ L(\theta) = \prod_{i=1}^{m} (\hat{y}^{(i)})^{y^{(i)}} (1-\hat{y}^{(i)})^{1-y^{(i)}} \]

Log-Likelihood #

We usually take the log of the likelihood to simplify calculations:

\[ \ell(\theta) = \log L(\theta) = \sum_{i=1}^{m} \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big] \]

Cost Function (Negative Log-Likelihood / Log Loss)#

To minimize a function, we take negative log-likelihood:

\[ J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big] \]

This is the primary cost function used in logistic regression.
Intuition:
- If the model predicts correctly, \(\hat{y}\) is close to \(y\), so log loss is small.
- If the model is confident but wrong, log loss is very large.

Variants / Regularized Cost Functions #

To prevent overfitting, we add regularization:

L2 Regularization (Ridge)

\[ J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2 \]

L1 Regularization (Lasso)

\[ J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big] + \frac{\lambda}{m} \sum_{j=1}^{n} |\theta_j| \]

\(\lambda\) = regularization parameter
L2 penalizes large weights
L1 encourages sparsity (many weights become 0)

Alternative (Less Common) Cost Functions #

Mean Squared Error (MSE): Sometimes used, but not preferred because it makes the cost function non-convex for logistic regression.
Hinge Loss: Used in SVMs, not typical for logistic regression.

✅ Summary Table

Cost Function	Formula	Notes
Log Loss (Binary)	\(-\frac{1}{m} \sum [y \log \hat{y} + (1-y) \log (1-\hat{y})]\)	Standard cost for logistic regression
L2 Regularized	Log Loss + \(\frac{\lambda}{2m} \sum \theta_j^2\)	Penalizes large weights, prevents overfitting
L1 Regularized	Log Loss + ( \frac{\lambda}{m} \sum	\theta_j	)	Encourages sparse models
MSE (Not Recommended)	\(\frac{1}{2m} \sum (\hat{y}-y)^2\)	Non-convex for logistic regression, rarely used

Cost Function

Contents

Cost Function#

Sigmoid Function#

Likelihood Function#

Log-Likelihood#