Machine Learning

Machine Learning#

Click here for Contents

Operator / Function

Definition

Usage / Intuition

Example

\(\min_x f(x)\)

Minimum value of a function

Find smallest value of objective

\( \min_x (x-3)^2 = 0 \)

\( \max_x f(x) \)

Maximum value of a function

Find largest value of objective

\( \max_x -(x-3)^2 = 0 \)

\( \arg\min_x f(x) \)

Input where function is minimized

Optimization to find best parameters

\( \arg\min_x (x-3)^2 = 3 \)

\( \arg\max_x f(x) \)

Input where function is maximized

Find best parameter location

\( \arg\max_x -(x-3)^2 = 3 \)

\( \frac{d}{dx} f(x) \)

Derivative w.r.t scalar

Slope / rate of change

\( \frac{d}{dx} (x^2) = 2x \)

\( \frac{\partial f}{\partial x_i} \)

Partial derivative

Multivariate rate of change

\( \frac{\partial}{\partial x} (x^2 + y^2) = 2x \)

\( \nabla f(x) \)

Gradient vector

Direction of steepest ascent

\( \nabla (x^2 + y^2) = [2x,2y] \)

\( \theta_{t+1} = \theta_t - \eta \nabla_\theta L \)

Gradient Descent update

Iteratively minimize loss

Linear regression update

\( \langle u, v \rangle \)

Dot product / inner product

Similarity / projection

\( \langle [1,2],[3,4] \rangle = 11 \)

\( |x|_2 \)

L2 norm (Euclidean)

Magnitude of vector

\( |[3,4]|_2 = 5 \)

\( |x|_1 \)

L1 norm (Manhattan)

Sum of absolute values

\( |[3,-4]|_1 = 7 \)

\( A^\top \)

Matrix transpose

Switch rows ↔ columns

\( [[1,2],[3,4]]^\top = [[1,3],[2,4]] \)

\( \text{Tr}(A) \)

Trace of a matrix

Sum of diagonal

\( \text{Tr}([[1,2],[3,4]]) = 5 \)

\( \det(A) \)

Determinant

Scaling factor of matrix

\( \det([[1,2],[3,4]])=-2 \)

\( \mathbb{E}[X] \)

Expectation / mean

Average value

\( \mathbb{E}[X] = \sum x_i P(x_i) \)

\( \text{Var}(X) \)

Variance

Spread of X

\( \text{Var}([1,2,3]) = 2/3 \)

\( \text{Cov}(X,Y) \)

Covariance

Measure of correlation

\( \text{Cov}(X,Y) = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])] \)

\( \mathbb{P}(A) \)

Probability

Chance of event

\( \mathbb{P}(X>0) \)

\( L(y,\hat{y}) \)

Loss function

Measures prediction error

MSE, Cross-Entropy

\( r_{im} = - \frac{\partial L}{\partial F(x_i)} \)

Pseudo-residuals (Boosting)

Direction to reduce loss

Gradient Boosting step

\( F_m = F_{m-1} + \nu \gamma_m h_m(x) \)

Boosted model update

Add tree’s contribution

Gradient Boosting

\( \text{sign}(x) \)

Sign function

Direction of number

\( \text{sign}(-5)=-1 \)

\( \mathbf{1}_{\{\text{condition}\}} \)

Indicator function

1 if true, 0 if false

\( \mathbf{1}_{x>0} \)

\( \sigma(x) \)

Sigmoid function

Map to probability [0,1]

\( \sigma(0)=0.5 \)

\( \text{softmax}(z_i) \)

Softmax function

Multi-class probability

\( \text{softmax}([1,2,3])_i \)

\( \text{ReLU}(x) \)

Rectified Linear Unit

Nonlinear activation

\( \text{ReLU}(-2)=0, \text{ReLU}(3)=3 \)

\( \hat{y} = F_M(x) \)

Regression prediction

Final model output

Gradient Boosting Regressor

\( \hat{y} = \mathbf{1}[\sigma(F_M(x))>0.5] \)

Binary classification prediction

Threshold probability

Gradient Boosting Classifier

\( \hat{y}_i = \text{softmax}(F_M(x))_i \)

Multi-class classification

Probability per class

Gradient Boosting Multi-class