Machine Learning#
Unsupervised – Clustering
Unsupervised – Dimensionality Reduction
Click here for Contents
Operator / Function |
Definition |
Usage / Intuition |
Example |
|---|---|---|---|
\(\min_x f(x)\) |
Minimum value of a function |
Find smallest value of objective |
\( \min_x (x-3)^2 = 0 \) |
\( \max_x f(x) \) |
Maximum value of a function |
Find largest value of objective |
\( \max_x -(x-3)^2 = 0 \) |
\( \arg\min_x f(x) \) |
Input where function is minimized |
Optimization to find best parameters |
\( \arg\min_x (x-3)^2 = 3 \) |
\( \arg\max_x f(x) \) |
Input where function is maximized |
Find best parameter location |
\( \arg\max_x -(x-3)^2 = 3 \) |
\( \frac{d}{dx} f(x) \) |
Derivative w.r.t scalar |
Slope / rate of change |
\( \frac{d}{dx} (x^2) = 2x \) |
\( \frac{\partial f}{\partial x_i} \) |
Partial derivative |
Multivariate rate of change |
\( \frac{\partial}{\partial x} (x^2 + y^2) = 2x \) |
\( \nabla f(x) \) |
Gradient vector |
Direction of steepest ascent |
\( \nabla (x^2 + y^2) = [2x,2y] \) |
\( \theta_{t+1} = \theta_t - \eta \nabla_\theta L \) |
Gradient Descent update |
Iteratively minimize loss |
Linear regression update |
\( \langle u, v \rangle \) |
Dot product / inner product |
Similarity / projection |
\( \langle [1,2],[3,4] \rangle = 11 \) |
\( |x|_2 \) |
L2 norm (Euclidean) |
Magnitude of vector |
\( |[3,4]|_2 = 5 \) |
\( |x|_1 \) |
L1 norm (Manhattan) |
Sum of absolute values |
\( |[3,-4]|_1 = 7 \) |
\( A^\top \) |
Matrix transpose |
Switch rows ↔ columns |
\( [[1,2],[3,4]]^\top = [[1,3],[2,4]] \) |
\( \text{Tr}(A) \) |
Trace of a matrix |
Sum of diagonal |
\( \text{Tr}([[1,2],[3,4]]) = 5 \) |
\( \det(A) \) |
Determinant |
Scaling factor of matrix |
\( \det([[1,2],[3,4]])=-2 \) |
\( \mathbb{E}[X] \) |
Expectation / mean |
Average value |
\( \mathbb{E}[X] = \sum x_i P(x_i) \) |
\( \text{Var}(X) \) |
Variance |
Spread of X |
\( \text{Var}([1,2,3]) = 2/3 \) |
\( \text{Cov}(X,Y) \) |
Covariance |
Measure of correlation |
\( \text{Cov}(X,Y) = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])] \) |
\( \mathbb{P}(A) \) |
Probability |
Chance of event |
\( \mathbb{P}(X>0) \) |
\( L(y,\hat{y}) \) |
Loss function |
Measures prediction error |
|
\( r_{im} = - \frac{\partial L}{\partial F(x_i)} \) |
Pseudo-residuals (Boosting) |
Direction to reduce loss |
Gradient Boosting step |
\( F_m = F_{m-1} + \nu \gamma_m h_m(x) \) |
Boosted model update |
Add tree’s contribution |
Gradient Boosting |
\( \text{sign}(x) \) |
Sign function |
Direction of number |
\( \text{sign}(-5)=-1 \) |
\( \mathbf{1}_{\{\text{condition}\}} \) |
Indicator function |
1 if true, 0 if false |
\( \mathbf{1}_{x>0} \) |
\( \sigma(x) \) |
Sigmoid function |
Map to probability [0,1] |
\( \sigma(0)=0.5 \) |
\( \text{softmax}(z_i) \) |
Softmax function |
Multi-class probability |
\( \text{softmax}([1,2,3])_i \) |
\( \text{ReLU}(x) \) |
Rectified Linear Unit |
Nonlinear activation |
\( \text{ReLU}(-2)=0, \text{ReLU}(3)=3 \) |
\( \hat{y} = F_M(x) \) |
Regression prediction |
Final model output |
Gradient Boosting Regressor |
\( \hat{y} = \mathbf{1}[\sigma(F_M(x))>0.5] \) |
Binary classification prediction |
Threshold probability |
Gradient Boosting Classifier |
\( \hat{y}_i = \text{softmax}(F_M(x))_i \) |
Multi-class classification |
Probability per class |
Gradient Boosting Multi-class |