Statistics
The task of estimating the parameters ($\theta$) from the training data is called training\model fitting
.
$\hat{\theta} = argmin L ({\theta})$
- $\hat{\theta}$ = estimated parameter
- $L = $ Loss Function (aka objective/cost function)
argmin $f(\theta)$ returns $\theta$ that minimizes $L(\theta)$. Ex: argmin $x^2$ = 0
Maximum Likelihood Estimate (MLE)
Refers to finding parameters that maximizes one likelihood of training data. It is the most common approach to parameter estimation. It picks parameters that assign the highest probability to the training data. $\hat{\theta}_{MLE}=argmax P(D | {\theta})$
Maximizing the likelihood is equivalent to maximizing the log-likelihood. $L(\theta) = logP(D|\theta)$ above equation equals to: $log-likelihood = log()$
- Most optimization algorithms minimize the cost function. We can redefine the cost function to be negative of the log-likelihood (NLL): $NLL = -logP(D|\theta) = -\sum_{n=1}logP(y_n|x_n, \theta)$
MLE for Bernoulli
${\hat\theta} = argminNLL(\theta)$ ${\hat\theta_{MLE}} = \frac{N_1}{N_0 + N_1}$
Empirical Risk Minimization (ERM)
We generalize MLE by replacing the log loss with any other loss function to get: $L(\theta) = \frac{1}{N}\sum{_{n=1}^N} L(y_n, \theta, x_n)$
Regularization
MLE (Maximum Likelihood Estimate) and ERM (Empirical Risk Minimization) will find parameters that minimize loss on the training data, but this doesn’t guarantee that the model will generalize well on the data.
If the model doesn’t generalize well on the data, then it is underfitted.
Ways to combat overfitting
- Use more data/or use a less complex model
- Apply regularization (adding a penalty term to the model)
$L(\theta, \lambda) = [\frac{1}{N}\sum_{n=1}^N L(y_n \theta, x_n)] + \lambda C(\theta)$
- C is the complexity penalty
- $\lambda$ is the regularization constant
By increasing the regularization constant, the bigger the regularization, it gives you a more smoother curve for your model.
Weight Decay
It is another way of regularization
.
$y_n = w_0 + w_1x_1 + w_2x_2+…+w_dx_d$
The data is D-dimensional.
$\theta = (w_0, w_2, …, w_b)$ are the parameters of the model. W’s are called weights/parameters.
$\hat w_{map} = argmin[NLL(w) + \lambda||w||_2^2]$
- W is the penalty
- $||w||^2_2$ is hte square of L_2 norm of W
- $\lambda$ is the regularization constant