CS 422 Notes 07

Statistics

The task of estimating the parameters ($\theta$) from the training data is called training\model fitting.

$\hat{\theta} = argmin L ({\theta})$

$\hat{\theta}$ = estimated parameter
$L = $ Loss Function (aka objective/cost function)

argmin $f(\theta)$ returns $\theta$ that minimizes $L(\theta)$. Ex: argmin $x^2$ = 0

Maximum Likelihood Estimate (MLE)

Refers to finding parameters that maximizes one likelihood of training data. It is the most common approach to parameter estimation. It picks parameters that assign the highest probability to the training data. $\hat{\theta}_{MLE}=argmax P(D | {\theta})$

Maximizing the likelihood is equivalent to maximizing the log-likelihood. $L(\theta) = logP(D|\theta)$ above equation equals to: $log-likelihood = log()$

Most optimization algorithms minimize the cost function. We can redefine the cost function to be negative of the log-likelihood (NLL): $NLL = -logP(D|\theta) = -\sum_{n=1}logP(y_n|x_n, \theta)$

MLE for Bernoulli

${\hat\theta} = argminNLL(\theta)$ ${\hat\theta_{MLE}} = \frac{N_1}{N_0 + N_1}$

Empirical Risk Minimization (ERM)

We generalize MLE by replacing the log loss with any other loss function to get: $L(\theta) = \frac{1}{N}\sum{_{n=1}^N} L(y_n, \theta, x_n)$

Regularization

MLE (Maximum Likelihood Estimate) and ERM (Empirical Risk Minimization) will find parameters that minimize loss on the training data, but this doesn’t guarantee that the model will generalize well on the data.

If the model doesn’t generalize well on the data, then it is underfitted.

Ways to combat overfitting

Use more data/or use a less complex model
Apply regularization (adding a penalty term to the model)

$L(\theta, \lambda) = [\frac{1}{N}\sum_{n=1}^N L(y_n \theta, x_n)] + \lambda C(\theta)$

C is the complexity penalty
$\lambda$ is the regularization constant

By increasing the regularization constant, the bigger the regularization, it gives you a more smoother curve for your model.

Weight Decay

It is another way of regularization. $y_n = w_0 + w_1x_1 + w_2x_2+…+w_dx_d$

The data is D-dimensional.

$\theta = (w_0, w_2, …, w_b)$ are the parameters of the model. W’s are called weights/parameters.

$\hat w_{map} = argmin[NLL(w) + \lambda||w||_2^2]$

W is the penalty
$||w||^2_2$ is hte square of L_2 norm of W
$\lambda$ is the regularization constant