What is Supervised Learning?

Function that perfectly maps each input to its output.

Supervised Learning Algorithms work with input-output pairs!

f* <- model’s approximation x (bold lowercase) = input vector y = output (typically scalar unless output is multi-dimensional)

What is a model parameter?

Variable which can be estimatted by fitting given data to model.

  • ex: $f(x) = mx+c$, where x = independent variable, c = dependent

m and c are parameters. They are estimated by fitting a straight line to the data by minimizing root from squared error. Goal of parameters is to optimize model.

How to determine input’s dimensions?

n-dimensional input is determined by the number of features.

  • ex: 5 features yields a 5-dimensional input.

What is Classification?

Categorical Outputs

  • ex: Hand-written Digits

What is a Confusion Matrix?

Describes the performance of a classification algorithm.

  1. Output can be TWO OR MORE classes.
  2. Predicts how many are in its own class and how many are predicted in each of the other classes.

Confusion Matrix will always have a combo of actual and predicted values.

What is Misclassification Rate?

Tells the percentage of observations that were incorrectly predicted by a classification model. It gives ratio of misclassified examples to all given examples.

  • ex: $L(\theta) = \frac{6}{50} = 0.04 = 4%$ This says that 4% of observations were incorrectly predicted.

What is Zero-One Loss?

Loss = 1 for misclassification; 0 otherwise

What is a Loss Matrix?

Defines penalties for getting the answer wrong.

  • ex: covid example. if you misclassify someone having COVID and you give them meds, the cost would be giving expensive meds for nothing since they don’t have COVID.

What is Uncertainty?

Uncertainty of Classification Model Predicted is represented as Conditional Probability Distribution.

$P(y=c|x, \theta) = f_c(x,\theta)$

  1. y=c is the class
  2. x is the input feature
  3. $\theta$ is the parameter

NOTE: Uncertainty Class = Conditional Probability

What is Maximum A Posteriori (MAP) Estimate?

Model output for each class can be represented as a real number.

Softmax Function will map the given output to probabilities.

  • Ex: 3 Classes A, B, and C a = [5, 10, 100] $P(Y = A|x, \theta) = \frac{e^5}{e^5+e^10+e^100}$

What is Regression?

A supervised learning model where outputs are continuous (uncountable/real-valued).

Commonly Used Loss Functions:

Most commonly used loss functions with regression is squared loss.

  1. $L(x_n) = (y_n - \hat y_n)^2$ is squared error
  2. $\sum_{n=1}^N(y_n-\hat y_n)^2$ is the sum squared error
  3. $\frac{1}{N}\sum_{n=1}^N(y_n-\hat y_n)^2$ is the mean squared error
  4. $\frac{1}{N}\sum_{n=1}^N|y_n-\hat y_n|$ is the mean absolute squared error

What is Generalization?

How well the model performs on data it hasn’t seen before.

  • Underfitting - model not complex enough to represent training data well.
  • Overfitting - model too complex that model performs really good on training data but performs really poorly on test data.
  • Note: As complexity increases, then so the model’s parameters.

REALLY IMPORTANT TO KNOW

Unsupervised Learning Stuff Below

What is Unsupervised Learning?

It is only trained on inputs.

What is the difference between clustering and classification?

Clustering Classification
No Labels (Unsupervised) Has Labels (classes). Is supervised
Must determine # of clusters Training Data determines # of classes.

Dimensionality Reduction With Principal Components Analysis (PCA)

Will transform large set of variables into smaller one that still contain most of the information in the large set.

  • Feature Selection Method = select subset of input features that explain certain percentage of variability. Features not selected = eliminated.

What are the types of Input/Outputs?

  1. Binary (Categorical w/ TWO possible values)
  2. Categorical (with MORE THAN TWO possible values)
  3. Real-valued (Continuous)

What is One-Hot Encoding?

Represent categorical variables as numerical values. Example: Panda = [1,0,0] Cat = [0,1,0] Lion = [0,0,1]

Before:

x1 x2 x3
n=1 .. Panda
n=2 .. Cat
n=3 .. Lion
n=4 .. Panda

After:

x1 x2 x3 x4 x5
n=1 .. 1 0 0
n=2 .. 0 1 0
n=3 .. 0 0 1
n=4 .. 1 0 0
  • Categorical input feature or output would be replaced with one-hot encoding.

Basically expand table by introducing more features

What are Random Variables?

Variable X that is unknown or has random quality of interest.

What are Discrete Random Variables?

If the sample space is finite or countably infinite.

What are Continuous Random Variables?

Random Variable that is a real-valued quantity.

  • Since there are infintely many values, you must use P(a <= x <= b)

What is a Marginal Distribution?

Represented using a two-way table.

P(x,y) y=0 y=1
x=0 0.4 0.2
x=1 0.3 0.1

P(x=0, y=0) = 0.4 P(x=1, y=0) = 0.3 P(x=0) = 0.4 + 0.3 = 0.6 P(y=1) = 0.2 + 0.1 = 0.3

Baye’s Rule

It combines conditional probability, product rule, and sum rule

Formula: $P(H=h|Y=y) = \frac{P(Y=y|H=h)*P(H=h)}{\sum_hP(Y=y|H=h)*P(H=h)}$

Numerator is product; denominator is sunm

What is Bernoulli Distribution

Represents the distribution of a binary outcome

  • ex: a coin toss y = 1 <- heads; $P(Y=1) = \theta$ y = 0 <- tails; $P(Y=0) = 1 - \theta$

What is Binomial Distribution?

Represent distribution of a repeated binary outcome $P(S|N, \theta) = $ N Choose S$* \theta * (1-\theta)^{N-S}$

Special case of Binomial, when N = 1

Categorical Distribution (Multinoulli)

Generalizes Bernoulli to more than 2 possible outcomes

  • Ex: Rolling a C-sided die where C > 2

Multinomial Distribution

Generalizes categorical distribution to N > 1

What is Training/Model Fitting?

Task of estimating parameters from training data.

What is Maximum Likelihood Estimate (MLE)?

It is the most common approach to parameter estimation. It picks parameters that assign highest probability to training data.

What is Empircal Risk Minimization (ERM)?

Generalize MLE by replacing log loss with any other loss function.

What is Regularization?

MLE and ERM will find parameters that minimize loss on training data, but no guarantee the model will generalize well on new data (overfitting).

What are ways to combat overfitting?

  1. Increase data instances
  2. Use Complex Model
  3. Apply Regularization (adding penalty term to model)

What is Bayesian Decision Theory

Lets us choose best action in a given siutation

What is Entropy?

Measure of uncertainty in a probability distribution