What is Supervised Learning?
Function that perfectly maps each input to its output.
Supervised Learning Algorithms work with input-output pairs!
f* <- model’s approximation x (bold lowercase) = input vector y = output (typically scalar unless output is multi-dimensional)
What is a model parameter?
Variable which can be estimatted by fitting given data to model.
- ex: $f(x) = mx+c$, where x = independent variable, c = dependent
m and c are parameters. They are estimated by fitting a straight line to the data by minimizing root from squared error. Goal of parameters is to optimize model.
How to determine input’s dimensions?
n-dimensional input is determined by the number of features.
- ex: 5 features yields a 5-dimensional input.
What is Classification?
Categorical Outputs
- ex: Hand-written Digits
What is a Confusion Matrix?
Describes the performance of a classification algorithm.
- Output can be TWO OR MORE classes.
- Predicts how many are
in its own class
and how many are predicted in each of theother classes.
Confusion Matrix will always have a combo of actual and predicted values.
What is Misclassification Rate?
Tells the percentage of observations that were incorrectly predicted by a classification model.
It gives ratio of misclassified examples to all given examples.
- ex: $L(\theta) = \frac{6}{50} = 0.04 = 4%$ This says that 4% of observations were
incorrectly predicted
.
What is Zero-One Loss?
Loss = 1 for misclassification; 0 otherwise
What is a Loss Matrix?
Defines penalties for getting the answer wrong.
- ex: covid example. if you misclassify someone having COVID and you give them meds, the cost would be giving expensive meds for nothing since they don’t have COVID.
What is Uncertainty?
Uncertainty of Classification Model Predicted is represented as Conditional Probability Distribution.
$P(y=c|x, \theta) = f_c(x,\theta)$
- y=c is the class
- x is the input feature
- $\theta$ is the parameter
NOTE: Uncertainty Class = Conditional Probability
What is Maximum A Posteriori (MAP) Estimate?
Model output for each class can be represented as a real number.
Softmax Function will map the given output to probabilities.
- Ex: 3 Classes A, B, and C a = [5, 10, 100] $P(Y = A|x, \theta) = \frac{e^5}{e^5+e^10+e^100}$
What is Regression?
A supervised learning model where outputs are continuous (uncountable/real-valued).
Commonly Used Loss Functions:
Most commonly used loss functions with regression is squared loss.
- $L(x_n) = (y_n - \hat y_n)^2$ is squared error
- $\sum_{n=1}^N(y_n-\hat y_n)^2$ is the sum squared error
- $\frac{1}{N}\sum_{n=1}^N(y_n-\hat y_n)^2$ is the mean squared error
- $\frac{1}{N}\sum_{n=1}^N|y_n-\hat y_n|$ is the mean absolute squared error
What is Generalization?
How well the model performs on data it hasn’t seen before.
- Underfitting - model not complex enough to represent training data well.
- Overfitting - model too complex that model performs really good on training data but performs really poorly on test data.
- Note: As complexity increases, then so the model’s parameters.
REALLY IMPORTANT TO KNOW
Unsupervised Learning Stuff Below
What is Unsupervised Learning?
It is only
trained on inputs.
What is the difference between clustering and classification?
Clustering | Classification |
---|---|
No Labels (Unsupervised) | Has Labels (classes). Is supervised |
Must determine # of clusters | Training Data determines # of classes. |
Dimensionality Reduction With Principal Components Analysis (PCA)
Will transform large set of variables into smaller one that still contain most of the information in the large set.
- Feature Selection Method = select subset of input features that explain certain percentage of variability. Features not selected = eliminated.
What are the types of Input/Outputs?
- Binary (Categorical w/ TWO possible values)
- Categorical (with MORE THAN TWO possible values)
- Real-valued (Continuous)
What is One-Hot Encoding?
Represent categorical variables as numerical values. Example: Panda = [1,0,0] Cat = [0,1,0] Lion = [0,0,1]
Before:
x1 | x2 | x3 |
---|---|---|
n=1 | .. | Panda |
n=2 | .. | Cat |
n=3 | .. | Lion |
n=4 | .. | Panda |
After:
x1 | x2 | x3 | x4 | x5 |
---|---|---|---|---|
n=1 | .. | 1 | 0 | 0 |
n=2 | .. | 0 | 1 | 0 |
n=3 | .. | 0 | 0 | 1 |
n=4 | .. | 1 | 0 | 0 |
- Categorical input feature or output would be replaced with one-hot encoding.
Basically expand table by introducing more features
What are Random Variables?
Variable X that is unknown or has random quality of interest.
What are Discrete Random Variables?
If the sample space is finite or countably infinite.
What are Continuous Random Variables?
Random Variable that is a real-valued quantity.
- Since there are infintely many values, you must use P(a <= x <= b)
What is a Marginal Distribution?
Represented using a two-way table.
P(x,y) | y=0 | y=1 |
---|---|---|
x=0 | 0.4 | 0.2 |
x=1 | 0.3 | 0.1 |
P(x=0, y=0) = 0.4 P(x=1, y=0) = 0.3 P(x=0) = 0.4 + 0.3 = 0.6 P(y=1) = 0.2 + 0.1 = 0.3
Baye’s Rule
It combines conditional probability, product rule, and sum rule
Formula: $P(H=h|Y=y) = \frac{P(Y=y|H=h)*P(H=h)}{\sum_hP(Y=y|H=h)*P(H=h)}$
Numerator is product; denominator is sunm
What is Bernoulli Distribution
Represents the distribution of a binary outcome
- ex: a coin toss y = 1 <- heads; $P(Y=1) = \theta$ y = 0 <- tails; $P(Y=0) = 1 - \theta$
What is Binomial Distribution?
Represent distribution of a repeated binary outcome $P(S|N, \theta) = $ N Choose S$* \theta * (1-\theta)^{N-S}$
Special case of Binomial, when N = 1
Categorical Distribution (Multinoulli)
Generalizes Bernoulli to more than 2 possible outcomes
- Ex: Rolling a C-sided die where C > 2
Multinomial Distribution
Generalizes categorical distribution to N > 1
What is Training/Model Fitting?
Task of estimating parameters from training data.
What is Maximum Likelihood Estimate (MLE)?
It is the most common approach to parameter estimation. It picks parameters that assign highest probability to training data.
What is Empircal Risk Minimization (ERM)?
Generalize MLE by replacing log loss with any other loss function.
What is Regularization?
MLE and ERM will find parameters that minimize loss on training data, but no guarantee the model will generalize well on new data (overfitting).
What are ways to combat overfitting?
- Increase data instances
- Use Complex Model
- Apply Regularization (adding penalty term to model)
What is Bayesian Decision Theory
Lets us choose best action in a given siutation
What is Entropy?
Measure of uncertainty in a probability distribution