Logistic Regression

Is a classification algorithm.

Is a discriminative classification algorithm that has the form $P(y|x, \theta)$ where $X\in R^D$ (basically, it’s continuous and D-dimensional).

NOTE: Logistic Regression can use categorical inputs. Naive Bayes is a generative algorithm.

If output size is:

  • C = 2, then it is classified as binary logistic regression.
  • C > 2, then it is classified as multi-class logistic regression.

Binary Logistic Regression

$P(y|x, \theta)$ = Ber$( y|\mu(x))$

$\mu(x)$ represents the probability of ‘x’ belonging to class 1.

Since it is a discriminative classifier, it will compute the posterior probability directly using the inputs of the parameters of the model.

$P(y=1|x,\theta) = \mu(x) = sigm(f(x))=sigm(w^Tx+w_0)$

NOTE: theta represents the parameters of the model.

where: $w_0 = $ bias $W^Tx = $ $w_1x_1 + w_2x_2 + … + w_Dx_D$ D = number of features

Linear Regression

In a Linear Regression model, a straight line will pass through a cluster of data points as close as possible. This model is basically a straight line that passes through a cluster of data points, then you would need to measure the distance between the model’s predicted outputs and the actual output of the $i^{th}$ data instance.

NOTE: Getting the difference between the acual and predict outputs is called the RESIDUAL ERROR.

Residual Error Formula: $\epsilon_i = y_i - f(x_i)$

Linear Regression assumes the output of the data comes from a Gaussian Distribution. $P(y | x, \theta) \sim N(y | x, \theta) = N(y | w_0 + w^Tx, \sigma^2)$

where: $\mu(x) = w_0 + w^Tx $ is the mean and $\sigma^2$ is the variance of the distribution.

$w_0 = $ bias term $w^T = $ not including bias term

Since there are infinitely many lines that can fit inside the data, there can exist infinitely many set of parameters.

In Linear Regression, parameters are optimized during training to find the line of best fit.

In Linear Regression, it uses squared error for its loss (calculates average of squared differences between predicted and actual values).

NOTE: Loss is the difference between the predicted and actual outputs of the model.

$Cost(w) = \frac{1}{2N}\sum^N_{n=1}(y_n - \mu(x_n))^2$ Remember: $f(xn) = w^Tx_n + w_0$ or $w^Tx_n$ if $w$ includes $w_0$

GOAL FOR LINEAR REGRESSION:

  • Find the w (parameter/weight) vector that will minimize the cost.

In terms for linear regression with gradeint descent, the gradient descent will find the gradient of the cost function and will subtract a fraction of it from w.

Neural Networks for Tabular Data

What is a Perceptron?

  • It is a deterministic version of logistic regression. Logistic regression’s output is between 0 and 1, making it probabilistic.
  • A perceptron’s output is 1 iff $sigm(w^Tx + w_0) \ge 0.5$, 0 iff $sigm(w^Tx + w_0) < 0.5$

Difference between deep learning and multi-layer perceptron:

  • Deep Learning - Networks w/ multiple layers
  • Multi-layer Perceptron - Simplest deep-learning neural network (also called Feedforward Neural Network)

Functions used as activation functions:

  1. Sigmoid (non-linear)
  2. Hyperbolic Tangent aka tanh (nonlinear)
  3. Rectified Linear Unit aka ReLu (nonlinear)
  4. Identity (linear)

NOTE: We only use non-linear functions at hidden units.

Multi-layer Perceptron Architecture:

  • An MLP will have 1 input layer and 1 output layer; there can exist 1 or more hidden layers.

Types of Models:

  1. Shallow Model - 1 Hidden Layer
  2. Deep Model - 2 or more hidden layers

Things to Note:

  • in MLP, bias node will exist in every layer except for the output layer, the bias will connect to the output.
  • feed-forward/MLP does not allow connections within the samle layer, or allow backward connections.
  • there are neural networks w/ the same layer and/or backward connections, but they are not MLPs or feedforward neural networks.
  • when indicating how many layers an MLP has we don’t count the input layer.

SoftMax Function

  • Its argument/input is a vector of continuous values, and output is corresponding probabilities for those continuous values.
  • A softmax function will take a vector of real numbers as input and normalize them into a probability distribution (sum of values will add up to 1).

NOTE: a neural network that is used for multi-class class classification uses the softmax function as an activation at the output layer.

Formula: $\sum^k_{j=1} s_j(a) = 1$ where $s_j(a) = \frac{e^{a_j}}{e^{a_1}+e^{a_2}+e^{a_3}+…{e^{a_k}}}$

Backpropagation Algorithm w/ Gradient Descent

Steps to algorithm:

  1. Initialize the parameters (w’s)
  2. Perform forward pass to compute model’s output
  3. Compute error signals at the output layer nodes
  4. Pass error signals backwards to compute error signals at the hidden layers
  5. Compute gradients (w/ chain derivatives)
  6. Update parameters (weights) by substracting a portion of the gradient from the correct weights.

Training Neural Networks

There are 3 types of training based on what portion of the training data is used at each iteration (epoch):

  1. Batch Learning
  • The entire training dataset is used at each iteration to train the model.
  1. Online Learning
  • One data instance (example) is used at each iteration to train the model.
  1. Mini-Batch Learning
  • Subset of the training data is used at each iteration to train the model.

NOTE: an epoch is a single complete pass of the following data.

Example: N = 1,000 examples

  • In Batch Learning: each iteration uses 1,000 examples to train model; 1 epoch = 1 iteration
  • In Online Learning: each iteration uses 1 example to train model; 1 epoch = 1,000 iterations
  • In Mini-Batch Learning: if batch size = 50, 1 epoch = 20 iterations since $\frac{1000}{50} = 20$

NOTE: For online and mini-batch, datashet should be shuffled before each epoch.

Neural Networks for Images

  1. Images have pixels, which have channels. ex: each pixel has 3 channels for RGB color, which would make the # of input features grow quickly.
  2. The MLP would ignore transitional invariability, which is the existence of a shape, object, etc. in a image, regardless of its position in the image.
  3. Images have different sizes, which would require learning w’s w/ varying dimensions.

Clustering

Convolutional Neural Networks (CNN)

  • Are a type of Multi-layer Perceptron.
  • Inspired by animal visual cortex (basically scanning your surroundings).
  • Is useful for image recognition and processing.

It uses weight sharing, which would decrease the number of connections (ex: model’s complexity).

NOTE: if the number of connections decrease, the complexity decreases as well.

K-Nearest Neighbor (KNN)

Is a classification algorithm. It would simply look for k instances in training data that are nearest (closests) to the test input x, then counts how many are in each class, and returns the probability distribution of the output.

NOTE: Euclidean Distance is the most commonly used distance in KNN.

Example: You choose a spot in a graph (graph would represent fruits; y = color, x = size). You then would look at the three closest neighbors of that particular spot. If there are more neighbors of one class over the other, then that spot would be classified by that certain class (look at page 189 in Grokking Algorithms textbook).

Practical Applications of KNN: a recommendation system for music, food, etc.

Parametric vs Non-parametric Models:

  • Parametric: If a ML model has a fixed # of parameters regardless of size of the training dataset.
  • Non-parametric: If # of parameters grow w/ the size of the dataset.