Machine Learning Notes

Goal for Machine Learning

The goal for machine learning is to help us make decisions with or without human supervision. To achieve this, machine learning uses a group of algorithms/methodologies to discover and formulate repeatable patterns in data.

Unsupervised Learning

An algorithm that detects and utilizes data that are not generated randomly, therefore it finds patterns/structure within the data to comprehend it.

Laymen’s Terms: Non-randomly generated data is read and the algorithm will detect certain patterns within the data to better understand it.

NOTE:

Feature - the characteristics of the data. Ex: Data about customer information. Features of the customer would include their age, purchase history, browsing behavior, etc.
Label - represents the outcomes of the model’s predictions based on those features.

Types of Distances:

Distance - how similar/different two data points are in a multi-dimensional space.

Euclidean - It is a straight-line distance between two points in 3D space–it is the most common and simplest. Use Pythagorean Theorem to find the distances between the two points.
Manhattan - A grid-like distance where the distance between the two points are represented as a zigzag path.
Cosine - Assesses the cosine of the angle formed by two data points to an origin–main focus here is the angle that is created by these two points.

K-Means Clustering

You create ‘k’ clusters (group of data) and use the means/averages to proximate the closeness between the data points. The mean will refer to the center point of each cluster that is created via each feature within the cluster.

Laymen’s Terms: You have a bag of different colored marbles and you want to group the marbles based on their colors.

Steps for K-Means Clustering:

Choose # of clusters, k.
Among data points, randomly choose k points as cluster centers.
Using distance measure (cosine, manhattan, or euclidean), iteratively compute distance from each point in the problem space to each of the k-cluster centers. For example, if k = 3 and there are 10,000 points, then you will need to compute 30,000 distances.
Assign each data point to the nearest cluster center.
Recalculate cluster centers by computing the mean of data points of each of the k clusters.
If clusters have changed, then recompute the cluster assignment for each data point (go back to step 3), else if there is no change, then the algorithm is done.

Supervised Learning

An algorithm that uses features as their inputs and labels as their outputs. Ex: User profiles would be a feature and customer purchasing habits/product quality ratings would be a label.

The goal for supervised learning is to train the model by using complex relationships between the features and labels represented by a mathematical formula. As a reuslt of this, it can be used to make predictions.

Layman’s Terms: You feed the model by showing various examples (instances). By using the examples, it will make predictions/decisions.

Categories within Supervised Learning:

Regression
Neural Networks
Classification

Ways to train the model:

Using Naive Bayes Classifier
Decision Trees

Terminology	Explanation
Label	Variable where the model is tasked with predicting. Can only be one label in a supervised learning model.
Feature	Set of inputs variables used to predict the label.
Model	Mathematical formulation of the patterns to gather the relationship between the label and the features.
Training	Creating a model using training data.
Prediction	Utilizing the trained model to estimate the label.

Two Types of Supervised Models:

Classifiers - Label is categorical: qualitative variables are classified into distinct categories.

Example of Classifier: Determining whether abnomal tissue growith is malignant or not.

Regressors - Label is numeric: an infinite number of values exist between two variables.

Exaple of Regressor: Determining the price of a particular home given its characteristics.

MAIN DIFFERENCE BETWEEN SUPERVISED AND UNSUPERVISED LEARNING: SUPERVISED is given LABELED DATA (desired inputs and outputs are provided) to better understand data; UNSUPERVISED is given UNLABLED DATA, and it must FIND patterns to better understand data.