Teko's ML training program

Machine learning crash course

1. Introduction to Machine Learning

Presented by Thức NC

### Content 1/2 - Traditional rule based program vs ML - Why ML? - Key ML terminology - Types of Machine Learning - Machine learning vs ... - Machine learning pipeline
### Content 2/2 - Linear Regression - Training and Loss - Reducing loss - Gradient Descent - Learning rate Note: Part 2

Traditional program vs ML

Can you write traditional rule based program to tell the difference between an orange and an apple?

How many orange & green pixels?

How about these images?

### ML approach - Prepare a lot of images with labels ("apple" or "orange") - ML algorithms with figure out the "rules" for us automatically

For that we need a classifier

  • You can think of a classifier as a function, that take data as input and assigns a label to it as output.
  • ML technique that writes the classifier automatically is called Supervised learning.

Traditional program vs ML

Standard programming paradigm

The developer identifies the algorithm and implements the code; the users supply the data

Machine learning paradigm

During development, we generate an algorithm from a data set and then incorporate that into our final application.
### Why ML? * Practical reasons to use ML: - Reduce the time you spend on programming - Allow you to customize and scale the products - ML let you solve the problems that you, as a programmer, have no idea how to do by hand: face recognition Notes: Program that corrects spelling errors - Reduce the time you spend on programming + Instead of: a lot of rules, ex “I before E except after C” + ML: feed some examples - Allow you to customize and scale the products + Much more easier to move from a specific language to others
### Why ML? * Philosophical reasons: ML changes the way we thing about a problem: - Software engineers are trained to think logically and mathematically - With ML, the focus shifts from a mathematical science to **natural science**: observe about an uncertain world, running experiments and using **statistics**, not logic to analyze the results of experiments Notes: - ⇒ thinking like a scientist will expend your horizons and open up new areas that you couldn’s explore without it
### ML Terminology - **Supervised learning**: ML systems learn how to combine input to produce useful predictions on never-before-seen data - **Label** - the thing we're predicting: **$y$** - **Features** - are input variables describing our data: **$$ \\{x_1, x_2, ... x_N\\} $$**
### Spam detector example - The **labels**: *spam* or *ham* - The **features** could include the following: + words in the email text + sender's address + time of day the email was sent + email contains the phrase "one weird trick."
- An **example** is a particular instance of data, **$X$** (a vector). We break examples into two categories: - **labeled** examples: ${features, label}: (x, y)$ - **unlabeled** examples: ${features, ?}: (x, ?)$
- A **model** defines the relationship between features and label - Two phases of a model's life: - **Training**:creating or learning the model - **Inference**: applying the trained model to unlabeled examples Note: - Two phases of a model's life: - **Training**:creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and label. - **Inference* means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y'). For example, during inference, you can predict medianHouseValue for new unlabeled examples.
- **Metric**: A number that you care about - **Objective**: A metric that your algorithm is trying to optimize. - **Pipeline**: The infrastructure surrounding a machine learning algorithm. Includes: *gathering* the data, *training* model(s), and *exporting* the models to production. Note: - Metric may or may not be directly optimized. Pipeline: The infrastructure surrounding a machine learning algorithm. Includes gathering the data from the front end, putting it into training data files, training one or more models, and exporting the models to production.

Types of machine learning

Supervised learning

  • Learn a model from labeled training data
  • Make predictions about unseen or future data
### Supervised learning - A **regression** model predicts *continuous values*: - What is the value of a house in California? - What is the probability that a user will click on this ad? - A **classification** model predicts *discrete values* (*class labels*): - Is a given email message spam or not spam? - Is this an image of a dog, a cat, or a hamster?

Classification

Regression

Reinforcement Learning

  • Solving interactive problems
  • The agent tries to maximize the reward by a series of interactions with the environment
#### Reinforcement Learning - A popular example: a chess engine - The agent decides upon a series of moves depending on the **state** of the board - The outcome of each move: a **different state** of the environment - Each state can be associated with a positive or negative reward - The **reward** can be defined as *win* or *lose* at the end of the game Note: There are many different subtypes of reinforcement learning. However, a general scheme is that the agent in reinforcement learning tries to maximize the reward by a series of interactions with the environment. Each state can be associated with a positive or negative reward, and a reward can be defined as accomplishing an overall goal, such as winning or losing a game of chess. For instance, in chess the outcome of each move can be thought of as a different state of the environment. To explore the chess example further, let’s think of visiting certain locations on the chess board as being associated with a positive event—for instance, removing an opponent’s chess piece from the board or threatening the queen. Other positions, however, are associated with a negative event, such as losing a chess piece to the opponent in the following turn. Now, not every turn results in the removal of a chess piece, and reinforcement learning is concerned with learning the series of steps by maximizing a reward based on immediate and delayed feedback.
### Unsupervised Learning - Dealing with **unlabeled data** or data of unknown structure - **Discovering hidden structures** of data without the guidance of a known outcome variable or reward function

Clustering

  • Organize data into meaningful subgroups (clusters) without having any prior knowledge of their group memberships.
#### Clustering - Each cluster defines a group of objects that share a certain degree of **similarity** but are more dissimilar to objects in other clusters - Example: discover customer groups based on their interests, in order to develop distinct marketing programs.

Dimensionality reduction

  • Remove noise from data
  • Compress the data onto a smaller dimensional subspace while retaining most of the relevant information
  • Also useful for visualizing data

Dimensionality reduction

Compress a 3D Swiss Roll onto a new 2D feature subspace

Association Rules

Machine learning vs ...

Machine learning vs ...

Machine learning pipeline

Feature engineering

Feature engineering is the process of transforming raw data into inputs for a machine learning algorithm.
Feature engineering example - Credit Card Fraud detection
Model building workflow

Quiz - Question 1

Q1. Suppose you want to develop a supervised machine learning model to predict whether a given email is "spam" or "not spam". Which of the following statements are true?

  • A. We'll use unlabeled examples to train the model.
  • B. Words in the subject header will make good labels.
  • C. Emails not marked as "spam" or "not spam" are unlabeled examples.
  • D. The labels applied to some examples might be untrustworthy.

Quiz - Question 2

Q2. Suppose an online shoe store wants to create a supervised ML model that will provide personalized shoe recommendations to users. That is, the model will recommend certain pairs of shoes to Marty and different pairs of shoes to Janet. Which of the following statements are true?

  • A. Shoe size is a useful feature.
  • B. User clicks on a shoe's description is a useful label.
  • C. Shoe beauty is a useful feature.
  • D. The shoes that a user adores is a useful label.

Quiz - Question 3

Q3. Suppose you are working on weather prediction, and you would like to predict whether or not it will be raining at 5pm tomorrow. You want to use a learning algorithm for this. Would you treat this as a classification or a regression problem?

Quiz - Question 4

Q4. Suppose you are working on stock market prediction, and you would like to predict the price of a particular stock tomorrow (measured in dollars). You want to use a learning algorithm for this. Would you treat this as a classification or a regression problem?

Quiz - Question 5

Q5. Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to?

  1. Take a collection of 1000 essays written on the US Economy, and find a way to automatically group these essays into a small number of groups of essays that are somehow "similar" or "related".
  2. Given a large dataset of medical records from patients suffering from heart disease, try to learn whether there might be different groups of such patients for which we might tailor separate treatements.
  3. Given genetic (DNA) data from a person, predict the odds of him/her developing diabetes over the next 10 years.
  4. Given 50 articles written by male authors, and 50 articles written by female authors, learn to predict the gender of a new manuscript's author (when the identity of this author is unknown).
  5. In farming, given data on crop yields over the last 50 years, learn to predict next year's crop yields.
  6. Examine a large collection of emails that are known to be spam email, to discover if there are sub-types of spam mail.
  7. Examine a web page, and classify whether the content on the web page should be considered "child friendly" (e.g., non-pornographic, etc.) or "adult."
  8. Examine the statistics of two football teams, and predicting which team will win tomorrow's match (given historical data of teams' wins/losses to learn from).

Linear regression

Cricket database: Chirps per Minute vs. Temperature. What is the relationship?

Linear regression

The relationship seems to be linear

#### Linear regression Using the equation for a line, you could write down this relationship as follows: $$y=mx+b$$ - where: - $y$ is the temperature in Celsius—the value we're trying to predict. - $m$ is the slope of the line. - $x$ is the number of chirps per minute—the value of our input feature. - $b$ is the y-intercept. Note: True, the line doesn't pass through every dot, but the line does clearly show the relationship between chirps and temperature. Using the equation for a line, you could write down this relationship as follows:
#### Linear regression By convention in machine learning, you'll write the equation for a model slightly differently: $$y' = b + w_1x_1$$ - where: - $y'$ is the predicted label (a desired output). - $w_1$ is the weight of feature 1. - $x_1$ is a feature (a known input). - $b$ is the bias (the y-intercept or offset from an origin), sometimes referred to as $w_0$.
#### Linear regression $$y' = b + w_1x_1$$ To **infer** (predict) the temperature for a new chirps-per-minute value $x_1$, just substitute the $x_1$ value into this model.
#### Linear regression More sophisticated linear model: model might rely on multiple features, each having a separate weight $$y' = b + w_1x_1 + w_2x_2 + w_3x_3 + ... + w_nx_n$$
### Training and loss - **Training** a model simply means learning (determining) good values for all the weights and the bias from labeled examples. - In supervised learning: a ML algorithm attempts to find a model that **minimizes loss** $\Rightarrow$ **empirical risk minimization** process

Loss

  • Loss is a number indicating how bad the model's prediction was on a single example.
  • The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.
High loss in the left model; low loss in the right model.
#### A popular loss function - **Squared loss** (also known as **$L_2$** loss), for a single example: $$=(observation - prediction(x))^2=(y - y')^2$$ - **Mean square error (MSE)** is the average squared loss per example over the whole dataset: $$MSE=\frac{1}{N}\sum_{(x,y) \in D}{(y-prediction(x))^2}$$
$$MSE=\frac{1}{N}\sum_{(x,y) \in D}{(y-prediction(x))^2}$$ - $(x,y)$ is an example in which - $x$ is the set of features - $y$ is the example's label - $prediction(x)$ is a function of the weights and bias in combination with the set of features $x$. - $D$ is a data set containing many labeled examples, which are $(x,y)$ pairs. - $N$ is the number of examples in .

Quiz

Which of the two data sets shown in the preceding plots has the higher Mean Squared Error (MSE)?

Reducing Loss

  • To train a model, we need a good way to reduce the model’s loss
  • An iterative approach is one widely used method for reducing loss, and is as easy and efficient as walking down a hill.
#### An iterative approach - Consider simple model: $y' = b + w_1x_1$ - Get initial value: $(b, w_1) = (0, 0)$ - Suppose initial feature value is 10, the prediction: $y' = 0 + 10\times0 = 0$ - Using a loss function to compute loss - "**Compute parameter updates**" part will generate new values for $(b,w_1)$ $\Rightarrow$ new loss value $\Rightarrow$ ... until overall loss **stops changing** or at least changes extremely slowly $\Rightarrow$ model has **converged**. Note: It is here that the machine learning system examines the value of the loss function and generates new values for and . For now, just assume that this mysterious box devises new values and then the machine learning system re-evaluates all those features against all those labels, yielding a new value for the loss function, which yields new parameter values.

Gradient descent

  • An algorithm for "Compute parameter updates" part
  • Goal: Find $w_1$ that minimize the loss
  • An inefficient way: calculating the loss function for every value of $w_1$ over the entire data set
  • Better mechanism: gradient descent

Gradient descent

  • First step: pick a starting value for $w_1$ (normal, 0 or random value)
  • The gradient descent algorithm then uses the gradient of the loss curve at the current point to calculate next point.

Gradient descent

  • The gradient of loss is: the derivative (slope) of the curve, and tells you which way is "warmer" or "colder."
  • With multiple weights, the gradient is a vector of partial derivatives with respect to the weights.
$$\nabla f= \left( \frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2},...,\frac{\partial f}{\partial x_n} \right)$$

Gradient descent

  • The gradient always points in the direction of steepest increase in the loss function
  • In gradient descent, we are trying to minimize the loss by following the negative of the gradient: $-\nabla f$

Gradient descent

  • Note that a gradient is a vector, so it has both: a direction and a magnitude
  • Determine the next point: adds some fraction of the gradient's magnitude to the current point

Learning rate

  • Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point
  • Hyperparameters are the knobs that programmers tweak in machine learning algorithms.
  • Learning rate is a hyperparameter. Most machine learning programmers spend a fair amount of time tuning the learning rate.

Learning rate

  • Learning rate is too small, learning will take too long:

Learning rate

  • Learning rate is too large, the next point will perpetually bounce haphazardly across the bottom:

Good learning rate

  • For every regression problem, there is a way to find the idea learning rate:
    • In one-dimension: $\frac{1}{f(x)''}$
    • For 2 or more dimensions: the inverse of the Hessian (matrix of second partial derivatives)

Hands-on: Calculate next values of weights

House size (X) Price (Y) X (Min-max Standardized) Y (Min-max Standardized)
1,100199,0000.000.00
1,400245,0000.220.22
1,425319,0000.240.58
1,550240,0000.330.20
1,600312,0000.370.55
1,700279,0000.440.39
1,700310,0000.440.54
1,875308,0000.570.53
2,350405,0000.931.00
2,450324,0001.000.61

Linear regression

  • Model: $y' = a + bx$
  • Starting value: $(a,b)=(0.45,0.75)$
  • Loss function: Sum of Squared Errors
$$SSE=f(a,b)=\frac{1}{2}\sum{(y'-y)^2}=\frac{1}{2}\sum{(a+bx-y)^2}$$

Step 1. Calculate loss (SSE)

XYabY'SSE
0.000.000.450.750.450.101
0.220.22  0.620.077
0.240.58  0.630.001
0.330.20  0.700.125
0.370.55  0.730.016
0.440.39  0.780.078
0.440.54  0.780.030
0.570.53  0.880.062
0.931.00  1.140.010
1.000.61  1.200.176
    Total:0.677

Input examples

Step 2. Calculate gradient

$$SSE=f(a,b)=\frac{1}{2}\sum{(y'-y)^2}=\frac{1}{2}\sum{(a+bx-y)^2}$$ $$(\frac{\partial f}{\partial a},\frac{\partial f}{\partial b})=(a+bx-y, x(a+bx-y))=(y'-y, x(y'-y))$$
XYabY'SSE=f(a,b)df/dadf/db
0.000.000.450.750.450.1010.450.00
0.220.22  0.620.0770.390.09
0.240.58  0.630.0010.050.01
0.330.20  0.700.1250.500.17
0.370.55  0.730.0160.180.07
0.440.39  0.780.0780.390.18
0.440.54  0.780.0300.240.11
0.570.53  0.880.0620.350.20
0.931.00  1.140.0100.140.13
1.000.61  1.200.1760.590.59
    Total:0.6773.301.55

Our initial model

Step 3. Update weight, re-calculate SSE

  • Take learning rate $ \alpha=0.1$
  • Update weights:
    • $a_{new} = a - \alpha\times\frac{\partial f}{\partial a} = 0.45-0.1 \times 3.30 = 0.12$
    • $b_{new} = b - \alpha\times\frac{\partial f}{\partial b} = 0.75-0.1 \times 1.55 = 0.60$
XYabY'SSE=f(a,b)df/dadf/db new anew bnew Y'new SSE
0.000.000.450.750.450.1010.450.000.120.600.120.007
0.220.22  0.620.0770.390.09  0.250.000
0.240.58  0.630.0010.050.01  0.260.051
0.330.20  0.700.1250.500.17  0.320.007
0.370.55  0.730.0160.180.07  0.340.022
0.440.39  0.780.0780.390.18  0.380.000
0.440.54  0.780.0300.240.11  0.380.012
0.570.53  0.880.0620.350.20  0.460.002
0.931.00  1.140.0100.140.13  0.670.054
1.000.61  1.200.1760.590.59  0.720.006
    Total:0.6773.301.55   0.161

Our model after updating weight

Hands-on: Optimize learning rate

Stochastic Gradient Descent

  • In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration.
  • Stochastic gradient descent (SGD) uses only a single (random) example (a batch size of 1) per iteration
  • Mini-batch SGD is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random.

Reducing Loss: Playground Exercise

Quiz

Q1. Suppose gradient descent is used to try to find the minimum of the function $f(x,y) = 1+x^2+y^2$ starting at the point $(1,1)$. What will the x and y coordinates be after the first step, given the learning rate of $0.5$?

Quiz

Q2. Suppose gradient descent is used on function $f(x)=x^2-1$. If the gradient descent begins at $x=-1$ and uses the learning rate of $1.0$, in how many steps will it converge to the global minimum $x=0$?

### Thank you!