Gradient Descent

Week 3 | Lesson 4.1

LEARNING OBJECTIVES

After this lesson, you will be able to:

Basic review of derivatives
Define the gradient descent algorithm
Step through example of gradient descent
When would the gradient descent algorithm get stuck or fail? Discuss

STUDENT PRE-WORK

Before this lesson, you should already be able to:

Make plots and scatter plots with matplotlib

INSTRUCTOR PREP

Before this lesson, instructors will need to:

Read in / Review any dataset(s) & starter/solution code
Generate a brief slide deck
Prepare any specific materials
Provide students with additional resources

STARTER CODE

Demo

LESSON GUIDE

TIMING	TYPE	TOPIC
5 min	Opening	Discussion
10 min	Introduction	Derivatives
15 min	Demo	Gradient Descent
25 min	Guided Practice	Gradient Descent
25 min	Independent Practice	Gradient Descent
5 min	Conclusion	Conclusion

Opening (5 mins)

Review prior labs/homework, upcoming projects, or exit tickets, when applicable
Review lesson objectives
Discuss real world relevance of these topics
Relate topics to the Data Science Workflow - i.e. are these concepts typically used to acquire, parse, clean, mine, refine, model, present, or deploy?

Check: Ask students to define, explain, or recall any relevant prework concepts such as local minimums of functions, loss functions, derivatives (if your students have calculus experience).

Introduction: Derivatives (10-30 mins)

Review derivatives first. Depending on the mathematics background of your class you will want to adjust your timing appropriately. There's not enough time to teach calculus from scratch of course, but you can get a lot of intuitive understanding out of a good demo and a few analogies.

How much time you spend on derivatives should depend strongly on the math proficiency level of your students.

Derivatives on Khan Academy

Partial Derivates

The derivative of a function measures the rate of change of the values of the function with respect to another quantity.

The gradient is a derivative for multi-variable functions that gives us the direction in which the function changes most quickly. Not all functions have derivatives at every point. For example, the absolute value function does not have a well-defined derivative at the origin because there are many possible tangent lines:

This is one reason why squared error is used as a loss function whenever possible. Even though we've seen a scenario in which the least absolute deviation model (LAD) produced a better fit, finding the LAD can be tricky because of the lack of derivative at the origin. However many loss functions do have derivatives at every point and we can use derivatives and gradients to minimize loss functions, thereby finding bit fit models for a variety of machine learning algorithms.

Demo: Gradient Descent (15-20 mins)

Use the included Jupyter notebook to explain the gradient descent algorithm and walk through a demo.

Advantages and Disadvantages

Advantages:

Relatively simple algorithm to code, many packages available
Efficient (linear complexity in the size of the training set)
Works for a variety of models

Disadvantages:

Only works for differentiable functions
Can get stuck in local minimums (rather than global optimums)
Can be sensitive to learning rate and scaling
For smaller datasets other algorithms can outperform gradient descent

Check:

Guided Practice: Gradient Descent (# mins)

Work through the examples in this Worksheet together.

Check: Why does the learning rate make a difference in convergence? Check: On which functions can we use gradient descent?

Independent Practice: Gradient Descent (# minutes)

Complete the Jupyter notebook for this exercise.

Solution code

Check: Were students able to compute the gradients of the example functions? Do they intuitively understand the use of the gradient even if their calculus skills are rusty?

4.1 Gradient Descent