Gradient Descent
Week 3 | Lesson 4.1
LEARNING OBJECTIVES
After this lesson, you will be able to:
- Basic review of derivatives
- Define the gradient descent algorithm
- Step through example of gradient descent
- When would the gradient descent algorithm get stuck or fail? Discuss
STUDENT PRE-WORK
Before this lesson, you should already be able to:
- Make plots and scatter plots with matplotlib
INSTRUCTOR PREP
Before this lesson, instructors will need to:
- Read in / Review any dataset(s) & starter/solution code
- Generate a brief slide deck
- Prepare any specific materials
- Provide students with additional resources
STARTER CODE
LESSON GUIDE
TIMING | TYPE | TOPIC |
---|---|---|
5 min | Opening | Discussion |
10 min | Introduction | Derivatives |
15 min | Demo | Gradient Descent |
25 min | Guided Practice | Gradient Descent |
25 min | Independent Practice | Gradient Descent |
5 min | Conclusion | Conclusion |
Opening (5 mins)
- Review prior labs/homework, upcoming projects, or exit tickets, when applicable
- Review lesson objectives
- Discuss real world relevance of these topics
- Relate topics to the Data Science Workflow - i.e. are these concepts typically used to acquire, parse, clean, mine, refine, model, present, or deploy?
Check: Ask students to define, explain, or recall any relevant prework concepts such as local minimums of functions, loss functions, derivatives (if your students have calculus experience).
Introduction: Derivatives (10-30 mins)
Review derivatives first. Depending on the mathematics background of your class you will want to adjust your timing appropriately. There's not enough time to teach calculus from scratch of course, but you can get a lot of intuitive understanding out of a good demo and a few analogies.
How much time you spend on derivatives should depend strongly on the math proficiency level of your students.
The derivative of a function measures the rate of change of the values of the function with respect to another quantity.
The gradient is a derivative for multi-variable functions that gives us the direction in which the function changes most quickly. Not all functions have derivatives at every point. For example, the absolute value function does not have a well-defined derivative at the origin because there are many possible tangent lines:
This is one reason why squared error is used as a loss function whenever possible. Even though we've seen a scenario in which the least absolute deviation model (LAD) produced a better fit, finding the LAD can be tricky because of the lack of derivative at the origin. However many loss functions do have derivatives at every point and we can use derivatives and gradients to minimize loss functions, thereby finding bit fit models for a variety of machine learning algorithms.
Demo: Gradient Descent (15-20 mins)
Use the included Jupyter notebook to explain the gradient descent algorithm and walk through a demo.
Advantages and Disadvantages
Advantages:
- Relatively simple algorithm to code, many packages available
- Efficient (linear complexity in the size of the training set)
- Works for a variety of models
Disadvantages:
- Only works for differentiable functions
- Can get stuck in local minimums (rather than global optimums)
- Can be sensitive to learning rate and scaling
- For smaller datasets other algorithms can outperform gradient descent
Check:
Guided Practice: Gradient Descent (# mins)
Work through the examples in this Worksheet together.
Check: Why does the learning rate make a difference in convergence? Check: On which functions can we use gradient descent?
Independent Practice: Gradient Descent (# minutes)
Complete the Jupyter notebook for this exercise.
Check: Were students able to compute the gradients of the example functions? Do they intuitively understand the use of the gradient even if their calculus skills are rusty?
Conclusion (# mins)
- Review Objectives
- Recap major takeaways
- Discuss additional resources
ADDITIONAL RESOURCES
- Derivatives on Khan Academy
- Partial Derivates
- A nice demo of gradient descent for linear regression.