Gradient Descent

Week 3 | Lesson 4.1


After this lesson, you will be able to:

  • Basic review of derivatives
  • Define the gradient descent algorithm
  • Step through example of gradient descent
  • When would the gradient descent algorithm get stuck or fail? Discuss


Before this lesson, you should already be able to:

  • Make plots and scatter plots with matplotlib


Before this lesson, instructors will need to:

  • Read in / Review any dataset(s) & starter/solution code
  • Generate a brief slide deck
  • Prepare any specific materials
  • Provide students with additional resources




5 min Opening Discussion
10 min Introduction Derivatives
15 min Demo Gradient Descent
25 min Guided Practice Gradient Descent
25 min Independent Practice Gradient Descent
5 min Conclusion Conclusion

Opening (5 mins)

  • Review prior labs/homework, upcoming projects, or exit tickets, when applicable
  • Review lesson objectives
  • Discuss real world relevance of these topics
  • Relate topics to the Data Science Workflow - i.e. are these concepts typically used to acquire, parse, clean, mine, refine, model, present, or deploy?

Check: Ask students to define, explain, or recall any relevant prework concepts such as local minimums of functions, loss functions, derivatives (if your students have calculus experience).

Introduction: Derivatives (10-30 mins)

Review derivatives first. Depending on the mathematics background of your class you will want to adjust your timing appropriately. There's not enough time to teach calculus from scratch of course, but you can get a lot of intuitive understanding out of a good demo and a few analogies.

How much time you spend on derivatives should depend strongly on the math proficiency level of your students.

Derivatives on Khan Academy

Partial Derivates

The derivative of a function measures the rate of change of the values of the function with respect to another quantity.


The gradient is a derivative for multi-variable functions that gives us the direction in which the function changes most quickly. Not all functions have derivatives at every point. For example, the absolute value function does not have a well-defined derivative at the origin because there are many possible tangent lines:

absolute value

This is one reason why squared error is used as a loss function whenever possible. Even though we've seen a scenario in which the least absolute deviation model (LAD) produced a better fit, finding the LAD can be tricky because of the lack of derivative at the origin. However many loss functions do have derivatives at every point and we can use derivatives and gradients to minimize loss functions, thereby finding bit fit models for a variety of machine learning algorithms.

Demo: Gradient Descent (15-20 mins)

Use the included Jupyter notebook to explain the gradient descent algorithm and walk through a demo.

Advantages and Disadvantages


  • Relatively simple algorithm to code, many packages available
  • Efficient (linear complexity in the size of the training set)
  • Works for a variety of models


  • Only works for differentiable functions
  • Can get stuck in local minimums (rather than global optimums)
  • Can be sensitive to learning rate and scaling
  • For smaller datasets other algorithms can outperform gradient descent


Guided Practice: Gradient Descent (# mins)

Work through the examples in this Worksheet together.

Check: Why does the learning rate make a difference in convergence? Check: On which functions can we use gradient descent?

Independent Practice: Gradient Descent (# minutes)

Complete the Jupyter notebook for this exercise.

Solution code

Check: Were students able to compute the gradients of the example functions? Do they intuitively understand the use of the gradient even if their calculus skills are rusty?

Conclusion (# mins)

  • Review Objectives
  • Recap major takeaways
  • Discuss additional resources


results matching ""

    No results matching ""