Train-Test Split

Week 3 | Lesson 3.2

LEARNING OBJECTIVES

After this lesson, you will be able to:

  • Split data into testing and training sets
  • Perform cross validation scoring
  • Make cross validation predictions

STUDENT PRE-WORK

Before this lesson, you should already be able to:

  • Fit a linear model to a dataframe
  • Basic use of sci-kit learn

INSTRUCTOR PREP

Before this lesson, instructors will need to:

  • Read in / Review any dataset(s) & starter/solution code
  • Generate a brief slide deck
  • Prepare any specific materials
  • Provide students with additional resources

STARTER CODE

Demo

LESSON GUIDE

TIMING TYPE TOPIC
5 min Opening Discussion
10 min Introduction Test/Train Split
15 min Demo Test/Train Split
25 min Guided Practice Cross-Validation
25 min Independent Practice Cross-Validation
5 min Conclusion Review / Recap

Opening (5 mins)

  • Review prior labs/homework, upcoming projects, or exit tickets, when applicable
  • Review lesson objectives
  • Discuss real world relevance of these topics
  • Relate topics to the Data Science Workflow - i.e. are these concepts typically used to acquire, parse, clean, mine, refine, model, present, or deploy?

Introduction: Test/Train Split (5 mins)

So far we've focused on fitting the best model to our data. In practice, we need to also make and validate predictions with our models. By splitting our data set into a subset to train our model on and a subset to make and test predictions on, we can validate the effectiveness of our model. This is called a train/test split and we'll explore a number of ways to effectively carry out the split. It's also a good way to avoid overfitting on your dataset.

Test/train split benefits:

  • Save a subset of data to make predictions
  • Can verify predictions without having to collect new data (which may be difficult or expensive)
  • Can help avoid overfitting

Use the included Jupyter notebook for the demo (first section of the starter code) and a more in-depth introduction (with equations).

Solution code

Demo: Test/Train Split (15 mins)

The demo covers a basic test/train split as well as k-fold cross-validation

Check: Is 2-fold cross-validation the same as a 50:50 test/train split?

It may seem so at first glance, but with 2-fold cross-validation we get a prediction for every point since we use each half of the data to train and test separate models.

Check: Will two different 50:50 (or x:y) splits produce the same model score?

In general no, and if the splits are chosen poorly along a categorical variable, the difference could be very large. For example, theme park attendance might be very different depending on the day of the week. Can students think of other examples?

Guided Practice: Cross-Validation (25 mins)

In the Starter code, practice cross-validating models.

Solution code

Independent Practice: Cross-Validation (25 minutes)

Continue practicing with the Starter code.

Solution code

Conclusion (10 mins)

  • Review any independent practice deliverable(s)
  • Recap topic(s) covered

If you are experienced in statistics and data analysis, you may be accustomed to using all the available data to establish relationships, so saving some data to make and test predictions may seem unusual. However is critical that we make and test the predictions of our models before we put them into practice, and to take care to avoid overfitting.

Consider also recapping the concept of model comparison introduced during the lesson's independent practice.


ADDITIONAL RESOURCES

results matching ""

    No results matching ""