Train-Test Split

Week 3 | Lesson 3.2

LEARNING OBJECTIVES

After this lesson, you will be able to:

Split data into testing and training sets
Perform cross validation scoring
Make cross validation predictions

STUDENT PRE-WORK

Before this lesson, you should already be able to:

Fit a linear model to a dataframe
Basic use of sci-kit learn

INSTRUCTOR PREP

Before this lesson, instructors will need to:

Read in / Review any dataset(s) & starter/solution code
Generate a brief slide deck
Prepare any specific materials
Provide students with additional resources

STARTER CODE

Demo

LESSON GUIDE

TIMING	TYPE	TOPIC
5 min	Opening	Discussion
10 min	Introduction	Test/Train Split
15 min	Demo	Test/Train Split
25 min	Guided Practice	Cross-Validation
25 min	Independent Practice	Cross-Validation
5 min	Conclusion	Review / Recap

Opening (5 mins)

Review prior labs/homework, upcoming projects, or exit tickets, when applicable
Review lesson objectives
Discuss real world relevance of these topics
Relate topics to the Data Science Workflow - i.e. are these concepts typically used to acquire, parse, clean, mine, refine, model, present, or deploy?

Introduction: Test/Train Split (5 mins)

So far we've focused on fitting the best model to our data. In practice, we need to also make and validate predictions with our models. By splitting our data set into a subset to train our model on and a subset to make and test predictions on, we can validate the effectiveness of our model. This is called a train/test split and we'll explore a number of ways to effectively carry out the split. It's also a good way to avoid overfitting on your dataset.

Test/train split benefits:

Save a subset of data to make predictions
Can verify predictions without having to collect new data (which may be difficult or expensive)
Can help avoid overfitting

Use the included Jupyter notebook for the demo (first section of the starter code) and a more in-depth introduction (with equations).

Solution code

Demo: Test/Train Split (15 mins)

The demo covers a basic test/train split as well as k-fold cross-validation

Check: Is 2-fold cross-validation the same as a 50:50 test/train split?

It may seem so at first glance, but with 2-fold cross-validation we get a prediction for every point since we use each half of the data to train and test separate models.

Check: Will two different 50:50 (or x:y) splits produce the same model score?

In general no, and if the splits are chosen poorly along a categorical variable, the difference could be very large. For example, theme park attendance might be very different depending on the day of the week. Can students think of other examples?

Guided Practice: Cross-Validation (25 mins)

In the Starter code, practice cross-validating models.

Solution code

Independent Practice: Cross-Validation (25 minutes)

Continue practicing with the Starter code.

Solution code

Conclusion (10 mins)

Review any independent practice deliverable(s)
Recap topic(s) covered

If you are experienced in statistics and data analysis, you may be accustomed to using all the available data to establish relationships, so saving some data to make and test predictions may seem unusual. However is critical that we make and test the predictions of our models before we put them into practice, and to take care to avoid overfitting.

Consider also recapping the concept of model comparison introduced during the lesson's independent practice.

3.2 Train-Test Split