Week 3 | Lesson 3.2
After this lesson, you will be able to:
- Split data into testing and training sets
- Perform cross validation scoring
- Make cross validation predictions
Before this lesson, you should already be able to:
- Fit a linear model to a dataframe
- Basic use of sci-kit learn
Before this lesson, instructors will need to:
- Read in / Review any dataset(s) & starter/solution code
- Generate a brief slide deck
- Prepare any specific materials
- Provide students with additional resources
|10 min||Introduction||Test/Train Split|
|15 min||Demo||Test/Train Split|
|25 min||Guided Practice||Cross-Validation|
|25 min||Independent Practice||Cross-Validation|
|5 min||Conclusion||Review / Recap|
Opening (5 mins)
- Review prior labs/homework, upcoming projects, or exit tickets, when applicable
- Review lesson objectives
- Discuss real world relevance of these topics
- Relate topics to the Data Science Workflow - i.e. are these concepts typically used to acquire, parse, clean, mine, refine, model, present, or deploy?
Introduction: Test/Train Split (5 mins)
So far we've focused on fitting the best model to our data. In practice, we need to also make and validate predictions with our models. By splitting our data set into a subset to train our model on and a subset to make and test predictions on, we can validate the effectiveness of our model. This is called a train/test split and we'll explore a number of ways to effectively carry out the split. It's also a good way to avoid overfitting on your dataset.
Test/train split benefits:
- Save a subset of data to make predictions
- Can verify predictions without having to collect new data (which may be difficult or expensive)
- Can help avoid overfitting
Use the included Jupyter notebook for the demo (first section of the starter code) and a more in-depth introduction (with equations).
Demo: Test/Train Split (15 mins)
The demo covers a basic test/train split as well as k-fold cross-validation
Check: Is 2-fold cross-validation the same as a 50:50 test/train split?
It may seem so at first glance, but with 2-fold cross-validation we get a prediction for every point since we use each half of the data to train and test separate models.
Check: Will two different 50:50 (or x:y) splits produce the same model score?
In general no, and if the splits are chosen poorly along a categorical variable, the difference could be very large. For example, theme park attendance might be very different depending on the day of the week. Can students think of other examples?
Guided Practice: Cross-Validation (25 mins)
In the Starter code, practice cross-validating models.
Independent Practice: Cross-Validation (25 minutes)
Continue practicing with the Starter code.
Conclusion (10 mins)
- Review any independent practice deliverable(s)
- Recap topic(s) covered
If you are experienced in statistics and data analysis, you may be accustomed to using all the available data to establish relationships, so saving some data to make and test predictions may seem unusual. However is critical that we make and test the predictions of our models before we put them into practice, and to take care to avoid overfitting.
Consider also recapping the concept of model comparison introduced during the lesson's independent practice.