Feature Scaling
Week 3 | Lesson 4.2
LEARNING OBJECTIVES
After this lesson, you will be able to:
- Use the scikit-learn preprocessing module to normalize the data in various ways
STUDENT PRE-WORK
Before this lesson, you should already be able to:
- Load and manipulate data with Pandas
- Fit models with sklearn
INSTRUCTOR PREP
Before this lesson, instructors will need to:
- Read in / Review any dataset(s) & starter/solution code
- Generate a brief slide deck
- Prepare any specific materials
- Provide students with additional resources
STARTER CODE
LESSON GUIDE
TIMING | TYPE | TOPIC |
---|---|---|
5 min | Opening | Discussion |
15 min | Introduction | Feature Scaling |
20 min | Demo | Scaling in Python |
15 min | Guided Practice | Normalization |
35 min | Independent Practice | Scaling and Linear Regression |
5 min | Conclusion | Review, Recap |
Opening (5 mins)
- Review prior labs/homework, upcoming projects, or exit tickets, when applicable
- Review lesson objectives
- Discuss real world relevance of these topics
- Relate topics to the Data Science Workflow - i.e. are these concepts typically used to acquire, parse, clean, mine, refine, model, present, or deploy?
Introduction: Feature Scaling (15-20 mins)
When working with new data sets we always need to process the data. As we've seen it's usually necessary to convert strings to numbers, handle date formats, and toss out bad data points. It's also often necessary to scale our data, and it is rarely not a good idea.
Why scale data?
There are a number of good reasons why we scale our data:
- to handle disparities in units
- because many machine learning models require scaling
- it can speed up gradient descent
The sci-kit learn documentation lays it out pretty clearly::
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
The reason we scale for gradient descent is to prevent major differences in the steps on different axis to be widely different. This makes it difficult to find a good learning rate since once that is too small will take a long time to move around in the direction of a larger-scale feature, and a learning rate that is too large will not have good resolution on a smaller-scale feature.
The good news is that it's rarely a bad idea to scale your data. So it's a good practice to apply consistently, and to master early in your progression as a data scientist.
How to we scale our data?
Typically we scale features in one of a few standard ways. For example, a common method called standardization takes a feature and rescales it to have mean zero and variance 1, like a standard normal distribution. We do this by computing the mean and standard deviation, and then transform data as so:
x' = (x - mean) / std_dev
Another common method is called Min-Max Scaling or simply rescaling. In this
case we rescale our data to fit into an interval (min, max)
by transforming:
x' = (x - min) / (max - min)
Normalization is another scaling method that you may have seen before, and it involves dividing the data in a feature by the sum of all the features. It you know what a unit vector is then you've seen normalization before, in which case you divided by the magnitude of a vector (the square root of the sum of the squares).
Check: Why do we scale data?
Demo: Scaling in Python (20 mins)
Use the starter code to walk through the demo of different scaling methods.
Guided Practice: Normalization (15 mins)
Practice scaling by normalization using pure Python and scikit-learn with the starter code.
- Apply L1 and L2 normalization using python (5-10 mins)
- Apply L1 and L2 normalization using scikit-learn (5-10 mins)
Independent Practice: Scaling and Linear Regression (30 minutes)
- Practice scaling and linear fits. Does normalization affect any of your models? (10-20 mins)
- Try some regularized models. Does scaling have a significant effect? (10 mins)
- Try some other models from scikit-learn, such as a SGDRegressor. It's ok if you are unfamiliar with the model, just follow the example code and explore the fit and the effect of scaling. (10 mins)
- Bonus: try a few extra models like a support vector machine. What do you think about the goodness of fit? Scaling is required for this model.
Scaling doesn't affect linear regression typically. The Bonus exercise asks students to choose another model that scaling is necessary for. Students may need a little guidance but really the need to only change one line (the model).
If students don't make it to the bonus exercise, take a few minutes at the end to show them that scaling does sometimes matter for e.g. and SGDRegressor. The support vector machine is a great fit!
Check: Does scaling affect linear regression?
Conclusion (# mins)
Let's review! Discuss a few of the reasons that scaling is important:
>
- To handle disparities in units.
- Because many machine learning models require scaling
- It can speed up gradient descent
ADDITIONAL RESOURCES
- Feature scaling the wine dataset.
- Some examples of regression with the Boston dataset.