Feature Scaling

Week 3 | Lesson 4.2

LEARNING OBJECTIVES

After this lesson, you will be able to:

Use the scikit-learn preprocessing module to normalize the data in various ways

STUDENT PRE-WORK

Before this lesson, you should already be able to:

Load and manipulate data with Pandas
Fit models with sklearn

INSTRUCTOR PREP

Before this lesson, instructors will need to:

Read in / Review any dataset(s) & starter/solution code
Generate a brief slide deck
Prepare any specific materials
Provide students with additional resources

STARTER CODE

Demo

LESSON GUIDE

TIMING	TYPE	TOPIC
5 min	Opening	Discussion
15 min	Introduction	Feature Scaling
20 min	Demo	Scaling in Python
15 min	Guided Practice	Normalization
35 min	Independent Practice	Scaling and Linear Regression
5 min	Conclusion	Review, Recap

Opening (5 mins)

Review prior labs/homework, upcoming projects, or exit tickets, when applicable
Review lesson objectives
Discuss real world relevance of these topics
Relate topics to the Data Science Workflow - i.e. are these concepts typically used to acquire, parse, clean, mine, refine, model, present, or deploy?

Introduction: Feature Scaling (15-20 mins)

When working with new data sets we always need to process the data. As we've seen it's usually necessary to convert strings to numbers, handle date formats, and toss out bad data points. It's also often necessary to scale our data, and it is rarely not a good idea.

Why scale data?

There are a number of good reasons why we scale our data:

to handle disparities in units
because many machine learning models require scaling
it can speed up gradient descent

The sci-kit learn documentation lays it out pretty clearly::

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

The reason we scale for gradient descent is to prevent major differences in the steps on different axis to be widely different. This makes it difficult to find a good learning rate since once that is too small will take a long time to move around in the direction of a larger-scale feature, and a learning rate that is too large will not have good resolution on a smaller-scale feature.

The good news is that it's rarely a bad idea to scale your data. So it's a good practice to apply consistently, and to master early in your progression as a data scientist.

How to we scale our data?

Typically we scale features in one of a few standard ways. For example, a common method called standardization takes a feature and rescales it to have mean zero and variance 1, like a standard normal distribution. We do this by computing the mean and standard deviation, and then transform data as so:

x' = (x - mean) / std_dev

Another common method is called Min-Max Scaling or simply rescaling. In this case we rescale our data to fit into an interval (min, max) by transforming:

x' = (x - min) / (max - min)

Normalization is another scaling method that you may have seen before, and it involves dividing the data in a feature by the sum of all the features. It you know what a unit vector is then you've seen normalization before, in which case you divided by the magnitude of a vector (the square root of the sum of the squares).

Check: Why do we scale data?

Demo: Scaling in Python (20 mins)

Use the starter code to walk through the demo of different scaling methods.

Guided Practice: Normalization (15 mins)

Practice scaling by normalization using pure Python and scikit-learn with the starter code.

Apply L1 and L2 normalization using python (5-10 mins)
Apply L1 and L2 normalization using scikit-learn (5-10 mins)

Solution code

Independent Practice: Scaling and Linear Regression (30 minutes)

Practice scaling and linear fits. Does normalization affect any of your models? (10-20 mins)
Try some regularized models. Does scaling have a significant effect? (10 mins)
Try some other models from scikit-learn, such as a SGDRegressor. It's ok if you are unfamiliar with the model, just follow the example code and explore the fit and the effect of scaling. (10 mins)
Bonus: try a few extra models like a support vector machine. What do you think about the goodness of fit? Scaling is required for this model.

Scaling doesn't affect linear regression typically. The Bonus exercise asks students to choose another model that scaling is necessary for. Students may need a little guidance but really the need to only change one line (the model).

If students don't make it to the bonus exercise, take a few minutes at the end to show them that scaling does sometimes matter for e.g. and SGDRegressor. The support vector machine is a great fit!

Solution code

Check: Does scaling affect linear regression?

Conclusion (# mins)

Let's review! Discuss a few of the reasons that scaling is important:

To handle disparities in units.
Because many machine learning models require scaling
It can speed up gradient descent

ADDITIONAL RESOURCES

Feature scaling the wine dataset.
Some examples of regression with the Boston dataset.

4.2 Feature Scaling