# Feature Scaling

Week 3 | Lesson 4.2

### LEARNING OBJECTIVES

*After this lesson, you will be able to:*

- Use the scikit-learn preprocessing module to normalize the data in various ways

### STUDENT PRE-WORK

*Before this lesson, you should already be able to:*

- Load and manipulate data with Pandas
- Fit models with sklearn

### INSTRUCTOR PREP

*Before this lesson, instructors will need to:*

- Read in / Review any dataset(s) & starter/solution code
- Generate a brief slide deck
- Prepare any specific materials
- Provide students with additional resources

### STARTER CODE

### LESSON GUIDE

TIMING | TYPE | TOPIC |
---|---|---|

5 min | Opening | Discussion |

15 min | Introduction | Feature Scaling |

20 min | Demo | Scaling in Python |

15 min | Guided Practice | Normalization |

35 min | Independent Practice | Scaling and Linear Regression |

5 min | Conclusion | Review, Recap |

## Opening (5 mins)

- Review prior labs/homework, upcoming projects, or exit tickets, when applicable
- Review lesson objectives
- Discuss real world relevance of these topics
- Relate topics to the Data Science Workflow - i.e. are these concepts typically used to acquire, parse, clean, mine, refine, model, present, or deploy?

## Introduction: Feature Scaling (15-20 mins)

When working with new data sets we always need to process the data. As we've seen it's usually necessary to convert strings to numbers, handle date formats, and toss out bad data points. It's also often necessary to scale our data, and it is rarely not a good idea.

### Why scale data?

There are a number of good reasons why we scale our data:

- to handle disparities in units
- because many machine learning models require scaling
- it can speed up gradient descent

The sci-kit learn documentation lays it out pretty clearly::

```
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
```

The reason we scale for gradient descent is to prevent major differences in the steps on different axis to be widely different. This makes it difficult to find a good learning rate since once that is too small will take a long time to move around in the direction of a larger-scale feature, and a learning rate that is too large will not have good resolution on a smaller-scale feature.

The good news is that it's rarely a bad idea to scale your data. So it's a good practice to apply consistently, and to master early in your progression as a data scientist.

### How to we scale our data?

Typically we scale features in one of a few standard ways. For example, a
common method called *standardization* takes a feature and rescales it to
have mean zero and variance 1, like a *standard* normal distribution. We do this
by computing the mean and standard deviation, and then transform data as so:

```
x' = (x - mean) / std_dev
```

Another common method is called *Min-Max Scaling* or simply *rescaling*. In this
case we rescale our data to fit into an interval `(min, max)`

by transforming:

```
x' = (x - min) / (max - min)
```

*Normalization* is another scaling method that you may have seen before, and it
involves dividing the data in a feature by the sum of all the features. It you
know what a unit vector is then you've seen normalization before, in which
case you divided by the magnitude of a vector (the square root of the sum of
the squares).

**Check**: Why do we scale data?

## Demo: Scaling in Python (20 mins)

Use the starter code to walk through the demo of different scaling methods.

## Guided Practice: Normalization (15 mins)

Practice scaling by normalization using pure Python and scikit-learn with the starter code.

- Apply L1 and L2 normalization using python (5-10 mins)
- Apply L1 and L2 normalization using scikit-learn (5-10 mins)

## Independent Practice: Scaling and Linear Regression (30 minutes)

- Practice scaling and linear fits. Does normalization affect any of your models? (10-20 mins)
- Try some regularized models. Does scaling have a significant effect? (10 mins)
- Try some other models from scikit-learn, such as a SGDRegressor. It's ok if you are unfamiliar with the model, just follow the example code and explore the fit and the effect of scaling. (10 mins)
**Bonus**: try a few extra models like a support vector machine. What do you think about the goodness of fit? Scaling is*required*for this model.

Scaling doesn't affect linear regression typically. The Bonus exercise asks students to choose another model that scaling is necessary for. Students may need a little guidance but really the need to only change one line (the model).

If students don't make it to the bonus exercise, take a few minutes at the end to show them that scaling does sometimes matter for e.g. and SGDRegressor. The support vector machine is a great fit!

**Check:** Does scaling affect linear regression?

## Conclusion (# mins)

Let's review! Discuss a few of the reasons that scaling is important:

>

- To handle disparities in units.
- Because many machine learning models require scaling
- It can speed up gradient descent

### ADDITIONAL RESOURCES

- Feature scaling the wine dataset.
- Some examples of regression with the Boston dataset.