Categorical & Dummy Variables
Week 2 | Lesson 3.3
LEARNING OBJECTIVES
After this lesson, you will be able to:
- Be able to use get_dummies and other ways of converting categorical data to numerical data
- How to create indicator variable (0 or 1) columns from categorical data
INSTRUCTOR PREP
Before this lesson, instructors will need to:
- Read in / Review any dataset(s) & starter/solution code
- Generate a brief slide deck
STARTER CODE
LESSON GUIDE
TIMING | TYPE | TOPIC |
---|---|---|
10 min | Introduction | Categorical & Dummy Variables |
25 min | Demo /Guided Practice | Categorical Variables |
25 min | Demo /Guided Practice | Dummy Variables |
25 min | Independent Practice | |
5 min | Conclusion |
Introduction: Categorical & Dummy Variables (10 mins)
Regression analysis is used with numerical variables. Results only have a valid interpretation if it makes sense to assume that having a value of 2 on some variable is does indeed mean having twice as much of something as a 1, and having a 50 means 50 times as much as 1. But, some times you need to work with categorical variables in which the different values have no real numerical relationship with each other. The solution is, to use categorical and dummy variables
A categorical variable is an independent or predictor variable that contains values indicating membership in one of several possible categories. E.g., gender (male or female), marital status (married, single, divorced, widowed). The categories are often assigned numerical values used as labels, e.g., 0 = male; 1 = female.
A dummy variable is created by recoding categorial variables that have more than two categories into a series of binary variables.
Here is more information on different types of variables.
Demo / Guided Practice: Categorical Variables (25 mins)
Why exactly would you want to use categorical variables? The categorical data type is useful in the following cases:
- A string variable consisting of only a few different values. Converting such a
string variable to a categorical variable will save some memory, see
here. - The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a
categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here - As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
Let's use pandas to create a few Categorical Series. One way is by specifying dtype="category" when constructing a Series:
Here is a link to the demo code.
import pandas as pd
s = pd.Series(["a","b","c","a"], dtype="category")
s
Another way is to convert an existing Series or column to a category dtype:
df = pd.DataFrame({"A":["a","b","c","a"]})
df["B"] = df["A"].astype('category')
df
You can also pass a pandas.Categorical object to a Series or assign it to a DataFrame.
raw_cat = pd.Categorical(["a","b","c","a"], categories=["b","c","d"],
ordered=False)
s = pd.Series(raw_cat)
s
Check: Why would you use a categorical variable? categorical variable
Demo / Guided Practice: Dummy Variables (25 mins)
As mentioned above, a dummy variable is created by recoding categorial variables that have more than two categories into a series of binary variables. E.g., Marital status, if originally labelled 1=married, 2=single, and 3=divorced, widowed, or separated, could be redefined in terms of two variables as follows: var_1: 1=single, 0=otherwise. Var_2: 1=divorced, widowed, or separated, 0=otherwise.
Let's use pd.get_dummies to convert categorical variables into dummy variables. First let's create a small DataFrame with categorical variables.
df = pd.DataFrame({'key': list('bbacab'), 'data1': range(6)})
Now, let's convert the categorical variables into dummy variables.
pd.get_dummies(df['key'])
Check: Why are dummy variable useful? dummy variable get_dummies
Independent Practice: Topic (25 minutes)
Use the Cruchbase data set to:
- Create dummy variable based on the Market column
- Clean the funding_total_usd column (it's the wrong data type)
- Create some pivots
Bonus Extract the different categories, e.g. "Games|Electronics" have 2 categories
Conclusion (5 mins)
We learned that categorical and dummy variables are very useful. Some applications are: turning a string value that may only have a few different values into a categorical variable or when the lexical order of a variable is not the same as the logical order. When we start Regression in Week 3, it will become even more apparent how valuable these tools are to help us manage our data and make it easier to analyze.