Intro to Data Cleaning
Week 2 | Lesson 2.3
LEARNING OBJECTIVES
After this lesson, you will be able to:
- Inspect data types
- Clean up a column using df.apply()
- Know what situations to use .value_counts() in your code
INSTRUCTOR PREP
Before this lesson, instructors will need to:
- Read in / Review any dataset(s) & starter/solution code
- Generate a brief slide deck
STARTER CODE
LESSON GUIDE
TIMING | TYPE | TOPIC |
---|---|---|
5 min | Introduction | Inspect data types, df.apply(), .value_counts() |
20 min | Demo /Guided Practice | Inspect data types |
20 min | Demo /Guided Practice | df.apply() |
20 min | Demo /Guided Practice | .value_counts() |
20 min | Independent Practice | |
5 min | Conclusion |
Introduction: Topic (5 mins)
Since we're starting to get pretty comfortable with using pandas to do EDA, let's add a couple more tools to our toolbox.
The main data types stored in pandas objects are float, int, bool, datetime64, datetime64, timedelta, category, and object.
df.apply() will apply a function along any axis of the DataFrame. We'll see it in action below.
pandas.Series.value_counts returns Series containing counts of unique values. The resulting Series will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values.
- Examples of dtypes.
- Examples of value_counts.
Demo /Guided Practice: Inspect data types (20 mins)
Let's create a small dictionary with different data types in it.
Instructor Note: The demo code contains all the code for this lesson in a Jupyter notebook. Use it to review the following code output:
in iPython notebook type:
import pandas as pd
import numpy as np
dft = pd.DataFrame(dict(A = np.random.rand(3),
B = 1,
C = 'foo',
D = pd.Timestamp('20010102'),
E = pd.Series([1.0]*3).astype('float32'),
F = False,
G = pd.Series([1]*3,dtype='int8')))
dft
There is a really easy way to see what kind of dtypes are in each column.
dft.dtypes
If a pandas object contains data multiple dtypes IN A SINGLE COLUMN, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).
# these ints are coerced to floats
pd.Series([1, 2, 3, 4, 5, 6.])
# string data forces an ``object`` dtype
pd.Series([1, 2, 3, 6., 'foo'])
The method get_dtype_counts() will return the number of columns of each type in a DataFrame:
dft.get_dtype_counts()
You can do a lot more with dtypes that you can check out here.
Check: Why do you think it might be important to know what kind of dtypes you're working with?
Demo /Guided Practice: df.apply() (20 mins)
Let's create a small data frame.
df = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd'])
df
Use df.apply to find the square root of all the values.
df.apply(numpy.sqrt)
Find the mean of all of the columns.
df.apply(np.mean, axis=0)
Find the mean of all of the rows.
df.apply(np.mean, axis=1)
Check: How would find the std of the columns and rows?
Demo /Guided Practice: .value_counts() (20 mins)
Let's create a random array with 50 numbers, ranging from 0 to 7.
data = np.random.randint(0, 7, size = 50)
Convert the array into a series.
s = pd.Series(data)
How many of each number is there in the series? Enter value_counts():
pd.value_counts(s)
Independent Practice: Topic (20 minutes)
- Use the sales.csv data set - we've seen this a few times in previous lessons!
- Inspect the data types
- You've found out that all your values in column 1 are off by 1. Use df.apply to add 1 to column 1 of the dataset
- Use .value_counts to count the values of 1 column of the dataset
Bonus
- Add 3 to column 2
- Use .value_counts for each column of the dataset
Conclusion (5 mins)
So far we've used pandas to look at the head and tail of a data set. We've also taken a look at summary stats and different types of data types. We've selected and sliced data too. Today we added inspecting data types, df.apply, .value_counts to our pandas arsenal. Nice!