Intro to Data Cleaning
Week 2 | Lesson 2.3
After this lesson, you will be able to:
- Inspect data types
- Clean up a column using df.apply()
- Know what situations to use .value_counts() in your code
Before this lesson, instructors will need to:
- Read in / Review any dataset(s) & starter/solution code
- Generate a brief slide deck
|5 min||Introduction||Inspect data types, df.apply(), .value_counts()|
|20 min||Demo /Guided Practice||Inspect data types|
|20 min||Demo /Guided Practice||df.apply()|
|20 min||Demo /Guided Practice||.value_counts()|
|20 min||Independent Practice|
Introduction: Topic (5 mins)
Since we're starting to get pretty comfortable with using pandas to do EDA, let's add a couple more tools to our toolbox.
The main data types stored in pandas objects are float, int, bool, datetime64, datetime64, timedelta, category, and object.
df.apply() will apply a function along any axis of the DataFrame. We'll see it in action below.
pandas.Series.value_counts returns Series containing counts of unique values. The resulting Series will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values.
Demo /Guided Practice: Inspect data types (20 mins)
Let's create a small dictionary with different data types in it.
Instructor Note: The demo code contains all the code for this lesson in a Jupyter notebook. Use it to review the following code output:
in iPython notebook type:
import pandas as pd import numpy as np dft = pd.DataFrame(dict(A = np.random.rand(3), B = 1, C = 'foo', D = pd.Timestamp('20010102'), E = pd.Series([1.0]*3).astype('float32'), F = False, G = pd.Series(*3,dtype='int8'))) dft
There is a really easy way to see what kind of dtypes are in each column.
If a pandas object contains data multiple dtypes IN A SINGLE COLUMN, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).
# these ints are coerced to floats pd.Series([1, 2, 3, 4, 5, 6.])
# string data forces an ``object`` dtype pd.Series([1, 2, 3, 6., 'foo'])
The method get_dtype_counts() will return the number of columns of each type in a DataFrame:
You can do a lot more with dtypes that you can check out here.
Check: Why do you think it might be important to know what kind of dtypes you're working with?
Demo /Guided Practice: df.apply() (20 mins)
Let's create a small data frame.
df = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd']) df
Use df.apply to find the square root of all the values.
Find the mean of all of the columns.
Find the mean of all of the rows.
Check: How would find the std of the columns and rows?
Demo /Guided Practice: .value_counts() (20 mins)
Let's create a random array with 50 numbers, ranging from 0 to 7.
data = np.random.randint(0, 7, size = 50)
Convert the array into a series.
s = pd.Series(data)
How many of each number is there in the series? Enter value_counts():
Independent Practice: Topic (20 minutes)
- Use the sales.csv data set - we've seen this a few times in previous lessons!
- Inspect the data types
- You've found out that all your values in column 1 are off by 1. Use df.apply to add 1 to column 1 of the dataset
- Use .value_counts to count the values of 1 column of the dataset
- Add 3 to column 2
- Use .value_counts for each column of the dataset
Conclusion (5 mins)
So far we've used pandas to look at the head and tail of a data set. We've also taken a look at summary stats and different types of data types. We've selected and sliced data too. Today we added inspecting data types, df.apply, .value_counts to our pandas arsenal. Nice!