Intro to Data Cleaning

Week 2 | Lesson 2.3


After this lesson, you will be able to:

  • Inspect data types
  • Clean up a column using df.apply()
  • Know what situations to use .value_counts() in your code


Before this lesson, instructors will need to:

  • Read in / Review any dataset(s) & starter/solution code
  • Generate a brief slide deck




5 min Introduction Inspect data types, df.apply(), .value_counts()
20 min Demo /Guided Practice Inspect data types
20 min Demo /Guided Practice df.apply()
20 min Demo /Guided Practice .value_counts()
20 min Independent Practice
5 min Conclusion

Introduction: Topic (5 mins)

Since we're starting to get pretty comfortable with using pandas to do EDA, let's add a couple more tools to our toolbox.

The main data types stored in pandas objects are float, int, bool, datetime64, datetime64, timedelta, category, and object.

df.apply() will apply a function along any axis of the DataFrame. We'll see it in action below.

pandas.Series.value_counts returns Series containing counts of unique values. The resulting Series will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values.

Demo /Guided Practice: Inspect data types (20 mins)

Let's create a small dictionary with different data types in it.

Instructor Note: The demo code contains all the code for this lesson in a Jupyter notebook. Use it to review the following code output:

in iPython notebook type:

import pandas as pd
import numpy as np
dft = pd.DataFrame(dict(A = np.random.rand(3),
                        B = 1,
                        C = 'foo',
                        D = pd.Timestamp('20010102'),
                        E = pd.Series([1.0]*3).astype('float32'),
                                F = False,
                                G = pd.Series([1]*3,dtype='int8')))

There is a really easy way to see what kind of dtypes are in each column.


If a pandas object contains data multiple dtypes IN A SINGLE COLUMN, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).

# these ints are coerced to floats
pd.Series([1, 2, 3, 4, 5, 6.])
# string data forces an ``object`` dtype
pd.Series([1, 2, 3, 6., 'foo'])

The method get_dtype_counts() will return the number of columns of each type in a DataFrame:


You can do a lot more with dtypes that you can check out here.

Check: Why do you think it might be important to know what kind of dtypes you're working with?

Demo /Guided Practice: df.apply() (20 mins)

Let's create a small data frame.

df = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd'])

Use df.apply to find the square root of all the values.


Find the mean of all of the columns.

df.apply(np.mean, axis=0)

Find the mean of all of the rows.

df.apply(np.mean, axis=1)

df.apply df.apply

Check: How would find the std of the columns and rows?

Demo /Guided Practice: .value_counts() (20 mins)

Let's create a random array with 50 numbers, ranging from 0 to 7.

data = np.random.randint(0, 7, size = 50)

Convert the array into a series.

s = pd.Series(data)

How many of each number is there in the series? Enter value_counts():


Independent Practice: Topic (20 minutes)

  • Use the sales.csv data set - we've seen this a few times in previous lessons!
  • Inspect the data types
  • You've found out that all your values in column 1 are off by 1. Use df.apply to add 1 to column 1 of the dataset
  • Use .value_counts to count the values of 1 column of the dataset


  • Add 3 to column 2
  • Use .value_counts for each column of the dataset

Conclusion (5 mins)

So far we've used pandas to look at the head and tail of a data set. We've also taken a look at summary stats and different types of data types. We've selected and sliced data too. Today we added inspecting data types, df.apply, .value_counts to our pandas arsenal. Nice!

results matching ""

    No results matching ""