# Intro to Data Cleaning

Week 2 | Lesson 2.3

### LEARNING OBJECTIVES

After this lesson, you will be able to:

• Inspect data types
• Clean up a column using df.apply()
• Know what situations to use .value_counts() in your code

### INSTRUCTOR PREP

Before this lesson, instructors will need to:

• Read in / Review any dataset(s) & starter/solution code
• Generate a brief slide deck

Demo

### LESSON GUIDE

TIMING TYPE TOPIC
5 min Introduction Inspect data types, df.apply(), .value_counts()
20 min Demo /Guided Practice Inspect data types
20 min Demo /Guided Practice df.apply()
20 min Demo /Guided Practice .value_counts()
20 min Independent Practice
5 min Conclusion

## Introduction: Topic (5 mins)

Since we're starting to get pretty comfortable with using pandas to do EDA, let's add a couple more tools to our toolbox.

The main data types stored in pandas objects are float, int, bool, datetime64, datetime64, timedelta, category, and object.

df.apply() will apply a function along any axis of the DataFrame. We'll see it in action below.

pandas.Series.value_counts returns Series containing counts of unique values. The resulting Series will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values.

## Demo /Guided Practice: Inspect data types (20 mins)

Let's create a small dictionary with different data types in it.

Instructor Note: The demo code contains all the code for this lesson in a Jupyter notebook. Use it to review the following code output:

in iPython notebook type:

``````import pandas as pd
import numpy as np
dft = pd.DataFrame(dict(A = np.random.rand(3),
B = 1,
C = 'foo',
D = pd.Timestamp('20010102'),
E = pd.Series([1.0]*3).astype('float32'),
F = False,
G = pd.Series([1]*3,dtype='int8')))
dft
``````

There is a really easy way to see what kind of dtypes are in each column.

``````dft.dtypes
``````

If a pandas object contains data multiple dtypes IN A SINGLE COLUMN, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).

``````# these ints are coerced to floats
pd.Series([1, 2, 3, 4, 5, 6.])
``````
``````# string data forces an ``object`` dtype
pd.Series([1, 2, 3, 6., 'foo'])
``````

The method get_dtype_counts() will return the number of columns of each type in a DataFrame:

``````dft.get_dtype_counts()
``````

You can do a lot more with dtypes that you can check out here.

Check: Why do you think it might be important to know what kind of dtypes you're working with?

## Demo /Guided Practice: df.apply() (20 mins)

Let's create a small data frame.

``````df = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd'])
df
``````

Use df.apply to find the square root of all the values.

``````df.apply(numpy.sqrt)
``````

Find the mean of all of the columns.

``````df.apply(np.mean, axis=0)
``````

Find the mean of all of the rows.

``````df.apply(np.mean, axis=1)
``````

Check: How would find the std of the columns and rows?

## Demo /Guided Practice: .value_counts() (20 mins)

Let's create a random array with 50 numbers, ranging from 0 to 7.

``````data = np.random.randint(0, 7, size = 50)
``````

Convert the array into a series.

``````s = pd.Series(data)
``````

How many of each number is there in the series? Enter value_counts():

``````pd.value_counts(s)
``````

## Independent Practice: Topic (20 minutes)

• Use the sales.csv data set - we've seen this a few times in previous lessons!
• Inspect the data types
• You've found out that all your values in column 1 are off by 1. Use df.apply to add 1 to column 1 of the dataset
• Use .value_counts to count the values of 1 column of the dataset

Bonus

• Add 3 to column 2
• Use .value_counts for each column of the dataset

## Conclusion (5 mins)

So far we've used pandas to look at the head and tail of a data set. We've also taken a look at summary stats and different types of data types. We've selected and sliced data too. Today we added inspecting data types, df.apply, .value_counts to our pandas arsenal. Nice!