Intro to Pandas 1
Week 2 | Lesson 1.1
LEARNING OBJECTIVES
After this lesson, you will be able to:
- Read a csv file using pandas
- Viewing data: head, columns, values, describe
- Selection: a single column, slicing by row, by position
STUDENT PRE-WORK
Before this lesson, you should already be able to:
- Since we're using Anaconda, pandas should already be installed. But,
make sure you have all the dependencies installed as well:
- setuptools
- NumPy: 1.7.1 or higher
- python-dateutil: 1.5 or higher
- pytz: needed for time zone support
STARTER CODE
INSTRUCTOR PREP
Before this lesson, instructors will need to:
- Read in / Review any dataset(s) & starter/solution code
- Generate a brief slide deck
LESSON GUIDE
TIMING | TYPE | TOPIC |
---|---|---|
5 min | Introduction | Pandas |
10 min | Demo / Guided Practice | Read csv |
25 min | Demo / Guided Practice | Viewing data: head/tail, describe |
25 min | Demo / Guided Practice | Selection: a single column, slicing by row, by position |
20 min | Independent Practice | |
5 min | Conclusion |
Introduction: Topic (5 mins)
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Demo / Guided Practice: Topic (10 mins)
in iPython notebook type:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Let's read in a csv file and create a pandas dataframe.
df = pd.read_csv('sales.csv')
Check: This looks familiar...didn't we already learn how to read in csv files? Yes, but that was using Python without any libraries or packages. It took 5 lines of Python W1 L3.2, but using Pandas it only takes one line. Nice!
Demo / Guided Practice: Viewing data: head/tail, describe (25 mins)
Let's take a summary look at our data. First, let's look at the head and tail.
df.head(df)
df.tail(df)
Check: What can looking at the head and tail of a dataset tell us?
Let's take a look at summary statistics.
df.describe
This gives us: count, mean, std, min, 25%, 50%, 75%, and max. Awesome!
Check: What was the cautionary tale about relying too heavily on summary stats again?
Demo / Guided Practice: Selection: a single column, slicing by row, by position (25 mins)
Let's select a single column.
df['Account']
Check: How would you select the 'Quantity' and 'Price' columns separately?
Now, let's slice and select for certain rows.
df[0:3]
Check: How would you slice for rows 9 to 14?
Now, let's try selecting by position. First, let's slice some rows.
df.iloc[1:3, :]
Check: How would you slice for rows 9 to 14?
Now, let's slice some columns.
df.iloc[:,1:3]
Check: How would you slice for the 'Manager' and 'Product' columns?
Now, let's get an explicit value only.
df.iloc[1,1]
Independent Practice: Topic (20 minutes)
- Read in this star wars survey csv
- Look at its head, tail, and summary stats, what does this tell you about the dataset?
- Select a certain column
- Slice for a set of rows
- Select a data point based on position
Bonus
- Convert one data type to another in the star wars survey csv
- Create a dummy variable for the yes and no answers
Conclusion (5 mins)
We read a csv file into a pandas dataframe with just one line of code. Compared to last week, when we used just used Python to read in a csv file, it took about 5 lines of code. Pandas is already making our data lives easier. We also took a look at how easy pandas makes it to get some general information about our dataset by looking at the head, tail, and summary stats. Lastly, we started to select and slice our dataset.