Welcome to GA's Data Science Immersive!
Week 1 | Day 1
LEARNING OBJECTIVES
After this lesson, you will be able to:
- Describe the roles and components of a successful learning environment
- Define data science and the data science workflow
- Apply the data science workflow to meet your classmates
STUDENT PRE-WORK
Before this lesson, you should already be able to:
- Define basic data types used in object-oriented programming
- Recall the Python syntax for lists, dictionaries, and functions
- Create files and navigate directories using the command line interface (for your specific environment)
INSTRUCTOR PREP
Before this lesson, instructors will need to:
- Modify GA DSI orientation deck
- Review course syllabus & projects
- Prepare for rest of Week 1, Day 1
LESSON GUIDE
TIMING | TYPE | TOPIC |
---|---|---|
20 min | Opening | Welcome to GA! |
20 min | Introduction | What is Data Science? |
10 min | Quiz | Data Science Pop-Quiz :) |
25 min | Introduction | Data Science Workflow |
25 min | Guided Practice | Workflow Application |
Optional | Demo | Data Science Tools / Onboarding Review |
5 min | Conclusion | Review |
Welcome to GA! (20 mins)
Instructors: Feel free to use / modify our sample GA DSI orientation deck provided here.
GA is a Special Learning Environment
- GA is: a global community of individuals empowered to pursue the work we love.
- GA Resources: any relevant discounts, community events, hub, office hours
- GA feedback loop: exit tickets, mid-course feedback, final feedback
Road to Success
- Emotional cycle of change
- Student learning responsibility
- GA graduation requirements
- After GA: build network, find opportunities, community, perks
Your Instructional Team!
- Who we are
- Our professional backgrounds
- Our data interests
Introduction: What is Data Science? (20 mins)
- A set of tools and techniques used to extract useful information from data
- An interdisciplinary, problem-solving oriented practice
- The application of scientific techniques to practical problems
Who uses Data Science?
- Netflix movie recommendations
- Amazon's algorithm - "If you like X, you might also like Y"
- Five Thirty Eight (election and sports coverage)
- Draft Kings: using data science to predict daily bets
- Google: auto-translate and search results
Check: Can you think of some other well-known examples?
What are some typical roles in Data Science?
Common Roles:
- Business Intelligence
- Data Analyst
- Data Researcher
- Data Scientist
- Data Engineer
- Statistician
Common Skills:
- Business Intelligence
- Machine learning
- Big data
- Programming
- Stats
- Math (Calculus, algebra)
- Critical thinking
- storytelling
Breakdown of Skills by Role:
Quiz: Data Science Baseline (10 Min)
Instructor Note: This quiz is intended as a gauge of your students' background knowledge about common data science related topics. It is intended to estimate their prior knowledge and give you a chance to address misconceptions and tailor future materials accordingly. You are welcome to substitute or modify this quiz as you see fit.
Quiz
- True or False: Gender (coded: male= 0 female= 1) is a continuous variable
According to the table below, BMI is the _
- Outcome
- Predictor
- Covariate
- Draw a normal distribution.
- True or False: Linear regression is an unsupervised learning algorithm.
- What is a hypothesis test?
Instructor Note: Discuss results. What trends do you spot? What features can you extract from their answers?
Introduction: The Data Science Work Flow (25 mins)
Overview of Steps:
Throughout the class - and for our projects - we will be following the data science workflow. This workflow will help you produce reliable and reproducible results.
- Reliable = Accurate findings
- Reproducible = Others can follow your steps and get the same results!
Data Science Workflow Steps:
- Identify
- Acquire
- Parse
- Mine
- Refine
- Build
- Present
- Optional: Deploy!
Note: While this may appear to be a linear process, this is in fact a bit simplified. Realistically, at any point, you may need to repeat earlier steps in order to iterate through the workflow, depending on whether you change your goals, acquire new data, or are trying to fine-tune your model.
Overall, the Data Science Workflow will serve as a useful set of standards and as a reference for our course projects.
Let's review these steps a bit further:
IDENTIFY: Understand the problem
- Identify business/product objectives.
- Identify and hypothesize goals and criteria for success.
- Create a set of questions to help you identify the correct data set.
ACQUIRE: Obtain the data
Ideal Data vs. Available Data Often times we start by identifying the ideal data we would want for a project.
During the data acquisition phase, we'll learn about what data is available and any limitations it may have. We'll decide if these limitations will inhibit our ability to answer our question or if we can work with what we have to find a reasonable and reliable answer.
Some typical questions at this stage may include:
- Identifying the “right” data set(s)
- Is there enough data?
- Does it appropriately align with the question/problem statement?
- Can the dataset be trusted? How was it collected?
- Is this dataset aggregated? Can we use the aggregation or do we need to get it pre-aggregation?
- Assess resources, requirements, assumptions, and constraints
Further, we'll need to acquire the data by:
- Importing data from the web (Google Analytics, HTML, XML)
- Importing data from a file (CSV, XML, TXT, JSON)
- Importing data from a preexisting database (SQL)
- Setting up local or remote data structure
- Determining most appropriate tools to work with data (following the format and size of data)
PARSE: Understand the data
Many times we are given secondary data, or data that was previously collected. In these cases, we have to learn as much as possible about our data using tools like data dictionaries and source documentation to determine exactly how this data was gathered.
Check: Why might it be important to understand how data was collected?
A data dictionary is exactly what it sounds like - it's a set of documentation that explains what our data is and how it is formatted. Here is an example:
Variable | Description | Type of Variable
---| ---| ---
Profession | Title of the account owner | Categorical
Company Size | 1- small, 2- medium, 3- large| Categorical
Location | Planet of the company | Categorical
Days Since Last Delivery | Integer | Continuous
Number of Deliveries | Integer | Continuous
Common Tasks at this step include:
- Reading any documentation provided with the data (e.g. data dictionary above)
- Performing exploratory surface analysis via filtering, sorting, and simple visualizations
- Describing data structure and the information being collected
- Exploring variables, data types via select
- Assessing preliminary outliers, trends
- Verifying the quality of the data (feedback loop -> 1)
MINE: Prepare, structure, & clean the data
Often, our data will need to be cleaned prior performing our analysis.
Check: What do you think this means? Why is this necessary?
Common Tasks at this step include:
- Sampling the data, determine sampling methodology
- Iterating and explore outliers, null values via select
- Reviewing qualitative vs quantitative data
- Formatting and cleaning data in Python (e.g. dates, number signs, formatting)
- Defining how to appropriately address missing values (cleaning)
- Categorization, manipulation, slicing, format, integrate data
- Formatting and combining different data points, separate columns, etc.
- Determining most appropriate aggregations, cleaning methods
- Creating necessary derived columns from the data (new data)
REFINE: Exploratory Data Analysis & Iteration
At this point, you'll be conducting EDA - or exploratory data analysis. For example, you may perform some basic summary statistics and check the Mean (STD) or specific frequency counts of your data. Example:
Variable | Mean (STD) or Frequency (%)
---| ---
Number of Deliveries | 50.0 (10)
NYC | 50 (10%)
LA 9 | 100 (20%)
Portland | 100 (20%)
Seattle 8| 100 (20%)
Other | 150 (30%)
Such descriptive statistics allow us to:
- Identify trends and outliers
- Decide how to deal with outliers - excluding, filtering, and communication
- Apply descriptive and inferential statistics
- Determine initial visualization techniques
- Document and capture knowledge
- Choose visualization techniques for different data types
- Transform data
BUILD: Create a data model
One we've fully cleaned and explored the extant data, we'll attempt to build predictive models based on the outcome we are interested in or the assumptions of the model we are using. An example of a model statement might look like this:
- "Completed a logistic regression using Statsmodels. Calculated the probability of a customer placing another order with the company."
Here, we are using a logistic model because we trying to determine the probability that a customer might place a return order, which is - at its heart - a classification problem.
Some of the steps we will take to build a model includes:
- Selecting appropriate model
- Building a model
- Testing and training our model
- Evaluating and refining our model
PRESENT: Communicate the results of your analysis
Presentations are a critical part of your analysis!!!
It doesn't matter how brilliant your model is or how illuminating your findings are, if you are not able to effectively communicate your results then unfortunately they may not be used.
The most basic form of a data science presentation should include - at the very least - a simple sentence that describes your results:
- "Enterprise customers from large companies had twice (CI 1.9, 2.1) the odds of of placing another order with the company compared to enterprise customers from small companies."
Check: What do you think the CI stands for? Why should we include this in our findings?
Of course, data science presentations can also be FAR more complex and exciting - like some of the research presented by Nate Silver's 538 blog.
When creating a presentation, always consider your audience and make sure to practice your presentation beforehand. Try to plan ahead for the types of questions your audience may have or - better yet - test your presentation on a few people and pay attention to their responses. Clarify and refine your presentation accordingly.
In general, make sure to consider your needs and goals as well as those of your audience. A presentation created for your fellow data scientists will be vastly different than a presentation intended for some executives who are trying to make a business decision.
Key factors of a good presentation include:
- Summarize findings with narrative and storytelling techniques
- Refine your visualizations for broader comprehension
- Present both limitations and assumptions
- Determine the integrity of your analysis
- Consider the degree of disclosure for various stakeholders
- Test and evaluate the effectiveness of your presentation beforehand
A (Further) Note About Iteration
Iteration is an important part of every step in the Data Science Workflow. At any given point in the process, you may find yourself repeating or going back and re-doing elements in order to better understand your data, clarify your model, and refine your presentation.
For example, after presenting your findings, you may want to:
- Identify follow-up problems and questions for future analysis
- Create a visually effective summary or report
- Consider the needs of different stakeholders and how your report might be changed for them
- Identify the limitations of your analysis
- Identify relationships between visualizations
So again, remember that the Data Science Workflow is not necessarily linear, and may curve back on itself quite a few times during a typical project :)
Practice: Data Science Work Flow (25 mins)
Get to know your classmates using three of the steps from the Data Science Workflow (e.g. identify, acquire, present)!
Students should get into 4 groups, spaced at the whiteboards around the room.
A. IDENTIFY: What problems are you trying to solve?
Have each group develop 1 research question that they would like to know about the class and make a hypothesis. Don't share these questions with the class just yet!
Examples:
- What is your current favorite tool for working with data?
- What are you most excited about learning?
- What can you help your classmates with when it comes to data analysis?
B. ACQUIRE: Obtain some data from your peers!
Rotate through the groups to "collect the data" and record the raw data on white boards.
Note: Suggest students create an easy visual way for the other students to write their answers, or an option quickly to save time.
C. PRESENT: Communicate the results of your analysis :)
- Summarize findings in a narrative
- Provide a basic visualization for broader comprehension on white board
- Nominate one student to present for each group
Optional Modules
Instructors: Walk through materials from pre-course onboarding tasks and the Installfest Lesson.
- Brief overview of the tools we will use as data scientists
- Workshop to test/finalize any environmental set-up
- Test dataset and discuss pre-course onboarding takeaways (Python Syntax, Statistics, Command Line Basics, Git Basics, SQL Overview)
Conclusion (5 mins)
By now, you should be able to answer the following questions with ease:
- What is data science?
- What is the data science workflow?
- How can you have a successful learning experience at GA?
Resources
- For a useful look at the different types of data scientists, read Analyzing the Analyzers (32 pages).
- For some thoughts on what it's like to be a data scientist, read these short posts from Win-Vector and Datascope Analytics.
- Quora has a data science topic FAQ with lots of interesting Q&A.