# Classification Case Studies

Week 5 | Lesson 2.1

### LEARNING OBJECTIVES

After this lesson, you will be able to:

• Walk through real world dataset case studies
• Map out the analytics/data science workflow used
• Discuss pros/cons of the methods involved

### STUDENT PRE-WORK

Before this lesson, you should already be able to:

• Perform EDA
• Perform classification
• Demonstrate proficiency with basic SQL syntax

### INSTRUCTOR PREP

Before this lesson, instructors will need to:

• Read in / Review any dataset(s) & starter/solution code
• Generate a brief slide deck
• Prepare any specific materials
• Provide students with additional resources

### LESSON GUIDE

TIMING TYPE TOPIC
10 mins Opening Opening
25 mins Guided Practice Default of Credit Card Clients
25 mins Guided Practice Chronic Kidney Disease
25 mins Guided Practice Student Alcohol Consumption
5 mins Conclusion Review & Recap

## Opening (10 mins)

In this class we will review several case studies in order to demonstrate how messy real data can be.

Check: Can anyone explain what steps are involved in data cleaning and preparation?

• filling missing data
• normalization
• scaling

Check: Can anyone explain what classification is and how it is performed? What methods have we learned so far?

• KNN
• Logistic Regression

## Guided Practice: Default of Credit Card Clients (25 mins)

Our first case is a study on the default of Credit Card Clients performed by Yeh et al. in 2009. The data can be found in the UCI dataset repository.

For this lesson you will be working in pairs. Take 10 minutes to first read over the paper and the dataset below:

Check: See if you can answer the following questions:

1. What is the goal of the study? (Hint: This can typically be found in the abstract.)

Compare the predictive accuracy of probability of default among six data mining methods.

1. What is the target variable (hint: look at the website and dataset)

a binary variable – default payment (Yes = 1, No = 0)

1. What models do they compare? (Hint: although you have not yet seen all of them, try to grasp the differences)

2. How do they judge the goodness of a model? Do they use accuracy? If not, what do they use?

The study uses area ratio, instead of the error rate, to examine the classification accuracy among the six data mining techniques.

1. What validation method do they use? Simple train/test split? Cross Validation?

Train/test split

1. Bonus: Which model performs best?

Neural Networks

## Guided Practice: Chronic Kidney Disease (25 mins)

Our second case study is an example of a poorly written study. Many papers have been written about the Chronic Kidney Disease Dataset on UCI. The one we've chosen is particularly simple and not very good in quality. See if you can come up with observations on how to improve it.

Spend 10 minutes to review the paper and the dataset, then let's discuss the questions below:

Check: Let's discuss the following questions:

1. What is the goal of the study? (hint: this is usually described in the abstract)

Compare KNN and SVM

1. What is the target variable? (hint: look at the website and dataset)

Binary variable

1. What models do they compare? (hint: although you have not yet seen all of them, try to grasp the differences)

KNN and SVM

1. How do they judge the goodness of a model? Do they use accuracy? if not, what do they use?

Accuracy, Precision, Recall, F1-score

1. What validation method do they use? Simple train/test split? Cross Validation?

Train/test

1. Bonus: How could the paper be improved? Consider:
• is the text well organized?
• are the methods clear?
• are the results clear?
• are the graphs easy to understand?

## Guided Practice: Student Alcohol Consumption (25 mins)

One more time! You know the drill. Let's take 10 minutes to review the material below. First, review the two datasets. Second, identify the goal of the study and the major takeaways:

Check: Now let's consider the following questions:

1. What is the goal of the study? (hint: this is usually described in the abstract)

Predicting the alcohol consumption of high school students

1. What is the target variable? (hint: look at the website and dataset)

categorical variable with 5 classes (1 very low - 5 very high)

1. What models do they compare? (Hint: although you have not yet seen all of them, try to grasp the differences)

Decision trees and Random forests

1. How do they judge the goodness of a model? Do they use accuracy? if not, what do they use?

Accuracy

1. What validation method do they use? Simple train/test split? Cross Validation?

Cross Validation

1. Bonus: Is there any missing data? Which pre-processing techniques do they use?

There is no missing data. They use the _ technique.

## Conclusion (5 mins)

We have reviewed few classification studies and explored some issues around data preparation, model building and model selection.

Review learning objectives & relate the discussions to weekly goals, projects, and outcomes/real world use cases.