Big Data Case Studies
Week 10 | Lesson 2.4
After this lesson, you will be able to:
- describe some success stories of big data
- identify problems that require a big data approach
- connect the technologies you have studied to real world problems
Before this lesson, you should already be able to:
- perform queries in SQL
- perform queries in Hive over a Hadoop cluster
- explain how Map-Reduce works
- perform calculations on big data using Map reduce
Before this lesson, instructors will need to:
- Read in / Review any dataset(s) & starter/solution code
- Generate a brief slide deck
- Prepare any specific materials
- Provide students with additional resources
- Read the two articles mentioned in the Opening section
|10 min||Introduction||Case 1: TV Show|
|10 min||Guided||Case 1: Project goals|
|10 min||Guided||Case 1: Implementation|
|5 min||Guided||Case 2: Fraud detection|
|15 min||Guided||Case 2: Investigation Phase|
|10 min||Guided||Case 2: Discussion Phase|
|15 min||Guided||Case 2: Solution Brainstorm|
|5 min||Conclusion||Case 2: Conclusion|
Opening (5 min)
In this class we will work in groups and review 2 case studies of use of Big Data technologies like Hadoop. We will do this by performing 2 activities:
- investigation + presentation for the first
- debate + role playing for the second
Instructor note: Your role is to facilitate the discussion and make sure that the key points from each article emerge.
Case 1: TV Show (10 min)
Today's lecture is going to be highly interactive. Let's form 4 groups. These groups will be called:
- Blue Team
- Red Team
- Yellow Team
- White Team
Each group will use the next 10 minutes to read this article: TV show uses data to change the world.
You may also visit the site of Persistent, the company mentioned in the article, to get a sense of what products and services they offer.
At the end of the activity we will share some of the key findings from each group.
Case 1: Project goals (10 min)
For the first 5 minutes discuss the following questions within your group:
what was the goal of the show
to tak about issues relevant to people
how did they leverage big data to achieve it
to do real time sentiment analysis on twitter and other channels
what aspects of their solution would not have been possible with traditional Dbs
the volume of transactions searched is enormous, this wouldn't have been possible without a big data approach
At the end of 5 minutes one person from each group will summarize their conclusions to the class.
Case 1: Implementation (10 min)
Back in groups, let's look at how might they have implemented the solution. In particular, choose one or two of the questions below and spend the next 5 minutes trying to find an answer. At the end each group will share with the class.
- which aspects of the project make it technologically challenging?
The volume of data, especially when it comes in bursts
- which technology components can you identify? Think of the various aspects of how data is handled and served, what ingredients are needed?
Instructor notes to lead the discussion:
- a stream processor (kafka? kinesis? rabbitMQ?)
- a sentiment analysis algo (where is it running?)
- a SQL-like interface to query the data (Hive? Postgres?)
- a dashboarding/visulizationtool (Hue? D3?)
if you were to build a prototype of this system that processes tweets and a much smaller rate, how would you build it? Think of the technologies you have learned:
how would you get the tweets?
twitter api, python requests
how would you store the tweets?
database, probably noSQL, but SQL would work too
how would you do sentiment analysis?
extract features from text with NLTK and scikit-learn
what would you use to visualize the data?
matplotlib, bokeh, pandas
At the end of the 5 minutes each group will share their key insights
Case 2: Fraud detection (5 min)
For this case we will join the teams to form 2 larger groups:
- Team Pink (formed by the union of Red and White teams)
- Team Green (formed by the union of Blue and Yellow teams)
- Team Pink will represent the client: a global telecom company called GAGlobal.
- Team Green will represent the consulting company: a big data consulting called DSISolutions.
As mentioned you represent a telecom company. You have currently been experiencing fraud problems of many different kinds including:
- Subscription fraud
- Technical/network fraud
- Insider fraud
- Handset abuse
- Social engineering
You represent the consultant. Your experience ranges from machine learning to big data and you're being engaged to solve the problems mentioned above.
In particular you will have to propose a system to detect the potential sources of problems in near-real time.
Instructor note: each group should be sitting on one side of the class, near one another.
Case 2: Investigation Phase (15 min)
The starting point for both teams will be this article that details a case study for fraud detection. However, each team should read the article with a different set of questions in mind. Read below for detailed explanation.
Team Pink (Client) goals
Your goal for this phase is to learn as much as possible about the different types of fraud mentioned above. Feel free to use other resources, to discuss within the group, to split the research. Here are some of the points you should investigate:
- which of the above mentioned types of fraud is a priority for you to tackle?
no correct answer, it's important to see how they reason about this. In particular:
- do they consider multiple perspectives (money risk, image risk, infrastructure risk)
- do they try to assess the relative size of each type of problem? How?
- which of the above mentioned types of fraud seems to be easier to tackle with a big-data approach?
- which of the above mentioned types of fraud doesn't seem to be related to big-data?
At the end of this phase you should select the problem that you would like to tackle first.
Team Green (Consultant) goals
Your goal for this phase is to learn as much as possible about the challenges of fraud detection. In particular, you should learn:
what makes fraud detection so difficult from a statistical point of view.
rare events, active deception
what technologies could you suggest to the client in order to tackle the problem?
with a big data approach one can analyze all the data without sampling
Case 2: Discussion Phase (10 min)
In this phase each group will present to the other group their findings.
Team Pink will present first, and they will describe what main sources of fraud they have identified and the impact they estimate to have on the company. At the end they will indicate which problem they have decided to tackle first.
Team Green will present second, and they will describe the advantages of using a Big Data approach to tackle a problem like fraud detection that involves rare events.
Case 2: Solution Brainstorm (15 min)
The client and the consultant agreed to form two joint task forces to tackle the problem.
You will split back to the original 4 groups and combine them in the other way to form:
- Team Orange: Team Red (Client) + Team Yellow (Consultant)
- Team Teal: Team White (Client) + Team Blue (Consultant)
Each of these 2 task forces will spend the next 15 minutes to brainstorm a solution to the main problem of the client.
In particular, the client component of the task force should focus on explaining the problem, while the consultant component should focus on identifying how the solution may solve the challenges of the client.
Instructor note: make sure that they clearly identify:
- what problem they are solving
- how they are going to look for a solution
- what technology they are going to use
Case 2: Conclusion (5 min)
Each task force will present the solution idea. Here are some questions to guid the discussion:
- How well did the client describe the problem?
- How well did the consultant describe the solution?
- How applicable are the solutions to the problem?
- How are the two proposed solutions similar?
- In what do they differ?
- Was the client satisfied with the proposed solution?
THe original articles on the 2 cases: