CS计算机代考程序代写 scheme python data structure data science database chain finance data mining Fortran decision tree Excel algorithm Hive Stratopoulos & Vanden Bosch, Waterloo – 2021

Stratopoulos & Vanden Bosch, Waterloo – 2021

Analytic Methods for Business – I
Introduction to Data Analytics Process: Spreadsheets and R

Theophanis C. Stratopoulos1

Nancy Vanden Bosch

November 14, 2021

1Contact author: Theophanis C. Stratopoulos PhD, School of Accounting and Finance
– University of Waterloo, Waterloo ON N2L 3G1, Canada.

Stratopoulos & Vanden Bosch, Waterloo – 2021

Orientation

Learning Objectives
Theme of this Chapter: To help you prepare for your journey in the land of data
analytics and emerging technologies by understanding the main learning objectives
and structure of this course.1

CRISP-DM stands for the Cross Industry Standard Process for Data Mining and
it is a process that data analysts/scientists use to approach data analytics problems.
According to CRISP-DM (Figure 1)2 each data analytics problem goes through the
following steps: 1) business understanding, 2) data understanding, 3) data prepara-
tion, 4) modeling, 5) evaluation and communication, and 6) deployment.

The steps in CRSIP-DM reflect the expectations and learning objectives of this
course. More specifically, by the end of this course students should be able to do the
following tasks:

1. Frame a business problem as a decision or question, applying step
1 of CRISP-DM in a simulated situation to identify required data elements.

1Ithaka by Constantinos Cavafis – https://www.onassis.org/initiatives/cavafy-archive/
the-canon/ithaka

2Source: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/
18.0/en/ModelerCRISPDM.pdf (Figure 1), CC BY-SA 3.0, https://commons.wikimedia.org/w/
index.php?curid=24930610.

i

https://www.onassis.org/initiatives/cavafy-archive/the-canon/ithaka
https://www.onassis.org/initiatives/cavafy-archive/the-canon/ithaka
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf
https://commons.wikimedia.org/w/index.php?curid=24930610
https://commons.wikimedia.org/w/index.php?curid=24930610

Stratopoulos & Vanden Bosch, Waterloo – 2021

ii

Figure 1: CRISP-DM Process Diagram

For example, to determine why revenue is growing, we need to analyze sales
volumes in units and average price per unit.

2. Develop an understanding of the data, applying step 2 & 3 of CRISP-
DM in a simulated situation to describe the data and perform simple statistical
analysis of the data set. For example, generate min, max, mean, and mode of
sales volume.

3. Build a spreadsheet-based model to solve a business problem, applying step
4 of CRISP-DM to run a modelling tool on a prepared data set.

4. Build an R-based model to solve a business problem, applying step 4 of
CRISP-DM to run a modelling tool on a prepared data set.

5. Evaluate and communicate your analysis results in an interesting,
attention-getting way, applying step 5 of CRISP-DM and visualization princi-
ples. For example, use spreadsheet or R based graphs.

CRISP-DM is a recommended process for solving problems with the help of data.
However, not all problems are conducive to solving them with data analysis. The five-

Stratopoulos & Vanden Bosch, Waterloo – 2021

iii

step process (introduced in AFM 111) provides a more general approach to problem
solving. Table 1 provides a mapping of steps between the two processes. We will use
the mapping in chapter/week 5.

CRISP-DM Five Stage Problem Solving Process

1. Business Understanding 1. Assess the Situation

2. Data Understanding
3. Data Preparation
4. Modeling (identify issues), or

2. Identify and Analyze Issues

4. Modeling (compare alternatives) 3. Develop and Analyze Alternatives

5. Evaluation and Communication 4. Decide, Recommend, Communicate

6. Deployment 5. Implementation

Table 1: Mapping: CRSP-DM to Five-Stage Process

Text and Data Analysis Tools
The text for this class is

• Stratopoulos, T. C., & Vanden Bosch, N. (2021). Analytic Methods for Busi-
ness. Waterloo, ON.

The text will be distributed free of charge from the course website. I will update
this text every week, so by the end of the term you will have a new version which
will have been tailored to the exact material covered in your class.

If you want to take a look at the version of the text from the Fall 2019, you can
download it from the following link: https://ssrn.com/abstract=3618697 .

For data analysis, we will use Google Sheets and R. Both of these tools are free
to use and platform agnostic (i.e., work the same on Windows, Mac, Linux, or any
other operating system).

Course and Text Structure
The typical chapter/week will have two lectures and one seminar. The focus in each
one of them will be as follows:

https://ssrn.com/abstract=3618697

Stratopoulos & Vanden Bosch, Waterloo – 2021

iv

1. Lecture 1 – Introduce Concepts

2. Seminars – Apply concepts using appropriate tools and data

3. Lecture 2 – Debrief and set-up the agenda for next week.

Students: Preparation
Each chapter includes a section that provides some suggested technical material that
students should consider reviewing in preparation for this chapter. This material
will be shared through a worksheet named studentPreparation.

1. The worksheet studentPreparation provides suggested videos/reading ma-
terial that you should view/review. We will use Google Sheets to perform data
analysis. If you don’t have a google account please create one. If you are
not familiar with Google sheets, these videos will help you with the following
topics:

(a) Accessing Google Sheets with your account
(b) Navigating Google Sheets
(c) Understanding common Google Sheets terminology
(d) Google Sheets formatting tips

2. To help you an outlet for asking questions any time, we will use a class discus-
sion board named piazza (https://piazza.com/class/krp90c94lp3ns#). If
you have not done this yet, please use the email you have received from piazza
to activate your class account and review the protocol (How to use Piazza)

3. To help you learn and practice data analysis tools, we will use DataCamp
(https://www.datacamp.com/). If you have not done this yet; please follow
directions on the course discussion board on how to register with DataCamp.

4. To enable interactive class exercises we will use Top Hat (https://tophat.
com/). If you have not done this yet; please, follow the instructions provided
on the course discussion board on how to register with Top Hat.

https://docs.google.com/spreadsheets/d/1uTRBaPmy6u2DRo5kynEoCAtLm_ez6cF7OcBq-g9BCJg/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1uTRBaPmy6u2DRo5kynEoCAtLm_ez6cF7OcBq-g9BCJg/edit?usp=sharing
https://piazza.com/class/krp90c94lp3ns#
https://www.datacamp.com/

Home

Home

Stratopoulos & Vanden Bosch, Waterloo – 2021

v

Assignments and Assessment
The primary objective of the assessments is to evaluate your understanding of con-
cepts and tools, and how to apply them in a business analytics setting. To emphasize
and help you build your critical thinking, instead of memorization, all assignments
are open books open notes. List of assignments/assessments that will be used:

1. Individual and Crew-Based interactive learning exercises using Top Hat.

2. Individual Weekly online quiz on topics and concepts covered in seminars.

3. Assignments via DataCamp.

4. Individual Mid-term exam is cumulative and case based (Spreadsheets).

5. Individual Final exam is cumulative and case based (Spreadsheets & R)

6. Crew based project on communication of data analysis results.

Please review the course syllabus for a detailed description of each assignment used
this term and weights. Collaboration on individual assignments is a an academic
violation. Please see school policy for details.

Weekly Topics
The course will give a high level introduction to different types/stages of data an-
alytics and their use in business settings (see Figure 2). More specifically, we will
cover the following topics.

• Course Orientation

• Understand, Organize, and Prepare Data (data management)

• Big Picture: Liquor Store case Business and Data understanding – Descriptive
Statistics and Visualization (descriptive analytics)

• Segment Analysis: Liquor Store case analysis by store, product, etc. Using
Pivot Tables (diagnostic analytics)

• Liquor Store integrative case using CRISP-DM

• Mid-term

Stratopoulos & Vanden Bosch, Waterloo – 2021

vi

• Liquor Store Case integrative case with R

• Toy Store integrative case with R

• OKCupid integrative case with R – ethical issues in data analytics (predictive
analytics)

• Pet Adoption integrative case with R & Spreadsheets (prescriptive analytics)

• Communications – Dashboards

• Crew Project: Dashboard

• Review & Final Exam

For a week-by-week list of topics and assignments/assessments please see the course
syllabus.

Figure 2: Data Analytics Stages

Stratopoulos & Vanden Bosch, Waterloo – 2021

vii

Preview of Topics from Next Chapter
Each chapter/week will end with the take away from the current chapter/week and a
brief preview of the topics for the following chapter/week. The main theme for this
chapter was to help you understand the course structure and course expectations
(both in terms of learning and assessment). For more details see the course syllabus.

In the following chapter, we will focus on the following topics/objectives:

1. Understand sources of data, types of data.

2. Organize data using spreadsheets.

3. Create new variables out of existing data.

Stratopoulos & Vanden Bosch, Waterloo – 2021

Acknowledgment

We would like to thank our research assistant Wanyue (Pamela) Zeng for her numer-
ous suggestions and feedback.

We would like to acknowledge and thank the following sources/providers for the
data used in this text:

• The liquor store data and toy store data used in these notes are based on the
Bibitor and BAToys data respectively, from the HUB of Analytics Education
(https://www.hubae.org/).

• The Pet Adoption data are based on the Kaggle pet finder competition (https:
//www.kaggle.com/c/petfinder-adoption-prediction).

• The data set for OKCupid is based on the cleaned and anonymized version
provided by user rudeboybert in his GitHub address: https://github.com/
rudeboybert/okcupiddata>.

viii

https://www.hubae.org/
https://www.kaggle.com/c/petfinder-adoption-prediction
https://www.kaggle.com/c/petfinder-adoption-prediction
https://github.com/rudeboybert/okcupiddata>
https://github.com/rudeboybert/okcupiddata>

Stratopoulos & Vanden Bosch, Waterloo – 2021

Contents

1 Understand & Prepare Data 1
1.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Student Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Source of Data . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Organizing Data: From Receipt to Spreadsheet . . . . . . . . 6
1.3.3 Types of Variables (Types of Data) . . . . . . . . . . . . . . . 8
1.3.4 Data Preparation: Modeling and Formulas . . . . . . . . . . . 11
1.3.5 Seminar Preview . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4 Seminar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1 Data Preparation – Part 1: Mathematical Formulas . . . . . . 13
1.4.2 Data Preparation – Part 2: Mathematical Formulas . . . . . . 14
1.4.3 Data Preparation – Part 3: Logical Formulas . . . . . . . . . . 15

1.5 Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.1 Seminar Debriefing . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.2 Understand Business & Build Research Skills . . . . . . . . . 17

1.6 Key Lessons from this Chapter . . . . . . . . . . . . . . . . . . . . . 18
1.7 Preview of Next Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Big Picture 20
2.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Students: Advance Preparation . . . . . . . . . . . . . . . . . . . . . 21
2.3 Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.1 Model: LCBO Deposits for Financial Reporting . . . . . . . . 21
2.3.2 Illustration of Descriptive Statistics . . . . . . . . . . . . . . . 22
2.3.3 Visualizations – Illustration of Potential Graphs . . . . . . . . 24
2.3.4 Descriptive and Summary Statistics . . . . . . . . . . . . . . . 28
2.3.5 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

ix

Stratopoulos & Vanden Bosch, Waterloo – 2021

CONTENTS x

2.3.6 Seminar Preview . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Seminar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.1 Visualization – Aggregate Column/Bar Chart . . . . . . . . . 34
2.5 Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.5.1 Seminar Debriefing . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.2 Understand Data Quality (Reliability) . . . . . . . . . . . . . 36
2.5.3 Evaluate Data Quality (Reliability) . . . . . . . . . . . . . . . 37
2.5.4 Understand Business: Total Revenue . . . . . . . . . . . . . . 39

2.6 Key Lessons from this Chapter . . . . . . . . . . . . . . . . . . . . . 41
2.7 Preview of Next Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Segment Analysis 44
3.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Students: Advance Preparation . . . . . . . . . . . . . . . . . . . . . 45
3.3 Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.1 Model: Business Segment Analysis . . . . . . . . . . . . . . . 45
3.3.2 Illustration of Segment Analysis . . . . . . . . . . . . . . . . . 47
3.3.3 The Logic Behind the Creation of a Pivot Table . . . . . . . . 48
3.3.4 Understand Seminar Data Set . . . . . . . . . . . . . . . . . . 49
3.3.5 Developing a Simple Algorithm . . . . . . . . . . . . . . . . . 50
3.3.6 Seminar Preview . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.4 Seminar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.1 Prepare the Data Set: Create New Variables . . . . . . . . . . 53
3.4.2 Segment Analysis/Modeling: Create Pivot Tables . . . . . . . 54
3.4.3 Segment Analysis/Modeling: Two Way Pivot Table . . . . . . 56

3.5 Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5.1 Class Exercises: Seminar Debriefing . . . . . . . . . . . . . . . 57
3.5.2 Show As Pivot Table Options . . . . . . . . . . . . . . . . . . 58
3.5.3 Categories: Other . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.6 Key Lessons from chapters 1-3: Product Sales . . . . . . . . . . . . . 61
3.7 Preview of Next Chapter: Product Cost Data . . . . . . . . . . . . . 63

4 Applying CRISP-DM 64
4.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Students: Advance Preparation . . . . . . . . . . . . . . . . . . . . . 64

4.2.1 Overview of the Week . . . . . . . . . . . . . . . . . . . . . . 65
4.2.2 Liquor Store Limited Mini-Case . . . . . . . . . . . . . . . . . 66

4.3 Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Stratopoulos & Vanden Bosch, Waterloo – 2021

CONTENTS xi

4.3.1 Assess the Situation: Role and Required Output . . . . . . . . 67
4.3.2 Assess the Situation: Business Understanding . . . . . . . . . 68
4.3.3 Identify and Analyze Issues: Data Understanding . . . . . . . 69
4.3.4 Identify and Analyze Issues: Data Preparation . . . . . . . . . 70
4.3.5 Seminar Preview . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4 Seminar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.1 Complete Data Preparation Tasks . . . . . . . . . . . . . . . . 72
4.4.2 Model: Segment Analysis Using Pivot Tables . . . . . . . . . . 73
4.4.3 Model: Segment Analysis Using Two-Way Pivot Table . . . . 74
4.4.4 Model: Analysis by Day of the Week . . . . . . . . . . . . . . 75

4.5 Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5.1 Seminar Debriefing . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5.2 Visualize Answers to Questions . . . . . . . . . . . . . . . . . 77
4.5.3 Filter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.6 Lessons from Problem-Solving Case . . . . . . . . . . . . . . . . . . . 79
4.7 Preview of Next Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 79

5 Midterm Review 80
5.1 Learning Objectives & Design . . . . . . . . . . . . . . . . . . . . . . 80
5.2 How to Prepare for the Midterm . . . . . . . . . . . . . . . . . . . . . 81
5.3 Midterm Fall 2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3.3 Practice Exam Deliverable . . . . . . . . . . . . . . . . . . . . 86

5.4 Preview for Next Chapter . . . . . . . . . . . . . . . . . . . . . . . . 86

6 Midterm 87
6.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7 CRISP-DM with R 88
7.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.2 Students: Advance Preparation . . . . . . . . . . . . . . . . . . . . . 89
7.3 Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

7.3.1 Overview of the chapter . . . . . . . . . . . . . . . . . . . . . 90
7.3.2 Why Transition from Spreadsheets to R . . . . . . . . . . . . 90
7.3.3 Prepare R Environment . . . . . . . . . . . . . . . . . . . . . 92
7.3.4 LSL Mini-Case: Business Understanding . . . . . . . . . . . . 93
7.3.5 LSL Mini-Case: Data Understanding . . . . . . . . . . . . . . 93
7.3.6 LSL Mini-Case: Detecting Outliers . . . . . . . . . . . . . . . 98

Stratopoulos & Vanden Bosch, Waterloo – 2021

CONTENTS xii

7.3.7 LSL Mini-Case: Data Preparation . . . . . . . . . . . . . . . . 99
7.3.8 LSL Mini-Case: Model . . . . . . . . . . . . . . . . . . . . . . 102

7.4 Seminar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.5 Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.5.1 Replicate LSL Analysis for ALL Stores . . . . . . . . . . . . . 104
7.5.2 LSL Analysis for ALL Stores: Data Understanding . . . . . . 105
7.5.3 Understand & Communicate Findings for ALL Stores . . . . . 106

7.6 Key Lessons from this Chapter . . . . . . . . . . . . . . . . . . . . . 107
7.7 Preview of Next Chapter: The Toy Store . . . . . . . . . . . . . . . . 107

8 The Toy Store 108
8.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.2 Students: Advance Preparation . . . . . . . . . . . . . . . . . . . . . 108
8.3 Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8.3.1 Company Description . . . . . . . . . . . . . . . . . . . . . . . 109
8.3.2 Data Understanding and Preparation . . . . . . . . . . . . . . 109
8.3.3 Models to Address Business Questions . . . . . . . . . . . . . 111
8.3.4 What is the Logic Behind the R Script? . . . . . . . . . . . . 114
8.3.5 Models to Address Business Questions – Continued . . . . . . 117

8.4 Seminar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.4.1 Prepare R Environment for Seminar . . . . . . . . . . . . . . 119
8.4.2 Seminar Questions . . . . . . . . . . . . . . . . . . . . . . . . 120

8.5 Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.5.1 The Toy Store Mini-case . . . . . . . . . . . . . . . . . . . . . 122

8.6 Key Lessons from this Chapter . . . . . . . . . . . . . . . . . . . . . 124
8.7 Preview of Next Chapter: Pet Adoption . . . . . . . . . . . . . . . . 124
8.8 Appendix: Answers to Selected Questions . . . . . . . . . . . . . . . 126

9 Pet Adoption 133
9.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.2 Students: Advance Preparation . . . . . . . . . . . . . . . . . . . . . 134
9.3 Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

9.3.1 PetPivot Mini-Case . . . . . . . . . . . . . . . . . . . . . . . . 135
9.3.2 Data Understanding and Preparation . . . . . . . . . . . . . . 135
9.3.3 Explore (Model) . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.3.4 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . 142
9.3.5 Prepare for Seminar . . . . . . . . . . . . . . . . . . . . . . . 146

9.4 Seminar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

Stratopoulos & Vanden Bosch, Waterloo – 2021

CONTENTS xiii

9.4.1 Extend the Existing Model . . . . . . . . . . . . . . . . . . . . 146
9.4.2 New Target Variable & Model: Monthly Adoption . . . . . . . 147
9.4.3 New Target Variable & Model: First Week Adoption . . . . . 147

9.5 Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.5.1 PetPivot Mini-Case . . . . . . . . . . . . . . . . . . . . . . . . 148

9.6 Key Lessons from this Chapter . . . . . . . . . . . . . . . . . . . . . 150
9.7 Preview of Next Chapter: OK Cupid . . . . . . . . . . . . . . . . . . 150

10 OkCupid 152
10.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
10.2 Students: Advance Preparation . . . . . . . . . . . . . . . . . . . . . 152
10.3 Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

10.3.1 Ethical Issues Around Data Analytics . . . . . . . . . . . . . . 153
10.3.2 OkCupid Data . . . . . . . . . . . . . . . . . . . . . . . . . . 155
10.3.3 Data Understanding (1) . . . . . . . . . . . . . . . . . . . . . 156
10.3.4 Data Understanding (2) . . . . . . . . . . . . . . . . . . . . . 159
10.3.5 Data Preparation and Understanding . . . . . . . . . . . . . . 163
10.3.6 Seminar Preparation – Playing Detective . . . . . . . . . . . . 165

10.4 Seminar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
10.4.1 Predict Income Group from Job . . . . . . . . . . . . . . . . . 165
10.4.2 Predict Income Group from Education . . . . . . . . . . . . . 167
10.4.3 Predict Income Group from Other Categories . . . . . . . . . 168

10.5 Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
10.5.1 Seminar Debriefing . . . . . . . . . . . . . . . . . . . . . . . . 168
10.5.2 OkCupid and Dating Algorithms . . . . . . . . . . . . . . . . 169

10.6 Key Lessons from this Chapter . . . . . . . . . . . . . . . . . . . . . 170
10.7 Preview of Next Chapter: Dashboards . . . . . . . . . . . . . . . . . 170

Bibliography 172

Alphabetical Index 174

Stratopoulos & Vanden Bosch, Waterloo – 2021

Chapter 1

Understand, Organize, & Prepare
Data

1.1 Learning Objectives
Theme of the Week: Understand sources of data, types of data, how to organize
data, and how to create (prepare) new variables out of existing data (i.e., focus on
data management). By the end of this week, students should …

1. Identify sources of data (e.g., cellphones, Fitbits, social media, customer re-
ceipts) and understand the different types of data.

1

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 2

2. Organize data (start a new spreadsheet or work with an existing one).

3. Generate new variables from existing ones using logical formulas and/or math-
ematical formulas.

4. Translate logical/mathematical formulas into spreadsheet formulas in order to
generate new columns (i.e., new data).

1.2 Student Preparation
The worksheet studentPreparation provides suggested videos/reading material
that you should view/review. Students should prepare the following material for
Week 2:

1. Using formulas and functions.

2. Sorting data on a spreadsheet.

3. Finish the DataCamp assignment “Getting Started – Introduction to Spread-
sheets.”

1.3 Lecture 1
Consider the following questions:

1. Did the pandemic affect sales of liquor? If yes, did they increase, decrease and
by how much?

2. Suppose that you are planning some promotion for a new service and you want
to customize it based on the income group of your customers. You look at the
information you have about your customers and you realize that about half of
them did not answer the question about their income. Does this mean that
you cannot use half of your data? Is there anything you can do to leverage the
data set with the missing values?

3. If you calculate sales per dollar of rent (e.g., if your sales are $50,000 and you
rent is $5,000 your sales per dollar of rent are 10), would you expect a small
store in densely populated area (e.g. metropolitan area) to perform better or
worse than a large store in sparsely populated area (e.g., suburbs)?

https://docs.google.com/spreadsheets/d/1Ekt0bE4LqABehMM9EyPdhshBaZlAOWN4ZfJ0QYDyAEA/edit?usp=sharing

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 3

There are two ways to answer these questions. The first one is to try to use your
intuition. The second one is to leverage available data. By analyzing historical and
current (real time) data decision makers can get valuable insights that help them
make more informed and better decisions. Using data to enable decision making is
known as business or data analytics.1

Business or data analytics means that we will use data to make business related
decisions, but where are these data coming from?

1.3.1 Source of Data
Location Aware Devices: Is location (GPS) turned on in your cell-phone?

• Why does this matter? How companies can use these data? According to a
Bloomberg article (Parmar 2019):

Data culled from mobile-phone use can reveal, in real time, the num-
ber of people carrying devices at a particular location. This can shed
light on how many – or few – people are frequenting a retailer, super-
market or fast-food joint. Firms can also monitor app downloads:
how popular they are, where they’re occurring and when they’re be-
ing used to make purchases.

• Google uses such data to populate information about stores. For example,
Figure 1.1 shows the flow of customers for Williams Cafe at the University
Plaza.

Social Media: Do you have a Twitter Account?

• Why does this matter? Continue reading from the Bloomberg article (Parmar
2019):

… How often are people tweeting about Apple’s newest iPhone? Is
the latest Nike sneaker a hit with teens? Firms have started track-
ing key words or phrases on social-media sites including Facebook

1The term business intelligence has been around since 1865 Bogost 2018. According to Bogost,
”The term business intelligence was first coined way back in 1865, in Richard Miller Devens’s book
Cyclopaedia of Commercial and Business Anecdotes …” It refers to the ability of firms to leverage
information on subject matters such as war, competition, and weather.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 4

Figure 1.1: … People Frequenting a Retailer

and Instagram to gauge what consumers are thinking. That infor-
mation can be mapped to various companies, providing clues about
the popularity of a product or service.

• A few years ago, I used R to download Tweets for all major Canadian airlines
(i.e., Air Canada, Air Transat, West Jet, and Porter) as well as for Delta
(largest US airline). I analyzed these data and generated an index showing the
sentiment (positive or negative words) on two separate periods: 2015 and 2016
shown in Figure 1.2 and 1.3 respectively. When the analysis you do is based
on text, we call this analysis text analytics.

• For one of my research projects “Blockchain Technology Adoption” we used a
similar approach to download and analyze firm disclosures (i.e., financial re-
ports that publicly traded companies have to file with the security commission).
We analyzed the text for references to blockchain and we used the results of
this text analysis to evaluate the adoption of blockchain by companies.2

2You can download a copy of this paper from the following URL: https://papers.ssrn.com/
sol3/papers.cfm?abstract_id=3188470. If you would like to learn a bit more about blockchain
you can read the first couple of chapters from the “Introduction to Blockchain for Accounting
Students,” which is available from the following URL: https://papers.ssrn.com/sol3/papers.
cfm?abstract_id=3395619.

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3188470
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3188470
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3395619
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3395619

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 5

Figure 1.2: Comparison in 2015

Figure 1.3: Comparison in 2016

Business Transactions (e.g. Receipts)

A customer went to a liquor store (e.g., LCBO in Ontario) and brought home the
products shown in Figure 1.4 and the receipt shown in Figure 1.5.

• Why does it matter? The information captured in this receipts enables the
management of the company to answer some very important questions.

• If we calculate the sum of all receipts for a day, we have generated the total
revenue for that day.

• If we count the units sold for one product (e.g., Baronnes Sancherre) during
a month, and subtract this from the total number of Baronnes bottles in the
warehouse at the beginning of the month, we can find how many units are in
the warehouse. If the actual number in the warehouse differs from our number
this could be a signal of potential theft.

These are just a couple of simple examples of the kind of analysis that firms do based
on receipt data.

All kinds of companies – large or small – rely on data. Mediterraneo is a small
family restaurant that specializes in Greek cuisine. Herrle’s is another family busi-
ness that specializes in locally grown produce. Customer receipts issued by these
firms are shown in Figures 1.6 and 1.7 respectively.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 6

Figure 1.4: Customer Purchase Figure 1.5: LCBO Receipt

Work with your crew: Analyze Receipts

• What are some of the questions that the owner of Mediterraneo may want to
answer based on such receipts?

• What are some of the questions that the owner of Herrle’s may want to answer
based on such receipts?

• Fun Fact: How many guests were served by Jessica in Mediterraneo? Healthy
appetite!

1.3.2 Organizing Data: From Receipt to Spreadsheet
The worksheet receipt in the google sheet receiptLCBO1 – also shown in Figure 1.8
– is a representation of the receipt. By converting the receipt into a spreadsheet, we
created a new data set. A data set has variables, values, and observations.

• Variables (know as columns in a spreadsheet) contain information related to
specific element or feature. For example, description is the variable the

https://docs.google.com/spreadsheets/d/1MKxZvjbpfCnsGZrUnL5gDIOkrC5xaOjrwY_WAzK-814/edit?usp=sharing

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 7

Figure 1.6: Mediterraneo Receipt
Figure 1.7: Herrle’s Market Receipt

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 8

Figure 1.8: Liquor Store Spreadsheet

captures the names of products sold, and retailPriceUnit captures the retail
price of each one of the products sold. Our dataset (spreadsheet) has eight
variables (columns).

• Value of the variable or data point (known as cell in a spreadsheet) is the
specific entry corresponding to a point of the data set. For example, the retail
price for Grey Goose Vodka is $26.85. This means that 26.85 is one of the
values that the variable retail price takes or a data point in our data set.

• Observations or records (known as row in a spreadsheet) contains the complete
set of values that is needed to describe a transaction (purchased item). For
example, the second row in our data set tells us that on April 22, 2019 (sales
Date) a customer bought from a specific store one bottle (sales Quantity)
of Baronnes Sancherre (product Description) that retails for $31.75 (retail
Price per Unit) and requires a $ 0.20 deposit to encourage recycling. This
means that the customer paid $31.95 (total from customer) for this product.

1.3.3 Types of Variables (Types of Data)
The type of values that we assign to a variable, determines the type of variable
(i.e., the type of data) that we are working with. At a high level, there are two
classifications of variables: quantitative and qualitative.3

• A quantitative variable, also known as numeric variable, takes numeric val-
ues. For example, the retail price is in dollars and the quantity is in bottles.
Because the numerical variables are measured with standard units (e.g., dollars,
bottles, litres, centimetres), we can add and multiply numerical values.

3For a slightly different approach to this topic see Davenport and Kim 2013.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 9

• A qualitative variable – also known as categorical – takes values associated
with different classes. Depending on the number of classes and and the type of
classes, we can make the following classification:

– A qualitative/categorical variable that takes only two values is known as
binary variable. For example; a product can be classified as liquor or
wine, the temperature is above or below the freezing point, a bill has been
paid or not. Typically, we measure binary variables using values of 1 and
0.

– A qualitative/categorical variable that takes more than two values is called
just categorical or nominal. For example, the variable description in our
data set is a categorical variable. The province/state (e.g., Ontario, Que-
bec, Alberta) where a company is located and the browser (e.g., Chrome,
Firefox, Safari) that a customer uses are examples of nominal variables.

– Ordinal variables: sometimes when we work with qualitative/categorical
variables, we assign numeric values to represent an order or scale. For
example using numbers 1 to 5 to represent a Likert scale for survey re-
sponses from strongly disagree (1) to strongly agree (5). You cannot add
or multiply ordinal values since the result does not make sense.

Why the type of data matters?

The type of available data determines that type of analysis that can be done. We
can use common statistical approaches to analyze numerical values. We need special
approaches to analyze categorical or nominal values. The data analysis that we will
cover in this course, will evolve primarily around quantitative data and techniques.
We will also perform some basic analysis with qualitative data. For example in this
week’s seminar, we will illustrate how to create a new variable when we identify
regular and premium products.

As we have seen in our discussion on sources of data (1.3.1) the growth of network
economy and social media has spawned new forms of data. Some of them such as
tweets, blog posts, images, videos, and audio do not fit with the traditional classi-
fication of just quantitative and qualitative. Developments in artificial intelligence
(AI) and machine learning have made it feasible to analyze such data. Some of these
techniques will be covered in upper level classes.

Work with your crew: Types of Data

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 10

1. The data set has a variable called store that takes such values such as 1, 2,
9, and 72. Does this mean that this variable is numeric? Hint: If a variable is
numeric this means that we can add or multiply its values. If we add store 1
and store 2, are we going to get a new store which is the same as store 3?

2. The data set has a variable size that takes such values as 375mL, 750mL, 1
Litre, and 1 Gallon. How will you classify the variable size? Hint: Is there a
standard unit?

Work with your crew: Stages in CRISP-DM According to CRISP-DM
(see Figure 1 on p. ii) each data mining problem goes through the following steps:

1. we start with a business question or problem (business understanding),

2. we make sure that we understand our data set (data understanding),

3. if necessary we create new variables (data preparation),

4. we build our model (modeling),

5. we evaluate the quality of our model (model evaluation) and

6. we communicate our results in non-technical terms (i.e., deployment commu-
nication).

The objective of this exercise is to start understanding the different stages. Don’t
worry if you cannot identify all stages at this point. Just try to do your best.

The manager of store #2 looked at our dataset and wants to know what was the
total amount that we collected from this customer.

An analyst highlighted the three values of the variable totalFromCustomer and
the function quick sum in Google sheets to produce the analysis shown in Figure
1.9.

• Create a list showing each CRISP-DM step in the context of this mini-case.
• Come up with another simple problem, and create a similar list.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 11

Figure 1.9: Quick Sum

1.3.4 Data Preparation: Modeling and Formulas
To answer business questions, such as the one above, we may have to create new
variables based on existing data. This is one of the several activities performed
during the data preparation stage of CRISP-DM.

1. Description/scenario based formula. Look at the first item (second line) in
Figure 1.5 (Cell H2 in receipt). What does the number 31.95 represent? How
was this number calculated?

2. The simplest explanation would be to to simply add price and deposit amounts.

3. Translate description to mathematical formula.

totalFromCustomer = retailPriceUnit + depositUnit

4. How can we modify this formula if someone buys more than one bottle?

totalFromCustomer =salesQuantity ∗ retailPriceUnit
+ salesQuantity ∗ depositUnit

Therefore,

totalFromCustomer =
salesQuantity ∗ (retailPriceUnit + depositUnit)

https://docs.google.com/spreadsheets/d/1MKxZvjbpfCnsGZrUnL5gDIOkrC5xaOjrwY_WAzK-814/edit?usp=sharing

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 12

When working with computer formulas, the standard mathematical order of
operations applies. This means that the computer will first process the con-
tent of parentheses (), continue with exponents, divisions, multiplications,
additions and subtractions.

NB: Order of Operations

Work with your crew: Consensus within Crew Open the worksheet re-
ceipt 4DataPreparation and translate the above mathematical formula to spread-
sheet formula that can be used in cell H2. Which of the following answers is correct?

a = E2 ∗ (F2 + G2)

b = (E2 + F2) ∗G2

c = E2 ∗G2 + F2

d = E2 ∗G2 ∗ F2

e = E2 + G2 + F2

Try to answer the question individually, before you discuss this with your crew.
If there are different opinions, try to discuss them within your crew. It is very
important to learn how to justify your choice to others and listen to the feedback
from others. Having others critique your argument will help you learn how to develop
your critical thinking skills.

• It is important to learn how to justify your choice because there are going to
be similar questions in assignments (e.g., quizzes, exams).

1.3.5 Seminar Preview
We have learned how to organize data using spreadsheets, how to prepare new vari-
ables out of existing ones, and generate quick summary statistics for each variable.
During the seminar you will apply these skills with a data set that has 27 transac-
tions from a sales store. The spreadsheet is named C01S and it has two worksheets.
The first one is named receipt, and the second one storeSales.

https://docs.google.com/spreadsheets/d/1MKxZvjbpfCnsGZrUnL5gDIOkrC5xaOjrwY_WAzK-814/edit#gid=1728442476
https://docs.google.com/spreadsheets/d/1MKxZvjbpfCnsGZrUnL5gDIOkrC5xaOjrwY_WAzK-814/edit#gid=1728442476
https://docs.google.com/spreadsheets/d/1xdgfoabNOY-pcjBkosRB-Eo4kNu2G-W3-BA6JVb8oK0/edit?usp=sharing

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 13

1.4 Seminar

1.4.1 Data Preparation – Part 1: Mathematical Formulas
• Estimated completion time for this exercise is 10 to 15 minutes.

• The following hands-on exercises are based on the worksheet receipt in C01S.
It contains just the 3 records from customer’s liquor store purchase based on
the LCBO receipt.

Hands-On: Calculate Total from Customer (totalFromCustomer)

1. Using the formula from the lecture exercise on p. 12, calculate the total due
from the customer (totalFromCustomer) in cell H2.

2. Copy and paste the formula to the rest of the rows (products purchased by this
customer).

Hands-On: Calculate Product Price per Unit (productPriceUnit)

On the customer receipt it says that the sales tax (HST) 13 per cent is included.
How will you express the retail price as a function of the original product price and
HST?

Given that HST is 13% on top of the product price, we can express the retail
price as a function of product price and HST as follows:

retailPriceUnit = productPriceUnit ∗ 1.13

Solve for product price, by dividing both sides by 1.13 and simplify.

retailPriceUnit

1.13
=

productPriceUnit ∗���1.13
���1.13

Therefore,

productPriceUnit =
retailPriceUnit

1.13

https://docs.google.com/spreadsheets/d/1xdgfoabNOY-pcjBkosRB-Eo4kNu2G-W3-BA6JVb8oK0/edit?usp=sharing

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 14

Work with your crew What is the formula that we need to write in cell I2
in order to calculate productPriceUnit?

• Work with their crew to suggest a formula and explain why.

• Copy and paste the formula to the rest of the rows.

Hands-On: Calculate HST per Unit (HSTperUnit)

HSTperUnit = productPriceUnit ∗ 0.13

Work with your crew What is the formula that we need to write in cell J2
in order to calculate HSTperUnit?

• Work with your crew to suggest and explain your formula.

• Copy and paste the formula to the rest of the rows.

When we create a new data set, we have to assign a name to each variable.
The name we select should be simple and meaningful. For example, the vari-
able named Price is more informative than variable1. Sometimes, we need
more than one word to describe a variable. However, most programming lan-
guages do not accept space between variable names. For example, if we name
a variable as Retail Price per Unit, it may be interpreted as four variables:
Retail, Price, per, and Unit. To avoid this problem we can use either an un-
derscore (e.g., Retail Price Unit) or use upper case letter as separators (e.g.,
retailPriceUnit).

In the rest of these notes – simply for consistency – we will use the upper
case letter separator approach. Both approaches work equally well.

NB: Naming Variables

1.4.2 Data Preparation – Part 2: Mathematical Formulas
• Estimated completion time for this exercise is 10 to 15 minutes.

• For this exercise you will have to work with the worksheet storeSales in C01S.
It contains a sample of sales transactions from several stores.

https://docs.google.com/spreadsheets/d/1xdgfoabNOY-pcjBkosRB-Eo4kNu2G-W3-BA6JVb8oK0/edit?usp=sharing

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 15

Hands On: Data Preparation

1. Use the formula from the hands-on exercise on p. 13 (totalFromCustomer)
to calculate the total due from the customer for each product.

2. Use the formula from the hands-on exercise on p. 13 (productPriceUnit) to
calculate the product price per unit for each product.

3. Use the formula from the hands-on exercise on p. 14 (HSTperUnit) to cal-
culate the HST included in the retail price for each product.

Hands On: Business Problems

Use Google Quick Sums to answer the following business related questions:

1. What was the sum of units sold (quantity) across all stores?

2. What was the sum of total from customers.

1.4.3 Data Preparation – Part 3: Logical Formulas
• Estimated completion time for this exercise is 20 to 25 minutes.

• Continue working with the worksheet storeSales in C01S.

Hands On: Manual Classification of Products

Create a new column to manually classify products as premium or regular as follows:

1. Create a new column and name it productType.

2. Freeze the first row. If you don’t do this, when you sort it will move the row
of names too.

3. Use column productPriceUnit to sort data in ascending order (i.e., from
smallest to largest).

4. Write premium in the cells of productType if productPriceUnit is over $20
and as regular if it is not.

https://docs.google.com/spreadsheets/d/1xdgfoabNOY-pcjBkosRB-Eo4kNu2G-W3-BA6JVb8oK0/edit?usp=sharing

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 16

Hands On: Conditional Statements Based Classification

Such manual coding is feasible when working with small data sets, but not practical
when dealing with large data sets. In cases of large data sets we use an if statement.
Let’s revisit the logic behind the creation of the premium and regular products.

• In essence, we examine each cell from the column productPriceUnit and if
the entry is higher than $20 we classify this as premium, and if not we classify
this as regular.

• In other words, we evaluate the condition productPriceUnit > 20.

– If it is TRUE, we assign the value premium.
– If it is FALSE, we assign the value of regular.

• In Google Sheets the if statement is formatted as follows:
=IF(logical expression, value if true, value if false)

– The values inside the if statement must be in double quotation marks,
e.g., “premium”, “regular”.

Work with your crew If we were to create a new column named product-
Type b that could produce the same results as in the productType column, what
is the formula that we should enter in cell L2?

1.5 Lecture 2

1.5.1 Seminar Debriefing
In the first lecture, we introduced some new terminology that it is necessary to
make sure that we can properly communicate our understanding of data. We also
learned the process/logic for preparing/adding new variables in a data set. During
the seminar, we had an opportunity to apply this knowledge using hands-on exercises.
More specifically, focused on applications related to the following topics:

1. Understand and organize data.

2. Create new variables out of existing data using mathematical formulas or logical
conditions.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 17

1.5.2 Understand Business & Build Research Skills
To really understand, organize, prepare, and then analyze (model) data, we first need
to understand the business. The data set used in the seminar was for a liquor store
business, so we’ll use the LCBO as a proxy for the business. During the lecture,
we’ll learn how understanding the LCBO’s business is necessary to use deposit data
to create a model that the company needs for financial reporting purposes.

Work with your crew: LCBO Business Understanding

1. Run a web search for the annual report of LCBO.

2. Work with your crew to find the portion that describes the business and sum-
marize what the company does in a sentence.

Work with your crew: Business Understanding – LCBO Deposit

1. Discuss with your crew why does the LCBO collect deposits and then refund
deposits to customers?

2. Summarize the consensus of your discussion in one to two short sentences.

Work with your crew: 2018 LCBO Deposits

1. In the 2018 fiscal year, how much did LCBO collect for deposits?

(a) $72.2 million
(b) $69.1 million
(c) $56.8 million
(d) $44.7 million

2. In the 2018 fiscal year, how much did LCBO refund to customers?

(a) $72.2 million
(b) $69.1 million
(c) $56.8 million

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 18

(d) $44.7 million

Try to answer the question individually. Once each member has answered the
question, open it for discussion within the crew. Remember that your goal is your
ability to not only having the right answer. You need to explain your answer to
others. For example, in this questions you may want to show the source of the infor-
mation. Once you have reached a consensus you will have to prepare two statements
like the following:

• The 2018 deposits collected of $XX.X million is the total amount of the 10 and
20 cent deposits collected when the sale is made and recorded on the customer’s
receipt and in the LCBO’s information system.

• The 2018 deposits refunded of $XX.X million is the total cash paid to the Beer
Store4 based on data they capture when the customer returns the bottle to
store.

Work with your crew: LCBO Deposits v. Refunds

1. Is there a discrepancy between the amount of deposits and refunds? If yes,
how much is this discrepancy?

2. Discuss with your crew the possible reasons why the amount collected is dif-
ferent than the amount refunded.

3. Prepare a short answer – one to two sentences.

1.6 Key Lessons from this Chapter
In this chapter, we have focused on the following two topics:

1. To understand and organize data, you first need to understand the business
and understand data sources and types.

2. Organizations build models to analyze data for various reasons, including mod-
els that provide estimates needed for financial reporting.

4In Ontario, customers who want to receive a refund for their deposit, will have to return their
empty bottles to the Beer Store.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 1. UNDERSTAND & PREPARE DATA 19

A question that you may have crossed your mind is that “Why does this matter?”
Stated this slightly different, why do we need to understand how to organize data?
A relatively easy way to answer this question is by thinking that in order to perform
any type of data analytics, you need to have data. You need to have data that is well
organized and easy to access. As we have seen in the course orientation (see Figure
2 on page vi), all different types/stages of data analytics rely on the foundation of
data management.

1.7 Preview of Next Chapter
Having created our foundation (i.e., having taken our first step towards data man-
agement with spreadsheets), in the following chapter we will have our first taste in
descriptive analytics. We leverage the data that we have generated to understand
what happened.

1. High level introduction to the ideas of summary statistics and aggregate values.

2. Generate summary statistics for units, price, deposits, sales, and taxes.

3. Leverage summary statistics to answer questions related to the business. For
example, how much revenue does the business generate from selling products
to customers during the year?

4. Leverage visualizations to communicate findings and frame questions.

Stratopoulos & Vanden Bosch, Waterloo – 2021

Chapter 2

Big Picture

2.1 Learning Objectives
Theme of the Week: See the Big Picture1 Leverage descriptive statistics to
understand your data and highlight important issues. Leverage visualizations to
communicate findings and frame questions. By the end of this week, students should
be able to:

1. generate descriptive statistics for relevant data variables; e.g., sales units, prod-
uct price, deposits, and total from customers.

1The image is from https://www.jpl.nasa.gov/spaceimages/details.php?id=PIA17172.

20

https://www.jpl.nasa.gov/spaceimages/details.php?id=PIA17172

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 21

2. generate summary statistics to total amounts for relevant data variables; e.g.
total units sold, total HST taxes and deposits collected, and total sales dollars.

3. describe the information communicated by each of four descriptive statistics
and summary statistics that aggregate data values.

4. leverage visualizations to communicate findings and frame question.
5. use the CRISP-DM model to explain when and why we use descriptive and

summary statistics.

2.2 Students: Advance Preparation
The worksheet studentPreparation provides suggested videos/reading material
that you should view/review. Students should prepare the following material for
this chapter:

1. Using quick sum
2. Types of data and graphs
3. Selecting the right chart type
4. Creating charts in Google Sheets

2.3 Lecture 1

2.3.1 Model: LCBO Deposits for Financial Reporting
A deposit is cash collected from the customer (e.g. 10 or 20 cents), held in trust
for the customer by the LCBO, and then returned to the customer when the empty
bottle is returned (days, weeks, months, and sometimes years later).2

• HELD IN TRUST means the LCBO can NOT spend the cash on something
else IF they believe the customer is likely to return the empty bottle and claim
the deposit amount.

• However, for financial reporting purposes, the LCBO must ESTIMATE the
amount of deposits collected during a fiscal year that will NOT be refunded
(never redeemed).

2See the lower part of the receipt – Figure 1.5 (p. 6).

https://docs.google.com/spreadsheets/d/1uTRBaPmy6u2DRo5kynEoCAtLm_ez6cF7OcBq-g9BCJg/edit?usp=sharing

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 22

• To make this estimate, the LCBO developed a model based on historical re-
demption patterns. This is the terminology used in the LCBO annual report.

– In the Foundations of Data Mining class you will learn how to generate
such estimates (forecasting) based on historical data.

• The estimate of deposits collected during the 2018 fiscal year that will not be
refunded (never redeemed) totals $14.6 million or 20.2% of the $72.2 million
collected.3

• As shown in worksheet estimates 2017-2018 in C02L1a, the estimate can
and does change to reflect changes in customer behaviour. The comparable es-
timate of amounts not refunded/returned for fiscal year 2017 was 24.2% ($16.7
million/$69.1 million). Thus indicating that people in Ontario seem to have
returned more of their empty bottles. A small win for the company’s environ-
mental goals.

2.3.2 Illustration of Descriptive Statistics
LCBO does not refund 100% of deposits to customers. The worksheet estimates
2008-2018 in C02L1a includes historical data extracted from the LCBO’s annual
report for the years 2008 to 2018. Since our data set has only 11 observations (years),
it is relatively easy to to just see the answers to questions such as:

1. What was the lowest percentage estimate of unredeemed deposits (proxy for
non-recycling) over that time period?

• Note that we call this value the MINIMUM.

2. What was the highest percentage estimate of unredeemed deposits over that
time period?

• Note that we call this value the MAXIMUM.

3. If we sort the values in sequential order, what value would fall in the middle?

• Note that we call this value the MEDIAN, NOT the AVERAGE.

3LCBO did not provide the break down between deposits and refunds in their annual reports
for 2018-19 and 2019-20. They simply report that the ODPR container deposit breakage income
(i.e., the difference between deposits collected and refunds was 14.6 mil. in 2019 and 14.7 mil. for
2020. Thanks to Saru Aggarwal for providing this research update.

https://docs.google.com/spreadsheets/d/1PJEuPwLqM09NPC_EOYpU5i6i97upOc5aFYzms6RTvJ0/edit?usp=sharing

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 23

capture the location/center of a data set (distribution). The most commonly
used measures are average (also known as the mean), median (i.e., the middle
value that separates the top from the bottom half of the distribution), and
mode (i.e., the most frequent data point).

NB: Measures of Central Tendency …

Manual Calculation of Median

1. Copy the range of data that you want to analyze, i.e., Estimate % of unredeemed
deposits (B8:L8)

2. In a new worksheet: Select Edit … Paste Special … Paste Values Only.
This will paste only the values, not the formula used to calculate the percent-
ages.

3. Copy the pasted cells and … Select … Paste Special … Paste Transposed.
This will convert your data from a row to a column

4. Highlight the data in the column … Select Data … Sort Range … Leave
the default selection A to Z and click … Sort. This will sort your data from
smallest to largest.

5. Find the value that falls in the middle. This is the median.

Work with your crew: Median

1. What is the median of the Estimate % of unredeemed deposits (proxy for non-
recycling)?

(a) 19.9%
(b) 38.3%
(c) 22.7%
(d) 20.1%
(e) 249.8%

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 24

2.3.3 Visualizations – Illustration of Potential Graphs
Time-series Graph with Trend Line

Many people prefer to see data presented as a graph instead of in data tables. For
example, when dealing with values/observations over time (e.g., estimates 2008-
2018 in C02L1a) a time-series graph with trend line (Figure 2.1) can be very useful.

Figure 2.1: Time Series with Trend Line: Estimates of Unredeemed Deposits

The trend line (red line in Figure 2.1) is a line that goes as close as possible to
all data points. It tries to capture the overall trend in a data set. In the introduction
to statistics course, you will learn how to fit such lines using linear regression.

Work with your crew: Time-Series Trend

1. Which of the following statements captures the correct interpretation of trend
line (red line) of unredeemed deposits in Figure 2.1?

(a) There is strong upward trend.
(b) There is strong downward trend.
(c) There is mild upward trend.
(d) There is mild downward trend.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 25

(e) The trend in unredeemed deposits has been flat.

Many times, when you hear about data visualization, you also hear about
storytelling. That is because the most effective visualizations incorporate a
storytelling approach. Why? There are three main reasons.

• Memorable: Stories will make it easier for the audience to connect and
remember the information you are trying to convey.

• Relatable: Stories lead to emotional coupling. Both the storyteller and
the audience go through and relate to the same experience.

• Lead to action: Research shows that storytelling can engage parts of the
brain that lead to action.

Source: EY-ARC Introduction to Data Visualization.

NB: Story Telling

Work with your crew: Story Telling

1. To describe figure 2.1, which of the following two options would you consider
more memorable and more relatable?

(a) There seems to be an upward trend in the estimates of unredeemed de-
posits during the period 2008 to 2018.

(b) Over the last ten years (2008 to 2018), the percentage of people who did
not return their empty bottles for refund at the Beer Store seems to have
been increasing.

2. Discuss with your crew on how to improve the description that you have se-
lected.

3. Remember that story telling approach should focus on implications derived
from the graph (i.e., lead to action). You may want to add a short sentence to
capture this.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 26

Histogram

A histogram is a graph that captures the distribution of data. Typically, a histogram
(distribution) is described as symmetric, skewed left, skewed right, unimodal, bimodal,
or multimodal.

Figure 2.2: Types of Histograms

For example, the first graph in figure 2.2 is considered unimodal (i.e., it has one
spike) and symmetric. The second from left is unimodal and skewed to the right.
The third one is unimodal and skewed to the left. The fourth one is bimodal (i.e., it
has two spikes) and the last one multimodal.4

Work with your crew: Histogram Interpretation As we can see from
the time series graph (Figure 2.1), the estimates of unredeemed deposits (i.e., non-
recycling) has been fluctuating around 20%. An alternative way to visualize this is
to create a histogram, like the one shown in Figure 2.3.

1. How will you describe the distribution of the histogram shown in Figure 2.3?

(a) Unimodal and Symmetric
(b) Unimodal and Skewed Left
(c) Unimodal and Skewed Right
(d) Bimodal
(e) Multimodal

2. What does the count represent in Figure 2.3?
4The graphs shown on Figure 2.2 are from the Wikipedia article (https://en.wikipedia.org/

wiki/Histogram.

https://en.wikipedia.org/wiki/Histogram
https://en.wikipedia.org/wiki/Histogram

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 27

Figure 2.3: Histogram – Estimate of Unredeemed Deposits

3. Which one of the following statements captures the correct interpretation of
the highest column in the histogram (Figure 2.3)?

(a) In 5% of the years in the data set, the percentage of unredeemed deposits
was less than 22%.

(b) In 5% of the years in the data set, the percentage of unredeemed deposits
was between 16% and 22%.

(c) None of the suggested answers is correct.
(d) In 5 out of the 11 years in the data set, the percentage of unredeemed

deposits was less than 22%.
(e) In 5 out of the 11 years in the data set, the percentage of unredeemed

deposits was between 16% and 22%.

When working with time-series it is very important to capture and understand
the overall trend. It is impossible to see the time-series trend from a histogram.
Therefore, the most appropriate and most useful graph to visualize data over
time is a time-series graph like the one shown in Figure 2.1.

NB: Time Series Graphs

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 28

2.3.4 Descriptive and Summary Statistics
When we deal with just a handful of data points (observations) it is very easy to
understand and describe what these data are showing. However, this becomes more
and more difficult as the number of observations is increasing. Descriptive statistics
allows us to communicate the properties/distribution of our data.

For example, when we deal with hundreds or thousands of products sold by a
liquor store, we would like to know such things as what is the average price of the
products we sell, what is the most expensive as well as the cheapest product we sell.
In other words we want to know about the center of our distribution (average price)
and the range of prices (min and max value).

Work with your crew: Quick Sum The quick sum is a feature in Google
Sheets for generating summary statistics for numeric variables. As shown in Figure
2.4, you need to highlight/select one of the variables/columns, and use the drop-down
menu at the bottom right of the screen.

The worksheet L1 QuickSum in C02L1b contains a sample of 33 sales transac-
tions from various stores.

Figure 2.4: Quick Sum – Total from Customer

1. Select the entire column for total from customer.

(a) What is the average of the total from customer?
(b) What is the minimum?

https://docs.google.com/spreadsheets/d/1HQdS6uyoQJQEakaSjyXh0ztA2RUhHK7gV5n68m8Bd7Q/edit?usp=sharing

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 29

(c) What is the maximum?
(d) What was the sum of all total from customer? What does this number

represent for the liquor store?
(e) Notice the difference between count: 34 (it includes the row of labels)

and count numbers: 33 (it includes only numeric values).

2. Sort the data based on the column ProductType in ascending order (A
to Z). Select/highlight all the rows in the column total from customer that
correspond to premium products.

(a) What is average for premium products?
(b) What is the minimum?
(c) What is the maximum?
(d) What is the sum?

3. Repeat above with regular products.

(a) What is average for regular products?
(b) What is the minimum?
(c) What is the maximum?
(d) What is the sum?

4. As a group (premium versus regular products), which generates more Total-
fromCustomer?

Trying to answer the last couple of questions poses a problem since the quick
sum is going away the moment we deselect. To solve these problems, in the following
paragraph we will introduce formulas that calculate these summary statistics and
in week 04, we will introduce pivot tables as a way of organizing data in categories
(e.g., premium versus regular products).

Descriptive and Summary Stats Formulas

Table 2.1 summarizes the commands used to create the statistics for Total from
Customers shown at the bottom of the worksheet L1 Formulas in C02L1b.

https://docs.google.com/spreadsheets/d/1HQdS6uyoQJQEakaSjyXh0ztA2RUhHK7gV5n68m8Bd7Q/edit#gid=785194533

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 30

Descriptive Statistic Formula

Sum =sum(H2:H34)

Average =average(H2:H34)

Median =median(H2:H34)

Minimum =min(H2:H34)

Maximum =max(H2:H34)

Count (numeric values) =count(H2:H34)

Table 2.1: Formulas for Quick Statistics

2.3.5 Visualizations
There are numerous graphs that one can make (see the Data Visualisation Cata-
logue). Making good visualizations is a science and an art. You need to select the
appropriate graph and make the visualization appealing to your audience. As shown
in Figure 2.5 the primary driver for choosing a graph is your objective. Do you want
to show the distribution of data, do you want to make a comparison, do you want
to show composition (share of total), or show the relation between two variables?

Analyzing LCBO’s estimates of unredeemed deposits (see section 2.3.3) we have
created two graphs. The first one (Figure 2.1) is a line chart that enables the visual
comparison of estimates over time. The second one (Figure 2.3) is a histogram
that provides a visual representation of the distribution of estimates.

Work with your crew: Histogram – Retail Prices

1. Open the worksheet L1 QuickSum in C02L1b. Highlight the retail price and
generate quick sum stats. Are these stats useful in terms of what is the most
common range of prices?

2. Does the histogram L1 Histogram4retailPrice in C02L1b solve this prob-
lem?

3. Select Edit Chart, select Customize, Histogram and change Bucket Size
from Auto to 5.5

5It is likely that someone may ask about the Outlier Percentile option. Unfortunately, Google

https://datavizcatalogue.com/index.html
https://datavizcatalogue.com/index.html
https://docs.google.com/spreadsheets/d/1HQdS6uyoQJQEakaSjyXh0ztA2RUhHK7gV5n68m8Bd7Q/edit#gid=850583576
https://docs.google.com/spreadsheets/d/1HQdS6uyoQJQEakaSjyXh0ztA2RUhHK7gV5n68m8Bd7Q/edit#gid=528408535

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 31

Figure 2.5: Graph Selection Guide

4. Interpret the new results.

Work with your crew: Bar Chart – Size

1. Using the worksheet L1 QuickSum in C02L1b, highlight the different sizes of
products.

2. Can you guess which size is more common?
3. Show how the bar graph L1 BarChart4Size in C02L1b solves this problem.

2.3.6 Seminar Preview
Even when we deal with hundreds or thousands of products sold by a liquor store
over a full year, we still want to know such things as what is the average price of
the products we sell, what is the most expensive as well as the cheapest product we
sell. We can answer to these and many more questions by generating descriptive
statistics, summary statistics, and basic visualizations/graphs.

documentation is not helpful! It does not explain how it works or what it supposed to do. Better
not to make a change unless you know what you are doing.

https://docs.google.com/spreadsheets/d/1HQdS6uyoQJQEakaSjyXh0ztA2RUhHK7gV5n68m8Bd7Q/#gid=850583576
https://docs.google.com/spreadsheets/d/1HQdS6uyoQJQEakaSjyXh0ztA2RUhHK7gV5n68m8Bd7Q/#gid=1987268789

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 32

Data Understanding

We will work with the worksheet C02S. The data set has 1,000 observations and the
following variables:

1. Store = store number
2. SalesDate = The sales date
3. Description = name of product, e.g., Capt Morgan White, Martell VS Cognac.
4. Size = classification of product sold in terms of size, e.g., 750mL.
5. SalesQuantity = units/bottles sold.
6. RetailPriceUnit = price per unit. This includes the sales tax (HST)!
7. DepositUnit = deposit per unit.
8. TotalFromCustomer = amount collected from customer. This includes HST

and deposit.
9. ProductPriceUnit = price per unit. This does not include HST.

10. HSTperUnit = sales tax per unit
11. ProductType = premium versus regular products.

The goal is to generate descriptive statistics, summary statistics, and basic visual-
izations/graphs demonstrated during this lecture.

2.4 Seminar

Work with your crew: Quick Sums

• Estimated completion time for this exercise is 10 minutes.

1. Use quick sums to answer the following questions:

(a) What is the total of units sold (salesQuantity)?
(b) What is the highest (max) units sold (salesQuantity)?
(c) What is the lowest (min) units sold (salesQuantity)?

2. Create a new column named HSTfromCustomer. The new column should
reflect that HST that was collected from the customer.

https://docs.google.com/spreadsheets/d/1WvZPZEifIo9Ai78CI2baZjpdFLXBXxONZ29-iJ1mlOE/#gid=790575302

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 33

(a) Describe the process that you will use.
(b) Convert the process into a math formula.
(c) Convert math formula into a spreadsheet formula

3. Use quick sums to answer the following questions:

(a) What is the total HST collected from customers?
(b) What is the highest (max) HST collected from a customer?
(c) What is the lowest (min) HST collected from a customer?

Work with your crew: Summary Statistics

• Continue working with the worksheet C02S.
• Estimated completion time for this exercise is 15 minutes.

1. Create the following variable: DepositFromCustomer.
2. Use formulas at the bottom of the data set to generate summary statistics

(sum, average, median, min, and max) for the following variables:

(a) SalesQuantity
(b) ProductPriceUnit
(c) DepositFromCustomer, and
(d) HSTFromCustomer

Work with your crew: Visualizations

• Continue working with the worksheet C02S.
• Estimated completion time for this exercise is 20 minutes.

1. What graph will you create if your objective is to showcase the distribution of
sales quantity?

(a) Column/Bar Chart
(b) Histogram
(c) Line Chart

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 34

(d) Pie Chart
(e) Scatter plot
(f) Other

2. Create the appropriate graph.
3. Create the graph to showcase the distribution of product price per unit.

2.4.1 Visualization – Aggregate Column/Bar Chart
The objective of this exercise is to compare among categories but with focus on
another variable. For example, we want to create a graph that will let us know
which product size category generates the most sales (sum of sales quantity) or
compare product size categories in terms of the median sales quantity. To achieve
this we need to work with a categorical variable (i.e., product size) and aggregate the
information that we want to focus on (i.e., sum or median of sales) for each category
of product size.

Work with your crew Complete the exercise below to generate a graph that
shows aggregate (sum) sales quantity for each product size category.

1. Given that your objective is to compare among product size categories in terms
of sales quantity, which column(s) would you select? Select all that apply.

(a) Store
(b) Sales Date
(c) Description
(d) Size
(e) Sales Quantity
(f) Retail Price Unit
(g) Deposit Unit
(h) Total From Customer
(i) Product Price Unit
(j) HST per Unit
(k) Product Type

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 35

2. Select the two columns and select Insert Chart. The Chart Editor will open.

3. Given your objective, what graph will you choose/create?

(a) Column/Bar Chart
(b) Histogram
(c) Line Chart
(d) Pie Chart
(e) Scatter plot
(f) Other

4. In the Chart Editor …

• select the appropriate … Chart.
• select that you want to use column D (it contains the product size cate-

gories) as labels.
• select that you want to use Row 1 as headers.
• select Aggregate and leave the default selection sum of sales quantity.

The link Seminar Graph in C02L1b provides the solution based on Lecture
1 data. Your answer will be similar.

5. Work with your crew to prepare a description of your graph.

6. Repeat the above exercise/create a new graph. This time instead of Aggregate
select Median.

7. Work with your crew to prepare a description of the new graph and explain
how it differs from the previous one.

Sometimes, the variables that we want to use to make a chart may not be
in adjacent columns. We can select non-adjacent columns by simply holding
down the CTRL key in windows machines (command key for Mac machines)
and highlight/select the specific columns or cells that you want to include in
your chart using our mouse.

NB: Selecting Non-Adjacent Columns

https://docs.google.com/spreadsheets/d/1HQdS6uyoQJQEakaSjyXh0ztA2RUhHK7gV5n68m8Bd7Q/#gid=173509942

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 36

2.5 Lecture 2

2.5.1 Seminar Debriefing
Open the google sheet that you prepared during the seminar. We are going to use
the descriptive statistics and graphs you completed to understand the data for two
purposes:

1. to check the quality of the data in advance of the data preparation stage.
2. to gain a better understanding of the business.

• Relate above to CRISP-DM model.

2.5.2 Understand Data Quality (Reliability)
Business analytics means to leverage data to make better decisions. Are you making
better decisions if the quality of the data is questionable? Are you familiar with
the saying “garbage in, garbage out”? What does it mean in the context of data
analytics? What is the input we use to perform data analytics?

Typically, when we talk about data quality, we tend to consider the following
attributes: accuracy/consistency, timeliness, and completeness.

1. Accuracy/consistency: A data set is accurate and consistent if it is as free
as possible from intentional or unintentional errors. The values in a data set
are accurate, if they capture what the decision maker would consider as the
true value. The values of a data set are consistent if they do not change across
occurrences. For example:

• A person’s weight was recorded while the person was wearing a heavy
winter jacket.

• A person’s height was recorded while the person was wearing high-heeled
shoes.

• The recorded sales quantity for a product – which is sold in package of
twelve bottles – was one.

• The retail price per unit for liquor store data was supposed to include
tax, but not deposit. However, due to a coding error, the retail price per
unit for all products supplied by a certain vendor (Perfecta Wines) did
not include the tax.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 37

Outliers are observations in a data set which are far away from the bulk
of the observations. They are typically too large or too small in value
compared to the rest of the data. A distribution with a few very large
values is likely to be skewed right. A distribution with a few very small
values is likely to be skewed left. A data set may have outliers due to
measurement errors (e.g., someone typed the wrong number). If this is
the case it makes sense to remove them from the data set before we do
any statistical analysis. Removing outliers may not be a good idea, if
they capture the behaviour of a sub-group within the population of data.
We will revisit outliers in chapter 8.

NB: Outliers

2. Timeliness: A data set is timely if it contains information which is time-
relevant to the business problem that it will be used to address.

• Take a look at the worksheet estimates 2008-2018 in C02L1a). What
would happen if you were to estimate percentage of unredeemed deposits
for 2019 and 2020, and the last year of available data was 2014?

3. Completeness: A data set is complete, if all the data points that are needed
to capture a transaction are available (i.e., there are no missing values).

• A complete transaction in our liquor store data (C02S) must contain
entries for the following fields: Store, Sales Date, Product Description,
Product Size, Sales Quantity, Retail Price per Unit, Deposit Unit and
Product Type.

• If any of these points is missing, it is going to be problematic and it
will inhibit management’s ability to make a decision based on available
data. For example, what would be the implication of missing product
description or sales quantity or store number?

2.5.3 Evaluate Data Quality (Reliability)

Work with your crew: Evaluate salesQuantity Let’s check the reliability
of the units sold (salesQuantity) data within the seminar data set (C02S).

1. What was the total number of units sold during the year (SUM salesQuantity)?

https://docs.google.com/spreadsheets/d/1PJEuPwLqM09NPC_EOYpU5i6i97upOc5aFYzms6RTvJ0/#gid=2059924350

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 38

2. What was the highest number of units sold per transaction (MAX salesQuan-
tity)?

3. What was the lowest number of units sold per transaction (MIN salesQuan-
tity)?

4. What was the average number of units sold per transaction (AVERAGE sa-
lesQuantity)?

5. Review the histogram that you have created for salesQuantity.

(a) What terms will you use to describe it (symmetric, skewed left, skewed
right, unimodal, bimodal, and/or multimodal)?

(b) Create a filter to isolate transactions with salesQuantity > 15. How many
records do you have?

(c) Does it seem that these values are due to measurement error or represent-
ing a sub-group of transactions?

It is quite common that when people complete surveys that they may not
answer all questions. This will result in observations where some piece of in-
formation is missing. In other cases, some information may be missing because
the person collecting the data was unable to find this information. If data error
have occurred during data collection, the data analyst may decide to remove
these values from the data set (i.e., introduing missing value).

Missing data can be a serious problem when performing statistical analysis.
In chapter 9, we work with a data set that has a lot of missing values and learn
one possible way of dealing with missing data.

NB: Missing Values

Work with your crew : Evaluate productPriceUnit Continue checking
the reliability of the product price data within the data set C02S.

1. What was the highest product price per unit (MAX ProductPriceUnit NOT
RetailPriceUnit since retail price includes HST)?

2. What was the lowest product price per unit (MIN ProductPriceUnit)?
3. What was the median product price per unit (MEDIAN ProductPriceUnit)?

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 39

4. What was the average product price per unit (AVERAGE ProductPriceUnit)?
5. Review the histogram that you have created for ProductPriceUnit.

(a) What terms will you use to describe it (symmetric, skewed left, skewed
right, unimodal, bimodal, and/or multimodal)?

(b) Create a filter to isolate transactions with ProductPriceUnit > 70. How
many records do you have?

(c) Does it seem that these values are due to measurement error or represent-
ing a sub-group of transactions?

6. Are there any missing values in this data set? Hint: Use the function COUNT or
COUNTA to find the number of non-missing values on each column of the data
set.

Do you have any concerns about the reliability of sales units and prices and the
data set in general?

2.5.4 Understand Business: Total Revenue
The SUM of the TotalfromCustomer represents all cash received from customers by
the Liquor Store. This SUM or total cash that the Liquor Store collects from a
customer can be split into three categories:

1. A deposit, which the company refunds when the customer returns the empty
bottle.

2. The sales tax (HST), which the company collects and transfers to the gov-
ernment.

3. Revenue Out of all the money that LCBO collects from each customer a
portion is set aside to make payments through the Beer Store to customers
who return their bottles. Another one – i.e., taxes collected – is going to the
government. What is left represents what accountants refer to as Revenue. An
amount a company generates by selling products and/or services to customers.

Work with your crew: Total Revenue There are several ways that we
can generate this amount (Total Revenue). The first one is to generate all cash

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 40

generated from customers and from this subtract sum of deposits and sum of taxes.
This is shown in the following formula:

TotalRevenue = sum(TotalFromCustomer)
− sum(TotalDepositFromCustomer)
− sum(TotalHSTFromCustomer)

• The implementation of this solution using a small data set (33 transactions) is
shown in the worksheet L2 summaryTotalRevenue in C02L2.

• Replicate it with seminar data C02S.

The second approach would be to calculate the Revenue for each transaction, and
then take the sum. This is shown in the following formula: Look very carefully at
the position of the parenthesis and what it means regarding order of operations!

Revenue = + TotalFromCustomer
− TotalDepositFromCustomer
− TotalHSTFromCustomer

Hence,

TotalRevenue = sum(Revenue)

• The sum shown at the bottom of column Revenue in the worksheet L2 For-
mulas in C02L2 shows the implementation of this second solution using a small
data set (33 transactions).

• Replicate it with seminar data C02S.

Food for Thought – Discuss this with your crew Can you think another way
of calculating the same amount? Hint: Think what the amount ProductPriceUnit
represents! Can you leverage this?

Revenue Decomposition – Graphical Pie Chart

As we have seen in Figure 2.5, the pie-chart is recommended in cases where we want
to show the composition (share of total of different groups). The objective of the
following exercise is to show how to create a a pie chart that shows the percentages
of the total cash collected by the Liquor Store for each variable (i.e., TotalRevenue,
HSTfromCustomer, and DepositfromCustomer).

https://docs.google.com/spreadsheets/d/1Tlt6GVx1QPU1ykY39161lFwbVBSFN4RSaufVzH5Isec/#gid=836850558
https://docs.google.com/spreadsheets/d/1Tlt6GVx1QPU1ykY39161lFwbVBSFN4RSaufVzH5Isec/#gid=1318739286
https://docs.google.com/spreadsheets/d/1Tlt6GVx1QPU1ykY39161lFwbVBSFN4RSaufVzH5Isec/#gid=1318739286

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 41

1. Open the worksheet worksheet L2 summaryTotalRevenue in C02L2
2. The three items TotalRevenue, HSTfromCustomer, and DepositfromCustomer

are the components of totalFromCustomer. Therefore, these three items are
making 100% of totalFromCustomer.

3. Highlight the range A3:B5. These cells contain the TotalRevenue, HSTfrom-
Customer, and DepositfromCustomer.

4. Click Insert and select Chart
5. Click the three dots to the upper right corner of the graph and select Edit

Chart
6. If your chart is not a pie-chart, use the Chart type to select pie chart.
7. Click Customize, select Chart & axis titles and change the chart title to

“Composition of Total from Customer”.
8. What is the percentage of the Total Cash from Customers that the liquor store

retained as revenue?

Work with your crew Replicate this graph using seminar data C02S.

2.6 Key Lessons from this Chapter
1. Descriptive statistics help provide an understanding of a data set and highlight

potential issues with the reliability of data.
2. Great financial professionals use data (e.g., averages based on history) to inform

estimates or assumptions.
3. Great financial professionals use visualizations (graphs) to communicate data

and frame questions for additional analysis.

2.7 Preview of Next Chapter
We will spend the week learning how to drill down on data variables and categories
to better understand the business using pivot tables; e.g., Understanding sales units,
prices, and total revenue for premium versus regular products.

The graph in Figure 2.6 was created using the same approach that you learned
in the seminar exercise 2.4.1 (pp. 34). A problem with such a graph is that there

https://docs.google.com/spreadsheets/d/1Tlt6GVx1QPU1ykY39161lFwbVBSFN4RSaufVzH5Isec/#gid=836850558

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 42

Figure 2.6: Revenue Per Store

are 79 stores. There are a lot of data to try and absorb. Start discussing with your
crew what kind of aggregation or categories could we create to resolve this problem
(i.e., understanding sales units, prices, and total revenue by store categories).

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 2. BIG PICTURE 43

Stratopoulos & Vanden Bosch, Waterloo – 2021

Chapter 3

Segment Analysis

3.1 Learning Objectives
The theme for this chapter is segment analysis.1 Approaching this from a purely
heuristic standpoint, this means using pivot tables to slice and dice our data. How-
ever, our approach will be to go beyond the simple key strokes needed to create a
pivot table. We will approach the segmentation analysis as the creation of an algo-

1The image is from the wikipedia article about the Pink Floyd album the Dark Side of the Moon.
By Source, Fair use, https://en.wikipedia.org/w/index.php?curid=18421376

44

https://en.wikipedia.org/w/index.php?curid=18421376

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 45

rithmic solution to a business problem. By the end of this chapter, students should
be able to:

1. Understand generally accepted good practices for writing simple algorithms.

2. Use such practices to develop new variables based on mathematical formulas
or logical conditions (If statements).

3. Use logical IF statements to assign categorical values (e.g., ProductType of
premium or regular) to rows in a data set.

4. Generate pivot tables that use categories to answer specific questions about
the business (e.g., which product type generated the most revenue?)

5. Use the CRISP-DM model to explain when, why, and how we use categorical
values and pivot tables.

3.2 Students: Advance Preparation
The worksheet studentPreparation provides suggested videos/reading material
that you should view/review. Students should prepare the following material for
this chapter:

1. Using pivot tables
2. Using if functions and nested if functions.
3. Using formulas and functions.

3.3 Lecture 1

3.3.1 Model: Business Segment Analysis
Picture the unsolved Rubik’s cube in Figure 3.1. It’s colourful and fun to play with.
We know if we make an appropriate set of moves, we get a complete picture of one
side of the cube (e.g., all the red squares on the same side). Then if we make more
moves, we get a complete cube with the same colour on each of the six sides of the
cube.

A dataset is like an unsolved Rubik’s cube. It is a mixture of different data
variables (coloured squares) just waiting for an analyst to provide insight into the

https://docs.google.com/spreadsheets/d/1Ekt0bE4LqABehMM9EyPdhshBaZlAOWN4ZfJ0QYDyAEA/

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 46

Figure 3.1: Segment Analysis Compared With Solving Rubik’s Cube

business. Imagine that the red squares are product segments (e.g., premium or
regular products) and the green squares are sales channels (e.g. store 1, store 2,
store 3, etc.). The analyst creates a model to analyze these business segments by
making a series of moves:

1. Understand the business and its potential segments: e.g., products, customers,
channels, resources, and activities.

2. Understand the data: e.g, available variables, the type of data represented by
the values for each variable and any available categorical values.

3. Identify available categories and potential categories that could logically be
created using the variables in the data set (document each category/business
segment using a flow chart).

4. Prepare the data for analysis by adding a category as needed to the data set,
then using logical IF statements to assign categorical values for each record in
the data set.

5. HAVE FUN! Use pivot tables to model questions about each category/business

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 47

segment until we have an interesting picture for each category (i.e., the equiv-
alent of solving for the colours in the Rubik’s cube).

6. Communicate the questions and answers from the analysis, highlighting key
issues and opportunities.

3.3.2 Illustration of Segment Analysis
Typically, segment analysis is performed based on categorical variables. For this
illustration, we will use the categorical variable that we created in the seminar for
chapter 1 (see p. 15). We created a simple categorization of product types: Premium
versus Regular. This means that, we have already completed (most) of the first four
steps:

1. We first understood the business by reading the LCBO annual report, learning
that the company sells alcoholic products to customers in Ontario through its
retail stores (sales channel).

2. We understood the data set provided: e.g, sales transactions for many stores
with variables that included ProductDescription, salesQuantity, RetailPriceU-
nit, DepositUnit, and ProductPriceUnit.

3. We used an available variable – ProductPriceUnit – to logically create premium
and regular category values. More specifically, we prepared the data for analysis
when we:

(a) added the ProductType category variable/column to the data set,
(b) created a logical IF statement (IF ProductPriceUnit is greater than 20,

THEN Premium, ELSE Regular), and
(c) used the IF statement to assign categorical values for each record in the

data set.

We can now HAVE FUN playing with a pivot table that lets us organize the
data by category and for each category see any kind of descriptive statistic that we
want to examine. Figure 3.2 lets us compare the sum, average, min, and max of the
Revenue for premium products versus regular products.

Work with your crew: Interpret/Communicate Pivot Table Results
What are the answers to the following questions that we can communicate to com-
pany management based on this pivot table?

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 48

Figure 3.2: Pivot Table: Comparison of Premium vs. Regular Products

1. Which of the two product types – premium or regular – generated more revenue?
2. What is the average revenue per transaction collected from a customer who

purchased a premium product?
3. What is the lowest revenue per transaction collected from a customer who

purchased a regular product?
4. Which is larger: the maximum revenue per transaction for a premium product;

or, the maximum revenue per transaction for a regular product?

3.3.3 The Logic Behind the Creation of a Pivot Table
A pivot table is a table that summarizes the data from a larger data set. The logic
behind the creation of a pivot table is the same as the logic behind the creation of
an aggregate query that lets you extract and summarize data from a database.2 The
reason why queries are very popular is because their structure is very simple. We
need to select the variables that we want to focus on. We need to specify how we
want to group them and what kind of summary statistics we want to see for each
group. If we apply this logic on the pivot table shown in Figure 3.2, we have the
following steps:

1. Select the categorical or binary variable that we want to focus on (i.e., pro-
ductType) from your data set.

2. Since the variable productType takes two values (Regular or Premium), we can
group by this variable. This is the equivalent of creating a subset of data that
captures all sales of Regular products and another one for Premium products.

2We will learn about queries in the Foundation of Data Mining course.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 49

3. For each one of these groups, we can aggregate the revenue data (i.e., generate
summary and/or descriptive statistics). In other words, we have a line that
shows the summary and/or descriptive statistics for revenue from premium
products, and another one for regular products.
The pivot table shown in Figure 3.2 simply puts together these two lines.

We can translate this logic into the steps that we will have to follow in order to
create this pivot table.

1. From the Google Sheet menu select Data and from the drop-down menu Pivot
Table.

2. In the new window we see that the data set is selected and we are asked to
choose where to place the pivot table. Choosing a new sheet is recommended.

3. A new sheet will open and the Pivot table editor will show on the right.

(a) Select the categorical or binary variable that we want to focus on. Click
on Add in Rows and select productType from the drop down menu. Two
rows will be added in the pivot table (Premium and Regular). As men-
tioned above, this means that we are creating the equivalent of two subsets
(Group by): Premium and Regular.

(b) For each group we want to aggregate (sum) Revenue. Click on Add in
Values and select Revenue from the drop down menu. The default under
Summarize by is SUM. The sum of revenue will show on the pivot table
next to each group.

(c) To add the average revenue in the pivot table, repeat the above pro-
cess (i.e., Add in Values and select Revenue) and select AVERAGE under
Summarize by.

3.3.4 Understand Seminar Data Set
For the seminar, we will use the data set C03S F2021. It is the same data set,
which was used to create the pivot table (Fig. 3.2). It has 2220 lines (observations
or sales transactions) and the following nine columns (variables):

1. Store = store number
2. Description = name of product, e.g., F Coppola Dmd Ivry Cab Svgn, Terroirs

du Rhone.
3. Size = classification of product sold in terms of size, e.g., 750mL.

https://docs.google.com/spreadsheets/d/15_6e44JW7oE1Z8c23APe-TsAxbzc0fnv_k-CCc1lNI0/

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 50

4. salesQuantity = units sold.
5. RetailPriceUnit = price per unit. This INCLUDES the sales tax (HST)!
6. SalesDate = The sales date.
7. PackageSize = product size in mL, e.g., 750, 1000, 5000
8. Classification = 1 for liquor and 2 for wine.
9. VendorName = the name of the supplier.

3.3.5 Developing a Simple Algorithm
In order to create a new variable such as productPrice (see p. 13) that does not
include HST, we used the information that the retailPrice contains HST which is
13%. Developing a new variable using mathematical or logical operations is by itself
a problem to solve that requires writing a simple algorithm. Therefore, it is a good
idea to follow some generally accepted good practice for writing algorithms.

Define the problem by identifying the list of inputs (relevant data), list of
outputs, and process (i.e., actions needed to produce the output). Check
the solution.

NB: Good practice for writing algorithms

In the context of creating the new variable, we have the following:

1. Inputs: retailPrice, and HST=13%

2. Output: productPrice

3. Process: Remove HST from retailPrice
We can further break-down the process in the following steps:

(a) Summarize/describe the process (problem) into a simple sentence. For
example, create a new variable productPrice that does not include the
sales tax (i.e., remove HST from retailPrice).

(b) Convert the description into a mathematical formula. You can think of
this stage as writing a pseudo-code. Given that HST is 13% on top of
the product price, we can express the retail price as a function of product
price and HST as follows:

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 51

RetailPriceUnit = ProductPriceUnit ∗ 1.13

Therefore,

ProductPriceUnit =
RetailPriceUnit

1.13
Alternatively, we can simply state our pseudo-code as follows: To create
the ProductPriceUnit divide retailPrice by 1.13.

(c) Write the actual code (spreadsheet formula). For example, in the context
of our data set (C03S F2021) the formula would be =E2/1.13

4. Check the Solution

(a) On a piece of paper write down a few observations from retailPrices (in-
put)

(b) Next to each one of them and using your proposed solution write down
the value of the productPrice (output)

(c) Compare the answers in your hand-written solution with those in the
spreadsheet.

Logical Statements for Product Classification

The column (Classification) in our data set (C03S F2021) categorizes products as
either liquor or wine.

1. If the product is a liquor product, then the value in the Classification column
equals “1”.

2. If the product is a wine product, then the value in the Classification column
equals “2”.

These two categories are mutually exclusive: each product MUST be 1 or 2. This
categorization scheme is an example of a binary type of data (see 1.3.3 for discussion
on data types). This means there are only two possible categories.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 52

Work with your crew: Logical Statements for Products Sold in Large
Quantities The objective of this guided/interactive exercise is to develop a new
categorical variable, while following the generally accepted good practices for writing
algorithms (3.3.5). This means to define input(s), output(s), process and check the
solution.

Problem description: Create a new variable/column (highQuantity) to classify
products in terms of the units sold. If units sold are 1 or 2 classify as “lowQuantity”.
If units sold are more than 2 classify as “highQuantity”.

1. Select input(s) from list of variables. If necessary, we may want to select more
than one.

(a) Store
(b) Description
(c) Size

(d) salesQuantity
(e) RetailPriceUnit
(f) SalesDate

(g) PackageSize
(h) Classification
(i) VendorName

2. Provide one or more words to capture the desired output(s).

3. Work with your crew to develop the process:

(a) Summarize the problem into a simple sentence.
(b) Prepare the pseudo-code.
(c) Write the actual code (spreadsheet formula).

4. Check your Solution.

3.3.6 Seminar Preview
We have identified and/or created three possible ways of categorizing the data set
for this chapter:

• Product Type: Premium or Regular

• Product Classification: Liquor (1) or Wine (2)

• High Quantity: highQuantity or lowQuantity

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 53

We have the equivalent of three (3) sides of a Rubik’s cube structured for analysis.
There are other additional categorization schemes possible with the data set for this
chapter. In the seminar, we will use one of these other categories. The variable
(Store) column only has four possible values: 1, 2, 3, or 4. Each store number (1 to
4) is categorical value, representing a sales channel used to sell products to customers.

We will HAVE FUN with some of these categories, learning to use pivot tables
in the seminar to analyze business segments.

3.4 Seminar
As mentioned in 3.3.4, the data set for this chapter is C03S F2021. The data
set contains sales from four stores (1-4) and for one day (June 30th, 2017).3 The
objective of the exercises is to analyze the performance of different types of products
and the performance of the four stores. For example, we want to find which product
type and which store generates the most and least sales (units), the most and least
revenues, and the average selling price. More information will be provided below.

3.4.1 Prepare the Data Set: Create New Variables

Work with your crew: The data set is missing some of the variables needed
to perform the analysis. The objective of this exercise is to prepare the data set for
segment analysis.

• Estimated completion time for this exercise is 10 minutes.

1. Add the column/variable “ProductPriceUnit.” Remember that the Retail-
PriceUnit includes HST (cash collected for taxes remitted to the government)
so we need to calculate the product price excluding HST.

(a) Define the input(s)
(b) Describe the problem and convert it into a mathematical formula that we

want to use. Hint: RetailPriceUnit = ProductPriceUnit x 1.13.
(c) Translate your mathematical formula to a spreadsheet formula.
(d) Apply the formula to a few lines and manually check the results.
(e) Use the spreadsheet formula to populate ProductPriceUnit in all rows.

3See p. 49 for variable description.

https://docs.google.com/spreadsheets/d/15_6e44JW7oE1Z8c23APe-TsAxbzc0fnv_k-CCc1lNI0/

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 54

2. Add the column/variable “ProductType.” Remember we categorize each row
as Premium (IF ProductPriceUnit is greater than 20) ELSE as Regular.

(a) Define the input(s)
(b) Describe the problem and convert it into a logical formula that we want

to use.
(c) Translate your formula to a spreadsheet formula.
(d) Apply the formula to a few lines and manually check the results.
(e) Use the spreadsheet formula to populate ProductPriceUnit in all rows.

3. Add a column/variable named “Revenue.” Remember companies generate
Revenue by selling products to customers so Revenue does NOT include HST
or Deposits.

(a) Define the input(s)
(b) Describe the problem and convert it into a mathematical formula that we

want to use. Hint: Revenue for this type of business model is driven by
the Product Price per unit and the number of units of each product sold.

(c) Translate your formula to a spreadsheet formula.
(d) Apply the formula to a few lines and manually check the results.
(e) Use the spreadsheet formula to populate Revenue in all rows.

4. Add a column/variable named “highQuantity.” Remember we categorize each
row based on units sold. An order need to have a sales quantity more than two
to be classified as highQuantity, else it is lowQuantity.

(a) Define the input(s)
(b) Describe the problem and convert it into a mathematical or logical formula

that you want to use.
(c) Translate your formula to a spreadsheet formula.
(d) Apply the formula to a few lines and manually check the results.
(e) Use the spreadsheet formula to populate highQuantity in all rows.

3.4.2 Segment Analysis/Modeling: Create Pivot Tables

Work with your crew: Segment Analysis by Store The objective of this
exercise is to create a pivot table that organizes data by store. Structure the table
so “stores” 1, 2, 3, 4 are the rows in your pivot table.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 55

• Estimated completion time for this exercise is 15 minutes.

For each one of the stores, you will need to calculate the following information:

1. total revenues (sum of Revenues)

2. total units sold (sum of salesQuan-
tity)

3. maximum of sales quantity

4. average sales quantity
5. minimum product price per unit
6. maximum product price per unit
7. average product price per unit

Review your Store category pivot table results. Do they make sense? What did
you learn about each of the four (4) stores?

Work with your crew: Segment Analysis by Product Type The objec-
tive of this exercise is to create a pivot table that organizes data by product type.
Each product type – premium and regular – should be a row in the pivot table.

• Estimated completion time for this exercise is 15 minutes.

For each product type you should calculate the following information:

1. total revenues (sum of Revenue)

2. total units sold (sum of salesQuan-
tity)

3. maximum of sales quantity

4. average sales quantity
5. minimum product price
6. maximum product price
7. average product price

Review your ProductType category pivot table results. Do they make sense?
What did you learn about each of the two (2) product categories?

We can copy the spreadsheet containing the stores pivot table and create a
new spreadsheet. To create a new pivot table to analyze product types, edit
the stores pivot table to add the“productType” field into Rows and remove the
“store” field from Rows.

NB: Time-Saving Hint

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 56

Work with your crew: Segment Analysis by Product Classification
The objective of this exercise is to create a pivot table that organizes data by product
classification: liquor or wine. Create a pivot table that organizes data by product
classification. Remember that 1 = liquor and 2 = wine.

• Estimated completion time for this exercise is 5 minutes.

For each classification generate the same information as you generated for the
stores.

1. total revenues (sum of Revenues)

2. total units sold (sum of salesQuan-
tity)

3. maximum of sales quantity

4. average sales quantity
5. minimum product price per unit
6. maximum product price per unit
7. average product price per unit

3.4.3 Segment Analysis/Modeling: Two Way Pivot Table

Work with your crew: Sales by Store and Product Classification The
objective of this exercise is to create a pivot table that organizes data by store (in
rows) and product classification (in columns).

• To create a two way pivot table you will have to use two binary or categorical
variables. We will have to Add one of them (i.e., Store) in Rows, and the second
one (i.e.,Classification) in Columns.

• Typically, you want to have only one aggregate value.
• Estimated completion time for this exercise is 5 minutes.

1. For each store and classification you must calculate the total revenues (sum of
total revenue)

2. Review your two-way pivot table results. Do they make sense? What additional
insights did you learn about each of the stores and the liquor and wine product
classifications?

If you finish before the end of the seminar, PLAY with other combinations of the
ProductType, Stores, and Classifications segments to generate additional pivot tables
and business insights

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 57

It is possible and you can Add two or more categorical variables in Rows and/or
Columns of a pivot table. This does NOT mean that you should do it. The
pivot table becomes difficult to read and your audience may not understand
it. Try to keep your two-way pivot table simple. Just one categorical variable
in rows, one categorical variable in columns, and one aggregate variable in
values.

NB: Keep Two-Way Tables Simple

3.5 Lecture 2

3.5.1 Class Exercises: Seminar Debriefing
Review all of the steps completed during the seminar exercises. Focus on asking
questions you can answer using pivot tables to communicate the results of the anal-
ysis.

1. What is the input for the creation of the variable productPrice?

(a) Store
(b) Description
(c) Size
(d) Sales Quantity
(e) Retail Price Unit

(f) Sales Date

(g) Volume

(h) Classification

(i) Vendor Name

2. The variable highQuantity is binary. It takes take only two values.

(a) True
(b) False

3. Answer Based on Pivot Table for Store Sales: Which store generated the highest
sum of revenues?

(a) store 1
(b) store 2
(c) store 3
(d) store 4

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 58

4. Answer Based on Pivot Table for Store Sales: The ratio of revenues between
the store with the highest sum of revenues and the lowest sum of revenues is
approximately …

(a) 2 to 1
(b) 10 to 1
(c) 20 to 1
(d) 100 to 1

5. Answer Based on Pivot Table for Product Type: The firm generates more rev-
enues from high priced products (i.e., premium products) than regular prod-
ucts.

(a) True
(b) False

6. Answer Based on Two-Way Pivot Table for Store and Product classification.

(a) How much revenue does the company generate from the sales of liquor
(classification =1)?

(b) How much revenue does the company generate from the sales of wine
(classification =2)?

(c) How much revenue does store 1 generate from the sales of wine (classifi-
cation =2)?

(d) How much revenue does store 1 generate from the sales of liquor (classifi-
cation =1)?

7. Approximately what percentage of revenues of store 2 are coming from sales of
liquor (classification =1)? Report just a number, for example 90%.

3.5.2 Show As Pivot Table Options
Figure 3.3 shows the two-way pivot table that we created in the seminar (3.4.3). It
shows stores 1 to 4 as rows and product classification (liquor = 1 and wine = 2) as
columns. Note that the values in the pivot table are dollar value amounts calculated
using the SUM of Revenue.

When we create a pivot table, one of the choices that we have to make is
Summarize by. The available options are SUM, COUNTA, AVG, MIN, MAX, etc.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 59

Figure 3.3: Two Way Pivot Table

Figure 3.4: Pivot Table: Show as

Another choice that we can make is
Show as. This provides us with the
following options: Default, % of row,
% of column, and % of grand total.
See Figure 3.4.

In the seminar debriefing exercise 7
(p. 58), we wanted to calculate what
percentage of revenues of store 2 are
coming from sales of liquor (product
type =1). This means that we wanted
to express the revenues that the store
generated from the sales of liquor as a
percentage of the total store revenue.
Since stores are represented in rows of
the pivot table, we can answer this ques-
tion, by selecting to show our results as
% of row.

1. Work with your Crew: Make a copy of the two-way pivot table store – clas-
sifications – revenues. Change the Show as from Default to % of row. Use
the results to answer the following question: Which store generates most of its
revenues from liquor. Remember liquor=1, wine=2?

(a) Store 1
(b) Store 2
(c) Store 3

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 60

(d) Store 4

2. Work with your Crew: Make a copy of the two-way pivot table store – classi-
fications – revenues. Change the Show as from Default to % of column. Use
the results to answer the following question: Out of all revenues generated from
sales of wine, what percentage is contributed by store 1?

(a) Store 1
(b) Store 2
(c) Store 3
(d) Store 4

3. Work with your Crew: Make a copy of the two-way pivot table ‘store – classifi-
cations – revenues’. Change the Show as from Default to % of grand total.
Use the results to answer the following question: Write a short sentence to de-
scribe the entry at the cell in the upper left (store=1 and product classification
=1).

3.5.3 Categories: Other
The (Size) column in this chapter’s data set has several values that repeat frequently
and a few values that repeat very infrequently. The frequently repeating values (e.g.,
750, 1000, 1500, 1750) could each be a categorical value. Creating a graph or pivot
table based on such data makes the resulting graph or table difficult to read.

Work with your crew: to create a Bar Chart for the variable size
To simplify the presentation of results (and minimize the number of rows in a

pivot table) an “Other” categorical value could be included that groups together all
package sizes that are not common. By including the “Other” categorical value, we
ensure that the categorization structure is collectively exhaustive, in other words,
nothing is left out.

Work with your crew

• Work with your crew and try to understand the logic of the formula used to
create the variable sizeOther :
= if(OR(C2 = “750ml′′, C2 = “1.5L′′, C2 = “Liter′, C2 = “1.75L′′, C2 =
“50mL′′), C2, “Other′′) 4

4Similarly, in situation when we need to combine two conditions, we can use =if(AND(…).

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 61

In the context of the good practice for writing algorithms, this exercise is in a
way reverse engineering. You have the final code (spreadsheet formula), and
your objective is to translate this into plain English description.

• Create a bar chart based on this new variable

Remember that when we prepare formuals in spreadsheets, we must enclose
text in double quotation marks. We do NOT enclose numbers in double quo-
tation marks.

NB: Double Quotation Enclosures

3.6 Key Lessons from chapters 1-3: Product Sales
Review the CRISP-DM steps:

1. business understanding

2. data understanding

3. data preparation

4. modeling

5. evaluation and communication.

Highlight the key concepts and related activities for each of the following CRISP-
DM steps completed during chapters 1, 2 and 3 in the course.

Business Understanding

• chapter 1: Learned to identify products and key activities by reading a com-
pany’s annual report.

• chapter 1: Learned about the concept of a deposit, specific to the liquor store’s
business model.

• chapter 2: Learned that companies collect sales tax (HST in Ontario) and
remit the taxes to the government.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 62

• chapter 3: Learned that we can analyze a business by creating categories for
each type of business segment; for example, products can be split into premium
or regular categories AND products can also be split into classification (liquor
or wine).

Data Understanding

• chapter 1: Learned that there are many possible sources of data (e.g., business
transactions, location aware devices, and social media).

• chapter 1: Learned about the different types of data: binary, categorical, ordi-
nal, and numerical.

• chapter 2: Learned to use descriptive statistics and visualizations to understand
available data.

• chapter 3: Learned to identify variables that can be used as categories (e.g.,
store category with categorical values of 1, 2, 3, and 4)

• chapter 3: learned to identify variables that can be used to create categories
(e.g., Product Price per Unit used to create ProductType categories of Pre-
mium or Regular).

Data Preparation

• chapter 1: Learned how to organize data using rows, columns, and cells in a
spreadsheet.

• chapter 1: Learned to translate the description of a variable into a mathemat-
ical formula and then into a spreadsheet formula.

• chapter 3: Learned to use flow charts and IF statements to create Mutually
Exclusive Collectively Exhaustive (MECE) categorical values.

Modeling, Evaluation and Communication

• chapter 2: Learned how a company (LCBO) uses data and a model (to account
for deposits) for financial reporting purposes.

• chapter 3: Learned to use pivot tables to create models that analyze business
segments (e.g., products and stores) using various categories.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 3. SEGMENT ANALYSIS 63

• chapter 3: Learned that great financial professionals STOP and THINK after
creating a model (pivot table). Do the results make sense? What did we learn
from the analysis?

3.7 Preview of Next Chapter: Product Cost Data
In the previous chapters, we applied the CRISP-DM steps using data for product
sales for the liquor store company. In the next chapter, we reinforce what we learned
by using a new data set that provides data on product costs for the liquor store
company.

Stratopoulos & Vanden Bosch, Waterloo – 2021

Chapter 4

Applying CRISP-DM

Synergy is the creation of a whole that is greater than the simple sum of
its parts.1

4.1 Learning Objectives
The theme for this chapter is to put together all steps in CRISP-DM. Approaching a
data analytics problem from a CRISP-DM standpoint means that we can provide an
integrated solution to a problem that is more useful than the isolated presentation
of just one or two steps (e.g., just the presentation of the model). By the end of this
chapter, students should:

1. Feel more comfortable applying each of the CRISP-DM steps.

2. Feel more confident in using logic, mathematical formulae, spreadsheet formu-
lae, descriptive statistics, and pivot tables.

3. Describe a simulated experience in which they completed the CRISP-DM steps
to: Assess a Situation; Identify and Analyze “Unstated” Issues; and, Commu-
nicate Results.

4.2 Students: Advance Preparation
The worksheet studentPreparation provides suggested videos/reading material
that you should view/review. If students have not done this yet, they should complete

1From the Wikipedia article on synergy (https://en.wikipedia.org/wiki/Synergy)

64

https://docs.google.com/spreadsheets/d/1Ekt0bE4LqABehMM9EyPdhshBaZlAOWN4ZfJ0QYDyAEA/
https://en.wikipedia.org/wiki/Synergy

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 65

the suggested datacamp assignments and review/prepare the following material:

1. Using pivot tables.
2. Using if functions and nested if function.
3. Using formulas and functions.

4.2.1 Overview of the Week
In this chapter, we will walk through the CRISP-DM steps using an updated liquor
store data set to demonstrate how you use analytics in a general problem-solving
process (e.g., from AFM 111). Table 4.1 provides the mapping between the CRSP-
DM and the five-step process. We will perform the following problem-solving and
CRISP-DM steps:

CRISP-DM
(AFM 112)

Five Stage Problem Solving
Process (AFM 111)

1. Business Understanding 1. Assess the Situation

2. Data Understanding
3. Data Preparation
4. Modeling (to identify issues), or

2. Identify and Analyze Issues

4. Modeling (to compare alternatives) 3. Develop and Analyze Alternatives

5. Evaluation and Communication 4. Decide, Recommend, Communicate

6. Deployment 5. Implementation

Table 4.1: Mapping: CRSP-DM to Five-Stage Process

1. Assess the Situation: Lecture 1

• Role and requirement (in mini-case)
• Business Understanding (CRISP-DM step 1)

2. Identify and Analyze Issues: Lecture 1

• Data Understanding (CRISP-DM step 2)
• Data Preparation (CRISP-DM step 3)

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 66

3. Identify and Analyze Issues: Seminar

• Complete Data Preparation (if needed)
• Modeling (CRISP-DM step 4): to answer questions and identify issues

and opportunities

4. Communication: Lecture 2

• Evaluation (CRISP-DM step 5): discuss findings from seminar
• Communication of findings (CRISP-DM step 5): practice visualization

lessons

4.2.2 Liquor Store Limited Mini-Case
You are in your first work term as a Business Analyst on the Finance Team at
Liquor Store Limited (LSL), a retailer that sells liquor and wine products. LSL does
not produce or make any products; it purchases the products from many different
suppliers and sells those products to consumers who shop at one of the company’s
stores. LSL rents space in busy shopping malls and hires knowledgeable staff to work
in the stores.

Your first assignment at LSL was to analyze product sales for four (4) of the com-
pany’s stores. You completed this assignment quickly and reliably. Ann Bosch, your
Finance Team Manager, was impressed with your work so when the Product Man-
agement Team Leader sent a message requesting additional analysis, Ann assigned
you to do the work.

Product Management Team Leader’s Message: I just looked at the recent
income statement (Figure 4.1) and the Gross Profit is lower than expected. I need
to understand why it’s lower to help my team identify what issues are causing the
problem. Please analyze the attached data set (C04S)2 and answer these questions:

1. What are the total Sales Quantities, Revenues, Product Costs, and Gross Profit
for each of the four stores and the total for all four stores?

2. For each of the four stores, what is the average, maximum, minimum, and
median Gross Profit per transaction?

2Given the size of the dataset, you will have to save a local copy of the file in your hard drive
and then upload/open it in your Google drive. If your browser automatically, opens this document
with Excel, you should change your settings to be prompted to save the file.

https://docs.google.com/spreadsheets/d/1RCjoxquWCkO79uRiStiOA7AwacQH0g4vU8lG16LW09Y/

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 67

3. For each of the four stores, what are the total Sales Quantities, Revenues, and
Gross Profit for each product type (regular vs premium) and each product
classification (liquor vs wine)?

4. What are the total units sold on each day of the week for each of the stores?
What are the total units sold on each day of the week for each product type
(premium vs regular)?

Figure 4.1: Extract from LSL Income Statement

Required: Use the CRISP-DM process from your AFM 112 course to Assess the
Situation and Identify and Analyze Issues to answer the Team Leader’s questions.

4.3 Lecture 1

4.3.1 Assess the Situation: Role and Required Output
Start at the beginning of the problem-solving process, after reading the Liquor Store
Limited (LSL) mini-case, by Assessing the Situation. In this situation, think about
who you are (role) and what you’ve been asked to so (required output).

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 68

1. What is your role in the LSL mini-case?

2. What are you required to prepare? What are you not required to do?

3. Who are the key stakeholders? What are they expecting? In your role, what
do you want to achieve?

Let’s use the CRISP-DM steps to:

• Finish Assessing the Situation by understanding the business (CRISP-DM step
1).

• Identify and analyze the data set by understanding the data (CRISP-DM step
2), preparing the data (CRISP-DM step 3), and preparing segment analy-
sis/modeling (CRISP-DM step 4) using pivot tables to answer the questions.

4.3.2 Assess the Situation: Business Understanding
Stop and think. What facts do we have in the case about LSL? What else have we
learned about the business during in the previous chapters?

1. What does the business do? How does it operate?

• Liquor Store Limited (LSL) is a retailer that sells liquor and wine prod-
ucts.

• LSL does not produce or make any products. It purchases the products
from many different suppliers, incurring a product cost, which will be
accounted for in Cost of Goods Sold.

• LSL sells to consumers who shop in one of the company’s stores, incurring
additional expenses/costs to operate the business; e.g., rent for the stores
and wages for knowledgeable staff who work in the stores.

• So LSL must sell the product to the consumer at a price (retailPriceUnit)
higher than the amount paid to the supplier (productCostUnit) in an
attempt to earn a profit (net income = revenue – expenses).

2. What measure/variable did the team leader ask you to analyze?

• The team leader is focused on the Gross Profit measure, also called Gross
Margin.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 69

• Gross Profit is a measure on a company’s income statement, but it is NOT
the bottom-line Net Income.

• Gross Profit for a retailer is calculated as Revenue less the purchase cost
of the products actually sold (typically called Cost of Goods Sold).

• An illustrative income statement for LSL is provided in Figure 4.1 showing
Revenue, Cost of Goods Sold, and Gross Profit.

• So the logical description of the measure/variable you need to analyze is:
Gross Profit equals Revenue less the Product Cost for quantities actually
sold.

3. What categories/categorical variables did the team leader mention in the ques-
tions?

• Stores: the data set has data for four (4) stores
• Product Type: premium vs regular products
• Product Classification: liquor (1) vs wine (2)
• Day of the Week (NEW ): e.g., Monday, Tuesday, etc.

4.3.3 Identify and Analyze Issues: Data Understanding
The data set for this chapter (C04S) was provided by the Product Management
Team leader. It has 10,853 lines (transactions) and the following ten columns (vari-
ables):

1. Store = store number
2. Description = name of product, e.g., F Coppola Dmd Ivry Cab Svgn, Terroirs

du Rhone.
3. Size = classification of product sold in terms of size, e.g., 750mL.
4. salesQuantity = units sold.
5. retailPriceUnit = price per unit. This INCLUDES the sales tax (HST)!
6. salesDate = The sales date.
7. Volume = product size in mL, e.g., 750, 1000, 5000
8. Classification = 1 for liquor and 2 for wine.
9. vendorName = the name of the supplier.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 70

10. productCostUnit = the purchase cost paid to the supplier for a unit of the
product.

The data set does NOT include a variable called Gross Profit, but we have learned
how to create new variables.

• In chapter 3, we developed a simple algorithm to calculate productPriceUnit
(see p. 50).

• During the seminar for chapter 3, we added new variables to the data set for
productPriceUnit and Revenue (see p. 53).

• The data set includes productCostUnit, which we can use to calculate a new
variable called productCost for the salesQuantity of the product sold.

• Then we can calculate a new variable called grossProfit as follows:
grossProfit = Revenue− productCost.

The data set includes a variable called salesDate (e.g., 2019-06-27), which we
can convert into a day of the week – Monday, Tuesday, etc. – using a spreadsheet
function.

We now understand the data available in the data set, the data required to address
the team leader’s request, and how we can prepare the data to address any gaps. It’s
time to prepare the data.

4.3.4 Identify and Analyze Issues: Data Preparation
Data Preparation Tasks

To prepare the data for modeling, we need to complete the following steps:
1. Add three variables/columns to the new data set provided: productPriceUnit,

productType (premium or regular), and Revenue.3

2. Develop a simple algorithm to calculate two new variables: productCost and
grossProfit.

3. Add the two new variables/columns to the new data set provided: productCost
and grossProfit.

4. Add one more new variable/column to the new data set to capture the day of
the week: Day.

3If you want to save some time, copy and paste the headings and formulae created during the
seminar of chapter 3.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 71

Simple Algorithm for productCost & grossProfit

In chapter 3, we created a simple algorithm to create a new variable productPrice
that does not include HST . We now need to create a new variable grossProfit using
the understanding that the variable equals Revenue less productCost. We can write
a simple algorithm to calculate grossProfit and practice the generally accepted good
practices for writing algorithms.

1. Define the problem by identifying the list of inputs (relevant data), list of
outputs, and processes (i.e., actions needed to produce the output). In the
context of creating the new variable, we have the following:

(a) Inputs: Revenue, productCostUnit, and salesQuantity
(b) Output: grossProfit
(c) Process:

i. Calculate productCost and subtract productCost from Revenue
ii. Translate the description into a simple mathematical formula (pseudo-

code) Calculate productCost first:

productCost = productCostUnit ∗ salesQuantity

Subtract (productCostUnit x salesQuanity) from Revenue.

grossProfit = Revenue− productCost

iii. Write the spreadsheet formulae.

2. Check the Solution.

(a) Write down a few observations for the three inputs: Revenue, product-
CostUnit, and salesQuantity

(b) Next to each one and using your proposed solution write down the value
of the productCost (productCostUnit x salesQuantity) and grossProfit
(output)

(c) Compare the answers in your hand-written solution with those in the
spreadsheet.

4.3.5 Seminar Preview
In the seminar, we will complete the data preparation tasks (see p. 70) and then use
pivot tables to model answers to the team leader’s questions.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 72

4.4 Seminar

4.4.1 Complete Data Preparation Tasks
• The objective of the exercise is to prepare the data set (C04S), adding variables

needed for segment analysis to answer the team leader’s questions.

• Estimated completion time is 5 – 10 minutes.

Work with your crew: Prepare new variables

1. Add the new variables/columns to the data set: productPriceUnit, productType
(premium or regular), and Revenue, productCost, and grossProfit.

2. When adding the variables follow the same steps used during the chapter 3
seminar:

(a) Define the input(s)
(b) Describe the problem and convert it into a logical formula.
(c) Translate your formula to a spreadsheet formula.
(d) Apply the formula to a few lines and manually check the results.
(e) Use the spreadsheet formula to populate the variable in all rows.

3. Add one more new variable/column (Day) to the data set to capture the day
of the week (see hint below).

(a) Define the spreadsheet formula: =WEEKDAY(XX).
(b) Apply the formula to a few lines (e.g., enter =WEEKDAY(F2) if salesDate

is in cell F2) and check the results.
(c) Use the spreadsheet formula to populate the variable in all rows.

You can enter the function “=WEEKDAY(cell reference)” into a Cell so the
value that appears in the cell is a number referencing the day of the week for
the salesDate in the “cell reference”. The number “1” refers to a “Sunday”.

NB: Spreadsheet Hint

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 73

4.4.2 Model: Segment Analysis Using Pivot Tables

Work with your crew on Question 1: Segment Analysis by Store

• The objective of this exercise is to create a pivot table that answers question:
What are the total Sales Quantities, Revenues, Product Costs, and Gross Profit
for each of the four stores and the total for all four stores?

• To answer the question, organize data by store. Structure the table so “stores”
1, 2, 3, 4 are the rows in your pivot table.

• Estimated completion time for this exercise is 10 minutes.

For each one of the stores, you will need to calculate the following information:

1. total units sold (sum of salesQuantity)

2. total revenues (sum of Revenues)

3. total product costs (sum of productCost)

4. total gross profit (sum of grossProfit)

Review your Store category pivot table results. Do they make sense? What did
you learn about Gross Profit for each of the four (4) stores?

Work with your crew on Question 2: Additional Segment Analysis by
Store

• The objective of this exercise is to create a pivot table that answers the question:
For each of the four stores, what is the average, maximum, minimum, and
median Gross Profit per transaction?

• To answer the question, organize data by store. Structure the table so “stores”
1, 2, 3, 4 are the rows in your pivot table and add more information to the
pivot table created to answer question 1.

• Estimated completion time for this exercise is 10 minutes.

For each one of the stores, you will need to calculate the following information:

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 74

1. total units sold (sum of salesQuantity)

2. total revenues (sum of Revenue)

3. total product costs (sum of productCost)

4. total gross profit (sum of grossProfit)

5. average gross profit per transaction

6. minimum gross profit per transaction

7. maximum gross profit per transaction

8. median gross profit per transaction

Review your expanded Store category pivot table results. Do they make sense?
What else did you learn about the gross profit for each of the four (4) stores?

4.4.3 Model: Segment Analysis Using Two-Way Pivot Table

Work with your crew on Question 3: Analysis of Stores by Product
Type and Classification

• The objective of this exercise is to create pivot tables that answer the question:
For each of the four stores, what are the total Sales Quantities, Revenues,
and Gross Profit for each product type (regular vs premium) and each product
classification (liquor vs wine)?

• To answer the question, create a pivot table that organizes data by store (in
rows) and product type (in columns).

• Estimated completion time for this exercise is 10 minutes.

1. For each store and product type, calculate the total sales quantities (sum of
salesQuantity)

2. For each store and product type, calculate the total revenues (sum of Revenue)

3. For each store and product type, calculate the total gross profit (sum of
grossProfit)

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 75

4. Review your two-way pivot table results. Do they make sense? What addi-
tional insights did you learn about each of the stores and the gross profit from
premium and regular product types?

5. IF you finish early, ANALYZE product Classification segments (liquor and
wine), generating additional pivot tables and business insights.

4.4.4 Model: Analysis by Day of the Week

Work with your crew on Question 4: Units Sold on Each Day by Store
and Product Type

• The objective of this exercise is to create pivot tables that answer questions:
(a) What are the total units sold on each day of the week for each of the stores?
(b) What are the total units sold on each day of the week for each product type
(premium vs regular)?

• To answer the question, first create a pivot table that organizes data by day
(days 1 to 7 in rows) and store (stores 1 to 4 in columns).

• Estimated completion time for this exercise is 10 minutes.

1. For each day and store, calculate the total sales quantities (sum of salesQuan-
tity).

2. Use the time-saving hint from the chapter 3 seminar (see p. 56), copying the
spreadsheet with the day-store two-way pivot table.

3. Edit the copy of the pivot table to organize data by day (in rows) and product
type (premium vs regular in columns); then calculate the total sales quantities
(sum of salesQuantity)

4. Review your two-way pivot table results. Do they make sense? What additional
insights did you learn about each of the stores and the premium and regular
product types?

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 76

4.5 Lecture 2

4.5.1 Seminar Debriefing
Review all of the steps completed during the seminar exercises. Focus on questions
that you can answer using the pivot tables that you created in the seminar to com-
municate answers to the Product Management Team Leader’s four questions from
the case.

1. Based on Data Set: What are the inputs for the creation of the variable pro-
ductCost?

(a) Store
(b) Description
(c) Size
(d) Sales Quantity
(e) Retail Price per Unit

(f) Sales Date
(g) Volume
(h) Classification
(i) Vendor Name
(j) Product Cost per Unit

2. Based on Data Set: The variable Day is numerical because we can convert a
date into a number from 1 to 7.

(a) True (b) False

3. Based on Pivot Table for Segment Analysis by Store: When you analyzed LSL
case question 1, which store generated the lowest sum of gross profit?

(a) store 1
(b) store 2

(c) store 3
(d) store 4

4. Based on Pivot Table for Expanded Segment Analysis by Store: When you
analyzed LSL case question 2, you saw that Store 4 has the lowest minimum
gross profit per transaction. This minimum value was:

(a) $3.57
(b) $2.46

(c) -$10.58
(d) -$996.61

5. Based on Two-Way Pivot Table for Store and Product Type (premium vs
regular). When you analyzed LSL case question 3, you saw that:

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 77

(a) One store generated more gross profit from regular products than premium
products.

(b) Two stores generated more gross profit from regular products than pre-
mium products.

(c) Three stores generated more gross profit from regular products than pre-
mium products.

(d) Four stores generated more gross profit from regular products than pre-
mium products.

6. Based on one of the Pivot Tables for Units Sold on Each Day. When you
analyzed LSL case question 4, which day did the four stores in total sell the
most units?

(a) Day 1 = Sunday
(b) Day 2 = Monday
(c) Day 3 = Tuesday
(d) Day 4 = Wednesday

(e) Day 5 = Thursday

(f) Day 6 = Friday

(g) Day 7 = Saturday

7. Based on the Pivot Table for Units Sold on Each Day by Store. When you
analyzed LSL case question 4, which day did Store 3 sell the most units?

(a) Day 1 = Sunday
(b) Day 2 = Monday
(c) Day 3 = Tuesday
(d) Day 4 = Wednesday

(e) Day 5 = Thursday

(f) Day 6 = Friday

(g) Day 7 = Saturday

8. Reviewing the results from the pivot tables can help the Product Management
Team Leader identify POTENTIAL ISSUES AND OPPORTUNITIES. The
Product Management Team can now focus their work to look for underlying
business (or data errors) cause of the problem and develop alternative solutions.
For each one of the pivot tables that you have created identify a potential issue
or opportunity that can share with the product management team leader.

4.5.2 Visualize Answers to Questions
In chapter 2, we discussed the art of science of visualizations. Figure 2.5 provided a
subset of possible graphs and a way to help you select a graph appropriate for what
you want to show. During this exercise, we will practice by creating graphs that

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 78

could help you communicate answers to the Product Management Team Leader’s
questions.

Work with your crew: Make a copy of the pivot table with segment analysis
by store (units sold, revenues, product costs, and gross profit variables). To answer
LSL case question 1, show the team leader a comparison of the four stores using
these four measures. Which type of graph will you use?

1. Histogram

2. Line Chart

3. Bar Chart

4. Pie Chart

5. Scatter Plot

Work with your crew: Make a copy of the pivot table with Units Sold Each
Day by Store. To answer LSL case question 4 and show the Team Leader how the
units sold vary during the week by store, which type of graph would you create?

1. Histogram

2. Line Chart

3. Bar Chart

4. Pie Chart

5. Scatter Plot

The Product Management Team Leader is unlikely to be a finance professional
so will be more comfortable with a visualization than with numbers. If the fictional
co-op student creates data tables for the Finance Team Manager (Ann likes numbers)
and graphs for the Product Management Team Leader (who likes pictures), both will
be impressed.

4.5.3 Filter Data

Work with your crew: Make a copy of the pivot table with expanded segment
analysis by store (includes average, maximum, minimum, and median gross profit per
transaction). Since the answer to LSL case question 4 shows that all four stores have
negative results for minimum value, create filters to help the Product Management
Team identify the specific transactions or products with negative gross profit values?

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 4. APPLYING CRISP-DM 79

4.6 Lessons from Problem-Solving Case
• Learn to follow the CRISP-DM steps since they help you use business analytics

for data-enabled problem-solving, using models to help identify unstated issues.
The CRISP-DM steps are:

1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling (to identify issues and opportunities)
5. Evaluation (and Communication)

• When faced with a new situation (in a simulated case world or on a real-world
co-op term), STOP and THINK before you start. Assess the Situation first:

– What’s your role?
– What are you required to do or prepare?
– Who are the key stakeholders/audience and what are they expecting?
– When you can answer these questions, THEN start to perform the CRISP-

DM steps.

• Great financial professionals focus their analysis to answer questions about the
business; they don’t just play with pivot tables and charts.

• To communicate results effectively, create relevant and simple visualizations
that answer specific questions and/or highlight issues and opportunities.

4.7 Preview of Next Chapter
This is the end of the first part of the course, which was based on the use of spread-
sheets to perform data analytics. After the midterm, we will transition to second
part and the use of R to analyze much larger and more comprehensive data sets.

Stratopoulos & Vanden Bosch, Waterloo – 2021

Chapter 5

Midterm Review

5.1 Learning Objectives & Design
To succeed in your exam, you will need to understand the goal and logic behind the
design/format of the exam. The main goal of the exam is to encourage critical think-
ing in the context of communicating the stages of CRISP-DM for a data analytics
project. The format of the exam is motivated by the typical expectations for new
employees who aspire to ascend to the upper echelons of their firm’s management.
That is manage yourself, manage-up, and manage-down.

The exam assumes that you are part of a data analytics team and you have
just completed the data analysis part of your assignment. Each member of the
team will prepare a short document (executive summary) for the team manager.
In this part you will practice writing for a superior so you’ll want to make your

80

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 5. MIDTERM REVIEW 81

summary easy to follow and concise. The executive summary must have a brief
statement of the business problem, background information (data understanding and
data preparation), concise description of the analysis that you have done (model),
and main take-away points of the analysis.

Each member of the team must provide anonymous feedback to peers. Here you
will practice coaching others and helping them to improve: essential skills for success
managing down and across an organization. The feedback must be reinforcing (i.e.,
indicate what parts of the summary were done well and why) and constructive (i.e.,
indicate what part of the summary could have been improved and offer suggestions
on how to make these improvements).

Therefore, the learning objectives of the exam are as follows:
1. Prepare an executive summary that covers the following points:

(a) Communicates the business problem and the data that you are using
(b) Explains the data preparation that you have done.
(c) Describes the analysis that you have done (modelling)
(d) Communicates the key findings in non-technical terms.

2. Review/evaluate an executive summary prepared by someone else and provide
constructive feedback (i.e., areas of strength, as well as areas for improvement).

5.2 How to Prepare for the Midterm
A simple way to demonstrate your ability to approach a business analytics prob-
lem using CRISP-DM is by preparing an executive summary to communicate the
problem and data description, data preparation, analysis (model), and key findings.
Therefore, in your midterm you should be prepared to demonstrate your ability to
prepare such an executive summary and evaluate (provide constructive feedback to)
the executive summary prepared by someone else.

In this chapter, we will review the midterm from Fall 2020. We will use this
as a foundation for practicing the preparation of an executive summary, as well as
providing feedback.

The most effective way to practice for the midterm would be to work with your
crew as follows:

1. Executive Summary Submission Each member of the crew should prepare
an executive summary and share it with other members of the crew. Preferably,
you want to share an anonymous copy of the executive summary.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 5. MIDTERM REVIEW 82

The executive summary should cover the four points mentioned under learning
objectives. This means that the executive summary should help you commu-
nicate the problem and data description, data preparation, analysis (model),
and key findings. To make the executive summary easier to read you should di-
vide it into four paragraphs: one short introductory paragraph for the business
problem and data, one for data preparation, one for modelling, and last one
for communicating your findings. Use standard Times New Roman font (size
12) and single spacing in your executive summary. Your executive summary
should not be more than 3/4 of a page.

2. Review/Feedback Each member should prepare a review for each executive
summary submitted by other crew members. In each review you should try to
explain how the submitted executive summary has done on the four areas and
using the following criteria/weights;

(a) The executive summary clearly communicates the business problem and
helps me understand the data

(b) The executive summary clearly explains the data preparation
(c) The executive summary clearly describes the analysis done (model)
(d) The executive summary clearly communicates the key findings.

Each review should be about a couple of paragraphs long and try to provide con-
structive feedback to the person who prepared the executive summary. When
you provide constructive feedback you will need to identify areas of strength,
as well as areas for improvement.

• The feedback that students receive on their submission, will determine
their score on the submission.

3. Score Reviews/Feedback After you have prepared the reviews you should
send the reviews to your crew members and ask them to rate the positive
comments offered in this review (i.e., does the reviewer offer insightful positive
feedback, does it tell us what is effective and why?), and rate the constructive
comments offered in this review (i.e., does the reviewer provide comments that
would improve the quality of the outline?).

• The scores that students receive on the feedback that they have provided,
will determine their score on feedback.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 5. MIDTERM REVIEW 83

5.3 Midterm Fall 2020
You (Chris Fansworth) are in your first co-op work term as a Business Analyst for
AFM Consulting. Bibitor, a wine and liquor retailer with multiple stores across
the USA, is one of the major clients of AFM Consulting. Your team has already
completed a couple of assignments quickly and reliably. Leela Powell, your manager
was impressed with your work so when Alex Fry the manager of one of the Bibitor
stores (store #1) sent a message requesting analysis of March 2020 data from store
#1, Leela assigned this project to your team.

As general information, Leela reminded your team that by the middle of March
2020 and in an attempt to contain the spread of covid-19, most of the world economies
went into a state of lockdown. While the measure proved very effective and saved
lives, it affected many businesses. Store #1 is located in the city of Hardsfield, state
of Lincoln. Lockdown started in the state of Lincoln on March 16th. In store #1,
they classify products as premium or regular based on their price. They use prices
greater than $30 for liquor and $22 for wine to classify products as premium.

5.3.1 Data
Alex provided a data set (dt4mtF2020) with sales transactions of store #1 for March
2020 (see screenshot in Figure 5.1). The data set has 18,462 transactions and the
following variables:

1. Store = store number
2. Brand = product number
3. salesQuantity = units sold.
4. salesPrice = price per unit. This does NOT include sales tax
5. salesDate = The sales date.
6. Description = name of product.
7. Classification = 1 for liquor and 2 for wine.

5.3.2 Analysis
The work that your team has done is summarised in Figures 5.2-5.5.

https://docs.google.com/spreadsheets/d/1MemZJEH0rb38h8EMpseLueK2Dv1wg3Jzm7f1LpS6yxM/

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 5. MIDTERM REVIEW 84

Figure 5.1

Figure 5.2

Figure 5.3

Figure 5.4

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 5. MIDTERM REVIEW 85

Figure 5.5

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 5. MIDTERM REVIEW 86

5.3.3 Practice Exam Deliverable
Follow directions in 5.1 and 5.2 in order to prepare the two deliverable:

1. Each member of the team must prepare a short document (executive summary)
for the manager (Leela Powell).

2. Each member of the team must provide anonymous feedback to the executive
summaries prepared by their peers.

In the next few days I will post examples of executive summaries and examples
of feedback on piazza. I encourage you to post your examples so we can have a
discussion.

• While you have access to the data and you could replicate the analysis, you do
NOT have to replicate the analysis in order to prepare your executive summary.
You should be able to prepare your executive summary based on just the work
shown in Figures 5.2-5.5.

5.4 Preview for Next Chapter
A message from the Center for Teaching Excellence.

Fall Reading Week is a chance for you to recharge, regroup, or reset. With
the current pandemic, you need to care for yourself and your loved ones
more than ever. Some of you may use the break to catch up on studies,
self-care, caring for loved ones, engaging in activities you enjoy, or just
catching up on sleep. Consider what YOU really need.

Stratopoulos & Vanden Bosch, Waterloo – 2021

Chapter 6

Midterm

6.1 Learning Objectives
1. Apply the CRISP-DM process to analyse data

2. Prepare an executive summary to communicate results of the analysis

3. Provide constructive feedback to executive summaries prepared by your peers

The case description, data, and analysis will become available on Learn.

87

Stratopoulos & Vanden Bosch, Waterloo – 2021

Chapter 7

Applying CRISP-DM with R

Robot guard #1: Halt!1

Robot guard #2: Be you robot or human?
Leela: Robot…we be.
Fry: Uh yup! Just two robots out roboting it up! Eh?
Robot guard #1: Administer the test.
Robot guard #2: Which of the following would you most prefer? A:
A puppy? B: A pretty flower from your sweetie or C: A large properly
formatted data file?

7.1 Learning Objectives
Transitioning from spreadsheets to a language like R can be challenging. The primary
objective for this and next chapter is to make this transition as smooth as possible,
so students can appreciate and leverage the huge potential that R offers. To achieve
this goal we will continue working with two familiar business settings and data sets.
In this chapter, we will revisit the Liquor Store Limited mini-case (4.2.2).

By the end of this chapter, students should be able to:

1. Understand the set-up and line-by-line R script that someone else has prepared
in order to analyze a business problem (Focus of Lecture 1).

2. Replicate the R script that someone else has prepared (Seminar).
1The dialog is from the Futurama episode “Fear of a Bot Planet.” Leela and Fry face two giant

robot guards.

88

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 89

3. Modify an existing R script to make it work with a new data set (Lecture 2).
4. Identify outliers using the Inter-quartile range.
5. Interpret/communicate results based on an R output.

7.2 Students: Advance Preparation
During the second half of the course, we will transition from spreadsheets to R. If you
are not familiar with R it is very important to watch the suggested videos for this
chapter (see studentPreparation). These videos provide step-by-step directions
on the following topics:

1. Installing R on your computer
2. Using RStudio
3. Getting started with the R Environment
4. Using and managing packages

• Make sure to start working on the DataCamp assignments before your semi-
nar.

R is an open source software. This means that it is free and there are numerous
free resources available. For example, the book “R for Data Science” by Garrett
Grolemund and Hadley Wickham, is available online (https://r4ds.had.co.nz/
index.html). Sections 4.1-4.3 from the book of Grolemund and Wickham are a good
starting place that supplements the LinkedIn videos and DataCamp assignments.

• NB If you are not familiar with R, it is very important to watch the LinkedIn
videos before the lecture.

7.3 Lecture 1
Your first major data analytics assignment (see Chapter 4 – Liquor Store Limited
(LSL) Mini-Case) was very successful. Ann Bosch, your Finance Team Manager,
and the Product Management Team Leader would like you to continue working on
the same project, but with a much larger data set. Your first assignment was to
analyze performance at four (4) stores, but now they want you to analyze ALL of
the company’s stores, which means you need to work with a data set that has over 1
million lines. The data set is so large that you cannot use spreadsheet pivot tables.
Now what?

https://docs.google.com/spreadsheets/d/1Ekt0bE4LqABehMM9EyPdhshBaZlAOWN4ZfJ0QYDyAEA/
https://r4ds.had.co.nz/index.html
https://r4ds.had.co.nz/index.html

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 90

7.3.1 Overview of the chapter
In this chapter we will transition from spreadsheets to R and demonstrate one way
to learn a new data analytics tool.

• In Lecture 1, we will load the SAME data set used for the four stores into
R Studio. We will then demonstrate how to understand the data, prepare the
data, and create a model to answer the first question that we answered using
spreadsheets in chapter 4.

• In the Seminar, you will have an opportunity to replicate the R code on your
own and answer the rest of the LSL case questions for the four stores. You
will then check the answers generated using R with our chapter 4 pivot table
answers.

• In Lecture 2, we will re-run the entire analysis with a new data set for ALL
stores that have over 1 million lines and see how easy it is to re-run analysis
with new data sets using R.

7.3.2 Why Transition from Spreadsheets to R
The advantage of using spreadsheets is that they are very easy to learn and everybody
is using them.2 However, …

1. According to Wikipedia “[The] concept of an electronic spreadsheet was out-
lined in the 1961 paper ‘Budgeting Models and System Simulation’ by Richard
Mattessich. The subsequent work by Mattessich (1964a, Accounting and Ana-
lytical Methods) and its companion volume, Mattessich (1964b, Simulation of
the Firm through a Budget Computer Program) applied computerized spread-
sheets to accounting and budgeting systems (on mainframe computers pro-
grammed in FORTRAN IV).”

2. A spreadsheet is not designed to work like a database management system.
The solutions that we create (using spreadsheets) are idiosyncratic and are not
easily transferable to other users.

3. As spreadsheets proliferate in organizations, they increase the likelihood of
making mistakes, make errors more difficult to detect and make it hard to

2This discussion is based on my presentation at the Intensive Data and Analytics Summer
Workshop, which was organized by the American Accounting Association in Orlando, FL from
June 4-7, 2018 (https://tinyurl.com/y2wyylgs).

https://tinyurl.com/y2wyylgs

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 91

identify changes to functionality or data. This is usually called Spreadsheet
Risk.

4. At early stages it takes little effort to complete a variety of tasks/applications.
However, the effort needed to perform advanced data analytics with spread-
sheets is very high. This means, that the number and variety of applications
we can do with spreadsheets plateaus (see red line in Fig. 7.1).3

Figure 7.1: Spreadsheets vs R

Working with a language like R or Python is more demanding to learn. However,

1. It will help you think about what you are doing.
2. You are more likely to realize that when dealing with large data sets visual

inspection of your data (opening it in Excel or Google Sheets) will not work.
Working with R will help you think about the structure of the data you are
using.

3. The spectrum of R applications you can use is constantly growing. That is not
necessarily the case with spreadsheets. Using R initially entails a steep learning
curve. However, sky is the limit once you have passed the initial training stage
(see blue line in Fig. 7.1).

4. The solution you create with R is easily transferable to others who understand
R or Python.

3Figure 7.1 is an adaptation of a graph prepared by Gordon Shotwell.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 92

5. Compared to spreadsheets, by learning R you can more readily transition and
become a power user of other software packages, such as Tableau or Alteryx.

7.3.3 Prepare R Environment
In the beginning, when you first start working on data analytics projects using R, you
may need some help remembering all the steps and R commands. The following link
(R Project Check List) provides a template that you can use. Feel free to customize
it to the specific requirements of each project. To prepare the R environment, we
perform the following steps:

1. Create a new directory/folder to store your data and analysis. I have named
my folder c07.

2. Download the original LSL mini-case data for chapter 4 (C04S) and save it in
the case-analysis folder (e.g., c07) as C04S.csv. Choosing the correct format-
ting (file extension matters!).

3. Start RStudio, and create a new R file. Name the file c07L1.R and save it in
the same directory. It is important to make sure that your data set and the R
file are in the same folder.

4. Set the working directory to this folder as follows:

(a) From the R Studio menu select Session
(b) select Set Working Directory
(c) select To Source File Location

5. Load the appropriate packages. To perform the data analysis of the LSL mini-
case, we will use the package tidyverse. If this is the first time that you are
using the package, you must install it before you can use it.

6. Load the package(s) using the function library()

library(tidyverse)

• If you are not familiar with R, it is very important to watch the
LinkedIn videos before the lecture.

https://docs.google.com/spreadsheets/d/1UpWq-gvsQulxLBobwO1iq2g_T0EskyV-Zw2Nr7-U3-g/
https://docs.google.com/spreadsheets/d/1RCjoxquWCkO79uRiStiOA7AwacQH0g4vU8lG16LW09Y/

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 93

7.3.4 LSL Mini-Case: Business Understanding
With the R environment ready, we are going to follow the CRISP-DM steps to answer
the same four questions:

1. What are the total Sales Quantities, Revenues, Product Costs, and Gross Profit
for each of the four stores and the total for all four stores?

2. For each of the four stores, what is the average, maximum, minimum, and
median Gross Profit per transaction?

3. For each of the four stores, what are the total Sales Quantities, Revenues, and
Gross Profit for each product type (regular vs premium) and each product
classification (liquor vs wine)?

4. What are the total units sold on each day of the week for each of the stores?
What are the total units sold on each day of the week for each product type
(premium vs regular)?

We will demonstrate how to answer question 1 during the lecture, by completing
the CRISP-DM steps. In the Seminar, you will answer questions 2, 3, and 4 and
compare the answers generated using R with the answers generated in chapter 4
using spreadsheet pivot tables to test your code.

7.3.5 LSL Mini-Case: Data Understanding
Load Data

Using the command read csv we load the data to RStudio.4 The name of the new
data set is dt4C04S.

dt4C04S <- read_csv("C04S.csv") Please notice that the data set will show in the Environment pane of RStudio. 4Please notice that we use the underscore version(i.e., read csv()), NOT the dot version read.csv()). Stratopoulos & Vanden Bosch, Waterloo - 2021 CHAPTER 7. CRISP-DM WITH R 94 Viewing the Data Structure R provides several options for reviewing/understanding the data set. With the func- tion names(), we can view the variable names. names(dt4C04S) [1] "Store" "Description" "Size" [4] "SalesQuantity" "RetailPriceUnit" "SalesDate" [7] "Volume" "Classification" "VendorName" [10] "productCostUnit" With the function glimpse(), we can get a more detailed understanding of the data set. glimpse(dt4C04S) Observations: 10,853 Variables: 10 $ Store 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ Description “Clayhouse Syrah Paso Robles”, “F Co…
$ Size “750mL”, “750mL”, “750mL”, “750mL”, …
$ SalesQuantity 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, …
$ RetailPriceUnit 14.99, 13.99, 13.99, 13.99, 8.99, 22…
$ SalesDate 2019-06-28, 2019-06-24, 2019-06-25,…
$ Volume 750, 750, 750, 750, 750, 750, 750, 7…
$ Classification 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ VendorName “MARTIGNETTI COMPANIES”, “SOUTHERN W…
$ productCostUnit 7.89, 9.26, 9.26, 9.26, 7.58, 12.49,…

Based on the above output, we can see the number of observations, the number
of variables, as well as a few sample values for each variable. Next to each variable
name, R has the data type description < dbl > for variables that take numeric values,
< chr > for variables which are text-based, and < date > for date based variables.

Work with your crew to answer the following questions: How many
variables are in this data set? How many observations? R indicates that the variable

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 95

Store takes numeric values (i.e., < dbl >). Does this mean that the variable store is
numeric?

Another option for understanding the structure of the data set is use the function
head() to view the top six observations of the data set, or the function tail() to
view the bottom six observations.

dt4C04S %>% slice_head(n=6)
dt4C04S %>% slice_tail(n=6)

In R we use the assign operator (< −) to define a new object such as a new data set or a new variable. We use the pipe operator (% > %) to establish
a sequence of operations/actions. For example, start with a data set, find the
missing values, and count missing values in each column. These two operators
are used a lot.

• The shortcut for the < − operator is ALT + minus sign (Windows) and Option + minus sign (Mac). • The shortcut for the % > % operator is CTRL + SHIFT + M.

NB: Useful Shortcuts

Data Understanding – Missing Values

To find the number of missing values in each variable in a data set, we start with the
data set (dt4C04S) and use two R functions: is.na() and colSums(). First, the
function is.na() examines each value of each variable, and returns two values: TRUE
(= 1) if the value is missing, and FALSE (= 0) if the value is not missing. Second,
the function colSums() generates the sum of TRUE (= 1) and FALSE (= 0) values for
each variable. The result is the sum of missing values in each variable in the data
set.

dt4C04S %>%
is.na() %>%
colSums()

Store Description Size SalesQuantity
0 0 0 0

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 96

RetailPriceUnit SalesDate Volume Classification
0 0 0 0

VendorName productCostUnit
0 0

Work with your crew to answer the following questions: Are there
missing values in any of the variables in this data set? If yes, list the variables with
missing values.

Data Understanding – Summary Statistics for All Variables

dt4C04S %>%
summary()

Store Description Size
Min. :1.000 Length:10853 Length:10853
1st Qu.:1.000 Class :character Class :character
Median :2.000 Mode :character Mode :character
Mean :2.067
3rd Qu.:2.000
Max. :4.000

SalesQuantity RetailPriceUnit SalesDate
Min. : 1.000 Min. : 0.79 Min. :2019-06-24
1st Qu.: 1.000 1st Qu.: 8.95 1st Qu.:2019-06-26
Median : 1.000 Median : 12.99 Median :2019-06-27
Mean : 2.394 Mean : 15.62 Mean :2019-06-27
3rd Qu.: 2.000 3rd Qu.: 19.99 3rd Qu.:2019-06-29
Max. :96.000 Max. :219.99 Max. :2019-06-30

Volume Classification VendorName
Min. : 50.0 Min. :1.000 Length:10853
1st Qu.: 750.0 1st Qu.:1.000 Class :character
Median : 750.0 Median :1.000 Mode :character
Mean : 882.2 Mean :1.358
3rd Qu.: 800.0 3rd Qu.:2.000

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 97

Max. :5000.0 Max. :2.000

productCostUnit
Min. : 0.38
1st Qu.: 5.71
Median : 8.79
Mean : 11.30
3rd Qu.: 14.49
Max. :139.70

The output includes all variables in the data set. For each of the variables that
had a format < dbl >, it shows a series of descriptive statistics. In addition to the
descriptive statistics that we have seen in Chapter 2 (i.e., min, max, average, and
median), it includes the 1st Quartile (Q1 ) and 3rd Quartile (Q3). Their calculation
and interpretation is similar to the median (p. 23). More specifically, if we arrange
all observations from smallest to largest, the 1st quartile leaves the 25% of lowest
observations below. The remaining 75% are above. Similarly, the 3rd quartile leaves
the 25% of the highest observations above and the remaining 75% below.

Data Understanding – Summary Statistics Selected Variables

Typically, we generate descriptive statistics and we look for outliers of numeric vari-
ables. With the following R script, we use the function select() to focus on a
subset of three variables (i.e., SalesQuantity, RetailPriceUnit, and productCostUnit)
and generate their summary/descriptive statistics.

dt4C04S %>%
select(SalesQuantity, RetailPriceUnit,

productCostUnit) %>%
summary()

SalesQuantity RetailPriceUnit productCostUnit
Min. : 1.000 Min. : 0.79 Min. : 0.38
1st Qu.: 1.000 1st Qu.: 8.95 1st Qu.: 5.71
Median : 1.000 Median : 12.99 Median : 8.79
Mean : 2.394 Mean : 15.62 Mean : 11.30
3rd Qu.: 2.000 3rd Qu.: 19.99 3rd Qu.: 14.49
Max. :96.000 Max. :219.99 Max. :139.70

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 98

Work with your crew to answer the following questions:

1. What percentage of customers buy no more than 2 bottles? Hint: What is the
value of Q3 for sales quantity?

2. What percentage of products have a retail price between $8.95 and $19.99?
3. What percentage of products have a product cost per unit which is above

$14.49?
4. The difference between the 3rd and 1st quartile is called the Inter-Quartile

Range (IQR = Q3−Q1).
If you round the Q1 and Q3 to the nearest integer, what is the IQR for each
one of the three variables?

7.3.6 LSL Mini-Case: Detecting Outliers
In Chapter 3 (p. 37), we learned that data values that are unusually large or unusu-
ally small compared to the rest of data values are considered outliers. This definition
is relatively subjective as it depends on the viewer’s assessment on what is consid-
ered as a very large or very small value. A more precise approach is to use the
inter-quartile range (IQR). More specifically:

• Any value which is above the upper whisker (uw) is an outlier.
Where uw = Q3 + 1.5 ∗ IQR

• Similarly, any value below the lower whisker (lw) is an outlier.
Where lw = Q1− 1.5 ∗ IQR

Detecting Outliers with IQR

The functions for calculating the quartiles and IQR for sales quantity, as well as
the calculated values, are given below:

• Q1:
quantile(dt4C04S$SalesQuantity, .25) = 1

• Q3:
quantile(dt4C04S$SalesQuantity, .75) = 2

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 99

• IQR:
IQR(dt4C04S$SalesQuantity) = 1

• Upper whisker (uw)
uw =quantile(dt4C04S$SalesQuantity, .75)+

1.5 ∗ IQR(dt4C04S$SalesQuantity) = 3.5

It is relatively easy to impress your colleagues in a meeting by using mental
math to provide a quick and dirty approach to detecting outliers. Consider
the following scenario. You are in a meeting and someone has just projected
on the screen the summary stats for the three variables (i.e., SalesQuantity,
RetailPriceUnit, and productCostUnit) (see p. 97).

You can make the following statement: ‘Products with retail price approx-
imately above $35 are outliers.’ Everyone, would want to know how did you
come up with this number.

Here is the trick: Round generously the Q1 from 8.95 to 10. Round gener-
ously the Q3 from 19.99 to 20. This means that IQR = Q3 – Q1 = 20 – 10 =
10 and 1.5*IQR is 15. Therefore, the upper whisker is 20 + 15 = 35.

NB: Using Mental Math to Identify Outliers

Work with your crew to answer the following questions:
• Use the quantile() and IQR() formula to calculate the uw for retail price?
• How does your answer compare to the one calculated from Using Mental

Math to Identify Outliers?
• Use mental math to identify the outliers (above uw) for product cost per unit.
• Use the the quantile() and IQR() formula to evaluate how good your mental

math estimate was.

7.3.7 LSL Mini-Case: Data Preparation
Create New Variables

As we have seen in 3.4.1, in order to generate the pivot tables needed to answer the
four business questions, we need to create the following variables: Product price per
unit, Product type, Revenue, Product cost, Gross profit, and Day.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 100

In the script below, we use the assign operator (< −) to redefine the data set to include the original, as well as the new variables that have been created using the function mutate. names(dt4C04S) [1] "Store" "Description" "Size" [4] "SalesQuantity" "RetailPriceUnit" "SalesDate" [7] "Volume" "Classification" "VendorName" [10] "productCostUnit" dt4C04S <- dt4C04S %>%
mutate(productPriceUnit = RetailPriceUnit/1.13,

productType =
ifelse(productPriceUnit>20,”Premium”,”Regular”),

Revenue = SalesQuantity*productPriceUnit,
productCost = SalesQuantity*productCostUnit,
grossProfit = Revenue – productCost,
Day = weekdays(SalesDate))

Work with your crew to answer the following questions:

• Understand and explain to each other the process used to create each variable.
• Compare the formula used in R and highlight differences or similarities com-

pared to spreadsheet formulas.
• R formulas/functions are vectorized. Run a search and see if you can find

what the term means and try to explain it to each other. Hint: You may want
to leverage your answer in the previous question.

Data Preparation – Review the New Variables

names(dt4C04S)

[1] “Store” “Description” “Size”
[4] “SalesQuantity” “RetailPriceUnit” “SalesDate”

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 101

[7] “Volume” “Classification” “VendorName”
[10] “productCostUnit” “productPriceUnit” “productType”
[13] “Revenue” “productCost” “grossProfit”
[16] “Day”

glimpse(dt4C04S)

Observations: 10,853
Variables: 16
$ Store 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Description “Clayhouse Syrah Paso Robles”, “F C…
$ Size “750mL”, “750mL”, “750mL”, “750mL”,…
$ SalesQuantity 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1,…
$ RetailPriceUnit 14.99, 13.99, 13.99, 13.99, 8.99, 2…
$ SalesDate 2019-06-28, 2019-06-24, 2019-06-25…
$ Volume 750, 750, 750, 750, 750, 750, 750, …
$ Classification 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ VendorName “MARTIGNETTI COMPANIES”, “SOUTHERN …
$ productCostUnit 7.89, 9.26, 9.26, 9.26, 7.58, 12.49…
$ productPriceUnit 13.265487, 12.380531, 12.380531, 12…
$ productType “Regular”, “Regular”, “Regular”, “R…
$ Revenue 13.265487, 12.380531, 24.761062, 12…
$ productCost 7.89, 9.26, 18.52, 9.26, 22.74, 12….
$ grossProfit 5.375487, 3.120531, 6.241062, 3.120…
$ Day “Friday”, “Monday”, “Tuesday”, “Thu…

Data Preparation- Summary Statistics for New Variables

As we have seen on p. 7.3.5, we can use the function select() to specify the variables
for which we want to generate descriptive statistics.

dt4C04S %>%
select(Revenue, productCost, grossProfit) %>%
summary()

Revenue productCost grossProfit
Min. : 0.6991 Min. : 0.38 Min. :-996.611

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 102

1st Qu.: 8.8407 1st Qu.: 7.14 1st Qu.: 1.108
Median : 15.9204 Median : 12.59 Median : 2.896
Mean : 28.0486 Mean : 23.17 Mean : 4.878
3rd Qu.: 28.3009 3rd Qu.: 23.48 3rd Qu.: 5.851
Max. :2173.7788 Max. :1655.01 Max. : 518.769

Work with your crew to answer the following questions: Does any of
the above variables seem to have outliers on both sides? Use mental math to find
the uw and lw for the variable that you have identified.

7.3.8 LSL Mini-Case: Model
Q1a: Segment Analysis by Store What are the total Sales Quantities, Rev-
enues, Product Costs, and Gross Profit for each of the four stores?

If we approach the creation of a pivot table as an algorithm (see p. 48), we need
to specify the input, process, and output.

Inputs: In the context of Q1a, the input(s) (i.e., variables that we need to
select) are: Store, SalesQuantity, Revenue, productCost, and grossProfit.

Process: Since the variable Store takes four values/categories, one for each store,
we can group by this variable. This is the equivalent of creating four subsets. For
each one of these subsets (groups/stores), we can aggregate (i.e., use the command
summarize) SalesQuantity, Revenue, productCost, and grossProfit. In addition to
these, and for each store, we can calculate the gross profit margin as the ratio of
store gross profit over store revenue.

The output will be a 4×6 table/dataset that captures the information needed to
answer Q1a. The R script and output are shown below.

dt4C04S %>%
select(Store, SalesQuantity, Revenue,

productCost, grossProfit) %>%
group_by(Store) %>%
summarize(storeSalesQ=sum(SalesQuantity),

storeRevenue=sum(Revenue),
storeCOGS=sum(productCost),
storeGP=sum(grossProfit),
storeGPM=(storeGP/storeRevenue)*100)

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 103

# A tibble: 4 x 6
Store storeSalesQ storeRevenue storeCOGS storeGP storeGPM

1 1 9618 104446. 85527. 18919. 18.1
2 2 10308 141012. 116223. 24789. 17.6
3 3 900 6928. 5721. 1207. 17.4
4 4 5156 52025. 44001. 8024. 15.4

Even the most seasoned R users come across problems and/or need help. The
following words of advice come from two very well respected R developers
Garrett Grolemund and Hadley Wickham (see ‘R for Data Science’).

As you start to run R code, you’re likely to run into problems.
Don’t worry — it happens to everyone. I have been writing R code
for years, and every day I still write code that doesn’t work!
Start by carefully comparing the code that you’re running to the
code in the book. R is extremely picky, and a misplaced character
can make all the difference. Make sure that every ( is matched with
a ) and every “ is paired with another ”. Sometimes you’ll run the
code and nothing happens. Check the left-hand of your console:
if it’s a +, it means that R doesn’t think you’ve typed a complete
expression and it’s waiting for you to finish it. In this case, it’s
usually easy to start from scratch again by pressing ESCAPE to
abort processing the current command.

NB: Error Messages

Q1b: All Four Stores What are the total Sales Quantities, Revenues, Product
Costs, and Gross Profit for all four stores?

dt4C04S %>%
select(Store, SalesQuantity, Revenue,

productCost, grossProfit) %>%
summarize(storeSalesQ=sum(SalesQuantity),

storeRevenue=sum(Revenue),
storeCOGS=sum(productCost),
storeGP=sum(grossProfit))

https://r4ds.had.co.nz/data-visualisation.html#common-problems

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 104

# A tibble: 1 x 4
storeSalesQ storeRevenue storeCOGS storeGP


1 25982 304411. 251472. 52939.

Work with your crew to answer the following questions: Look carefully
to see the difference between the R script above and the one used to answer Q1a.
Hint: What are you grouping by? What change do you need to make to the above
R script to include the gross profit margin across all four stores.

7.4 Seminar
The primary objective of the seminar is to simply replicate the above R script on
your own. Use the R Project Check List to make sure that you are completing each
step.

To answer question 3 and 4, you will need to create two-way pivot table.

1. Revisit your notes from 4.4.3 and 4.4.4 and prepare the input-process-output
needed to answer each question.

2. Prepare the pseudo-code needed to answer each question. Hint: Pay attention
to the group by portion of your pseudo-code.

3. Convert your pseudo-code into the R script.

R is designed for performing data analysis rather than visualization. This means
that it does not create a visually appealing pivot table. It formats the output in
what is known as panel data, because that is what you will use to further analyze
large data sets.

7.5 Lecture 2

7.5.1 Replicate LSL Analysis for ALL Stores
Working with Larger Data Sets One of the advantages of using R is that you
can re-run the entire analysis with a new data set, by simply changing the name of
the data set in your R script.

https://docs.google.com/spreadsheets/d/1ittaLs6PIKZGb-1WdV8j1eMKgw-4nG4Ml5PLh30B5zo/edit?usp=sharing

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 105

1. Download the new data set (c07L2.csv) from the course website (Learn). The
new data set has all stores and sales for September 2019. The data set has over
a million lines. It is very important that you do NOT open the data set in
Excel. It will not import (delete) the observations which are beyond the 1
million point.

2. Create a copy of the seminar R file and save it with the name c07L2.R
3. Open the new file and make the following changes:

(a) Change the name of the imported data set from C04S.csv to c07L2.csv
(b) Change the name of new data set from dt4C04S to dt4c07L2. You can

do this as follows:
(c) Highlight/select the dt4C04S and run CTRL + F (Windows) or Command

+ F (Mac).
(d) A new menu will show on top of the source area. The selected text

(dt4C04S) will appear on the left side.
(e) Type the new name (dt4c07L2) in the Replace area and click All.

4. Your R script has been updated and you can run it (Source with Echo) to
view the results based on the new data set.

There are two main reason why you should NOT open a large data set with
Excel. First, if the data set has over 1 million lines, Excel will delete all lines
above 1 million. Second, if the data set has dates, Excel will change the way
dates are recorded. R may not read the dates as < date >.

NB: Excel Opening of Large Files

7.5.2 LSL Analysis for ALL Stores: Data Understanding
Viewing the Larger Data set Work with your crew to use options for review-
ing/understanding the new data set; for example: Use the function names() to view
the variable names.. Use the function glimpse(), to get a more detailed understand-
ing of the new data set. Use the function summary(), to see descriptive statistics for
numeric variables in the new data set.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 106

Work with your crew to answer the following questions:

1. How many variables are in the new data set?
2. How many observations are in the new data set?
3. How many stores can you analyze using the new data set? Hint: Try the

following and see which one works the best. Consider the pros and cons of each
method.

• Use the function table() to generate a frequency table (number of trans-
actions) per store.
dt4c07L2 % > % select(Store) % > % table()

• Treat Store as numeric variable and generate summary statistics:
dt4c07L2 % > % select(Store) % > % summary()

• Try to count the number of unique stores.
dt4c07L2 % > % select(Store) % > % distinct() % > %
summarise(count = n())

4. How many days worth of data are included in the new data set? What specific
dates show the beginning and end of this range of days?

7.5.3 Understand & Communicate Findings for ALL Stores

Work with your crew to answer the following questions:

1. What are the total Sales Quantities, Revenues, Product Costs, and Gross Profit
for ALL stores?

2. For store number 10, what is the average, maximum, minimum, and median
Gross Profit per transaction?

• Hint: Use the function filter() to limit results of your pivot table to a
specified store. You can add the filter function using a pipe to the end
of an existing R script as follows: … % > % filter(Store==10)

3. For store number 25, what are the total Sales Quantities, Revenues, and Gross
Profit for premium products?

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 7. CRISP-DM WITH R 107

• Hint: You can specify two conditions inside the function filter() using
the & sign.5 Add the filter function using a pipe to the end of an
existing R script as follows:
… % > % filter(Store==25 & productType==‘‘Premium’’)

4. For store number 40, what are the total Sales Quantities, Revenues, and Gross
Profit for liquor?

5. What are the total units sold on each day of the week for store number 55 ?

7.6 Key Lessons from this Chapter
Transitioning from spreadsheets to R increased our degrees of freedom in terms of
data we can analyze and the kind of analysis we can do. In this chapter, we have seen
how easy it is to transition from a data set of four stores and one week of transactions
to all stores (82) and one month of transactions or all stores and a year or more of
transactions.6

Nevertheless, even the most powerful tools and the largest data sets are not useful;
unless they address real business questions and you can understand/communicate the
results of the analysis to support decision making.

7.7 Preview of Next Chapter: The Toy Store
The Toy Store (TTS) is a retailer in the United States of America (USA) that sells
a wide range of toys. TTS does not produce or make any products; it purchases
the products from suppliers (vendors) and sells those products to consumers. TTS
operates seven (7) stores located in one of two states in the USA: Massachusetts
(MA) and New York (NY).

Next chapter, we will start with a data set for TTS that has 11,914 lines (each
line is an observation of the daily sales total for a specific product in each store) for
two (2) days: 2016-October-02 and 2016-December-20. We will finish with a data
set that has sales for an entire year.

The primary objective for next chapter is to make the transition to R as smooth
as possible, so students can appreciate and leverage the huge potential that R offers.

5The symbol for OR is a vertical line |.
6The data set that has all stores and a year worth of transactions has over 13 mil. lines.

Stratopoulos & Vanden Bosch, Waterloo – 2021

Chapter 8

The Toy Store Mini-Case

8.1 Learning Objectives
This week we will work on The Toy Store (TTS) mini-case. By the end of this week,
students should be able to:

1. Understand the set-up and line-by-line R script that someone else has prepared
in order to analyze a business problem (Focus of Lecture 1).

2. Replicate the R script that someone else has prepared and introduce variations
(Seminar).

3. Work on a new set of business problems (Lecture 2).
4. Describe the process (data analysis with R) needed to answer a business ques-

tion.
5. Understand and explain the logic behind the model that you have prepared to

answer a business question.

8.2 Students: Advance Preparation
Watch the videos for chapters 7 and 8 (see studentPreparation). These videos
will help you better understand how the package tidyverse works. If you have not
done this yet, make sure to complete the DataCamp assignment.

108

https://docs.google.com/spreadsheets/d/1Ekt0bE4LqABehMM9EyPdhshBaZlAOWN4ZfJ0QYDyAEA/

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 109

8.3 Lecture 1

8.3.1 Company Description
The Toy Store (TTS) is a retailer in the United States of America (USA) that sells
a wide range of toys. TTS does not produce or make any products; it purchases
the products from suppliers (vendors) and sells those products to consumers. The
company is headquartered in Harrisburg Pennsylvania, operates 20 retail toy stores
from Maryland to Maine. The headquarters location also serves as the location for
the online store. Sales are approximately 450 million dollars and costs of goods sold
are approximately 300 million dollars.

The management of TTS would like to compare business performance in terms
of revenues, cost, and gross profit across seven (7) stores located in one of two states:
Massachusetts (MA) and New York (NY) on two sales days: 2016-October-02 and
2016-December-20.

The data set, dt4TTSL1.csv is available on Learn. Each line in the TTS
data set is an observation of the daily sales total for a specific product in each store.

8.3.2 Data Understanding and Preparation
We load and review the TTS data set, and look for missing values.

dt0 <- read_csv("dt4TTSL1.csv") glimpse(dt0) Rows: 11,914 Columns: 9 $ Store “MA-1384”, “MA-1384”, “MA-1384”, “MA-˜
$ State “MA”, “MA”, “MA”, “MA”, “MA”, “MA”, “˜
$ SalesDate 2016-12-20, 2016-12-20, 2016-10-02, ˜
$ Description “Little People Musical Zoo Train”, “M˜
$ SalesQuantity 1, 3, 2, 1, 1, 2, 1, 4, 11, 7, 1, 4, ˜
$ SalesPriceUnit 20.75, 4.99, 34.99, 11.99, 11.99, 64.˜
$ VendorNumber 55185, 24967, 55185, 24967, 24967, 55˜
$ VendorName “FisherPrice”, “Hasbro”, “FisherPrice˜
$ PurchasePriceUnit 14.14, 3.37, 23.89, 8.43, 8.43, 45.59˜

As we can see the data set has 11,914 observations and the following 9 variables:

1. Store = store number

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 110

2. State = state where the store is located
3. salesDate = The sales date.
4. Description = name of product.
5. salesQuantity = units sold.
6. salesPrice = price per unit. This does NOT include sales tax
7. Vendor = supplier identification number.
8. Vendor = supplier name.
9. PurchasePriceUnit = Price that The Toy Store pays to buy the product from

the supplier.

As we can see from the R output below, there are no missing values in the data
set.

dt0 %>%
is.na() %>%
colSums()

Store State SalesDate
0 0 0

Description SalesQuantity SalesPriceUnit
0 0 0

VendorNumber VendorName PurchasePriceUnit
0 0 0

Work with your crew: Analyze the summary statistics for the following
variables: salesQuantity, SalesPriceUnit, and PurchasePriceUnit. Are there any
outliers in these variables? Explain why.

• The R Script and R output for this problem are on p. 126.

Data Preparation

Given the focus of the business problem on revenues, cost, and gross profit, we will
leverage the existing variables to add these three new variables in the data set.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 111

dt1 <- dt0 %>%
mutate(Revenue = SalesQuantity*SalesPriceUnit,

productCost =
SalesQuantity*PurchasePriceUnit,

GrossProfit =
SalesQuantity*(SalesPriceUnit-PurchasePriceUnit))

Work with your crew: Analyze the summary statistics for the three new
variables.

• The R Script and R output for this problem are on p. 126.

8.3.3 Models to Address Business Questions
First Business Question: Which state (MA or NY) generated the most revenue
in total during the two days? To answer the question, we need to create a model
(e.g., a pivot table) based on the following input, process, and output:

1. Input:

• Select the variables State and Revenue

2. Process:

• First: Group by State
• Second: Aggregate/sum revenues for each category/state

3. Output

• Pivot table showing sum of revenue for each state (MA, NY)

dt1 %>%
select(State, Revenue) %>% # input
group_by(State) %>% # process – grouping
summarize(StateRevenue=sum(Revenue)) # process – aggregate

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 112

# A tibble: 2 x 2
State StateRevenue

1 MA 697436.
2 NY 649867.

Based on these results, we can see that MA generates the most revenue $697436.

Second Business Question: TTS operates four (4) stores in Massachusetts (MA)
and the three (3) stores in New York (NY). The management of TTS would like to
see the total revenue, as well as average, min, and max for each store.

dt4q2 <- dt1 %>%
select(State, Store, Revenue) %>%
group_by(State, Store) %>%
summarize(sumRevenue=sum(Revenue),

avgRevenue=mean(Revenue),
maxRevenue=max(Revenue),
minRevenue=min(Revenue))

dt4q2

# A tibble: 7 x 6
# Groups: State [2]

State Store sumRevenue avgRevenue maxRevenue minRevenue

1 MA MA-1384 106608. 91.9 5199. 0.57
2 MA MA-1738 125060. 94.2 3500. 0.99
3 MA MA-2647 226203. 104. 12600. 0.45
4 MA MA-5262 239564. 123. 15750. 0.5
5 NY NY-3349 187604. 123. 7000. 2.06
6 NY NY-3458 136982. 92.7 4399. 0.9
7 NY NY-7283 325282. 141. 13650. 0.4

Work with your crew: Review the above results and answer the following
questions:

1. Which of the 7 stores generated the most revenue?

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 113

2. Which store in Massachusetts (MA) had the lowest average revenue per prod-
uct?

3. What amount is the lowest revenue per product for a store in New York (NY)?
4. Which is larger: the maximum revenue per product for a store in Massachusetts

(MA); or, the maximum revenue per product for a store in New York (NY)?

• The R Script and R output for these questions are on p. 126.

Third Business Question – Part A

Work with your crew: What might be a business question that you can
address with the data analysis shown below? Be succinct one to two short sentences.
Work with your crew to suggest possible business questions.

dt1 %>%
select(State, SalesDate, Revenue) %>%
group_by(State, SalesDate) %>%
summarize(sumRevenue=sum(Revenue))

# A tibble: 4 x 3
# Groups: State [2]

State SalesDate sumRevenue

1 MA 2016-10-02 281140.
2 MA 2016-12-20 416296.
3 NY 2016-10-02 272782.
4 NY 2016-12-20 377086.

• Suggested answers for this question are on p. 127

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 114

Third Business Question – Part B

Work with your crew: Summarize one key-take away point that you want to
communicate/share with the management of TTS based on the data analysis shown
below:

State SalesDate sumRevenue pctTlRevenue

1 MA 2016-10-02 281140. 20.9
2 MA 2016-12-20 416296. 30.9
3 NY 2016-10-02 272782. 20.2
4 NY 2016-12-20 377086. 28.0

• Suggested answers for this question are on p. 128

8.3.4 What is the Logic Behind the R Script?
As part of your training towards becoming a tech-savvy financial professional, it
is critical that you can communicate the results of data analytics in simple non-
technical terms. An equally important skill is that of communicating/translating a
business question into a technical problem. Collectively, the two of them point to
the need to become the conduit/bridge between management and computer/data
scientists. In the following paragraphs, we will work on an exercise that aims to help
you translate (understand the logic) behind the R script used to generate the output
for previous crew exercise (p. 114).

Work with your crew: Use the input-process-output approach to describe the
algorithm that you could have used to generate the above table. Do not prepare the
actual R script. Just describe in plain English the input and process for generating
the output. Think of this as the instructions you would provide to a data analyst
for generating this output (shown on p. 114).

The R Script: The above output was generated using the following R script.
Notice that in order to generate this output we have an R script that uses the
commands summarize() and mutate() to create new variables. In the following

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 115

paragraphs we will go step-by-step through each one of the lines of the R script and
try to understand the difference and when we use the summarize() versus using the
mutate().

dt_pctTtlRev <- dt1 %>%
select(State, SalesDate, Revenue) %>%
group_by(State, SalesDate) %>%
summarize(sumRevenue=sum(Revenue)) %>%
ungroup() %>%
mutate(pctTlRevenue=

100*sumRevenue/sum(sumRevenue))

1. The first four lines of this script are similar to what we have seen in prior
exercises (Chapter 7). The model groups the data by state and day (i.e.,
creates a group for each state and day) and for each one of these four groups
(2 states x 2 days) generates the sum of revenues (sumRevenue). The following
R output show the result of running just the first four lines

dt_pctTtlRev <- dt1 %>%
select(State, SalesDate, Revenue) %>%
group_by(State, SalesDate) %>%
summarize(sumRevenue=sum(Revenue))

dt_pctTtlRev

# A tibble: 4 x 3
# Groups: State [2]

State SalesDate sumRevenue

1 MA 2016-10-02 281140.
2 MA 2016-12-20 416296.
3 NY 2016-10-02 272782.
4 NY 2016-12-20 377086.

2. In order to express the sumRevenue as percentage of total revenue, we need
to calculate total revenue (sum(sumRevenue)) over all records. However, the
data set was already grouped by State and salesDate. This means that if we
were to apply the formula sum(sumRevenue), we would end up with the group
level results (the sum of revenue for each one of four groups).

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 116

dt_pctTtlRev <- dt1 %>%
select(State, SalesDate, Revenue) %>%
group_by(State, SalesDate) %>%
summarize(sumRevenue=sum(Revenue),

pctTlRevenue=
100*sumRevenue/sum(sumRevenue))

dt_pctTtlRev

# A tibble: 4 x 4
# Groups: State [2]

State SalesDate sumRevenue pctTlRevenue

1 MA 2016-10-02 281140. 100
2 MA 2016-12-20 416296. 100
3 NY 2016-10-02 272782. 100
4 NY 2016-12-20 377086. 100

3. To avoid this, we need to ungroup(), before we can create the new aggregate
variable (i.e., sum(sumRevenue)). This is what will happen if we use the func-
tion summarize(). We can use this variable (sum(sumRevenue)) to express
the revenues in each one of the four groups as percentage of total revenue (i.e.,
sum(sumRevenue) = 100*sumRevenue/sum(sumRevenue)).

dt_pctTtlRev <- dt1 %>%
select(State, SalesDate, Revenue) %>%
group_by(State, SalesDate) %>%
summarize(sumRevenue=sum(Revenue)) %>%
ungroup() %>%
summarize(pctTlRevenue=

100*sumRevenue/sum(sumRevenue))
dt_pctTtlRev

# A tibble: 4 x 1
pctTlRevenue


1 20.9
2 30.9

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 117

3 20.2
4 28.0

4. Using summarize() generated the desired results, but in the process it removed
all the other variables. We can avoid this problem by using the command
mutate() instead of summarize().

dt_pctTtlRev <- dt1 %>%
select(State, SalesDate, Revenue) %>%
group_by(State, SalesDate) %>%
summarize(sumRevenue=sum(Revenue)) %>%
ungroup() %>%
mutate(pctTlRevenue=

100*sumRevenue/sum(sumRevenue))
dt_pctTtlRev

# A tibble: 4 x 4
State SalesDate sumRevenue pctTlRevenue

1 MA 2016-10-02 281140. 20.9
2 MA 2016-12-20 416296. 30.9
3 NY 2016-10-02 272782. 20.2
4 NY 2016-12-20 377086. 28.0

Work with your crew: The above explanation of the logic behind the R script
is a bit technical. Work with your team to create a couple of paragraphs that aim to
provide the same explanation but in non-technical terms. You may want to update
your answer to the crew problem on p. 114 as an outline for your explanation.

8.3.5 Models to Address Business Questions – Continued

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 118

Fourth Business Question: What percentage of the firm’s December 20th prod-
uct cost (COGS) was contributed by store MA-2647?

To answer this question, we will need to perform the following steps:

1. Calculate the COGS of each store on each day (sumCOGS).

2. Calculate the total COGS for each day (dayCOGS).

3. Express the sumCOGS as percentage of dayCOGS.

4. Limit our results to just one store (MA-2647) and one day (December 20th).

dt1 %>%
select(Store, SalesDate, productCost) %>%
group_by(SalesDate,Store) %>%
summarize(sumCOGS=sum(productCost)) %>%
ungroup() %>%
group_by(SalesDate) %>%
mutate(dayCOGS=sum(sumCOGS),

pctDayCOGS = 100*sumCOGS/dayCOGS) %>%
filter(SalesDate==”2016-12-20″ &

Store==”MA-2647″)

# A tibble: 1 x 5
# Groups: SalesDate [1]

SalesDate Store sumCOGS dayCOGS pctDayCOGS

1 2016-12-20 MA-2647 96467. 548469. 17.6

Based on the above results, we can see that the COGS on December 20th was
$548469. The store’s COGS for the same day was $96467, and it represents 17.6%
of the day’s COGS.

• For an explanation of the logic behind the above R script see p. 128.

Fifth Business Question: What percentage of the firm’s total product cost (COGS)
was contributed by sales on October 2nd in store NY-3458?

To answer this question, we will need to perform the following steps: 1) calculate
the COGS of each store on each day (sumCOGS). 2) calculate the total COGS

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 119

for all stores and days (totalCOGS). This means, the COGS for the entire firm. 3)
Express the daily COGS of each store (sumCOGS) as percentage of the firm’s OCGS
(totalCOGS). 4) Limit our results to just one store (MA-2647NY-3458) and one day
(October 2nd).

Model and results are shown below. Therefore, the answer is 4.27%.

dt1 %>%
select(Store, SalesDate, productCost) %>%
group_by(Store, SalesDate) %>%
summarize(sumCOGS=sum(productCost)) %>%
ungroup() %>%
mutate(totalCOGS=sum(sumCOGS),

pctTotalCOGS = 100*sumCOGS/totalCOGS)%>%
filter(SalesDate==”2016-10-02″ &

Store==”NY-3458″)

# A tibble: 1 x 5
Store SalesDate sumCOGS totalCOGS pctTotalCOGS

1 NY-3458 2016-10-02 39769. 931033. 4.27

• For an explanation of the logic behind the above R script see p. 130.

8.4 Seminar
The primary objective of the seminar is to replicate – with some small variations –
the R script from Lecture 1.

8.4.1 Prepare R Environment for Seminar
Since you are still getting comfortable with R, you may need some help remembering
all the steps and R commands. Remember that the following link (R Project Check
List) provides a template that you can use.

Complete the following steps before you start working on the content of the
seminar.

1. If you have not done this yet, create a folder/directory on your computer and
name it something like c08 or c08 TTS.

https://docs.google.com/spreadsheets/d/1UpWq-gvsQulxLBobwO1iq2g_T0EskyV-Zw2Nr7-U3-g/
https://docs.google.com/spreadsheets/d/1UpWq-gvsQulxLBobwO1iq2g_T0EskyV-Zw2Nr7-U3-g/

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 120

2. Download/save the data set dt4TTSL11 in the c08 folder.
3. Start RStudio and clean the source area from any existing files, the console

area from existing R script and/or results, and clean any existing data from
the environment.

4. Start a new R file, name it c08S.R and save it in the c08 folder.
5. Set the working directory in R to ‘source file location’.
6. Run/Load the tidyverse library, using library(tidyverse). Do not re-run

the installation of the tidyverse package.
7. Load and review the data set.

• How many observations? Are there missing values?
• Generate summary stats for numeric variables.
• Find the store and transaction that generated the highest sales quantity.

Name the store, day, and product description. Hint: Use the function …
filter(SalesQuantity==max(SalesQuantity))

8. Data preparation. Add the following variables in your data set:

• Revenue is SalesQuantity times SalesPriceUnit,
• productCost is SalesQuantity times PurchasePriceUnit, and
• GrossProfit is sales quantity times the difference between sales price

and purchase price.

8.4.2 Seminar Questions

Work with your crew on the following questions:

1. Generate total revenue by state. Name the new variable StateRevenue. Which
state generated the most revenues?

2. Generate total revenue (sumRevenue), average revenue (avgRevenue), as well
as min and maximum revenue (maxRevenue and minRevenue respectively ) for
each state/store.

3. For each state, generate the total revenue per day. Name the variable sumRev-
enue.

1The data set is available on Learn.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 121

• Filter the above results to find the day when each state generated the
highest volume of revenues. Hint: set the filter to
sumRevenue == max(sumRevenue)

• Filter the above results to find the day when each state generated the
lowest volume of revenues. Hint: set the filter to
sumRevenue == min(sumRevenue)

4. For each state, generate the total revenue per day and express it as a percentage
of the firm’s total revenues. Name the new variable pctTtlRevenue. Save the
results as a new data set named dt pctTtlRev.

• On the day that had the highest (max) contribution to the firm’s total
revenue, what was this contribution (percentage of total revenue)?

• On the day that had the smallest (min) contribution to the firm’s total
revenue, what was this contribution (percentage of total revenue)?

5. For each store, calculate the daily COGS and express it as a percentage of the
daily COGS across all stores. Name the new variable pctDayCOGS. Save the
results in a new data set named dt pctDayCOGS.

• Arrange in data in descending order of pctDayCOGS and review the top
six observations. Hint: Use the function arrange(desc()) and head().

• Arrange in data in descending order of pctDayCOGS and review the bot-
tom ten observations.

6. For each store, calculate the daily COGS and express it as a percentage of the
total COGS across all stores. Name the new variable pctTotalCOGS. Save the
results in a new data set named dt pctTotalCOGS.

• Generate summary statistics for the pctTotalCOGS.
• Use mental math to say whether you agree/disagree with the following

statement: Any value above 10% is an outlier, and there are no outliers
in the lower end of the distribution.

• Use the formulas (Q1−1.5∗IQR, Q3+1.5∗IQR) to support your position.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 122

8.5 Lecture 2

8.5.1 The Toy Store Mini-case
As a Business Analyst working at TTS during a co-op work term, you receive a
message from the TTS Financial Analytics Team Leader, who asks you to analyze a
large data set and provide insights for two upcoming meetings.

The new TTS data set has the same seven (7) stores and sales for an entire year.
The data set has over a million lines. It is very important that you do NOT open
the data set in Excel. It will not import (delete) the observations which are beyond
the 1 million point. Download the new data set (dt4TTSL2.csv) from Learn.

1. For a meeting with the TTS Purchasing Manager, we need to understand the
revenue and gross profit for products that we sold from each of our four vendors:
EA, FisherPrice, Hasbro, and Mattel. This information helps the Purchasing
Manager negotiate volume discounts for future purchases.

2. For a meeting with the TTS Store Operations Manager, we need to understand
the sales performance for all seven (7) stores. The Sales Manager wants to
review daily revenue, daily sales quantities, and the count of unique products
sold each day before she books review meetings with the Store Managers.

The large data set is the fiscal year that ends in January 2017. Like many retailers,
TTS has a January year-end date since sales volumes in December are typically high.
The data set has over 1 million lines (each line is an observation of the daily sales
total for a specific product in each store for all days in the fiscal year).

The Financial Analytics Team Leader’s message concludes as follows: “I pulled
data for two days to generate sample pivot tables for each meeting. Please use the
data set that contains data for the entire fiscal year and generate a pivot table like
Figure 8.1 for the purchasing meeting.”

“You should also create a pivot table like the one shown in Figure 8.2 for the
sales meeting.”

Work with your crew: Summarize one key take-away point that we should
communicate during our meeting with the TTS Purchasing Manager, and another
one for TTS Store Operations Manager.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 123

Figure 8.1: Sample Vendor Analysis

Figure 8.2: Sample Store Sales Analysis

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 124

8.6 Key Lessons from this Chapter
Ability to communicate the results of data analytics in simple non-technical terms is a
very important skill. An equally important skill is that of communicating/translating
a business question into a technical problem. Collectively, the two of them point to
the need to have employees that can become the conduit/bridge between management
and computer/data scientists. The consensus among senior managers and recruiters
seems to be that demand for people with such skills is much larger than the supply.

The reason that people with such skills are in high demand is because there is
gap between the mindset and approach to problem solving differs between the two
groups. On the one hand, managers tend to focus on the monetary implications
of adopting/implementing a new technology, with little or no-understanding of the
technical constraints. On the other hand, computer/data scientists tend to focus on
the technical attributes of the proposed solution, with little or no-understanding of
the economic or user implications.

One of the main objectives this week, was to help students develop the skills that
would allow them to become the conduit/bridge between management and com-
puter/data scientists. More specifically, from a technical standpoint, the focus this
week was on understanding what needs to be done and explaining to others, in-
stead of memorizing key strokes and R commands. To achieve this objective, in the
midterm question 5 (p. ??), we introduced the business question, we described the
steps that one would have to take to answer the question, we introduced the R script,
and explained step-by-step the logic in the R script. We assigned the creation of the
input-process-output as a crew exercise.

8.7 Preview of Next Chapter: Pet Adoption
The primary objective in the previous chapters was to learn how to leverage tools like
spreadsheets and R to develop models that would help us answer specific business
questions. To help students focus on this objective, we used ‘clean’ data sets. This
means data sets which contain relatively simple and well defined variables, and there
were no missing values. However, this is not always the case.

In our next project, we are going to work on a pet adoption project. We will
try to see if we can predict the likelihood that a rescue dog (like Penny) would be
adopted. However, in order to do this we will need to clean and prepare the data
set. It is a messy job, but …

Millions of stray animals suffer on the streets or are euthanized in shelters

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 125

every day around the world. If homes can be found for them, many
precious lives can be saved — and more happy families created.2

Figure 8.3: Penny

Working with a clean and well prepared data set is important and consistent with
our goal, i.e., what is the likelihood that a rescue dog (like Penny) would be adopted?
The answer to that question has business implications for planning store operations;
e.g., how many animals will they have in the upcoming month so therefore how much
food do they need to order?

While developing a predictive model is beyond the scope of this course,3 this
case will give us an opportunity to discuss the different forms of data analytics (i.e.,
descriptive, diagnostic, predictive, prescriptive). It’s also an opportunity to demo
the limitations in trying to “guess” the likelihood that Penny is adopted and point
out to the importance of building good prediction models.

2https://www.kaggle.com/c/petfinder-adoption-prediction/overview/description
3You will learn about probability theory in your introduction to statistics class, and you will

learn how to generate predictive models in the foundation of data mining class

https://www.kaggle.com/c/petfinder-adoption-prediction/overview/description

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 126

8.8 Appendix: Answers to Selected Questions
Crew Exercise: Analyze Summary Statistics in Original Data

Use the following R script and R output to answer the question on p. 110.

dt0 %>%
select(SalesQuantity, SalesPriceUnit, PurchasePriceUnit) %>%
summary()

SalesQuantity SalesPriceUnit PurchasePriceUnit
Min. : 1.000 Min. : 0.40 Min. : 0.28
1st Qu.: 1.000 1st Qu.: 11.99 1st Qu.: 8.11
Median : 2.000 Median : 18.83 Median : 12.82
Mean : 4.182 Mean : 26.75 Mean : 18.48
3rd Qu.: 4.000 3rd Qu.: 26.99 3rd Qu.: 18.88
Max. :199.000 Max. :535.66 Max. :378.36

Hint: Use mental math to evaluate if there are outliers in these variables.

Crew Exercise: Analyze Summary Statistics in New Variables

Use the following R script and R output to answer the question on p. 111.

dt1 %>%
select(Revenue, productCost, GrossProfit) %>%
summary()

Revenue productCost GrossProfit
Min. : 0.40 Min. : 0.28 Min. : 0.12
1st Qu.: 17.99 1st Qu.: 12.64 1st Qu.: 5.66
Median : 35.97 Median : 24.80 Median : 11.23
Mean : 113.09 Mean : 78.15 Mean : 34.94
3rd Qu.: 86.27 3rd Qu.: 59.77 3rd Qu.: 26.47
Max. :15749.55 Max. :10836.00 Max. :4913.55

Crew Exercise: Second Business Question

Use the following R script and R output to answer the question on p. 112.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 127

1. Which of the 7 stores generated the most revenue?

• dt4q2 %>% filter(sumRevenue==max(sumRevenue))

# A tibble: 1 x 6
State Store sumRevenue avgRevenue maxRevenue minRevenue

1 NY NY-7283 325282. 141. 13650. 0.4

2. Which store in Massachusetts (MA) had the lowest average revenue per prod-
uct?

• dt4q2 %>% filter(State==”MA” & avgRevenue==min(avgRevenue))

# A tibble: 1 x 6
State Store sumRevenue avgRevenue maxRevenue minRevenue

1 MA MA-1384 106608. 91.9 5199. 0.57

3. What amount is the lowest revenue per product for a store in New York (NY)?

• Use appropriate filter to generate the correct answer.

4. Which is larger: the maximum revenue per product for a store in Massachusetts
(MA); or, the maximum revenue per product for a store in New York (NY)?

• Use appropriate filter to generate the correct answer.

Crew Exercise: Third Business Question – Part A

Suggested Answers for output associated with the part A of the third business ques-
tion on p. 113.

• Compare revenues on two days (October 2nd & December 20th) between NY
and MA

• Which state (MA or NY) generates the most/least amount of revenues on each
day (October 2nd & December 20th)

• Are there differences in revenues in states (MA and NY) on two given days
(October 2nd & December 20th)?

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 128

Crew Exercise: Third Business Question – Part A

Suggested Answers for output associated with the part A of the third business ques-
tion on p. 114.

• Although both states contribute similarly to the firm’s revenues on October 20th,
MA stores contribute more (30.9% vs 28%) on December 20th. The manage-
ment may want to explore why and see if we can replicate their success with
NY stores.

• On December 20th, stores in NY contributed 28% versus 30.9% from MA stores.
The management may want to explore why there is this difference and if there
is something they can do to increase sales in NY in December.

Fourth Business Question – Understand the Logic of R Script

In the following paragraphs we have a step-by-step explanation of the logic behind
the R script used to answer the fourth business question (p. 118).

The first four lines – repeated below – are similar to what we have seen in prior
exercises (Chapter 7). The model groups the data by store and day (i.e., creates a
group for each store and day) and for each one of these groups generates the total
COGS (i.e., sumCOGS).

dt1 %>%
select(Store, SalesDate, productCost) %>%
group_by(SalesDate,Store) %>%
summarize(sumCOGS=sum(productCost))

The second component of this analysis focuses on generating the COGS by day
(dayCOGS) and use this to create the new variable (pctDaysCOGS). The new
variable expresses the store’s COGS as percentage of a day’s COGS.

Notice that in order to generate the new aggregate variable pctDaysCOGS, we
need to group by salesDate. However, the data set was already grouped by Store and
salesDate. Therefore, we need to ungroup(), before we do the new grouping (i.e.,
group by(salesDate)).

For each one of these salesDate based groups, we calculate the aggregate variable
– sum of day’s COGS (i.e., dayCOGS=sum(sumCOGS)) – as well as the store’s COGS
expressed as percentage of dayCOGS (i.e., pctDayCOGS = 100*sumCOGS/dayCOGS)).

The version of the R script up to this point, as well as the results, are shown
below.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 129

dt1 %>%
select(Store, SalesDate, productCost) %>%
group_by(SalesDate,Store) %>%
summarize(sumCOGS=sum(productCost)) %>%
ungroup() %>%
group_by(SalesDate) %>%
mutate(dayCOGS=sum(sumCOGS),

pctDayCOGS = 100*sumCOGS/dayCOGS)

# A tibble: 14 x 5
# Groups: SalesDate [2]

SalesDate Store sumCOGS dayCOGS pctDayCOGS

1 2016-10-02 MA-1384 29434. 382565. 7.69
2 2016-10-02 MA-1738 38284. 382565. 10.0
3 2016-10-02 MA-2647 59988. 382565. 15.7
4 2016-10-02 MA-5262 66216. 382565. 17.3
5 2016-10-02 NY-3349 59619. 382565. 15.6
6 2016-10-02 NY-3458 39769. 382565. 10.4
7 2016-10-02 NY-7283 89255. 382565. 23.3
8 2016-12-20 MA-1384 44257. 548469. 8.07
9 2016-12-20 MA-1738 48058. 548469. 8.76

10 2016-12-20 MA-2647 96467. 548469. 17.6
11 2016-12-20 MA-5262 98992. 548469. 18.0
12 2016-12-20 NY-3349 70057. 548469. 12.8
13 2016-12-20 NY-3458 54905. 548469. 10.0
14 2016-12-20 NY-7283 135732. 548469. 24.7

The third and final component of this analysis (R script) is to focus on the
particular store and day by using the filter().

dt1 %>%
select(Store, SalesDate, productCost) %>%
group_by(SalesDate,Store) %>%
summarize(sumCOGS=sum(productCost)) %>%
ungroup() %>%
group_by(SalesDate) %>%
mutate(dayCOGS=sum(sumCOGS),

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 130

pctDayCOGS = 100*sumCOGS/dayCOGS) %>%
filter(SalesDate==”2016-12-20″ &

Store==”MA-2647″)

Fifth Business Question – Understand the Logic of R Script

In the following paragraphs we have a step-by-step explanation of the logic behind
the R script used to answer the fifth business question (p. 119).

The main new component in this analysis is that we want the store cost expressed
as a percentage of total firm cost. To achieve this, after the ungroup function, we
don’t need to do another group by. We simply generate the sum across all records
(i.e., totalCOGS=sum(sumCOGS)).

There are several ways we can implement this process. The first approach –
and simplest one – is to generate the results by grouping based on store and day, and
visually inspect/select the store that had the largest difference. Visual inspection of
the results shown below indicates that store NY-7283 seem to have had the biggest
difference, but it may be difficult in a case with many stores.

dt1 %>%
select(Store, SalesDate, productCost) %>%
group_by(Store, SalesDate) %>%
summarize(sumCOGS=sum(productCost))

# A tibble: 14 x 3
# Groups: Store [7]

Store SalesDate sumCOGS

1 MA-1384 2016-10-02 29434.
2 MA-1384 2016-12-20 44257.
3 MA-1738 2016-10-02 38284.
4 MA-1738 2016-12-20 48058.
5 MA-2647 2016-10-02 59988.
6 MA-2647 2016-12-20 96467.
7 MA-5262 2016-10-02 66216.
8 MA-5262 2016-12-20 98992.
9 NY-3349 2016-10-02 59619.

10 NY-3349 2016-12-20 70057.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 131

11 NY-3458 2016-10-02 39769.
12 NY-3458 2016-12-20 54905.
13 NY-7283 2016-10-02 89255.
14 NY-7283 2016-12-20 135732.

The second approach is to make the visual inspection easier by presenting the
results in a form of pivot table. To achieve this, we add the function spread(SalesDate,
sumCOGS). You can think of the function spread() as telling R to treat the variable
salesDate as columns, the variable sumCOGS as values. Again, visual inspection
shows that store NY-7283 had the biggest difference.

dt1 %>%
select(Store, SalesDate, productCost) %>%
group_by(Store, SalesDate) %>%
summarize(sumCOGS=sum(productCost)) %>%
spread(SalesDate, sumCOGS)

# A tibble: 7 x 3
# Groups: Store [7]

Store `2016-10-02` `2016-12-20`

1 MA-1384 29434. 44257.
2 MA-1738 38284. 48058.
3 MA-2647 59988. 96467.
4 MA-5262 66216. 98992.
5 NY-3349 59619. 70057.
6 NY-3458 39769. 54905.
7 NY-7283 89255. 135732.

The third approach would be to focus the model on the difference and calculate
these differences. Based on the results that store NY-7283 clearly had the biggest
difference.

dt1 %>%
select(Store, SalesDate, productCost) %>%
group_by(Store, SalesDate) %>%
summarize(sumCOGS=sum(productCost)) %>%
ungroup() %>%

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 8. THE TOY STORE 132

group_by(Store) %>%
mutate(delta=diff(sumCOGS)) %>%
select(Store, delta) %>%
unique()

# A tibble: 7 x 2
# Groups: Store [7]

Store delta

1 MA-1384 14823.
2 MA-1738 9774
3 MA-2647 36479.
4 MA-5262 32776.
5 NY-3349 10438.
6 NY-3458 15137.
7 NY-7283 46477.

What is the logic behind this analysis (R script)? In this final version, the
focus is on generating for each store the difference in COGS between the two days.

To achieve this goal, we ungroup and then group by store. Within each group
(store) we have two values (one for each day). Therefore, we can use the function
diff(), to calculate the difference between these two values.

Work with your crew: An more useful approach would be to identify the
exact store that had the maximum delta. See if you can improvise the code to
produce just one line that shows the store that had the maximum delta. Hint: use
the filter function.

Stratopoulos & Vanden Bosch, Waterloo – 2021

Chapter 9

Pet Adoption Mini-Case

pet noun [C] (ANIMAL) an animal that is kept in the home as a com-
panion and treated kindly (Cambridge Dictionary)

9.1 Learning Objectives
Our main focus this week will be on binary variables and how to use them in order
to classify/predict the likelihood that an event may occur. The case and data for
this chapter are based on pet adoption. More specifically, our target variable will be
whether a pet will be adopted or not. However, this type of modeling/analysis has
many other accounting and finance applications, such as bankruptcy prediction and
fraud detection.

By the end of this week, students should be able to:

1. Create binary variables from categorical or numeric variables.
2. Generate frequency tables for categorical variables
3. Generate conditional probabilities based on one or more categorical variables.
4. Interpret conditional probabilities
5. Work on a simple business simulation that uses estimated probabilities as an

input. For example, given a shelter’s capacity and estimated ability to place
pets with families, decide on how many new pets to bring in.

133

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 134

9.2 Students: Advance Preparation
We will continue working with the tidyverse package. If you have not done this
yet, please watch the videos for chapters 7 and 8 (see studentPreparation). These
videos will help you better understand how the package tidyverse works. If you
have not done the DataCamp assignment yet, make sure to work on it.

9.3 Lecture 1
When working on data analytics projects that aim to generate some form of predic-
tion, the data set is divided into two parts: training and test set. The idea behind
this distinction is that analysts can use the training portion of the data to generate
the model, and the testing portion to evaluate the model. Typically the training
portion of the data set is in the neighbourhood of 70% to 80% of total observations.
The remaining observations are the test portion.

The data set has a variable of interest, which is known as the target variable.
For example, in the pet adoption case the target (variable) is the pet adoption or
the probability that a pet would be adopted. The data analyst will try to generate a
model based on training data to predict the probability that a pet would be adopted.
Once the model has been built, it will be tested with test data.

The target variable in the pet adoption case is a binary variable, i.e., a variable
that takes two values. If a pet is adopted, the variable takes the value of Yes or
1 or TRUE. If the pet is not adopted, the variable takes the value of No or 0 or
FALSE. In this chapter, we will try to approach this problem from an exploratory
standpoint. This means that we will try to use common sense in order to identify
variables that we feel that may be good predictors of likelihood of adoption and
generate the probability of adoption (frequency based) using the actual data.

In upper level classes, you will learn more advanced techniques, such as logistic
regression, association rules, and/or decision trees. These models tend to generate
better predictions because they consider the effect of multiple variables and their
interaction on the likelihood of adoption.

There are numerous business applications that use a setting, which is similar to
pet adoption. For example, we can try to predict the probability that a firm will file
for bankruptcy by using a target variable that is binary (firm filled for bankruptcy
yes or no). You can try to predict the probability that management has performed
fraud during a financial audit. You can try to predict the probability that a person
will default on a loan that they have received.

https://docs.google.com/spreadsheets/d/1Ekt0bE4LqABehMM9EyPdhshBaZlAOWN4ZfJ0QYDyAEA/

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 135

9.3.1 PetPivot Mini-Case
PetPivot connects dogs and cats with new families, creating happy homes for people
and pets. After a recent hurricane, a volunteer group helping with rescue efforts
found a large number of dogs abandoned when their owners fled the island for safety.
Given the severity of the hurricane damage, the owners cannot return home or keep
these pets. PetPivot management volunteered to accept some of the dogs and find
them new homes. However, PetPivot only has enough shelter space to accept a
maximum of 1,000 dogs. The first batch of pets would arrive on the first of January
and a second batch could arrive on the first of February. PetPivot will accept 1,000
dogs in January. PetPivot management wants to accept a second batch, but needs a
way to predict how many of the first 1,000 dogs would be adopted during January,
freeing up space for a second batch. As pet lovers, we have volunteered to help
PetPivot using data that the company currently tracks.

1. In Lecture 1, we will understand and prepare the data, then use summary
stats to explore possible models to predict the likelihood that a pet would be
adopted.

2. In the Seminar, you will replicate the R script from Lecture 1 and generate any
additional combinations of variables that you think that might be important.

3. In Lecture 2, we will help management estimate how many dogs it can accept
in a second batch given the space limitation for 1,000 dogs in total at the start
of February.

9.3.2 Data Understanding and Preparation
The PetPivot manager has provided us with the data set (petPivot.csv – available
on the course website) that has a training and a test component. As you can see
from the list below the majority of the variables are categorical.

1. dataSet – training or test observations
2. PetID – Unique hash ID of pet profile
3. Type – Type of animal (dog, cat)
4. AdoptionSpeed – Categorical speed of adoption. Lower is faster.

(a) 0 – Pet was adopted on the same day as it was listed.
(b) 1 – Pet was adopted between 1 and 7 days (1st week) after being listed.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 136

(c) 2 – Pet was adopted between 8 and 30 days (1st month) after being listed.
(d) 3 – Pet was adopted between 31 and 90 days (2nd & 3rd month) after

being listed.
(e) 4 – No adoption after 100 days of being listed.

5. Name – Name of pet. The value is empty (NA) if the pet is not named.
6. Age – Age of pet when listed, in months
7. Gender – Gender of pet (Male, Female, xplePets represents group of pets).
8. BreedName1 – Primary breed of pet.
9. BreedName2 – Secondary breed of pet, if pet is of mixed breed

10. ColorName1 – Color 1 of pet.
11. ColorName2 Color 2 of pet.
12. MaturitySize – Size at maturity (Small, Medium, Large, Extra Large, Not

Specified).
13. FurLength – Fur length (Short, Medium, Long, Not Specified).
14. Vaccinated – Pet has been vaccinated (Yes, No, Not Specified).
15. Dewormed – Pet has been dewormed (Yes, No, Not Specified).
16. Sterilized – Pet has been spayed / neutered (Yes, No, Not Specified).
17. Health – Health Condition (Healthy, minorIssues, majorIssues, Not Specified).
18. Quantity – Number of pets represented in profile.
19. VideoAmt – Total uploaded videos for this pet.
20. PhotoAmt – Total uploaded photos for this pet.
21. Description – Profile write-up for this pet. The primary language used is En-

glish, with some in Malay or Chinese.
22. Fee – Adoption fee (0 = Free).

Load and Review Data

library(tidyverse)
dt0 <- read_csv("petPivot.csv") glimpse(dt) Stratopoulos & Vanden Bosch, Waterloo - 2021 CHAPTER 9. PET ADOPTION 137 Observations: 18,941 Variables: 22 $ dataSet “train”, “train”, “train”, “train”, “t…
$ PetID “86e1089a3”, “6296e909a”, “3422e4906”,…
$ Type “cat”, “cat”, “dog”, “dog”, “dog”, “ca…
$ AdoptionSpeed 2, 0, 3, 2, 2, 2, 1, 3, 1, 4, 1, 1, 2,…
$ Name “Nibble”, “No Name Yet”, “Brisco”, “Mi…
$ Age 3, 1, 1, 4, 1, 3, 12, 0, 2, 12, 2, 3, …
$ Gender “Male”, “Male”, “Male”, “Female”, “Mal…
$ BreedName1 “Tabby”, “Domestic Medium Hair”, “Mixe…
$ BreedName2 NA, NA, NA, NA, NA, NA, “Domestic Long…
$ ColorName1 “Black”, “Black”, “Brown”, “Black”, “B…
$ ColorName2 “White”, “Brown”, “White”, “Brown”, NA…
$ MaturitySize “Small”, “Medium”, “Medium”, “Medium”,…
$ FurLength “Short”, “Medium”, “Medium”, “Short”, …
$ Vaccinated “No”, “Not Specified”, “Yes”, “Yes”, “…
$ Dewormed “No”, “Not Specified”, “Yes”, “Yes”, “…
$ Sterilized “No”, “Not Specified”, “No”, “No”, “No…
$ Health “Healthy”, “Healthy”, “Healthy”, “Heal…
$ Quantity 1, 1, 1, 1, 1, 1, 1, 6, 1, 1, 1, 1, 1,…
$ Description “Nibble is a 3+ month old ball of cute…
$ PhotoAmt 1, 2, 7, 8, 3, 2, 3, 9, 6, 2, 7, 2, 1,…
$ VideoAmt 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Fee 100, 0, 0, 150, 0, 0, 300, 0, 0, 0, 0,…

There are a few things that we need to pay attention on this data set, in addition
to the fact that the data set has 18,941 observations and 22 variables:

1. The variable adoptionSpeed has a format < db > which indicates that the
variable takes numeric values. However, it is clear from the variable description,
that the variable is categorical, not numeric. Since this is the focal point of
our analysis, we will need to revisit it during the data preparation stage.

2. There is a variable called dataSet that takes values train and test. We will
use filter to create the training and testing portion of the data set using these
values.

3. There are several variables that take such values as Not Specified. Techni-
cally, these values are not missing, but they are not available from an analy-
sis/decision making standpoint.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 138

Work with your crew: According to the following output there are no missing
values associated with the variable MaturitySize. How will you find out how many
observations in the data set take the value Not Specified?

dt0 %>% is.na() %>% colSums()

dataSet PetID Type AdoptionSpeed
0 0 0 3948

Name Age Gender BreedName1
1560 0 0 5

BreedName2 ColorName1 ColorName2 MaturitySize
13840 0 5548 0

FurLength Vaccinated Dewormed Sterilized
0 0 0 0

Health Quantity Description PhotoAmt
0 0 0 0

VideoAmt Fee
0 0

Data Preparation

In the first, stage of our analysis, we want to focus on the probability that a pet
would be adopted. Looking at the missing values in our data set, we realize that for
3,948 observations the adoption speed are missing. This is a very common practice
in the design of training and test portion of the data set. The target variable is
missing in the set of testing data. We can verify this as follows:

dt0 %>% filter(dataSet==”test”) %>% count(AdoptionSpeed)

# A tibble: 1 x 2
AdoptionSpeed n


1 NA 3948

As we can see, all (3948) values of the variable adoptionSpeed in the testing
portion of the data set are missing. This means, that we will need to generate the
probability that a pet would be adopted using the training data. To do this we
need to create a new binary variable Adopted, that takes two values: Adopted, Non
Adopted.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 139

Work with your crew: What is the logic behind the following R script. Try
to explain in non-technical terms why you filter the data set and new variable that
you have created.

dt1 <- dt %>%
filter(dataSet==”train”) %>%
mutate(Adopted =

if_else(
AdoptionSpeed == 4, “Not adopted”, “Adopted”))

Understand Target Variable: Pet Adoption Since our target variable is bi-
nary, one of the first things that we can do is to generate the count (n) of pets that
have been adopted or not adopted.

dt1 %>%
count(Adopted)

# A tibble: 2 x 2
Adopted n

1 Adopted 10796
2 Not adopted 4197

As we can see the pets that have been placed are more than double the pets that
have not been adopted. With a small addition (mutate) to the above script, we can
generate the new variable that shows the percentage/frequency (freq).

dt1 %>%
count(Adopted) %>%
mutate(freq = n / sum(n))

# A tibble: 2 x 3
Adopted n freq

1 Adopted 10796 0.720
2 Not adopted 4197 0.280

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 140

Therefore, 72% of the pets in the test data have been adopted and 28% had not
been adopted after 100 days of being listed.

Work with your crew: You can replicate the above R script and use summarise()
instead of mutate(). Explain what is the advantage of using mutate() in this case.

9.3.3 Explore (Model)
Summary Stats of Numeric Variables: Adopted vs Non-Adopted Pets
The goal is to explore if the age, number of pets available (quantity), amount of
photo (PhotoAmt) or video (VideoAmt), or Fee make a difference in the likelihood
that a pet will be adopted.1

Adopted Pets: With the following R script, we focus on (filter) pets that have
been adopted, and generate summary statistics for each one of the numeric variables.

dt1 %>%
select(Adopted, Age, Quantity,PhotoAmt, VideoAmt, Fee) %>%
filter(Adopted==”Adopted”) %>%
summary()

Adopted Age Quantity
Length:10796 Min. : 0.000 Min. : 1.000
Class :character 1st Qu.: 2.000 1st Qu.: 1.000
Mode :character Median : 3.000 Median : 1.000

Mean : 9.202 Mean : 1.516
3rd Qu.: 7.000 3rd Qu.: 1.000
Max. :212.000 Max. :20.000

PhotoAmt VideoAmt Fee
Min. : 0.000 Min. :0.00000 Min. : 0.00
1st Qu.: 2.000 1st Qu.:0.00000 1st Qu.: 0.00
Median : 3.000 Median :0.00000 Median : 0.00
Mean : 4.111 Mean :0.06086 Mean : 21.24

1In your stats class you will learn how to perform hypothesis testing in order to evaluate, for
example, if the average age of pets which are adopted is less than those who are not adopted. In
other words, do people prefer to adopt younger pets?

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 141

3rd Qu.: 5.000 3rd Qu.:0.00000 3rd Qu.: 0.00
Max. :30.000 Max. :8.00000 Max. :3000.00

Non-Adopted Pets: With the following R script, we focus on (filter) pets that
have not been adopted.

dt1 %>%
select(Adopted, Age, Quantity,PhotoAmt, VideoAmt, Fee) %>%
filter(Adopted==”Not adopted”) %>%
summary()

Adopted Age Quantity
Length:4197 Min. : 0.00 Min. : 1.00
Class :character 1st Qu.: 3.00 1st Qu.: 1.00
Mode :character Median : 6.00 Median : 1.00

Mean : 13.67 Mean : 1.73
3rd Qu.: 15.00 3rd Qu.: 2.00
Max. :255.00 Max. :20.00

PhotoAmt VideoAmt Fee
Min. : 0.00 Min. :0.00000 Min. : 0.00
1st Qu.: 1.00 1st Qu.:0.00000 1st Qu.: 0.00
Median : 3.00 Median :0.00000 Median : 0.00
Mean : 3.32 Mean :0.04622 Mean : 21.32
3rd Qu.: 4.00 3rd Qu.:0.00000 3rd Qu.: 0.00
Max. :30.00 Max. :8.00000 Max. :750.00

Contrasting the results of non-adopted pets with those of adopted pets, we can
see that non-adopted pets have a median age which is approximately three months
older than that of adopted pets. Non-adopted pets tend to have an average age of
13.7 months versus 9.2 for adopted pets.

Work with your crew Examine/Contrast the rest of variables.

1. What is the average and median of the variable Quantity for adopted vs non-
adopted pets?

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 142

2. Do you think that this would be useful to make predictions regarding the
likelihood that a pet would be adopted?

3. What is the average and median of the variable photoAmt for adopted vs non-
adopted pets?

4. Do you think that this would be useful to make predictions regarding the
likelihood that a pet would be adopted?

5. In your opinion, which variable produces the strongest contrast between pets
that have been adopted and non-adopted pets?

9.3.4 Conditional Probabilities
In this section, we will explore (contrast) frequency-based probabilities between
adopted and non-adopted pets. The goal is to see which variable or combination
of variables produces the strongest contrast between adopted and non-adopted pets.

Does the Type of the Pet Matter? To explore this question, we will group
our data based on pet type (dog or cat) and within each group we will generate the
percentage of pets that have been adopted or non-adopted.

Since this is the first time working on these type of questions, we will split the
R script in two stages. First, we want to focus on the variable Adopted, which takes
two values (Adopted or Not Adopted) and Type, which takes two values (dog or cat).
Therefore, the function count will produce four values, one for each groups (Adopted
cat, Adopted dog, Not Adopted cat, and Not Adopted dog).

dt1 %>%
count(Adopted, Type) %>%
mutate(freq = n / sum(n))

# A tibble: 4 x 4
Adopted Type n freq

1 Adopted cat 5078 0.339
2 Adopted dog 5718 0.381
3 Not adopted cat 1783 0.119
4 Not adopted dog 2414 0.161

Second, we want to group pets by type (dog vs cat) in order to see if there are
differences in likelihood of adoption based on type. This means that we want to find

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 143

probabilities within each group. For example, given that a pet is a cat, what is the
probability that the pet would be adopted. This means, that the probabilities add to
100% within each type (dog or cat). This kind of probability is known as conditional
probability.

dt1 %>%
count(Adopted, Type) %>%
mutate(freq = n / sum(n)) %>%
group_by(Type) %>%
mutate(condProb = freq/sum(freq)) %>%
arrange(Type)

# A tibble: 4 x 5
# Groups: Type [2]

Adopted Type n freq condProb

1 Adopted cat 5078 0.339 0.740
2 Not adopted cat 1783 0.119 0.260
3 Adopted dog 5718 0.381 0.703
4 Not adopted dog 2414 0.161 0.297

Interpretation of results:

1. The variable n provides the count of dogs or cats that have been adopted or
non-adopted.

(a) There are 5078 cats that have been adopted within the first 100 days.
(b) There are 1783 cats that have not been adopted within the first 100 days.

2. The variable freq converts the above count (n) as a percentage of all pets.
Notice, that the sum of all freq is equal to 1.0 or 100%.

(a) Out of all pets in our data set, 33.9% are cats and have been adopted
within the first 100 days.

(b) Out of all pets in our data set, 16.1% are dogs that have not been adopted
within the first 100 days.

(c) If we randomly pick a pet from our data set (test data), the probability
that it is a cat that has not been adopted within the first 100 days is
11.9%.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 144

3. The variable condProb converts the count (n) as a percentage of all cats or all
dogs. Notice that the sum of condProb is 100% for cats and 100% for dogs.

(a) If there is a cat in the shelter, the probability that it will be adopted is
74%.

Adopted Type n freq condProb
1 Adopted cat 5078 0.339 0.740
2 Not adopted cat 1783 0.119 0.260

(b) If there is a dog in the shelter, the probability that it will not be adopted
is 29.7%.

Adopted Type n freq condProb
1 Adopted dog 5718 0.381 0.703
2 Not adopted dog 2414 0.161 0.297

(c) If a new dog arrives to the shelter (i.e., given that it is a dog), the proba-
bility that it would be adopted within the first 100 days is 70.3%.

Does the Gender of the Pet Matter? To explore this question, we group the
pets by gender (female, male, xplePets). The results are shown below.

dt1 %>%
count(Adopted, Gender) %>%
mutate(freq = n / sum(n)) %>%
group_by(Gender) %>%
mutate(condProb = freq/sum(freq)) %>%
arrange(Gender)

# A tibble: 6 x 5
# Groups: Gender [3]

Adopted Gender n freq condProb

1 Adopted Female 5152 0.344 0.708
2 Not adopted Female 2125 0.142 0.292
3 Adopted Male 4130 0.275 0.746
4 Not adopted Male 1406 0.0938 0.254
5 Adopted xplePets 1514 0.101 0.694
6 Not adopted xplePets 666 0.0444 0.306

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 145

Work with your crew to interpret these results: Provide one to two
examples of interpreting count (n), frequency (freq), and conditional probability
(condProb). Make sure that your interpretation of conditional probabilities states
the grouping (i.e., what is given).

Does the Type and Gender of the Pet Matter? To explore this question, we
group the pets by gender (female, male, xplePets). The results are shown below.

dt1 %>%
count(Adopted, Type, Gender) %>%
mutate(freq = n / sum(n)) %>%
group_by(Type, Gender) %>%
mutate(condProb = freq/sum(freq)) %>%

arrange(Type, Gender)

# A tibble: 12 x 6
# Groups: Type, Gender [6]

Adopted Type Gender n freq condProb

1 Adopted cat Female 2213 0.148 0.732
2 Not adopted cat Female 812 0.0542 0.268
3 Adopted cat Male 1929 0.129 0.762
4 Not adopted cat Male 602 0.0402 0.238
5 Adopted cat xplePets 936 0.0624 0.717
6 Not adopted cat xplePets 369 0.0246 0.283
7 Adopted dog Female 2939 0.196 0.691
8 Not adopted dog Female 1313 0.0876 0.309
9 Adopted dog Male 2201 0.147 0.732

10 Not adopted dog Male 804 0.0536 0.268
11 Adopted dog xplePets 578 0.0386 0.661
12 Not adopted dog xplePets 297 0.0198 0.339

Work with your crew to interpret these results: Provide one to two
examples of interpreting count (n), frequency (freq), and conditional probability
(condProb). Make sure that your interpretation of conditional probabilities states
the grouping (i.e., what is given).

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 146

9.3.5 Prepare for Seminar
Complete the following steps before you arrive to the seminar.

1. If you have not done this yet, create a folder/directory on your computer and
name it c09.

2. Download/save the data set petPivot.csv in the c09 folder.
3. Start RStudio and clean the source area from any existing files, the console

area from existing R script and results, clean any existing data from the
environment.

4. Start a new R file, name it c09S.R and save it in the c09 folder.
5. Set the working directory in R to ‘source file location’.
6. Load the tidyverse library.
7. Replicate the R script from Lecture 1.

9.4 Seminar
The primary objective of the seminar is to continue working on the R script from
Lecture 1 and add a new target variable.

9.4.1 Extend the Existing Model
1. Does the Type and Health of the pet matter? Generate conditional probabili-

ties.
2. Create a new binary variable (pureBred) that takes the value pureBred if the

pet is a pure breed and the value of mixed if it is mixed breed.

• Hint: … if else(BreedName1 == “Mixed Breed”, “mixed”, “pureBred”))

3. Does the Type of pet and whether the pet is pureBred or not, matter when it
comes to adoption? Generate conditional probabilities.

4. Explore another combination of two variables that you think that may generate
good predictions. See if you can come up with a combination that has a
probability of adoption above 82%.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 147

9.4.2 New Target Variable & Model: Monthly Adoption
The PetPivot manager would be interested in creating a new target variable that
shows the probability a pet would be adopted in the first, second, or third month,
and not-adopted after three months. Name the new variable as monthlyAdopt.

1. Create monthlyAdopt as follows: The variable takes the value of 1 if the pet
was adopted in the first month, the value of 3 if it was adopted in the second
or third month, and the value of 4 if it was not adopted after three months.

• Hint When we have to make more than 2 or 3 choices, instead of using
if else() we can use case when. The case when does not have a limit on
how many cases one can use. The following R script shows two different
ways of implementing the function case when to create the new variable.

dt1 <- dt1 %>%
mutate(monthlyAdopt=case_when(AdoptionSpeed<3 ˜ 1, AdoptionSpeed==3 ˜ 3, AdoptionSpeed==4 ˜ 4)) # OR dt1 <- dt1 %>%

mutate(monthlyAdopt=case_when(AdoptionSpeed<3 ˜ 1, AdoptionSpeed==3 ˜ 3, TRUE ˜ 4)) 2. Create the frequency table showing percentage of pets adopted based on the variable monthlyAdopt. 3. Does the Type and Health matter for monthlyAdopt? Create conditional prob- abilities. 4. Does the Type and pureBred matter for monthlyAdopt? Generate conditional probabilities. 9.4.3 New Target Variable & Model: First Week Adoption The PetPivot manager would be interested in finding what factors contribute to the likelihood that a pet would be adopted within the first week. Create a new variable that captures pets adopted within the first week and name it FWAdopt. Stratopoulos & Vanden Bosch, Waterloo - 2021 CHAPTER 9. PET ADOPTION 148 1. Create FWAdopt as follows: The variable takes the value of Yes if the pet was adopted in the first week and the value of No for all other cases. 2. Create the frequency table showing percentage of pets adopted in the first week. 3. Does the Type and Health matter for first week adoption pets? Create condi- tional probabilities. 4. Does the Type and pureBred matter for first week adoption pets? Generate conditional probabilities. 9.5 Lecture 2 9.5.1 PetPivot Mini-Case PetPivot management has provided a link to the basic forecast model that it will use to plan for the two batches of dogs from the hurricane (Link to PetPivot Hurricane Pet Forecast Model). To complete the forecast, management needs estimates of the likelihood that dogs in each of the first and second batches will be adopted: (a) during the first month; and, (b) during months 2 and 3. Experience has shown that the rate of adoption during months 2 and 3 is similar during each month so the model allocates half of the months 2 and 3 rate to month 2 and half to month 3. PetPivot now has additional information about the dogs found after the hurricane. 1. The 1,000 dogs in the first batch arriving on January 1st will be those most at risk: dogs with some health problems, and/or older dogs. 2. The dogs available for a second batch will be similar to the dogs in the training set in petPivot.csv. First Batch: Estimate of Dogs Adopted 1. Create a new training data set (i.e., a subset of the existing training set) that resembles the composition of the dogs expected to arrive in the first batch. In the new data set the representation of older dogs and dogs with minor or major health issues should be higher than in the training used for Lecture 1 and Seminar. Name the new training set dt4L2 4B1 The process for creating the new training set is shown in the following R script. https://docs.google.com/spreadsheets/d/1-yA_v76euI25bSLFXEoBdo-_v5LITx3KCWV-1VrCc04/ https://docs.google.com/spreadsheets/d/1-yA_v76euI25bSLFXEoBdo-_v5LITx3KCWV-1VrCc04/ Stratopoulos & Vanden Bosch, Waterloo - 2021 CHAPTER 9. PET ADOPTION 149 dt4L2_4B1a <- dt0 %>%
filter(dataSet==”train”, Type==”dog”, Health!=”Healthy”)

dt4L2_4B1a %>% nrow()

dt4L2_4B1b <- dt0 %>%
filter(dataSet==”train”, Type==”dog”, Age>36, Quantity==1,

Health==”Healthy”)
dt4L2_4B1b %>% nrow()

dt4L2_4B1 <- bind_rows(dt4L2_4B1a, dt4L2_4B1b) dt4L2_4B1 %>% nrow()

First, we create a data set (dt4L2 4B1a) that selects all dogs in the original
training set that are NOT healthy (Health! = “Healthy”).2 This subset has
287 observations/rows.
Second, we create a data set (dt4L2 4B1b) that selects all dogs in the orig-
inal training set that are older (Age > 36 months), they are not in groups
(Quantity == 1), and they are healthy (Healthy == “Healthy′′). This new
set has 647 observations/rows. The choice of 36 months was through trial and
error. The aim was to generate a data set that when combined with the one
above comes as close as possible to 1000 observations.
Third, we combine these two sets (bind rows()) to create a training set that
has 934 observations and comes as close as possible to the composition of the
first batch of rescue dogs that the shelter is hoping to place in January.

2. Use the new data set (dt4L2 4B1 ) to estimate the likelihood (percentage of
dogs with given attributes in the first batch) that would be adopted in the first
30 days, and in month 2 & 3.

3. Use your above estimates, to populate the entries “Batch 1: percent adopted
during period”(cell B13 and C13 respectively) in Link to PetPivot Hurricane
Pet Forecast Model.

4. How many pets from the first batch would be adopted in January, February,
and March?

5. How many new pets can the shelter accept in February?
2Notice that in R, the expression ! = stands for not equal.

https://docs.google.com/spreadsheets/d/1-yA_v76euI25bSLFXEoBdo-_v5LITx3KCWV-1VrCc04/
https://docs.google.com/spreadsheets/d/1-yA_v76euI25bSLFXEoBdo-_v5LITx3KCWV-1VrCc04/

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 150

Second Batch: Estimate of Dogs Adopted

Since the second batch is likely to have the same composition as the original training
set, we can use the training portion of the petPivot.csv data to generate predictions
for the second batch.

1. Estimate the likelihood (percentage of dogs with given attributes as in the
second batch) that would be adopted in the first 30 days, and in month 2 & 3.

2. Use your above estimates, to populate the entries “Batch 2: percent adopted
during period”(cell C18 and D18 respectively) in Link to PetPivot Hurricane
Pet Forecast Model.

3. How many pets from the second batch would be adopted in February and in
March?

9.6 Key Lessons from this Chapter
1. Conditional probabilities can help identify the subset of categorical variables

that are relevant in predicting a certain outcome: for example, the outcomes
in PetPivot (adopted or not adopted) are similar to The Toy Store predicting
if a certain product would be sold or not sold.

2. Predicting how many products will be sold and when they will be sold (e.g., 1
week, 1 month, 2 months, or later) is an important step in inventory manage-
ment for retailers and manufacturers.

3. As you’ll learn in management accounting, great financial professionals use
data – conditional probabilities – to inform estimates or assumptions used in
planning for inventory management.

9.7 Preview of Next Chapter: OK Cupid
Next week’s analysis and discussion is based on a data set from OK Cupid. The
data set contains anonymized information from thousands of users that posted their
information on OK Cupid’s web site in June 2012. The data set includes typical user
information, lifestyle variables, as well as text responses to 10 essay questions.

While the main objective in OK Cupid is to match users. The idea has numerous
other applications. For example, think of matching venture capitalists with startups,

https://docs.google.com/spreadsheets/d/1-yA_v76euI25bSLFXEoBdo-_v5LITx3KCWV-1VrCc04/
https://docs.google.com/spreadsheets/d/1-yA_v76euI25bSLFXEoBdo-_v5LITx3KCWV-1VrCc04/

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 9. PET ADOPTION 151

matching insurance companies with people who want to buy insurance, or matching
banks with bank loan applicants.

In addition to the above, working with OK Cupid would open the stage to an
opportunity to discuss the ethical implications of working with big data and data
analytics. Read the article by Zimmer (2016). What is your position on the issue
raised in the article? Does Public Equal Consent?

Stratopoulos & Vanden Bosch, Waterloo – 2021

Chapter 10

OkCupid Mini-Case

10.1 Learning Objectives
Our main focus for this week is on working with information provided by cus-
tomers/users. More specifically, we will work on anonymized data collected by an
online dating service: OkCupid.

By the end of this week, students should be able to:

1. understand the ethical implications of using data and data analytics technology.
2. learn how to work with missing (incomplete) data.
3. leverage existing data to draw inference regarding the possible values of obser-

vations with missing or incomplete information.
4. understand the logic of how an algorithm like the one used by OkCupid works

and its limitations.

10.2 Students: Advance Preparation
We will continue working with the tidyverse package. If you have not done this
yet, please watch the videos for chapters 7 and 8 (see studentPreparation).

To understand the logic of how an algorithm like the one used by OkCupid works
and its limitations watch the following two TED talk videos:

1. Christian Rudder (co-founder of OkCupid) ‘Inside OkCupid – The math of on-
line dating.’ Available from the following URL: https://www.ted.com/talks/
christian_rudder_inside_okcupid_the_math_of_online_dating

152

https://docs.google.com/spreadsheets/d/1Ekt0bE4LqABehMM9EyPdhshBaZlAOWN4ZfJ0QYDyAEA/

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 153

2. Amy Webb ‘How I hacked online dating.’ https://www.ted.com/talks/amy_
webb_how_i_hacked_online_dating?language=en#t-1031384

10.3 Lecture 1
1. In Lecture 1, we will discuss the ethical issues around data analytics and learn

how to work with incomplete or messy data.
2. In the Seminar, you will replicate the R script from Lecture 1 and work on

variations of working with incomplete data.
3. In Lecture 2, we will review the answers to the questions from the seminar,

and understand the logic of an algorithm like the one used by OkCupid.

10.3.1 Ethical Issues Around Data Analytics
Data is Valuable In the last two decades, technological changes have enabled
firms to leverage large volumes of data (i.e., big data) and advanced data analytics
to deliver personalized services that most of us now take for granted. For example, a
web search through Google will produce results which are more likely to be relevant in
terms of the attributes, location, and even preferences of the person who is searching.
Similarly, Facebook has enabled people to connect with friends and relatives all over
the world, and even be re-united with long lost friends.

The paradoxical aspect in these benefits of data analytics is that, from a purely
monetary standpoint, these services/benefits are free. However, as Paul Samuelson –
a Nobel laureate economist – once said, “There’s no such thing as a free lunch.” This
means, that we (consumers of these services, customers of these companies) provide
something in exchange. Our payment is in the form of personal information that we
provide to these companies.

Think for a second about the web-searches that you have done in the last few days
and what these searches reveal about yourself. It is very likely, that some of these
searches are related to very personal matters or very sensitive business information.
The GPS in your cell phone helped you visit places – without being lost – but how
would you feel if someone could re-create a map of all the places you visited. Again
some of these places may be related to very sensitive personal or professional matters.

In the comic series Dilbert – Dogbert Consults (2010), Dogbert proposes that:1

1The cartoon is available from the following url: https://dilbert.com/strip/2010-10-13.

https://dilbert.com/strip/2010-10-13

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 154

Customer data is an asset you can sell. It is totally ethical because our
customers would do the same to us if they could. In phase one, we will
dehumanize the enemy by calling them ‘data’.

Why Ethical Considerations Matter Chessell (2014) explains that historically
laws and regulations were providing organizations with guidance around privacy and
use of personal data. However, recent technological developments have widened the
gap between:

• what is technologically possible, and

• what is legally allowed.

Chessel argues that this gap provides opportunities, as well as risks. For example,
it had been well documented that Facebook and OkCupid ran experiments without
notifying their users.2 In both cases, this generated a significant backlash among
users and loss of credibility for both firms in terms of the company’s ethical standards.
Zimmer (2016) raises similar concern about the group of researchers that made the
OkCupid data available for data mining, without anonymizing them.

Given the risks associated with what stakeholders perceive to be the inappro-
priate use of data, it is not sufficient to simply ask: what is legal? People must
stop and consider what is appropriate or ethical use of data. Chessell (2014) raises
the question: “As an organization looks towards applying analytics and big data to
enhance the way they operate, how do they know that their use of this technology is
ethical?” One could even ask what is an acceptable code of data ethics?

Data Ethics is a Concept Under Development According to the Wikipedia
(2019): Data Ethics refers to systemising, defending, and recommending concepts
of right and wrong conduct in relation to data, in particular personal data. Data
Ethics is concerned with the following principles:

1. Ownership – Individuals own their own data.
2. Transaction Transparency – If an individuals personal data is used, they should

have transparent access to the algorithm design used to generate aggregate data
sets

2See Grandoni (2014): ‘You May Have Been A Lab Rat In A Huge Facebook Experiment’, and
Selterman (2014): ‘The Ethics of OkCupid’s Dating Experiment’.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 155

3. Consent – If an individual or legal entity would like to use personal data, one
needs informed and explicitly expressed consent of what personal data moves
to whom, when, and for what purpose from the owner of the data.

4. Privacy – If data transactions occur all reasonable effort needs to be made to
preserve privacy.

5. Currency – Individuals should be aware of financial transactions resulting from
the use of their personal data and the scale of these transactions.

6. Openness – Aggregate data sets should be freely available.

The Wikipedia article provides just one illustration of principles to help dis-
tinguish right from wrong. Data ethics is a new and quickly evolving space with
many organizations, including several consulting companies, rushing to provide guid-
ance. The one-page overview titled “Ethics of Big Data and Analytics” by Chessell
(2014) highlights the distinction between what is legal and what is ethical. The
ethical awareness framework described by Chessell (2014) reinforces principles in the
Wikipedia definition like Ownership and Consent, then adds additional principles
and questions including:

• Substantiated – Are the sources of data used appropriate, authoritative, com-
plete and timely for the application?

• Fair – How equitable are the results of the application to all parties?

• Accountable – How are mistakes and unintended consequences detected and
repaired?

Implications for Use of Data and Analytics Technology Since different or-
ganizations and stakeholders may have differing principles and opinions on what is
“right”, we must consult widely and review available policies when we face ethical
questions around what is acceptable.

10.3.2 OkCupid Data
The data set contains anonymized information from thousands of users that posted
their information on OkCupid’s web site in June 2012. The data set includes typical
user information, lifestyle variables, as well as text responses to 10 essay questions.
The data set for OkCupid is based on the cleaned and anonymized version provided
by user rudeboybert in GitHub.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 156

10.3.3 Data Understanding (1)
Perform the standard analysis for data understanding. We start loading the library
tidyverse.

library(tidyverse)

We use the function options(scipen=) to avoid scientific notation in the report-
ing of results. The number that you choose for scipen equals to, does not matter.

options(scipen = 99)
dt0 <- read_csv("c10.csv") glimpse(dt0) Observations: 59,946 Variables: 31 $ age 22, 35, 38, 23, 29, 29, 32, 31, 24, 37, …
$ body_type “a little extra”, “average”, “thin”, “th…
$ diet “strictly anything”, “mostly other”, “an…
$ drinks “socially”, “often”, “socially”, “social…
$ drugs “never”, “sometimes”, NA, NA, “never”, N…
$ education “high school”, “space cadet”, “masters p…
$ essay0 “about me:
\n
\ni would love t…
$ essay1 “currently working as an international a…
$ essay2 “making people laugh.
\nranting abo…
$ essay3 “the way i look. i am a six foot half as…
$ essay4 “books:
\nabsurdistan, the republic…
$ essay5 “food.
\nwater.
\ncell phone.<... $ essay6 “duality and humorous things”, NA, NA, “…
$ essay7 “trying to find someone to hang out with…
$ essay8 “i am new to california and looking for …
$ essay9 “you want to be swept off your feet!
“asian, white”, “white”, NA, “white”, “a…
$ height 75, 70, 68, 71, 66, 67, 65, 65, 67, 65, …
$ income -1, 80000, -1, 20000, -1, -1, -1, -1, -1…
$ job “transportation”, “hospitality / travel”…
$ last_online “2012-06-28-20-30”, “2012-06-29-21-41”, …
$ location “south san francisco, california”, “oakl…
$ offspring “doesn’t have kids, but might want…

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 157

$ orientation “straight”, “straight”, “straight”, “str…
$ pets “likes dogs and likes cats”, “likes dogs…
$ religion “agnosticism and very serious about it”,…
$ sex “m”, “m”, “m”, “m”, “m”, “m”, “f”, “f”, …
$ sign “gemini”, “cancer”, “pisces but it doesn…
$ smokes “sometimes”, “no”, “no”, “no”, “no”, “no…
$ speaks “english”, “english (fluently), spanish …
$ status “single”, “single”, “available”, “single…

dt0 %>% is.na() %>% colSums()

age body_type diet drinks drugs
0 5296 24395 2985 14080

education essay0 essay1 essay2 essay3
6628 5485 7571 9638 11476

essay4 essay5 essay6 essay7 essay8
10537 10847 13771 12450 19214

essay9 ethnicity height income job
12602 5680 3 0 8198

last_online location offspring orientation pets
0 0 35561 0 19921

religion sex sign smokes speaks
20226 0 11056 5512 50

status
0

Most of the variables in the data set are self explanatory.

• age
• body type description provided by the user regarding body type.
• diet description provided by the user regarding eating.
• drinks description provided by the user regarding drinking
• drugs description provided by the user regarding drug use
• education level of education
• essay0 – essay9 essays prepared by users
• ethnicity ethnic background

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 158

• height inches
• income
• job job title
• last online last time user was online
• location geographic area/address
• offspring description provided by the user regarding children
• orientation sexual orientation
• pets description provided by the user regarding pets
• religion description provided by the user regarding religion
• sex gender
• sign zodiac sign
• smokes description provided by the user regarding smoking
• speaks language(s)
• status description provided by the user regarding their status (e.g., single)

The data set has profiles for close to 60,000 users and 31 variables. The majority
of the variables are text based. The analysis for missing values shows that there are a
lot of variables for which the users have not provided any information. Interestingly,
it seems that practically all users have reported information for the three primary
numeric variables: age, height, and income.

Summary Stats: Numeric Variables The results for age and height seem to
be reasonable with the possibility of some outliers that need to be further explored.
However, the variable that deserves closer attention is income.

For a lot of people, talking about their income or disclosing their income is a
taboo. The results show that this is the case with OkCupid users. It seems that
there is a large portion of users who have reported income of $-1. These are people
who may not want to report/disclose their income. It is possible that they may think
that income should not be a factor when one is looking to find a soul mate.

dt0 %>%
select(age, height, income) %>%
summary()

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 159

age height income
Min. : 18.00 Min. : 1.0 Min. : -1
1st Qu.: 26.00 1st Qu.:66.0 1st Qu.: -1
Median : 30.00 Median :68.0 Median : -1
Mean : 32.34 Mean :68.3 Mean : 20033
3rd Qu.: 37.00 3rd Qu.:71.0 3rd Qu.: -1
Max. :110.00 Max. :95.0 Max. :1000000

NA’s :3

Work with your crew … to find how many users have reported an income
of $-1 in their profile.

1. Create a new binary variable reportIncome that takes the values: Yes or No.
Assign users with income of $-1 as No.

2. Find how many have not reported their income (count()).
3. Calculate frequency count as a percentage of the entire population.

10.3.4 Data Understanding (2)
Given the results for income, it makes sense to consider and analyze two separate
paths of further exploration. First, focus on those who have reported their income.
Second, see what we can infer about users who have not reported their income.

Let’s start with the group of users who reported their income. We create a new
data set (dt1 ) and review summary statistics for income.

dt1 <- dt0 %>% filter(income!=-1)
dt1 %>%

select(income) %>% summary()

income
Min. : 20000
1st Qu.: 20000
Median : 50000
Mean : 104395
3rd Qu.: 100000
Max. :1000000

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 160

Based on the above, we can see that the majority of users reported an income
between $20,000 and $100,000. To better visualize these results, we will create a box
plot.

A graph used to show the distribution of a variable by using quartiles (see Fig.
10.1)a . The typical box plot has a box, which is defined by the first (Q1) and
third (Q3) quartile. Its width is equal to the inter-quartile range (IQR). The
position of second quartile (median) is shown inside the box. The position of
the upper whisker (Q3 + 1.5 ∗ IQR) and lower whisker (Q1− 1.5 ∗ IQR) are
defined by lines extending above Q3 and below Q1, respectively. Dots above
the upper whisker or below the lower whisker indicate the existence of outliers.

aFig. 10.1 is from the Wikipedia (Jhguch at en.wikipedia [CC BY-SA 2.5 (https://
creativecommons.org/licenses/by-sa/2.5)])

NB: Box plot

Figure 10.1: Box Plot vs Normal Distribution

The function boxplot, shown below, has three arguments:

1. the name of the data set and variable, i.e., dt1$income,

https://creativecommons.org/licenses/by-sa/2.5
https://creativecommons.org/licenses/by-sa/2.5

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 161

2. the function horizontal=TRUE to indicate that the graph should be shown in
horizontal, rather than the default vertical position, and

3. the title of the graph, i.e., main=”Income”.

boxplot(dt1$income, horizontal = TRUE,
main=”Income”)

Figure 10.2: Box Plot for Income

The box plot (Figure 10.2) shows that there are outliers on the upper end of the
distribution. Please be aware that the graph is trying to capture the general picture
of the distribution. It does not show the actual number of outliers. We can find the
outliers by generating the upper whisker of the distribution.

Work with your crew Use mental math to estimate the upper whisker. See
how your answer compares to the actual value shown below.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 162

quantile(dt1$income, .75)+1.5*IQR(dt1$income)

75%
220000

Therefore, the upper whisker is $200,000 and any user with an income above
$220000 is an outlier. To find how many users are outliers, we create a new data set
that is limited to observations above the upper whisker, and use the function count.

dt1a <- dt1 %>% filter(income>220000)
dt1a %>% select(income) %>% count()

# A tibble: 1 x 1
n


1 718

Given the relatively high number of observations (over 1% of the entire population
of about 60,000 profiles), it makes sense to generate summary statistics and the box
plot to better understand the income distribution within the set of outliers, i.e., users
with very high income.

The summary statistics and the box plot. The summary statistics show that half
of the users have reported income of 1 million dollars, which suggests that users may
have had to choose their income based on brackets.

dt1a %>%
select(income) %>% summary()

income
Min. : 250000
1st Qu.: 500000
Median :1000000
Mean : 810933
3rd Qu.:1000000
Max. :1000000

Work with your crew: Can you explain why/how the box plot (fig. 10.3)
visually supports the argument that half of the high income users have reported

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 163

income equal to $1 million?

boxplot(dt1a$income, horizontal = TRUE,
main=”Income – Top Percentile”)

Figure 10.3: Box Plot for Very High Income Users

10.3.5 Data Preparation and Understanding
In the process of trying to understand the variable income, we have realized that
behind this numeric variable there are different categories of users/profiles. For
example, there are people who do not want to disclose their income, people who
have above or below the median income, people with income which is around the
median, as well as users who according to their reported income belong to the top
one percent of all users.

With the following R script, we will create/add two new variables in our data set
(dt0 ). First, a binary variable (reportIncome) that shows whether user has included
income as part of their profile. Second, a categorical variable (incomeGroup) that
tries to captures all different income groups/categories.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 164

dt0 <- dt0 %>%
mutate(

reportIncome=
ifelse(income==-1,”No”, “Yes”),

incomeGroup=
case_when(

income==-1 ˜ “doNotReport”,
income<=20000 ˜ "belowMedian", income<100000 ˜ "aroundMedian", income<220000 ˜ "aboveMedian", TRUE ˜ "top1Pct" )) Using these two new variables we can calculate count as well as general prob- abilities for each combination of the variables reportIncome and incomeGroup. In addition to this we calculated conditional probabilities for each reportIncome group and arrange the results based on the reportIncome group. The R script and results are shown below. dt0 %>% count(reportIncome, incomeGroup) %>%
mutate(freq=n/sum(n)) %>%
group_by(reportIncome) %>%
mutate(condProb=freq/sum(freq)) %>%
arrange(reportIncome)

# A tibble: 5 x 5
# Groups: reportIncome [2]

reportIncome incomeGroup n freq condProb

1 No doNotReport 48442 0.808 1
2 Yes aboveMedian 2252 0.0376 0.196
3 Yes aroundMedian 5582 0.0931 0.485
4 Yes belowMedian 2952 0.0492 0.257
5 Yes top1Pct 718 0.0120 0.0624

Work with your crew:

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 165

1. Based on the above results, we can see that 80.8% of the users did not include
income in their profiles. For those who have reported their income the con-
ditional probabilities add to 100%. Explain why? Hint: Which part of the R
script determines this? What is the part of the R script that indicates that we
want to calculate a conditional, rather than a general probability?

2. Looking at the conditional probabilities (i.e., within the group that has reported
their income), we see that there are approximately 25% of users who have an
income below Q1, around 50% of users reported an income between Q1 and Q3,
but only 19.6% with income above Q3. Does this mean that there is something
wrong with our analysis?

10.3.6 Seminar Preparation – Playing Detective
The objective during the seminar will be to first replicate the content of lecture one,
and then see if we can come up with other attributes (from a user’s profile) that
would let us estimate the likely income group for a person who did not report their
income.

Complete the following steps before you arrive to the seminar.
1. If you have not done this yet, create a folder/directory on your computer and

name it c10.
2. Download/save the data set c10.csv in the c10 folder.
3. Start RStudio and clean the source area from any existing files, the console

area from existing R script and/or results, clean any existing data from the
environment.

4. Start a new R file, name it c10.R and save it in the c10 folder.
5. Set the working directory in R to ‘source file location’.
6. Load the tidyverse library.
7. Replicate the R script from Lecture 1.

10.4 Seminar

10.4.1 Predict Income Group from Job
It is very likely that a person’s job would be associated with their income group.
Therefore, if we know a person’s job, we can guess the income category of this
person. To do this we need to work in two stages.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 166

1. First, see if the distribution of jobs between users who reported their income
is similar to the distribution of users who did not report their income.

2. Second, if there is a similar distribution generate conditional probabilities show-
ing the income group of a person given the job of this person.

Use the following R script to perform the first step. Lines 1-6 in this R script
are similar to what we have seen in the pet adoption case. Lines 7 and 8 contain
new R functions. In line 7, we use the function spread to convert the results in
the form of a pivot table. The function spread has two arguments: The first one
(key=reportIncome) specifies, which variable will be shown in columns instead of
rows. The second one (value=condProb) specifies the values that we would like to
see inside the pivot table. With line 8, we simply view the results as a pivot table in
R Studio.

dt0 %>%
count(reportIncome, job) %>%
mutate(freq=n/sum(n)) %>%
group_by(reportIncome) %>%
mutate(condProb=freq/sum(freq)) %>%
select(reportIncome, job, condProb) %>%
spread(key = reportIncome, value = condProb) %>%
view()

Work with your crew:

1. Replicate the above R script.
2. How does the distribution of jobs compare between those who reported their

income with those who did not report their income?
3. If you feel that the distribution is comparable, replicate and run the R script

– shown below – to calculate conditional probability of income category given
job. The script will filter the results to users who have reported their income
and show the results as a pivot table.

4. What is the strongest prediction (highest probability) that you can make for
someone’s income category based on these results. In other words, given that
a person’s job description is (select one from the list below), this person has a
… probability of being in the … income category.

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 167

(a) artistic / musical / writer
(b) construction / craftsmanship
(c) military
(d) unemployed

dt0 %>% filter(reportIncome==”Yes”) %>%
count(incomeGroup, job) %>%
mutate(freq=n/sum(n)) %>%
group_by(job) %>%
mutate(condProb=freq/sum(freq)) %>%
select(incomeGroup, job, condProb) %>%
spread(key = incomeGroup, value = condProb) %>%
view()

10.4.2 Predict Income Group from Education

Work with your crew: Repeat the above analysis to see if you can make
predictions about income group based on education. You may want to pay attention
to space cadets.

1. How does the distribution of education compare between those who reported
their income with those who did not report their income?

2. If you feel that the distribution is comparable, calculate conditional probability
of income category given education.

3. What is the strongest prediction (highest probability) that you can make for
someone’s income category based on these results. In other words, given that
a person’s education is (select one from the list below), this person has a …
probability of being in the … income category.

(a) high school
(b) college/university
(c) masters
(d) medical school,
(e) space cadet

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 168

About 3% of the users have reported their education as a ‘space cadet’. Accord-
ing to Merriam-Webster Dictionary a space cadet refers to a flaky, lightheaded,
or forgetful person.

Space cadet has been used derogatorily since the late 1970s, but
long before then it referred to the rank that the character Matt
Dodson hoped to achieve in Robert Heinlein’s 1948 novel Space
Cadet. Other writers of futuristic fiction followed Heinlein’s lead,
using the word in reference to young astronauts. From there the
meaning broadened to cover any space travel enthusiast. Today
the word is occasionally used as a slang word for a pilot who
shows off, but it most commonly refers to those of us who may
seem to have our minds in outer space while our bodies remain
earthbound. (https://www.merriam-webster.com/dictionary/
space%20cadet)

NB: Space Cadet

10.4.3 Predict Income Group from Other Categories

Work with your crew: Repeat the above analysis to see if you can make
predictions about income group based on any other categorical variable that you
think is appropriate. Does a person’s astrological sign or sexual orientation or status
relate to their income group. Just have fun exploring these and other possibilities.

10.5 Lecture 2

10.5.1 Seminar Debriefing
1. Revisit the logic of proposed detective work.
2. Why it is important to look for similarity of distributions between those who

reported their income and those who did not report their income for each of
the proposed categories.

3. Interpret conditional probabilities.
4. Will our predictions become stronger or weaker if we were to reduce the number

of income groups?

https://www.merriam-webster.com/dictionary/space%20cadet)
https://www.merriam-webster.com/dictionary/space%20cadet)

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 169

10.5.2 OkCupid and Dating Algorithms
According to the classic study by Byrne and Nelson (1965) people are attracted to
other people who are similar to them. A quick review of the description/focus of
dozens of online dating companies,3 demonstrates that a huge industry has been built
on the premise that people are attracted to people with similar attributes, including
among others education, race, age, religion, and socio-economic status. The following
quote is from a blog by Dr. Sadie Leder-Elder (August 13, 2015):4

… people are attracted to similar others. … Despite the overabundance of
scientifically-validated literature on the topic, I think no one has described
the phenomenon better than Jerry Seinfeld when he said, “I know what
I’ve been looking for all these years. Myself. I’ve been waiting for me to
come along. And now I’ve swept myself off my feet!”

OkCupid – like many other dating cites – uses its own algorithm to match users. In
a TED talk (https://www.ted.com/talks/christian_rudder_inside_okcupid_
the_math_of_online_dating), Christian Rudder – one of the founders of OkCupid
– provides a basic introduction to algorithms and explains the math behind the
algorithm used by OkCupid.

Figure 10.4: OkCupid – Web Page (2019)

3See https://en.wikipedia.org/wiki/Comparison_of_online_dating_services
4The blog contains a link to a short video from the Seinfeld episode. See https://www.luvze.

com/seinfeld-and-similarity-relevant-relationship-advice-from-th/

https://en.wikipedia.org/wiki/Comparison_of_online_dating_services

Seinfeld and Similarity – Relevant Relationship Advice From the 90s

Seinfeld and Similarity – Relevant Relationship Advice From the 90s

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 170

However, there is no such thing, as a good one-size fits all algorithm for dating,
and Amy Webb explains why in her TED talk: https://www.ted.com/talks/amy_
webb_how_i_hacked_online_dating?language=en#t-1031384

Work with your crew:

1. Can you explain to someone how the OkCupid algorithm works?

2. Randomly select one of the users from the data set. For this person, try to
understand what might be the most important factors for selecting a match.

3. Use this criteria to generate a list of potential matches.

10.6 Key Lessons from this Chapter
1. In assessing the use of data, you will get different answers to the following

questions: (a) what is technologically possible? (b) what is legally allowed?
(c) what is ethical?

2. Data ethics – assessing right and wrong conduct for use of data – relies on
principles that are a work in progress and vary across organizations and stake-
holders.

3. When working with data sets, you will encounter situations where there is
missing (incomplete) data.

4. You may leverage existing data to draw inferences regarding the possible values
of observations with missing or incomplete information.

5. While an algorithm may appear logical, it may also have limitations.

10.7 Preview of Next Chapter: Dashboards
In Chapter 8, we worked with a Liquor Store Limited data set that was much too
large for a spreadsheet application. Using R, we analyzed that data set to extract
information for the company’s 77 stores; however, R does not produce nicely format-
ted pivot tables, which means the output can be difficult to interpret. Next week, we
will learn how to build an interactive dashboard using lessons on data visualization
from earlier in the term and the pivot tables generated in Chapter 8. The interac-
tive dashboard will allow the Liquor Store Company’s Product Management Team

Stratopoulos & Vanden Bosch, Waterloo – 2021

CHAPTER 10. OKCUPID 171

Leader to answer a series of questions and gain insight into the performance of any
of the company’s 77 stores.

Stratopoulos & Vanden Bosch, Waterloo – 2021

Bibliography

Bogost, Ian (Aug. 23, 2018). “Welcome to the Age of Privacy Nihilism”. In: The
Atlantic. url: https://www.theatlantic.com/technology/archive/2018/
08/the-age-of-privacy-nihilism-is-here/568198/ (visited on 08/26/2018).

Byrne, Donn and Don Nelson (1965). “Attraction as a linear function of propor-
tion of positive reinforcements”. In: Journal of Personality and Social Psychology
1.6, pp. 659–663. issn: 1939-1315(Electronic),0022-3514(Print). doi: 10.1037/
h0022073.

Chessell, Mandy (2014). Ethics for big data and analytics. IBM, p. 1. url: https:
//www.ibmbigdatahub.com/sites/default/files/whitepapers_reports_
file/TCG%20Study%20Report%20-%20Ethics%20for%20BD&A.pdf.

Davenport, Thomas H. and Jinho Kim (June 11, 2013). Keeping Up with the Quants:
Your Guide to Understanding and Using Analytics. Boston, Massachusetts: Har-
vard Business Review Press. 240 pp. isbn: 978-1-4221-8725-8.

Grandoni, Dino (June 29, 2014). “You May Have Been A Lab Rat In A Huge Face-
book Experiment”. In: HuffPost Canada. url: https://www.huffingtonpost.
com/2014/06/29/facebook- experiment- psychological_n_5540018.html
(visited on 11/05/2019).

Parmar, Hema (July 24, 2019). “From Fitbits to Rokus, Hedge Funds Mine Data
for Consumer Habits”. In: Bloomberg.com. url: https://www.bloomberg.com/
news/articles/2019-07-24/from-fitbits-to-rokus-hedge-funds-mine-
data-for-consumer-habits (visited on 09/05/2020).

Selterman, Dylan (Aug. 18, 2014). The Ethics of OKCupid’s Dating Experiment.
Luvze. url: https://www.luvze.com/the-ethics-of-okcupids-dating-
experiment/ (visited on 11/05/2019).

Wikipedia (Nov. 1, 2019). Big data ethics. In: Wikipedia. Page Version ID: 924021478.
url: https://en.wikipedia.org/w/index.php?title=Big_data_ethics&
oldid=924021478 (visited on 11/05/2019).

172

https://www.theatlantic.com/technology/archive/2018/08/the-age-of-privacy-nihilism-is-here/568198/
https://www.theatlantic.com/technology/archive/2018/08/the-age-of-privacy-nihilism-is-here/568198/
https://doi.org/10.1037/h0022073
https://doi.org/10.1037/h0022073
https://www.ibmbigdatahub.com/sites/default/files/whitepapers_reports_file/TCG%20Study%20Report%20-%20Ethics%20for%20BD&A.pdf
https://www.ibmbigdatahub.com/sites/default/files/whitepapers_reports_file/TCG%20Study%20Report%20-%20Ethics%20for%20BD&A.pdf
https://www.ibmbigdatahub.com/sites/default/files/whitepapers_reports_file/TCG%20Study%20Report%20-%20Ethics%20for%20BD&A.pdf
https://www.huffingtonpost.com/2014/06/29/facebook-experiment-psychological_n_5540018.html
https://www.huffingtonpost.com/2014/06/29/facebook-experiment-psychological_n_5540018.html
https://www.bloomberg.com/news/articles/2019-07-24/from-fitbits-to-rokus-hedge-funds-mine-data-for-consumer-habits
https://www.bloomberg.com/news/articles/2019-07-24/from-fitbits-to-rokus-hedge-funds-mine-data-for-consumer-habits
https://www.bloomberg.com/news/articles/2019-07-24/from-fitbits-to-rokus-hedge-funds-mine-data-for-consumer-habits

The Ethics of OKCupid’s Dating Experiment

The Ethics of OKCupid’s Dating Experiment


https://en.wikipedia.org/w/index.php?title=Big_data_ethics&oldid=924021478
https://en.wikipedia.org/w/index.php?title=Big_data_ethics&oldid=924021478

Stratopoulos & Vanden Bosch, Waterloo – 2021

BIBLIOGRAPHY 173

Zimmer, Michael (May 14, 2016). “OkCupid Study Reveals the Perils of Big-Data
Science”. In: Wired. issn: 1059-1028. url: https://www.wired.com/2016/05/
okcupid-study-reveals-perils-big-data-science/ (visited on 11/05/2019).

https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/
https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/

Stratopoulos & Vanden Bosch, Waterloo – 2021

Alphabetical Index

association rules, 134

box plot, 160
business analytics, 3

conditional probability, 143
cost of goods sold, 69
CRISP-DM, i

data analytics, 3
data point, 8
data preparation

conditional statement, 16
logical formulas, 15
mathematical formulas, 11

data quality, 36
data reliability, 36
data types, 8

binary, 9
categorical, 9
nominal, 9
numerical, 8
ordinal, 9
quantitative, 8

data value, 8
decision trees, 134
descriptive statistics, 22

1st quartile, 97
3rd quartile, 97

average, 29
max, 29
median, 22, 29
min, 29

graphs
aggregate charts, 34
bar/column chart, 31
graph selection guide, 31
histogram, 26
pie chart, 40
story telling, 25
trend line, 24

gross profit, 69, 70

Income Statement, 67
inter-quartile range (IQR), 98, 160
IQR, 160

logistic regression, 134
lower whisker, 98, 160

maximum, 22
measurement error, 37
median, 22, 160
mental math, 99
minimum, 22
missing values, 38

net income, 69

174

Stratopoulos & Vanden Bosch, Waterloo – 2021

ALPHABETICAL INDEX 175

observation, 8
order of operations, 12
outliers, 37, 98
outliers – mental math, 99

pivot table, 48
% of column, 59
% of grand total, 59
% of row, 59
average, 49
group by, 49
max, 58
min, 58
show as, 59
sum, 58
summarize by, 49
two-way, 56

profit margin, 69

quartiles, 160

R functions
AND = & sign, 107
colSums(), 95
count(), 139
filter(), 106
glimpse(), 94, 105
group by(), 102
head(), 95
ifelse(), 100
IQR(), 99
is.na(), 95
library(), 92
mutate(), 100
names(), 94, 105
OR = sign for vertical line, 107
quantile(), 98
read csv(), 93
select(), 97

spread(), 166
summarise(), 102
summarize(), 102
summary(), 96, 105
table(), 106
tail(), 95
ungroup(), 116
weekdays(), 100

R packages
tidyverse, 92

record, 8

sales channel, 47
segment analysis, 45
simple algorithm, 50

inputs, 50
output, 50
process, 50
pseudo-code, 50

spreadsheet functions
AND, 60
average, 30
count, 39
count (numeric values), 30
counta, 39
if statement, 16
IF(AND), 60
IF(OR), 60
max, 30
median, 30
min, 30
OR, 60
quick sum, 28
weekday, 72
sum, 30

target variable, 134
testing set, 134
text analytics, 4

Stratopoulos & Vanden Bosch, Waterloo – 2021

ALPHABETICAL INDEX 176

Toy Store Case
Store Sales Analysis, 123
Vendor Analysis, 123

training set, 134

upper whisker, 98, 160

variable, 8
vectorized, 100

Understand & Prepare Data
Learning Objectives
Student Preparation
Lecture 1
Source of Data
Organizing Data: From Receipt to Spreadsheet
Types of Variables (Types of Data)
Data Preparation: Modeling and Formulas
Seminar Preview

Seminar
Data Preparation – Part 1: Mathematical Formulas
Data Preparation – Part 2: Mathematical Formulas
Data Preparation – Part 3: Logical Formulas

Lecture 2
Seminar Debriefing
Understand Business & Build Research Skills

Key Lessons from this Chapter
Preview of Next Chapter

Big Picture
Learning Objectives
Students: Advance Preparation
Lecture 1
Model: LCBO Deposits for Financial Reporting
Illustration of Descriptive Statistics
Visualizations – Illustration of Potential Graphs
Descriptive and Summary Statistics
Visualizations
Seminar Preview

Seminar
Visualization – Aggregate Column/Bar Chart

Lecture 2
Seminar Debriefing
Understand Data Quality (Reliability)
Evaluate Data Quality (Reliability)
Understand Business: Total Revenue

Key Lessons from this Chapter
Preview of Next Chapter

Segment Analysis
Learning Objectives
Students: Advance Preparation
Lecture 1
Model: Business Segment Analysis
Illustration of Segment Analysis
The Logic Behind the Creation of a Pivot Table
Understand Seminar Data Set
Developing a Simple Algorithm
Seminar Preview

Seminar
Prepare the Data Set: Create New Variables
Segment Analysis/Modeling: Create Pivot Tables
Segment Analysis/Modeling: Two Way Pivot Table

Lecture 2
Class Exercises: Seminar Debriefing
Show As Pivot Table Options
Categories: Other

Key Lessons from chapters 1-3: Product Sales
Preview of Next Chapter: Product Cost Data

Applying CRISP-DM
Learning Objectives
Students: Advance Preparation
Overview of the Week
Liquor Store Limited Mini-Case

Lecture 1
Assess the Situation: Role and Required Output
Assess the Situation: Business Understanding
Identify and Analyze Issues: Data Understanding
Identify and Analyze Issues: Data Preparation
Seminar Preview

Seminar
Complete Data Preparation Tasks
Model: Segment Analysis Using Pivot Tables
Model: Segment Analysis Using Two-Way Pivot Table
Model: Analysis by Day of the Week

Lecture 2
Seminar Debriefing
Visualize Answers to Questions
Filter Data

Lessons from Problem-Solving Case
Preview of Next Chapter

Midterm Review
Learning Objectives & Design
How to Prepare for the Midterm
Midterm Fall 2020
Data
Analysis
Practice Exam Deliverable

Preview for Next Chapter

Midterm
Learning Objectives

CRISP-DM with R
Learning Objectives
Students: Advance Preparation
Lecture 1
Overview of the chapter
Why Transition from Spreadsheets to R
Prepare R Environment
LSL Mini-Case: Business Understanding
LSL Mini-Case: Data Understanding
LSL Mini-Case: Detecting Outliers
LSL Mini-Case: Data Preparation
LSL Mini-Case: Model

Seminar
Lecture 2
Replicate LSL Analysis for ALL Stores
LSL Analysis for ALL Stores: Data Understanding
Understand & Communicate Findings for ALL Stores

Key Lessons from this Chapter
Preview of Next Chapter: The Toy Store

The Toy Store
Learning Objectives
Students: Advance Preparation
Lecture 1
Company Description
Data Understanding and Preparation
Models to Address Business Questions
What is the Logic Behind the R Script?
Models to Address Business Questions – Continued

Seminar
Prepare R Environment for Seminar
Seminar Questions

Lecture 2
The Toy Store Mini-case

Key Lessons from this Chapter
Preview of Next Chapter: Pet Adoption
Appendix: Answers to Selected Questions

Pet Adoption
Learning Objectives
Students: Advance Preparation
Lecture 1
PetPivot Mini-Case
Data Understanding and Preparation
Explore (Model)
Conditional Probabilities
Prepare for Seminar

Seminar
Extend the Existing Model
New Target Variable & Model: Monthly Adoption
New Target Variable & Model: First Week Adoption

Lecture 2
PetPivot Mini-Case

Key Lessons from this Chapter
Preview of Next Chapter: OK Cupid

OkCupid
Learning Objectives
Students: Advance Preparation
Lecture 1
Ethical Issues Around Data Analytics
OkCupid Data
Data Understanding (1)
Data Understanding (2)
Data Preparation and Understanding
Seminar Preparation – Playing Detective

Seminar
Predict Income Group from Job
Predict Income Group from Education
Predict Income Group from Other Categories

Lecture 2
Seminar Debriefing
OkCupid and Dating Algorithms

Key Lessons from this Chapter
Preview of Next Chapter: Dashboards

Bibliography
Alphabetical Index