8/27/2020 Assignment #1
Assignment #1
Submit Assignment
Due Friday by 10am Points 100 Submitting a file upload File Types rmd and pdf Available after Aug 21 at 8pm
All manipulation of the data must be done with R. If you do not include an R chunk for each task, you will receive no points for that task.
The Charles Book Club (¡°CBC¡±) was established in December of 1986, on the premise that a book club could differentiate itself through a deep understanding of its customer base and by delivering uniquely tailored offerings. CBC focused on selling specialty books by direct marketing through a variety of channels, including media advertising (TV, magazines, newspapers) and mailing. CBC is strictly a distributor and does not publish any of the books that it sells. In line with its commitment to understanding its customer base, CBC built and maintained a detailed database of its club members. Upon enrollment, readers were required to fill out an insert and mail it to CBC. Through this process, CBC has created an active database of 500,000 readers. CBC acquired most of these customers through advertising in specialty magazines.
CBC sent mailings to its club members each month containing its latest offering.
On the surface, CBC looked like they were very successful, mailing volume was increasing, book selection was diversifying and growing, their customer database was increasing; however, their bottom line profits were falling. The decreasing profits led CBC to revisit their original plan of using database marketing to improve its mailing yields and to stay profitable.
The dataset is presented to you in a format that you will encounter frequently in practice, a Microsoft Excel
spreadsheet: Charles_BookClub_modified.xls . You will need to first clean the data before attempting to do any analytics or visualization.
Gender: 0 = Female and 1 = Male.
M=Monetary value: Total money spent on CBC books.
F=Frequency: Total number of purchases in the chosen period (used as a proxy for frequency.) R=Last purchase (recency of purchase): Months since last purchase.
First purchase: Months since first purchase.
ChildBks: Number of children¡¯s books purchased. YouthBks: Number of youth books purchased. CookBks: Number of cookbooks purchased. DoltYBks: Number of do-it-yourself books purchased. RefBks: Number of reference books purchased. ArtBks: Number of art books purchased.
GeogBks: Number of geography books purchased. ItalCook: Number of Italian cookbooks purchased. ItalHAtlas: Number of Italian atlases purchased. ItalArt: Number of Italian art books purchased.
Florence: A choice variable indicating whether the customer purchased the The Art History of Florence. 1 corresponds to a purchase and 0 corresponds to a nonpurchase.
Your submission will consist of an .Rmd and a .pdf document. Use headings formatted with R markdown syntax for each of the following tasks and questions. Use the R markdown text blocks to comment on your code in lieu of R comments in the {r} chunks. On all your plots be sure to put a meaningful title. Be sure to repeat the task instruction or question and identify the task or question by the numbers in this list:
1. How many observations are in the dataset?
2. List the variables and show their type.
3. How many observations have no books purchased (including Florence)? Delete those observations before proceding to the next step.
4. Recode the gender variable as a factor with two levels: “M” and “F”.
5. Recode the Florence variable as a factor with two levels: “Yes” and “No”.
6. Calculate the mean, standard deviation, skewness and kurtosis for monetary value.
7. Are there any outliers? If yes, how do you know?
8. If you detect the presence of any outliers, change them to missing.
9. Are there any missing values? If yes, count them and report the counts for each variable that has missing values.
10. Impute the missing values using mean substitution. Do this with one of the apply() family of functions (apply, sapply, lapply, etc.)
11. Produce a histogram of the continuous variables and display them all on one page.
12. Produce a bar chart for the counts of each of the ten books purchased.
13. Using dplyr filter() and mutate() verbs, construct 27 categories of RFM where each category is composed of combinations of values of “High”,
“Medium” and “Low” for R, F, and M. To do this you must divide each of these three variables into thirds (boundaries at 33 and 66 percentiles for each). At the end of the process you’ll have three new factors, one for R, one for F, and one for M with three levels each, High, Medium, and
https://smu.instructure.com/courses/78193/assignments/401065?module_item_id=718967
1/4
8/27/2020 Assignment #1
Low. Use these factors to produce three tables of means of M, one for each level of M. Each table will be a 3X3 cross tabulation of frequency
level by recency level.
14. What is the median monetary value per visit by gender?
15. Make a bubble plot of recency on the x-axis, monetary value on the y-axis, where the size of the points are proportional to first purchase and the
points are colored by gender.
Assignment Rubric
https://smu.instructure.com/courses/78193/assignments/401065?module_item_id=718967
2/4
8/27/2020 Assignment #1
Criteria
Ratings
Pts
Submitted ontime
20.0 pts Full Marks
0.0 pts No Marks
20.0 pts
Hadley Wikham Style Guide Compliance
functions are verbs; variables are nouns; indents correct;
curly braces positioned correctly;
variable and function names all lower case; correct use of spaces;
line lengths <= 80
15.0 pts Full Marks
0.0 pts No Marks
15.0 pts
All chunks run without errors
35.0 pts Full Marks
0.0 pts No Marks
35.0 pts
Q1
How many observations are in the dataset?
1.5 pts Full Marks
0.0 pts No Marks
1.5 pts
Q2
List the variables and show their type.
1.0 pts Full Marks
0.0 pts No Marks
1.0 pts
Q3
How many observations have no books purchased (including Florence)? Delete those observations before proceding to the next step.
2.0 pts Full Marks
0.0 pts No Marks
2.0 pts
Q4
Recode the gender variable as a factor with two levels: "M" and "F".
1.5 pts Full Marks
0.0 pts No Marks
1.5 pts
Q5
Recode the Florence variable as a factor with two levels: "Yes" and "No".
1.5 pts Full Marks
0.0 pts No Marks
1.5 pts
Q6
Calculate the mean, standard deviation, skewness and kurtosis for monetary value.
1.5 pts Full Marks
0.0 pts No Marks
1.5 pts
Q7
Are there any outliers? If yes, how do you know?
1.0 pts Full Marks
0.0 pts No Marks
1.0 pts
Q8
If you detect the presence of any outliers, change them to missing.
2.0 pts Full Marks
0.0 pts No Marks
2.0 pts
Q9
Are there any missing values? If yes, count them and report the counts for each variable that has missing values.
1.0 pts Full Marks
0.0 pts No Marks
1.0 pts
Q10
Impute the missing values using mean substitution. Do this with one of the apply() family of functions (apply, sapply, lapply, etc.)
2.0 pts Full Marks
0.0 pts No Marks
2.0 pts
https://smu.instructure.com/courses/78193/assignments/401065?module_item_id=718967
3/4
8/27/2020 Assignment #1
Criteria
Ratings
Pts
Q11
Produce a histogram of the continuous variables and display them all on one page.
2.0 pts Full Marks
0.0 pts No Marks
2.0 pts
Q12
Produce a bar chart for the counts of each of the ten books purchased.
2.0 pts Full Marks
0.0 pts No Marks
2.0 pts
Q13
Using dplyr filter() and mutate() verbs, construct 27 categories of RFM where each category is composed of combinations of values of "High", "Medium" and "Low" for R, F, and M. To do this you must divide each of these three variables into thirds (boundaries at 33 and 66 percentiles for each). At the end of the process you'll have three new factors, one for R, one for F, and one for M with three levels each, High, Medium, and Low. Use these factors to produce three tables of means of M, one for each level of M. Each table will be a 3X3 cross tabulation of frequency level by recency level.
5.0 pts Full Marks
0.0 pts No Marks
5.0 pts
Q14
What is the median monetary value per visit by gender?
2.0 pts Full Marks
0.0 pts No Marks
2.0 pts
Q15
Make a bubble plot of recency on the x-axis, monetary value on the y-axis, where the size of the points are proportional to first purchase and the points are colored by gender.
4.0 pts Full Marks
0.0 pts No Marks
4.0 pts
Total Points: 100.0
https://smu.instructure.com/courses/78193/assignments/401065?module_item_id=718967
4/4