Homework 2 (Deadline: Feb 16, 2022)
1. The Institute for Statistics Education at Statistics.com offers online courses in statistics and business analytics, and is seeking information that will help in packaging and sequencing courses. Consider the data in the file CourseTopics.csv. These data are for purchases of online statistics courses at Statistics.com. Each row represents the courses attended by a single customer. The firm wishes to assess alternative sequencings and bundling of courses. Use association rules to analyze these data (with support = 0.01, and confidence = 0.5), and interpret the first two of the resulting rules (ranked by the lift ratio).
rm(list=ls())
# Load and clean data.
Copyright By PowCoder代写 加微信 powcoder
course.df <- read.csv("Coursetopics.csv") course.mat <- as.matrix(course.df) head(course.mat, 10)
Intro DataMining Survey Cat.Data Regression Forecast DOE SW
## [1,] 1 1 0 0 0 0 0 0 ## [2,] 0 0 1 0 0 0 0 0 ## [3,] 0 1 0 1 1 0 0 1 ## [4,] 1 0 0 0 0 0 0 0 ## [5,] 1 1 0 0 0 0 0 0 ## [6,] 0 1 0 0 0 0 0 0 ## [7,] 1 0 0 0 0 0 0 0 ## [8,] 0 0 0 1 0 1 1 1 ## [9,] 1 0 0 0 0 0 0 0 ##[10,]0 001 0000
library(arules)
## Loading required package: Matrix
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## abbreviate, write
## Apriori
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
# Recast incidence matrix into transactions list.
course.trans <- as(course.mat, "transactions")
# Generate rules with the highest lift.
options(digits = 2, scipen = 1)
rules <- apriori(course.trans, parameter = list(supp= 0.01, conf = 0.5, target = "rules"))
## 0.5 0.1 1 none FALSE
## maxlen target ext
## 10 rules TRUE
TRUE 5 0.01 1
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
## Absolute minimum support count: 3
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[8 item(s), 365 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [54 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(head(sort(rules, by = "lift"), 5))
## lhs rhs
support confidence coverage
## [1] {Intro, Regression, Forecast}
## [2] {Intro, Survey, DOE}
## [3] {Intro, DataMining, Cat.Data}
## [4] {Intro, DataMining, Regression} => {Forecast}
## [5] {Intro, Survey, Cat.Data}
## lift count
## [1] 4.0 5
## [2] 3.8 4
## [3] 3.6 6
## [4] 3.6 5
## [5] 3.6 5
=> {Forecast}
0.014 0.50
0.014 0.50
=> {DataMining} 0.014 0.71
=> {Cat.Data} 0.011 0.80
=> {Regression} 0.016 0.75
2. The file UniversalBankFull.csv contains data on 5000 customers of Universal Bank. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal.Loan). Among these 5000 customers, only 480 (=9.6%) accepted the personal loan that was offered to them in the earlier campaign. In this question, we focus on two predictors: Online (whether or not the customer is an active user of online banking services) and Credit Card (CreditCard, does the customer hold a credit card issued by the bank), and the outcome Personal.Loan. Partition the data into
history (60%) and future (40%) sets. Consider the task of classifying a new customer (in the future set) who owns a bank credit card and is actively using online banking services. Using the naïve Bayes classifier. Find P(Personal.Loan = 1|CreditCard=1, Online=1) and P(CreditCard=0|Personal.Loan=1)?
rm(list=ls())
#load the data
bank.df <- read.csv("UniversalBankFull.csv")
#consider only the required variables
bank.df <- bank.df[ , c(13, 14, 10)]
bank.df$Online <- as.factor(bank.df$Online) bank.df$CreditCard <- as.factor(bank.df$CreditCard) bank.df$Personal.Loan <- as.factor(bank.df$Personal.Loan) str(bank.df)
## 'data.frame': 5000 obs. of 3 variables:
## $Online :Factorw/2levels"0","1":1111122121...
## $CreditCard :Factorw/2levels"0","1":1111211211...
## $Personal.Loan:Factorw/2levels"0","1":1111111112...
#partition the data into history (60%) and future (40%) sets
#set the seed for the random number generator for reproducing the partition. set.seed(12345)
ntotal <- length(bank.df$Personal.Loan)
#Sample row numbers randomly.
nhistory.index <- sort(sample(ntotal, round(ntotal * 0.6)))
history.df <- bank.df[nhistory.index, ]
future.df <- bank.df[-nhistory.index, ]
#check if variables in the dataset are correctly identified for their types
str(bank.df)
## 'data.frame': 5000 obs. of 3 variables:
## $Online :Factorw/2levels"0","1":1111122121... ## $CreditCard :Factorw/2levels"0","1":1111211211... ## $Personal.Loan:Factorw/2levels"0","1":1111111112...
str(history.df)
## 'data.frame': 3000 obs. of 3 variables:
## $Online :Factorw/2levels"0","1":1212121112... ## $CreditCard :Factorw/2levels"0","1":2121111121... ## $Personal.Loan:Factorw/2levels"0","1":1111211211...
## 'data.frame': 2000 obs. of 3 variables:
## $actual:Factorw/2levels"0","1":1111111112... ## $ X0 : num 0.903 0.903 0.903 0.903 0.904 ...
## $ X1 : num 0.0974 0.0974 0.0974 0.0974 0.0963 ...
head(loan.combined.df[future.df$Online==1 & future.df$CreditCard == 1, ])
# Find P(Personal.Loan = 1|CreditCard=1, Online=1)
library(e1071)
loan.nb <- naiveBayes(Personal.Loan ~ Online + CreditCard, data = history.df)
## predict probabilities
loan.pred.prob <- predict(loan.nb, newdata = future.df, type = "raw")
loan.combined.df <- data.frame(actual = future.df$Personal.Loan, loan.pred.prob)
str(loan.combined.df)
actual X0 X1
# Find P(CreditCard=0|Personal.Loan=1)
## Naive Bayes Classifier for Discrete Predictors
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## A-priori probabilities:
## 0 1
## 0.902 0.098
## Conditional probabilities:
## ##Y ## ## ## ## ##Y ## ##
0 0.41 0.59
1 0.41 0.59
CreditCard
0 0.71 0.29
1 0.70 0.30
Draw 40000 random variables following the standard normal distribution. Plot the histogram.
Histogram of r40000
rm(list=ls())
set.seed(100)
r40000 <- rnorm(40000)
hist(r40000, breaks= 200, probability=T, xlab="value", ylab="density")
−4 −2 0 2 4
A human resource manager at a small university in the US has been considering a change to the structure of employee benefits (in terms of healthcare coverage and pension savings). To get an idea of how receptive the faculty, administrators, and staff members might be to the proposed changes, she has decided to conduct a survey in which n = 188 respondents could register their support or opposition.
0.0 0.1 0.2 0.3 0.4
Use R and the data set benefits.csv to answer the following questions:
a. Find the 95% confidence interval estimate of p.
b. What sample size would you recommend to achieve a margin of error of 0.02, with confidence 0.99?
(use p ̇ = 1/2)
## agree ##1 1 ##2 0 ##3 0 ##4 1 ##5 1 ##6 1
rm(list=ls())
benefit.df <- read.csv('./benefits.csv') # read data head(benefit.df)
str(benefit.df)
## 'data.frame':
188 obs. of 1 variable: ## $agree:int 1001111111...
t.test(benefit.df, conf.level = 0.95)
## One Sample t-test
## data: benefit.df
## t = 18, df = 187, p-value <2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.56 0.70
## sample estimates:
## mean of x
## 0.63
## [1] 4147
PME <- 0.02
conf.level <- 0.99
alpha <- 1 - conf.level
z = qnorm(1 - alpha/2)
n = zˆ2*p*(1-p)/PMEˆ2
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com