• Technical Details of how to download the dataset and point R to this dataset.
• We want to measure the impact of units sold based upon the Marketing Technique (TV ads, Newspaper Ads, and Radio Ads)
• What kind of model is most appropriate and why?
• Linear regression because we want to ultimately predict the # of sales given the $$ we spend on each ad type. This is Linear because we’re predicting a number, vs. Logistic predicts the odds/chance of something happening.
• Ho is whether all three of these ad types have zero impact on sales. Meaning that none of these marketing strategies work
Step 1 for students: look at your data set; how do you interpret it?
Step 2: What is the outcome variable (sales)? What are the predictors? (Type of ad)
Code: Advertising –this will display the data as in select *
Summary(Advertising)—this creates mean, median, mode, quartiles
What do the summary statistics tell us?
• Average values of each marketing type; TV is on average most expensive
Step 3: Create histogram of the outcome variable (sales) and respective frequency “bins”
-why do we do this? We want visualize the normality of sales, i.e is is shaped like a bell curve?
If not, we would need to do a transformation to adjust for the skewness (log transform, square it, 1/x, etc)
Code: hist(Advertising$sales)
Step 5: Build three simple linear models
TV ads on sales
Radio ads on sales
Newspaper ads on sales
TV and Radio on sales
All three on sales
What do you see, how do you interpret this?
How to assess which simple linear model is better:
Residual Standard Error: Standard error on the residuals—how far each data point is from the regression line. The lower the better.
Adjusted r^2: demonstrates the amount of variability in sales, explained by TV ads (the higher the r^2, the better, meaning increased sales are driven by increased TV ads)
Use the adjusted instead of the multiple r^2, since it includes an adjustment for dilution due to increased number of predictors. The higher the better—we can explain X% of the variation in sales due to Ad Type.
Beta 1 Coefficient:
For every 1 unit increase of TV ads, you should expect to see an increase sales of 1k x beta 1 coefficient
P Value and F Test (F test becomes more relevant with many predictors)
Testing whether there is a relationship—P Value from the F Test: There is definitely a strong correlation between each of the ad types and sales—but what happens we look at all three together.
P Value is finding the observed results when Ho is True (p=.05 is standard, the lower the better)
TV and Radio:
Beta 1: On the TV line, this means that holding radio constant, every increase in TV ad unit increases sales by 46 units
On the Radio Line, holding TV Constant, every increase Radio ad units increases sales by 188
We can reject Ho since P value is basically zero—i.e. TV and Radio both have an impact Sales.
All Three Ad Types: What changed?
Takeaway: Purchasing Newspaper ads does not meaningfully increase sales. We can determine this by insignificant increases in r^2, Beta 1, t test is .86 (greater than .05) In this case, t test is measuring statistical significance between Beta 1 and 0
Predictive Phase: Can we predict sales given the units of TV and Radio we purchase
We have 200 rows in the table; we divide the data set 75% Train, 25% test
Step1: create linear model on TV and Radio on the Training set
Step 2: create summary of the final model
-when looking at the r^2 and RSE value, how do they compare to the full data set?
-RSE and r^2 is higher with the training data set—why is this so? Because you have less data
Step3: Calculate RSE and R^2 on the Train dataset in order to check values against the summary function. They should match. We do this to check our math, since we will then apply this formula to the Test dataset.
Step4: Evaluating the Model on the Test Set
-calculate RSE and R^2 values;
-compare these values to the training data set; are they good enough?
-Generally, we expect RSE to go up a bit, since our model was calibrated on the train data set, which didn’t have the exact values as the test set. IF RSE goes down consider yourself lucky
-In this case, RSE is measuring difference between predicted sales and actual sales
-If TEST RSE is significantly higher than the TRAIN RSE, this is a red flag that the model is over-fitting.
Potentially adding Interaction Term (TV, Radio, TVxRadio): we can also explore adding interaction terms to check whether or not predictors are truly independent variables.
Step 5: you have a new market coming on board; you plan to spend 250 on TV and 40 on Radio—what is the predicted sales?