CS计算机代考程序代写 algorithm PMDA: Practice exam (Solutions)

PMDA: Practice exam (Solutions)
Your name here
QUESTION 1: (4 sub-questions, 20 points)
The 4 sub-questions of this question are not related to each other.
QUESTION 1.1 (5 points)
You want to study the effect of class sizes on student achievement.You have data from an experiment in which Tennessee schools created some smaller classes and randomly assigned each student to either a small class (13-17 students) or to a regular-sized class (22-25 students). It was, however, left to each school to decide how many small classes they wanted to have. Thus, the randomization only assured that within schools students were randomly assigned to classes of different sizes.
Download and view Table 2 from the academic study that analyzes data from this experiment; the table is provided in the PDF file “star_table.pdf”. The columns labeled (1) and (2) present two regressions run by the study’s authors. The table reports standard errors in parentheses below the coefficient estimates.
Why did the authors run these regressions? What do we learn from these results? Be sure to explain why columns (1) and (2) differ and which one is the more appropriate one to use given the nature of the randomization.
ANSWER: The table shows regression output for two regressions of treatment on various demographics; thus, these regressions provide a randomization check.
Given that schools can choose their own proportion of small vs regular-sized classes, column (2) provides the more appropriate regression to use to assess the quality of the randomization. That is, because student assignment to small-vs-large classes is random within a school but not across schools, it is necessary to condition on school in the randomization check.
The coefficient estimates for the demographics differ between columns (1) and (2) because (2) includes school fixed effects. The change in coefficient estimates must result from the fact that demographics vary by school.
All coefficients for demographics in column (2) are within 2 standard errors of zero (i.e., no demographic is statistically significant), which is consistent with random assignment of treatment.
QUESTION 1.2 (5 points)
You are tasked to implement a randomized experiment in order to test the effectiveness (measured in grade point averages, “GPA”) of new schooling material for high school students. At one high school, you are planning to randomly give the material to a subset of students, but not other students at that same school. From historic data (i.e. before the intervention) you know that the variance of the GPA score is equal to 1.3. The organization expects an increase in GPA scores of 0.1 due to the intervention. Assume that the treatment / control proportions are equal to 80% / 20%.
How large does the sample size have to be in order to be able to find a positive effect on GPA scores at the 97.5 percent level of confidence (i.e., )?
ANSWER: We want , with Using the historical variance of GPA as a proxy for , we have i.e. . You can thus pick .
QUESTION 1.3 (5 points)
You are running an A/B test to assess whether a new online ad campaign will increase revenue relative to your current campaign. Suppose you find that your new campaign does not outperform the old one. You wonder if the new campaign might only appeal to certain demographic groups. You therefore split the sample into male/female and 10 age bins for each gender (i.e. 20 bins in total). Only for 1 of the 20 demographic groups (women between ages 20 and 25) are you able to reject the null hypothesis that the new campaign has the same effect as the old one.
Based on this result, would you advise your marketing manager to target the new campaign specifically to this demographic group (and show the old ad to everybody else)? Explain your reasoning.
ANSWER: Here we find 5% of our variables are statistically significant at a 95% confidence level, which is exactly the expected Type I error rate (i.e., the probably that we erroneously find a statistically significant result by chance when in fact there is no true effect is 5%).
You thus conclude that there is no statistically-detectable difference between the performance of the new and old ad campaigns.
QUESTION 1.4 (5 points)
You are trying to optimize pricing for a product sold at a large number of stores. You implement an A/B test where you randomly vary price. However, you are only able to randomly vary prices for a given store relative to the historical price level at that store. Because prices differed across stores in the past, the A/B test only generates random variation in prices within stores, whereas price differences across stores are not random. Assume that the only variable that is correlated with price across stores is advertising and that stores with higher prices advertise more. Suppose you have panel data set on these randomized prices and on sales at the week/store level.
Explain why a regression of sales on price would give you a biased estimate of the price coefficient and explain which regression you would run instead to remove the bias. Then use the omitted variables bias formula to guess the sign of the bias in the first regression.
ANSWER: Price is not randomly assigned across stores and therefore some of the price variation used in this regression is non-random. A regression with store fixed effects will remove the bias. The price coefficient is likely overestimated when not including store fixed effects. Thinking of advertising as the omitted variable, we know that price and advertising are positively correlated, and we expect advertising to positively affect demand.
QUESTION 2. Predicting credit default (4 sub-questions, 20 points)
You are working at a FinTech start-up which is trying to use demographic data to predict default risk for small loans. You have access to historic data on loans, including information on a set of demographics, whether the loan was repaid, and the interest rate (coded as a percentage, i.e. between 0 and 100) that was charged to the customer. Demographics are coded as 100 mutually exclusive groups (each constitutes a unique demographic characteristic).
The company used demographics to predict default risk in the past and charged different interest rates based on the default predictions. Your task is to update the targeting algorithm based on new data (the credit_default data-set).
QUESTION 2.1 (5 points)
The following output is from a cross-validated Lasso where the candidate X-variables are the dummies for all 100 demographic groups. Interpret the results from this regression
library(readr)
credit_default <- read_csv("credit_default.csv") ## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## consumer_id = col_double(),
## demo_group = col_double(),
## default_dummy = col_double(),
## interest_rate = col_double()
## ) library(glmnet) ## Loading required package: Matrix ## Loaded glmnet 4.1 # X variables to try for LASSO
X_demo <- sparse.model.matrix(~ factor(demo_group),credit_default)

# Y variable for LASSO
Y = credit_default$default_dummy

# simple version of lasso (with default parameters)
cv.lasso_demo <- cv.glmnet(X_demo,Y)

# plot as a function of lambda
plot(cv.lasso_demo) # getting the coefficients and picking those that are not zero
coefficients <- coef(cv.lasso_demo,s="lambda.min")
coeffnames <- rownames(coefficients)[which(coefficients != 0)]
coeffvalues <- coefficients[which(coefficients != 0)]

# printing non-zero ones
cbind(coeffnames,coeffvalues) ## coeffnames coeffvalues 
## [1,] "(Intercept)" "0.0591691922315344" 
## [2,] "factor(demo_group)9" "0.00451257403323839" 
## [3,] "factor(demo_group)10" "0.0347329598399265" 
## [4,] "factor(demo_group)12" "0.017570205255731" 
## [5,] "factor(demo_group)15" "0.0264241035054739" 
## [6,] "factor(demo_group)17" "0.0364704893063275" 
## [7,] "factor(demo_group)21" "-0.0242463070542091" 
## [8,] "factor(demo_group)23" "0.00689789878387241" 
## [9,] "factor(demo_group)25" "-0.000802310208010512"
## [10,] "factor(demo_group)27" "-0.0458168925309699" 
## [11,] "factor(demo_group)28" "-0.0434557010344263" 
## [12,] "factor(demo_group)30" "-0.0229106049914261" 
## [13,] "factor(demo_group)33" "-0.000511747451607942"
## [14,] "factor(demo_group)37" "-0.00250414684358087" 
## [15,] "factor(demo_group)39" "-0.00374160780379664" 
## [16,] "factor(demo_group)40" "-0.00826826103099875" 
## [17,] "factor(demo_group)41" "0.00708628214541337" 
## [18,] "factor(demo_group)42" "0.000518900947168005" 
## [19,] "factor(demo_group)43" "0.0024176433303336" 
## [20,] "factor(demo_group)45" "0.0420474421015577" 
## [21,] "factor(demo_group)46" "-0.00639424133762686" 
## [22,] "factor(demo_group)50" "0.00489058609201252" 
## [23,] "factor(demo_group)51" "0.000416173638422472" 
## [24,] "factor(demo_group)53" "0.0120699803932205" 
## [25,] "factor(demo_group)54" "0.00734354051106706" 
## [26,] "factor(demo_group)56" "-0.0228860283769864" 
## [27,] "factor(demo_group)66" "0.000806810705838218" 
## [28,] "factor(demo_group)69" "0.00170162788680598" 
## [29,] "factor(demo_group)79" "0.0143847043032036" 
## [30,] "factor(demo_group)83" "-0.0151760342652063" 
## [31,] "factor(demo_group)87" "-0.0327258836534178" 
## [32,] "factor(demo_group)93" "0.0019730844941684" 
## [33,] "factor(demo_group)95" "-0.000389186536051923" ANSWER: Only a limited number of demographics are retained in the regression, suggesting that including all demographic dummies would lead to overfitting. Demographics 27, 28 and 45 are the ones more strongly associated (in absolute value) with defaults. QUESTION 2.2 (5 points) The following cross-validated Lasso regression includes the interest rate as well as the demographics in the set of X-variables to select from. How do the results differ from the ones in 1.1? Why are fewer demographics retained when also including the interest rate among the candidate X variables? # X variables to try for LASSO
X_interest <- sparse.model.matrix(~ factor(demo_group) + interest_rate,credit_default)

# simple version of lasso (with default parameters)
cv.lasso_interest <- cv.glmnet(X_interest,Y)

# plot as a function of lambda
plot(cv.lasso_interest) # getting the coefficients and picking those that are not zero
coefficients <- coef(cv.lasso_interest,s="lambda.min")
coeffnames <- rownames(coefficients)[which(coefficients != 0)]
coeffvalues <- coefficients[which(coefficients != 0)]

# printing non-zero ones
cbind(coeffnames,coeffvalues) ## coeffnames coeffvalues 
## [1,] "(Intercept)" "-0.0409076245635271" 
## [2,] "factor(demo_group)9" "0.00107180487544013" 
## [3,] "factor(demo_group)15" "0.0229642993527313" 
## [4,] "factor(demo_group)21" "-0.0197956128017048" 
## [5,] "factor(demo_group)23" "0.00333461639561381" 
## [6,] "factor(demo_group)27" "-0.00808285565272376"
## [7,] "factor(demo_group)28" "-0.00565597201129263"
## [8,] "factor(demo_group)40" "-0.00361996200995594"
## [9,] "factor(demo_group)41" "-0.0113255128127821" 
## [10,] "factor(demo_group)46" "-0.00202889926820448"
## [11,] "factor(demo_group)50" "0.00138389106091957" 
## [12,] "factor(demo_group)53" "0.00847731759612992" 
## [13,] "factor(demo_group)54" "0.00376767149927589" 
## [14,] "factor(demo_group)79" "0.0107315477718471" 
## [15,] "factor(demo_group)83" "-0.0106641180024546" 
## [16,] "interest_rate" "0.016601791083859" ANSWER: The interest rate was set based on demographics (and how they predict default) in the past. Hence we would expect the interest rate to be highly correlated with the set of demographics that predict default in the current data. After controlling for past interest rates, some demographics might not provide additional explanatory power and hence Lasso does not retain them. QUESTION 2.3 (5 points) The following code runs a regression of the default dummy on the interest rate and computes the (out-of-sample) R-squared for the Lasso regression in question 2.2. Based on the comparison of the R-squared from the first regression and the Lasso R-squared, do you think that interest rates were well targeted in your data (i.e., past interest rates are a good predictor of default)? # simple regression r-squared
interest_reg <- lm(default_dummy ~ interest_rate, data=credit_default)
summary(interest_reg) ## 
## Call:
## lm(formula = default_dummy ~ interest_rate, data = credit_default)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -0.11532 -0.05855 -0.05855 -0.05855 0.97930 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -0.054991 0.009103 -6.041 1.54e-09 ***
## interest_rate 0.018923 0.001499 12.625 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2356 on 49998 degrees of freedom
## Multiple R-squared: 0.003178, Adjusted R-squared: 0.003158 
## F-statistic: 159.4 on 1 and 49998 DF, p-value: < 2.2e-16 # Lasso r-squared
credit_default$prediction <- predict(cv.lasso_interest,X_interest,s="lambda.min")
credit_default$residual <- credit_default$default_dummy - credit_default$prediction
1 - var(credit_default$residual)/var(credit_default$default_dummy) ## 1
## 1 0.004085379 ANSWER: The R-squared from the simple regression (0.003178) is only slightly lower than the R-squared from the Lasso regression (0.00422). This suggests that the past interest rate alone is a strong predictor of default, so it is well targeted, in the sense that higher-risk customers receive higher interest rates. However, the interest rate alone does not perfectly capture the relevant demographics (if it did, only the interest rate would have been retained in the Lasso regression). QUESTION 2.4 (5 points) Which Lasso regression (with or without the interest rate) would you use to decide how to set interest rates in the future for NEW customers? ANSWER: Interest rates for new customers have to be set based on the Lasso regression that does not include the interest rate. By construction, a new customer will not have been charged an interest rate yet and hence this variable cannot be used for prediction. QUESTION 2. Yelp Star Ratings (5 sub-questions, 25 points) You want to study the impact of star ratings (yelp_star) on clicks to restaurant websites in the Yelp platform. You have data on average daily clicks across a range of restaurants and the star ratings for those restaurants (which vary in half-star increments and take on values in the range 5, 4.5, …, 2.5). The data also contain the actual average Yelp score underlying the star rating (yelp_score). The star rating is equal to the actual average Yelp score (e.g., 4.1374) rounded to the nearest half-star value (e.g., 4.0). QUESTION 2.1 (5 points) You regress clicks on yelp_star and yelp_score and obtain the following estimated relationship (two stars next to a coefficient indicate significance at the 5% level; no stars indicate no significance at the 5% level): Interpret the coefficients. ANSWER: Holding the yelp_score constant, a half star increase in the yelp_star ratings is associated with a 2.9 increase in average daily clicks. The intercept is not meaningful here as there are no restaurants with zero stars in the dataset. The yelp_score is positively correlated with clicks. QUESTION 2.2 (5 points) A regression of clicks on yelp_star only gives the following estimates (two stars next to a coefficient indicate significance at the 5% level; no stars indicate no significance at the 5% level): Use the omitted variables bias formula to explain the change in the coefficient on the yelp_star variable between the two regressions in questions 2.1 and 2.2. ANSWER: The coefficient decreases when controlling for yelp_score. This is because the score is positively correlated with clicks (as we see from the 2.1 regression) and is also positively correlated with yelp_star by definition (the higher the score, the higher the rounded number, i.e., the higher the star rating). The univariate regression thus suffers from positive omitted variable bias so the coefficient on yelp_star goes down once controlling for yelp_score. QUESTION 2.3 (5 points) Under which assumption does the yelp_star variable have a causal interpretation? ANSWER: We would like to know if star ratings cause more clicks. Under the assumption that, after controlling for yelp_score, yelp_star ratings only affect clicks directly (not indirectly via correlation with any other variable), then the yelp_star ratings variable has a causal interpretation. Here, this assumption probably holds because star ratings is a function of score (and nothing else). You can also think of this as a regression of clicks on yelp_star, where yelp_score is the only omitted variable impacting the yelp_star rating coefficient. Once we control for yelp_score, the coefficient on yelp_star has a causal interpretation. QUESTION 2.4 (5 points) Suppose you have data for all the stores in the Yelp dataset at two different months in a given year, and suppose that for each store you can measure clicks and yelp_star ratings at each month, but cannot measure yelp_score. Explain how you would use these data to estimate the causal effect of yelp_star on clicks. Which condition on the time variation in yelp_star ratings must be satisfied for you to be able to estimate this causal effect? ANSWER: You can use a panel data regression with store fixed effects to control for the unobservable omitted variable “store quality” (since this is plausibly constant within a year). You need at least some stores to change their ratings between the two months, otherwise you cannot pin down the causal effect of ratings if all of them stay constant (the variable would be explained by the store fixed effects and dropped from the regression).