程序代写代做 algorithm Columbia Business School

Columbia Business School
Business Analytics Prof. Daniel Guetta
Homework 2
(See syllabus and course website for due date)
Attention: Please prepare three files for this homework assignment: a .pdf file containing your answers including relevant figures, an .R file for your relevant R scripts, and an xlsx file for the optimziation question. File names should be uni.pdf and uni.R, e.g., crg2133.pdf and crg2133.R. Your submissions must be based on your own original work. Late submissions will not be accepted.
Question 1 (Quality of Classification)
Tahoe Healthcare has been approached by a healthcare analytics company, Xaltra, about a new system for managing readmissions. The CEO of Tahoe believes the initial success with the CareTracker system could be enhanced through the use of better predictive analytics and is intrigued by Xaltra¡¯s approach. The Xaltra system merges data from a variety of hospital systems to provide state-of-the-art predictions of readmissions risk. It uses up to 50,000 variables and generates its predictions using advanced machine learning algorithms. Xaltra has been adopted by a number of hospitals throughout the country. The price for the Xaltra system is $250,000 in up-front licence and integration fees plus an ongoing $45,000 per year fee for maintenance and support. The expected lifetime of the system is 10 years.
Xaltra requested a sample of data to test their systems¡¯s performance against Tahoe¡¯s current logistic regression method (the one we developed together in class). Tahoe provided Xaltra with the same dataset used in our class session, which includes admissions and outcome data for all AMI patients treated by Tahoe in the last three years.
The ROC curve results for the Xaltra test are provided in the file xaltra.csv. For convenience, these are shown alongside the ROC curve for Tahoe¡¯s current logistic prediction system. (Note: for this problem, all the data required for the analysis is provided in the problem. There is no need to use data from the corresponding class session.)
You will remember that the dataset used to generate these ROC curves had 998 positive cases (patients readmitted) and 3,384 negative cases (patients not readmitted). The cost of CareTracker is $1,200 per patient, and it reduces the chances of readmission by 40%. The financial penalty for a readmission is $8,000.
(a) Given these ROC test results, what is your estimate of the total readmissions and CareTracker costs for AMI patients for the past three years if Tahoe had used the Xaltra system? Explain your estimate.
(b) What is the reduction in cost relative to Tahoe¡¯s current system? Do the savings justify the fees Xaltra is charging? Why or why not.
Question 2 (Skill vs. Luck and DiD)
Hillside is a small charter school in an inner city neighborhood of Newark New Jersey. Feeling pressure from its board to increase student scores on the state standardized test, the school administration recently piloted an intervention program called SIS (Student Intervention for Success) aimed at improving
Page 1 of 3

the scores of the lowest-performing students by providing tutoring. SIS provides an intensive help for students, including tutoring, an after-school study skills workshop and peer advising.
The state test has a maximum score of 25. Students who receive a score of 11 or less are considered to be performing significantly under grade level. For its pilot, Hillside decided to enroll any student who had a 2011 score of 11 or lower in the SIS program at the start of 2012. The 2012 academic year was now over and Hillside administrators wanted to evaluate the results of SIS and report back to the board.
The file hillside data.csv contains a sample of 100 Hillside students. Their performance on the past three standardized tests (2010, 2011 and 2012) are reported along with an indicator of whether the student was enrolled in SIS for 2012. Based on these data, answer the following questions:
(a) Using 2011 as the ¡°before¡± period and 2012 as the ¡°after¡± period, perform a difference-in-difference analysis on the change in the average test scores of the SIS students. Based on your DiD estimate, what is the increase in test scores from SIS?
(b) You suspect the results in part (a) may be overly optimistic because of the effects of regression to the mean. That is, because only the students who performed poorly on the 2011 exam were enrolled in SIS, some increase in their 2012 scores would be expected simply due to regression to the mean. To test this idea, consider the performance of the students between 2010 and 2011. Use the data from 2010 and the data from 2011 to determine whether there was regression to the mean. If so, what is the shrinkage coefficient?
(c) Using the shrinkage coefficient you obtained in part (b), construct a shrinkage estimate of 2012 scores based on the 2011 test results. What is the RMSE of your predictions?
(d) Now, use the results from (b),(c) to correct the DiD analysis so it accounts for the shrinkage effect. To do that, compute the average of the estimated and actual 2012 scores for both the SIS students and non-SIS students. Considering the estimated 2012 scores as the ¡°before¡± scores and the actual 2012 scores as the ¡°after¡± scores, perform another DiD analysis of the SIS program. With this correction for shrinkage, what is your new estimate of the increase in test scores from SIS? Make sure you completely understand this technique before you apply it.
(e) Briefly comment on what was ¡®wrong¡¯ with the first method, and how the second method ¡®fixed¡¯ this problem.
(f) How might you use a regression discontinuity framework to estimate the effect of SIS on grades? Obtain such an estimate using the data provided.
Question 3 (Clustering and PCA)
The file protein.csv contains data about protein consumption patterns in various European coun- tries. In this question, we will explore relationships between these countries in terms of their protein consumption patterns.
(a) First, consider each country only in terms of red meat and white meat protein (ignore all other data). Cluster the countries into three clusters based on these data, and plot the resulting cluster memberships.
(b) Now, consider all the data. Pick an appropriate number of clusters to group these countries. Print out cluster memberships.
(c) Perform hierarchical clustering on these data, and plot the results. Are the results as you would expect?
(d) Perform a principal component analysis on these data, and plot the cities in terms of their first two principal components. How much of the variance is explained by these two principal components?
Question 4 (Simulation)
A bakery needs to make croissants every day to satisfy its customers. Based on historical data, they estimate that the demand each morning is normally distributed with mean 50 and standard deviation 10. In the afternoon, demand is uniformly distributed between 60 and 80 on rainy days and uniformly distributed between 20 and 50 on sunny days. The probability of a sunny day is 0.4.
Page 2 of 3

(a) Simulate 10,000 days of total demand and create a histogram of daily demand. What is the 10th and 90th percentile for the demand?
(b) Assume that each croissant costs $1 to make, and sells for $4. All croissants are made daily before the store opens. What is the expected profit if 120 croissants are made every day?
(c) What is the optimal number of croissants to make every day? What is the corresponding optimal profit? Use simulation to find your answer.
Question 5 (Optimization)
A manufacturer makes 1500 laptops and 1000 desktops per month. Any laptop or desktop can be customized. The demand for standard laptops is 1200 a month, and for customized laptops is 1000 a month. The net profit for a standard laptop is $100, and for a customized laptop is $200. The demand for standard desktops is 700 a month, and for customized desktops is 400 a month. The net profit for a standard desktop is $150, and for a customized desktop is $400. Due to labor limitations, only 500 machines can be customized in a month.
(a) Write down a mathematical formulation to optimize the total net profit. Is it linear, nonlinear, or discrete?
(b) Solve this problem and describe the optimal strategy and the optimal net profit.
(c) What is the benefit of being able to customize 200 more machines?
(d) What is the benefit from being able to make 300 more desktops?
(e) What happens if we manufacture 100 fewer laptops?
Page 3 of 3