python 机器学习代写 QBUS3830

1 Instructions

QBUS3830 Advanced Analytics

Semester 2, 2018

Homework Task 1

Using the code scaffold and data provided:

Implement a Python class for a logistic regression. The class should compute the MLE numerically and be able to compute predictions for the class label and probabilities. Follow the instructions in the notebook.
Run the test code provided and check whether you get the same coefficients and standard errors as the StatsModels package.
Implement a Python class for a Robit regression.
Run the code to evaluate which method seems to be more accurate for prediction. Discuss the results.

2 Rules

No looking up for a similar answer on the internet. The point of the exercise is to understand the statistical and mathematical logic and practice translating it into code. You must follow the structure provided in the scaffold.

3 Rubric

You will get the full marks if you follow the instructions and obtain the correct numbers in parts 2 and 4.

QBUS3830 Advanced Analytics

Semester 2, 2018

Homework Task 2: Hypothesis Testing

1 Case study: Benford’s Law

Implement a Python function that performs Pearson’s χ2 test for multinomial data and returns the test statistic and p-value.
The GDP dataset contains a list of countries ranked by their GDP in 2017 (in millions of dollars), according to the International Monetary Fund (IMF). Make a basic table to discuss how well the data conforms to Benford’s law for the first digit. Perform the χ2 test and discuss the results.
The Fraud dataset contains three series. One is a real financial variable for a random sample of companies listed in the New York Stock Exchange (NYSE). The other two are the same series, but with random modifications of digits. Repeat the exercise above for each series, and identify the two “fraudulent” series.

2 Case study: Verizon Repair Times

The Verizon dataset contains data from a court case that involved the American telecom- munications company. Verizon is the primary local telephone company (incumbent local exchange carrier, ILEC) for a large area of the Eastern United States. As such, it is re- sponsible to provide repair services for the customers of other telephone companies know as competing local exchange carriers (CLECs). Verizon is subject to fines if the repair times for CLEC customers are worse than those for Verizon customers.

Assume a significance level of 1%.

1. Conduct a two-sample test based on large-sample theory and discuss the results.
2. Implement a Python function that conducts a permutation test based on the mean.

3. Conduct the permutation test, plot the permutation distribution, and discuss the results. Compare the permutation test to the large-sample test, and discuss which one is more appropriate for this problem.

3 Rules

Do not use package versions of the χ2 and permutation tests. Do not look for similar code in the internet. The code must be your own work.

4 Rubric

You will get the full marks if you follow the instructions, obtain the correct p-values, and interpret the results correctly.

QBUS3830 Advanced Analytics

Semester 2, 2018

Homework Task 3: The Bootstrap 1 Bootstrapping regression models

Forecasting the equity premium is one of the most important problems in empirical asset pricing. The dataset for this task is updated version of the data used by Welch and Goyal (2007), who studied this question in detail. While several predictor variables have been proposed in the literature, there is a large degree of uncertainty and instability in estimates arising from equity premium forecasting models. See Gu et al. (2018) for a recent view on measuring asset risk premia.

Use the code provided to get started. Fit a linear regression and obtain confidence intervals for the coefficients based on Central Limit Theorem. Write functions that build confidence intervals for linear regression coefficients by bootstrapping the residuals and the observations. Compute the bootstrap confidence intervals based on the data. Compare the results.

2 Rules

The code for bootstrapping the regression must be your own work. You can and should use a package to fit the linear regression.

3 Rubric

You will get the full marks if you follow the instructions and obtain the correct confidence intervals.

References

Gu, S., B. T. Kelly, and D. Xiu (2018). Empirical asset pricing via machine learning.

Welch, I. and A. Goyal (2007). A comprehensive look at the empirical performance of equity premium prediction. The Review of Financial Studies 21 (4), 1455–1508.

QBUS3830 Advanced Analytics

Semester 2, 2018

Homework Task 4: Additive Models 1 Task description

In this task you will work with the California Housing Dataset. Our objective to model the average house price in a geographic area as a function of characteristics of the housing stock, demographics, and location. An interesting feature of this dataset is the presence of spatial information: we have the latitude and longitude for each census tract.

The accompanying Jupyter Notebook provides a standard analysis. We find that the gra- dient boosting method has substantially better generalisation performance than a linear regression, which is not surprising since the latter cannot account for spatial effects.

Your task is to implement a Python class that estimates the following model:

∑6 j=1

The estimation will probably be very slow to run, maybe hours. Test your code with a small sample, and use the print function to keep track of the progress.

A generalised additive model can outperform gradient boosting for this problem. However, to obtain this result you would need to estimate a model of the type

∑6 j=1

βjxij +g(latitudei,longitudei)+εi,

Yi =β0 +
where g is an unknown function. Fit the model and compare the test performance to that

of the benchmark models.

fj(xij)+g(latitudei,longitudei)+εi. You can also try a neural network for your own knowledge and practice.

2 Rules

Yi =β0 +

The code for fitting the generalised additive model must be your own work.

3 Rubric

You will get the full marks if you follow the instructions and fit the model correctly, as measured by performance on the test data.

Related Posts