1
Homework Assignment #4
Due Date: 11/17/21
Notes: Assignments are due at the start of class. No late assignments will be accepted
except for prior communications with appropriate reasons (medical, conference, etc.).
1. COVID-19 Data
Analyze the data at https://github.com/nytimes/covid-19-data, which is a series of data files
with cumulative counts of coronavirus cases in the United States over time from the New
York Times.
(1) Please use an ARIMA model to analyze the national level data (i.e., the us.csv file).
Using the most update-to-date data, what is the order of the model? Please visualize the
data and report the fitted model.
(2) Similar to the first question, now we want to test the forecasting performance of the
ARIMA model. Please leave the last 1 day, last 5 day and last 10 days of data, respectively,
and use the remaining data to fit an ARIMA model (in total, 3 models will be fitted). Then,
perform the 1-step, 5-step, and 10-step forecasting using each of the fitted ARIMA model,
respectively. Report the forecasted values, and forecasting root mean squared errors.
(3) Describe your observations on the above analysis, will ARIMA be a good model for
the analysis. If not, please discuss how you can better model the data (there is no need to
implement the approaches you discussed).
2. Motor Cycle Data
Analyze the “motor cycle data” (use “library(MASS)”, then load “data(mcycle)”, the data
are x=times, y=accel). Use smoothing splines to fit the data (see the help function for
smooth.spline). Try different degree of freedoms (df) in [5, 20]. Find the optimal degree of
freedom in [5, 10] according to the cross-validation criterion (in the function
“smooth.spline”, specify “cv=T”). What is the λ and cross-validation error of the best fit?
Please answer the following questions:
(1) The plot for the observation points and the optimal smoothing spline fit.
(2) The plot for the observation points and the three smoothing splines with df=5, 10, 15
(three different colored curves). Then you should also add a “legend” to denote these
lines.
(3) Plot the cross validation errors against different df’s from 5 to 20 (show both points and
lines). The step of df’s is 0.5. (Hint: from this plot you can find the optimal df.)
(4) Use the “wd” function in library “wavethresh”, perform the wavelet analysis with any
wavelet basis and resolution for the first 128 points of “accel”. Then soft threshold the
wavelet coefficients with the function “threshold”. Finally, reconstruct the thresholded
coefficients with “wr”. Compare the reconstructed profile with the spline profile in (1).
3. Functional Regression
Electrocardiogram (ECG) signal can reflect the heart rate condition, and is usually
equipped as a wearable sensor during the labor-intensive tasks. In this problem, by treating
the ECG signals as functional data, please extract relevant features and then use a
classification method to discriminate between normal and abnormal ECGs.
https://github.com/nytimes/covid-19-data
2
We will use a public dataset available at the UCR Times Series Classification Archive
(https://www.cs.ucr.edu/~eamonn/time_series_data_2018/). The training data set can be
found as ’ECG200TRAIN’, and the testing data set as ’ECG200TEST’. In the data, each
row is an observation. The first column is the class label, and the remaining columns are
measured ECG data.
Please try to extract functional features of ECG with B-spline, and then classify the signal
into normal and abnormal conditions with logistic regression. Report the classification
model and results.
https://www.cs.ucr.edu/~eamonn/time_series_data_2018/