华东师范大学期末大作业 2019—2020 学年第二学期 课程名称:地理建模与地理计算
Please submit your take home exam online by 30 June 2020.
Please finish the (take home) examination individually. In all cases, please summarize your results and discussion in a word document, providing brief and to the point answers in a summary response to the question, with the major steps identified, and submit the word document together with the code file.
请将期末大作业在 2020 年 6 月 30 日结束前在大夏学堂(第十六周中)提交。
期末大作业需要独立完成。在所有情况下,请用一个 word 文档总结你对问题的 回答和主要思路与解题步骤,并同作业的代码文件,在网络上一起提交。
If you need to modify your answer after the first submission, you can submit the examination again. The score will be evaluated based on the lastest submission. 首次提交后如需要修改,可以重新提交一次。评分以最后一次提交的为准。
The take home exam contains three parts. You are required to answer the questions for all parts.
期末大作业由三个部分组成,所有部分都需要进行作答。
If you finish the take home exam in English, you will get 5 bonus points. The total score will not exceed 100 points.
用英语完成大作业的将有额外 5 分加分,但是总分不会超过 100 分。
在截止时间之后提交(包括第二次提交)将会有相应罚分。迟交 30 分钟内不扣 分,30分钟至6小时扣5分,6小时至12小时扣10分,12至24小时扣15分。 迟交 24 小时以上的将视作放弃提交大作业,不得任何分数。
In this study, we are going to evaluate how the El Niño-Southern Oscillation (ENSO) affects the austral summer rainfall. The summer season in south hemisphere is December-January-February (DJF).
In this exam, we will only focus on one station to explore appropriate models. The summer total rainfall data is obtained from a high quality gauge in Australia, whose data is presented in the file rain.039037.summer.txt. The first two columns are the starting and ending date of summer season, and the third column is the summer total rainfall in mm.
There are several indices which measure the intensity of ENSO. The one we use here is called Southern Oscillation Index (SOI). The DJF averaged SOI index is presented in the file SOI.csv. The first column shows the year corresponding to January.
Our objective is to find an appropriate model to describe how DJF averaged SOI affects the summer total rainfall of this station. Maybe we can extend this model to other stations.
Part I. Data Characteristics (25 points)
1. The file “rain.039037.summer.txt” gives summer total rainfall, from which it is easy to calculate the summer average daily rainfall. Between the summer average daily rainfall and the summer total rainfall, if we need to choose one target variable for modelling, in your opinion which one is more scientifically reasonable? Provide your reason. Hint: you may consider the influence of leap years. (5 points)
2. No matter what your answer to the first question is, we will work on the summer total rainfall in this exam. Let us denote y as the time series of the summer total rainfall, and x as the time series of DJF averaged SOI index. Discuss numerically and graphically whether x and y are likely to be normally distributed. You may consider the functions summary(); qqnorm();qqline(). (5 points)
3. Do you think that taking the log of y will help to be closer to Normal distribution? Create a new variable log.y=log(y) and then repeat the same check as in Question 2. (5 points)
4. For x, y and log.y, use a statistical test to test if the hypothesis that the Normal distribution is appropriate for the data with a significance level that appeals to you. Describe what hypotheses are you testing. Explain the results of the hypothesis test. (10 points)
Part II. Exploratory models (32 points)
In this part, we will try to build a linear model to describe the relationship between the summer total rainfall and ENSO.
5. Check the relationship between y and x using a scatterplot. Then, make a linear regression and describe the result of the linear regression. Put the regression lines on the scatterplot. (10 points)
6. Is the fit adequate using this linear regression? Justify your answer. (5 points)
7. Replace y by log.y, then repeat Question 5 and 6. (5 points)
8. Describe the model outcomes of Question 7. i.e. Describe the relationship between y and x presented in the model? (5 points)
9. If we have the following two models for y, which one is more appropriate? Explain why. (7 points)
a. y=exp(α α+x ε)+ 01
b. y=exp(α α+x) ε+ 01
in which α0 and α1 are the intercept and slope. ε is the random error, which follows a Normal distribution N(0,σ). σ is the standard deviation.
Part III. Asymmetric models (43 points)
Some scientists found that the relationship between ENSO and rainfall is not symmetric, which means that during El Niño phase and La Niña phase, the relationship between ENSO and rainfall can be different. To simplify the problem, let us say that we consider positive SOI (SOI≥0) as La Niña phase and negative SOI (SOI<0) as El Niño phase. We are going to build the model separately for El Niño and La Niña phase. Figure 1(b) is an illustration of the asymmetric model. i.e. we will obtain two slopes at different phases instead of one.
Figure 1: (a) Symmetric linear model; (b) asymmetric linear model
In this part, we still focus on log.y.
10. Separate the data for El Niño and La Niña phase, and make a Pearson correlation test for log.y and SOI. Describe what hypotheses are you testing and explain your findings. (5 points)
11. Now let us make linear regressions separately for El Niño and La Niña phase. Are these slopes significantly different from zero? What does that mean in terms of the relationship between the summer total rainfall and ENSO? Put the regression lines of both phases on the same scatterplot. What will be the final model that you suggest for El Niño and La Niña phase? Hint: When suggesting the final model, think about what should you do if the slope is not significant. (15 points)
12. When we put the regressions of El Niño and La Niña phase together, we obtain the asymmetric linear model as shown in Figure 1(b). Of course, we need to check the goodness-of-fit, but let us skip this step here and admit that the goodness-of-fit is ok. Now, the question that left is whether the symmetric linear model built in Question 7 is better or the asymmetric linear model built in Question 11 is better. Here, we make the comparison with two measures, mean square error (MSE) and Akaike information criterion (AIC). (15 points)
a. Calculate the MSE for log.y using both symmetric and asymmetric linear models. According to MSE, which model is better?
b. Calculate the AIC for log.y for both models. According to AIC, which model is better? Hint: When calculate the AIC for the asymmetric linear model, you may need to go back to the formula and think about how to obtain the likelihood for log.y.
c. Make your suggestion on the final model of log.y, and make your suggestion for the final model of y.
13. In the last question, we want to see how random noise will affect our results. Saying that the variable ID is your student number. Let us add some noise to log.y and create log.y2 by using the following codes:
set.seed(888); log.y2 = log.y +rnorm(length(log.y), 0, (ID %% 10)/10+0.1)
Repeat Q11 with log.y2 instead of log.y. Do you still find a similar result as with log.y? Describe possible reasons why you find a similar or different result when noise is added to the original data. Hint: You may think about the signal-to-noise ratio. (8 points)