STAT 3280 Spring 2020 HW3
Due on Apr 12.
Submit your homework by sending it to both tianxili@virginia.edu and rm9dd@virginia.edu, with subject “STAT 3280-Section#-HW3: names”, where the section number is “001” if you are in 11-12:15 and “002” if you are in 12:30-1:45, and the “names” should be replaced by your last name(s) of the group. Each group only has to submit it once, by ensure you in- clude everyone’s name. Please use a separate page for each problem. And the answer to each problem cannot be longer than one page (with reasonable font size, line space, margins etc.). You can explain how you did it in R by submitting your code with detailed explanations, but only include this part in an appendix. For this homework, you can use whatever R package you want.
Due to the recent COVID-19 problem, I decide to replace some problem in my mind before by a new problem of analyzing COVID-19 data. I strongly encourage you to work on the problems with careful thinking and sufficient efforts. For that, I actually reduce the number of problems in this data set (I originally planned for 5 but now there are only 3). For Q1, you are supposed to give a single figure. For each of Q2 and Q3, you can use up to 2 pages for your analysis and add text (including paragraphs) to explain your results. However, your component cannot exceed one-page for each of Q2 and Q3.
And I will include both Q2 and Q3 as your options for presentation and you can select either one of the two to present (but you need to work on both for your submission). We will work out a plan for presentation in this remote teaching scenario later on. The presentation will count for 5 points. So Q1 will be 5 points. Q2 and Q3 will be 15 points. Together with the presentation, you will have 40 points for HW3. I strongly recommend early starting for this homework, specially because in foreseeable future, most of you will be working remotely with your team members. The deadline for HW3 is a hard deadline. We need to reserve enough time for HW4 and group presentation and there is not way we can extend it.
1. (5 pts) We will be introducing the way to generate the US airline plot in class. Now, based on the larger data set in the airport folder, generate the global airline network data, similar to that in the slides.
2. (15 pts) In the folder “statisticians”, you can find the data about statisticians’ publica- tions in 4 journals during 2002-2012. Look at the ReadMe file in the folder to understand the data set. You can also explore the paper at https://arxiv.org/abs/1410.2840 for a detailed analysis if you want. We will explore the data set a bit in class and test basic visualizations. You are supposed to further extend the analysis as below:
1. Use the abstract keywords and visualize the keywords over time. Do you observe any trend?
2. Ego network exploring: pick one of the a few statisticians with large number of citations and/or collaborations.
1
(a) Visualize his/her citation network change over time. Notice that it makes more sense to visualize it in an accumulative way (eg. in 2010, you take all citation relations he/she had before 2010). Highlight and comment on the expansion pattern.
(b) Similarly, visualize his/her collaboration network change over time. Highlight and comment on the expansion pattern. Comment on the difference between this one and the citation networks.
(c) Visualize the keywords from his/her articles in the data set. Is the trend in for this person similar to the global trend?
(d) Use Google Scholar, find all the paper titles under his/her name. Again, visual the word cloud from the titles over time, for the person’s whole academic career. Does the period of 2002-2012 seems to match what you observe from (c)? If not, give a briefly intuitive explanation. (Hint: you do not have to segment the data by each year, if you feel that is too long or you do not have enough data for each year. It is find to aggregate your data across several years.)
3. (15 pts) In class, we will be extracting the latest coronavirus incident data from Johns Hopkins University’s Center for Systems Science and Engineering. There are several data science tools globally following the progress of this disease as well as an R package coronavirus: The 2019 Novel Coronavirus COVID-19 (2019-nCoV) that provides a daily summary of the Coronavirus (COVID-19) cases by state/province. We will also learn the join operation in class, as our preparation for database later on.
1. In class, we will visualize the number of confirmed cases over time for countries. Generate a graph to visualize infected per million (instead of total infected number). The World Bank is a tremendous source of global socio-economic data and can be accessed via the R package wbstats. Have a look at their HTML on CRAN.
2. Similar to (1), visualize the ratio between death and recovered over time for different countries.
3. (Only for presentation) As an extension, give an animated illustration of the result in (1) and (2).
4. Imagine you are part of a working group that is to provide the US government with some recommendation on various policies or strategies to face the challenge of the virus. Is the current development of the virus spreading in US still under control? If not, what else one should do? Would travel ban (international or domestic or both) be effective? Would the lockdown strategy of China be effective? As part of the team, your task is to use the data provided by John Hopkins University to generate a few key visualization graphics that will contribute towards making this decision. Please justify why these graphics are useful and suggest the potential “decision” it may lead to. Hints: you might start with checking the following aspects of the data
• Are there certain states in US that are more at risk compared to others? 2
• Compare the pattern or rate of increase in different countries, in different pe- riods.
• Compare the cross-over point between the number of recovery and active case.
• Examine the geographical distribution of confirmed cases around the world.
• Perhaps use some external information. For example, you might want to learn overall the categories of strategies used by other countries so far, especially those countries whose status is ahead of US. For another example, the openflight dataset you use for Q1 may provide helpful information about international travel flow between countries, thus give you some information about the travel impacts.
Remark: note that to really rigorously discover the true effects of a policy, the only way is causal inference. However, in practice, causal inference has too many strong requirements and may not be feasible for such emergencies. Therefore, exploring from such observational data is usually the only option for data scientists. There are hundreds of teams working intensively on evaluating various strategies. For ex- ample, a recent paper published on Science uses statistical infection models to eval- uate the potential effects of travel restrictions within China and out of China. URL: https://science.sciencemag.org/content/early/2020/03/05/science.aba9757. In this problem, you do not need to use such advanced methods. Exploring your data by carefully designed operations can already give you very insightful findings.
3