—
title: “Homework 3 – Report”
author: “”
date: “`r format(Sys.time(), ‘%d %B, %Y’)`”
output: html_document
—
## Basic instructions
To obtain the maximum number of points, use whenever possible:
* the pipeline symbol %>%,
* `dplyr` verbs,
* `kable` to print tables,
* `ggplot` to produce the figures,
* `purr`’s functions to minimize code duplication,
* `broom`’s functions to format the results model fits.
Also:
* Pay attention to the general aesthetics of your report (i.e., formatting of
the tables, sizing of the figures, etc.).
* Use inline code to display single numbers, and `echo = FALSE` for code chunks when you display
a table of produce a figure (i.e., we don’t want to see code unless it is necessary).
* Minimize the amount of code duplication by using iterations concepts and list-columns.
* Comment your code to explain what you are doing.
<!– The following code chunk loads the packages and dataset that you will need for this assignment: –>
“`{r setup, echo = FALSE, warning = FALSE, message = FALSE}
library(knitr)
library(tidyverse)
library(lubridate)
library(modelr)
library(broom)
library(ggrepel)
temperature_data <- read_csv(“/Users/spencer/Desktop/dsfba_homework3_skeleton/data/temperature_data.csv”)
“`
<!– The following code chunk sets a few option for nice visualization: –>
“`{r setup2, echo = FALSE, warning = FALSE, message = FALSE}
theme_set(theme_light())
opts_chunk$set(fig.width = 8,
fig.asp = 0.618,
out.width = “70%”,
fig.align = “center”,
fig.show = “hold”,
message = FALSE)
“`
## The data
For this assignment, you will use temperature data provided by [Berekeley Earth](http://berkeleyearth.org/data/).
More specifically, you will use time series of average monthly air temperatures over land in every country between 1743 and today.
The dataset contains `r nrow(temperature_data)` observations and `r ncol(temperature_data)` variables (`r names(temperature_data)`).
## A note on time series
A time series is a series of data points indexed by (a specific) time.
For example, monthly time series are series with monthly data points. Such series can (often) be decomposed into three main components: a trend, which reflects the long-term behaviour of the series (e.g. a series has an upward trend if its values increase over time); a seasonal pattern, when some specific behaviour appear every x days/months/years (e.g. for sales data over time, you might have a seasonal pattern in December when people are making their Christmas presents); and a remaining pattern, not captured by the previous components.
Formally, if $y_t$ is the observation at time $t$, one can write $y_t=T_t + S_t + R_t$ where $T_t$ corresponds to the trend at time $t$, $S_t$ corresponds to the seasonality at time $t$ and $R_t$ to the remaining pattern.
In this assignment, you will analyze time series and try to describe their features:
* You will start by building a simple model for the monthly average temperature of a single time series, and check your model particularities and the captured effects (or pattern). Remember to look at the residuals is often useful to asses whether or not the model captures the important features of the data.
* You will then apply the same simple model to every time series in the dataset.
## Wrangling and exploratory data analysis
__a__ Add two column corresponding to the year and month.
Encode month as a factor and make sure that the factor is properly ordered
(i.e., January first an December last).
Furthermore:
* Filter you data to focus on the 20th and 21th centuries.
* Remove countries without any data over this period.
“`{r}
## your code goes here
“`
__b__ Provide a high level description of the country and region variables. In particular, your answer must provide (and summarise) all the important information about these two variables for someone who would not have access to the data.
You can use either words, or tables, or figures, or a bit of each.
“`{r}
## your code goes here
“`
__c__ Produce a worlwide map of the temperature in March 2010, as well as another map
with the differences in temperatures between March 2010 and March 1900, and describe what you see.
Hint: ggplot2 includes two helpful command for this part: `map_data()` to retrieve a map and `geom_map()` to draw a map on a plot. Use `expand_limits` to make sure you display the whole map of the world.
“`{r}
## your code goes here
“`
## A model for a single time series
__d__ Plot the evolution of the monthly temperature in Swizerland.
Then propose a model for the average monthly temperature, display its predictions and
the residuals, and describe what you see.
“`{r}
## your code goes here
“`
__e__ Improve your model by including the effect of the year of observation,
display its predictions, the residuals, and describe what you see.
“`{r}
## your code goes here
“`
## Many models
__f__ Use nested data and list-columns to fit the same model to
every country in the dataset, as well as to add predictions and residuals for
each fitted model.
“`{r}
## your code goes here
“`
__g__ Describe the results obtained for each continent
(i.e., `country %in% c(“Europe”, “Asia”, “Africa”, “South America”, “Oceania”, “North America”)`).
Emphasize the description of:
* the trend,
* the seasonal patterns,
* the residuals.
“`{r}
## your code goes here
“`
__h__ Describe the results obtained “per” country for each continent (you can discard countries in the “Other” category).
Emphasize the description of the trend and seasonal patterns. Then, assess the model quality/performance by using a measure of fit quality of your choice. In particular, this assessment can be done at the continent level (i.e., by averaging r squared) and/or by country.
“`{r}
## your code goes here
“`