程序代写代做代考 # Overview

# Overview

The City of Melbourne has sensors set up in strategic locations across the inner city to keep hourly tallies of pedestrians. The data is updated on a monthly basis and available for download from [Melbourne Open Data Portal](https://data.melbourne.vic.gov.au/Transport/Pedestrian-Counting-System-2009-to-Present-counts-/b2ak-trbp). The **rwalkr** package provides an API in R to easily access sensor counts and geographic locations.

There are three parts to this work:

1. Data wrangling and data visualisation of the pedestrian data
2. Joining data together weather data
3. Performing preliminary modelling

“`{r pkgs, echo = FALSE}
library(broom)
library(ggmap)
library(knitr)
library(lubridate)
library(rwalkr)
library(sugrrants)
library(timeDate)
library(tsibble)
library(here)
library(readr)
library(tidyverse)
“`

# Part 1: Data wrangling and data visualisation of the pedestrian data

> Amelia: I’ve downloaded a map chunk of Melbourne. Can you take the map I made, and plot the location of the sensors on top of it? We want to be able to see all of the sensors, but we also want to create different shapes for the following sensors:

– “Southern Cross Station”
– “Melbourne Central”
– “Flinders Street Station Underpass”
– “Birrarung Marr”

First we download the data on the pedestrian sensor locations around Melbourne.

“`{r ped-sensor, echo = FALSE}
# only select active sensors
ped_loc <- pull_sensor() %>% filter(status == “A”)

selected_sensors <- c("Southern Cross Station", "Melbourne Central", "Flinders Street Station Underpass", "Birrarung Marr") # identify those sensors that are selected <- ped_loc %>% filter(sensor %in% selected_sensors)
nonselected <- ped_loc %>% filter(!(sensor %in% selected_sensors))
“`

And now we draw a plot on a map tile of the pedestrian sensor locations

“`{r ped-sensor2, echo = FALSE}
melb_map <- read_rds(here::here("data-raw/melb-map.rds")) ggmap(melb_map) + geom_point(data = nonselected, aes(x = longitude, y = latitude), colour = "#2b8cbe", alpha = 0.6, size = 3) + geom_point(data = selected, aes(x = longitude, y = latitude, colour = sensor), size = 4, shape = 17) + labs(x = "Longitude", y = "Latitude") + scale_colour_brewer(palette = "Dark2", name = "Sensor") + guides(col = guide_legend(nrow = 2, byrow = TRUE)) + theme(legend.position = "bottom") ``` ## Q1A Tell me what this map shows us? (0.5 Mark) !> Answer:

## Q1B Create a markdown table of the number of sensors installed each year (0.5 Mark)

“`{r install-each-year}

ped_loc %>%
# calculate the year from the date information
mutate(year = ___) %>%
# count up the number of sensors
___(___) %>%
# then use `kable()` to create a markdown table
kable()
“`

Additionally, how many sensors were added in 2016, and in 2017?

!> Answer:
> …

## Q1C Filter the data down to just look at the selected four sensors (1 mark)

We would like you to focus on the foot traffic at 4 sensors:

– “Southern Cross Station”
– “Melbourne Central”
– “Flinders Street Station Underpass”
– “Birrarung Marr”

We’ve already downloaded the data for you, so your task is to:

– Filter the data down to include only the four sensors above
– add variables that contain the day of the month, month, year, and day in the year.

“`{r bind-rows-filter}
walk_2018 <- read_rds( here::here("data-raw/walk_2018.rds") ) %>%
# Filter the data down to include only the four sensors above
filter(___) %>%
# now add four columns, containing month day, month, year, and day of the year
# using functions from lubridate.
mutate(mday = mday(___),
___)
“`

Now we can plot the pedestrian count information for January – April in 2018

“`{r plot-temporal-patterns}
ggplot(walk_2018,
aes(x = Date_Time,
y = Count)) +
geom_line(size = 0.3) +
facet_grid(Sensor ~ .,
# this code presents the facets ina nice way
labeller = labeller(Sensor = label_wrap_gen(20))) +
# this code mades the x axis a bit nicer to read
scale_x_datetime(date_labels = “%d %b %Y”,
date_minor_breaks = “1 month”) +
labs(x = “Date Time”)
“`

We can see that there are quite different patterns in each of the sensors. Let’s explore this further.

## Q1D How many types of activities might this capture at the four selected places? (0.5 marks)

!> Answer:

## Q1E Create a plot of the counts of each sensor over the day (2 Marks)

We’re primarily interested in exploiting pedestrian patterns at various time resolutions and across different locations. In light of people’s daily schedules, let’s plot the **counts** against **time of day** for each **sensor**.

“`{r ped-lineplots}
ggplot(___,
aes(x = ___,
y = ___,
group = ___,
colour = ___)) +
geom_line() +
facet_wrap(___,
labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = “Dark2”,
name = “Sensor”) +
theme(legend.position = “none”))

“`

Write a short paragraph that describe what the plot shows:

– What is plotted on the graph? What does each line represent?
– Is the data in these sensors similar, or different?
– Does each panel show the same trend within it, or is there variation?
– What do you learn?

!> Answer:

## Q1F Exploring non work days (2 marks)

Use the data inside the `hols_2018` data to identify weekdays and weekends, and holidays.

“`{r add-2018-holidays}
hols_2018 <- tsibble::holiday_aus(year = ___, state = "VIC") walk_2018_hols <- walk_2018 %>%
mutate(weekday = ___(___, label = TRUE, week_start = 1),
workday = if_else(
condition = ___ | ___,
true = ___,
false = ___
))
“`

Now create a plot to compare the workdays to the non workdays.

“`{r workdays-vs-non-work}
ggplot(___,
aes(___)) +
geom_line(size = 0.3,
alpha = 0.3) +
facet_grid(___,
labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = “Dark2”, name = “Sensor”) +
theme(legend.position = “none”)
“`

Write a short paragraph that describe what the plot shows, and helps us answer these questions:

– What is plotted on the graph? What does each line represent?
– How are the data in these sensors similar or different?
– Does each panel show the same trend within it, or is there variation?
– What do you learn?

!> Answer:

## Q1G Calendar plot (0.5 mark)

To locate those unusual moments, Flinders Street Station data is calendarised on the canvas, using the **sugrrants** package. We can spot the unusual weekday patterns on public holidays using their color. Using the calendar plot, try to spot another unusual pattern, do a google search to try to explain the change in foot traffic. (Hint: Look at the plot carefully, does a particular day show a different daily pattern? Is there a specific event or festival happening on that day?)

“`{r calendar-plot}

# filter to just look at flinders st station
flinders <- walk_2018_hols %>% ___

flinders_cal <- flinders %>%
frame_calendar(x = ___, y = Count, date = ___)
gg_cal <- flinders_cal %>%
ggplot(aes(x = ___, y = ___, colour = ___, group = ___)) +
geom_line()
prettify(gg_cal) +
theme(legend.position = “bottom”)
“`

!> Answer:

# Part Two: Combining data sources

## Q2A We’ve got pedestrian count data for 2020, can you add that to the data? (2 marks)

You’ll need to ensure that you follow the steps we did earlier to filter the data and add the holiday information.

“`{r read the data and combine}
walk_2020 <- read_rds( here::here("data-raw/walk_2020.rds") ) %>%
filter(Sensor ___)
) %>%
# now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate.
mutate(__)

“`

Now add the holiday data

“`{r read more data and combine}
# also the steps for adding in the holiday info
hols_2020 <- tsibble::holiday_aus(year = ___, state = "VIC") walk_2020_hols <- walk_2020 %>%
mutate(weekday = ___,
workday = ___)

melb_walk_hols <- bind_rows(___, ___) ``` ## Q2B There is some repetition in the code above from previous, describe two (or more) ways to limit that repetition, and demonstrate the use of this as a function (1 Mark) ```{r functions} filter_sensor <- function(data, sensors){ data %>% filter(Sensor %in% ___)
}

add_day_info <- function(data){ # now add four using `mutate` columns which contain the day of the month, month, and year, and day of the year using functions from lubridate. data %>%
mutate(___)
}

add_working_day <- function(data){ walk_years <- unique(data$___) hols <- tsibble::holiday_aus(year = walk_years, state = "VIC") data %>%
mutate(weekday = ___,
workday = if_else(
___,
___,
___
))
}
“`

“`{r applied function}

# Step one, combine the walking data
bind_rows(walk_2018, walk_2020) %>%
# Step two, filter the sensors
filter_sensor(sensors = ___) %>%
# step three, add the info on day of the year, month, etc
add_day_info() %>%
# strep four, add info on working days.
add_working_day()
“`

## Q2C For Flinders st, in the month of April, can you compare 2018 to 2020? (2 Marks)

Write a paragraph that describe what you learn from these plots. Can you describe any similarities, and differences amongst the plots, and why they might be similar or different? (You might need to play with the plot output size to clearly see the pattern)

“`{r flinders-april-over-years}
melb_walk_hols_flinders_april <- melb_walk_hols %>%
filter(___, ___)

ggplot(___,
aes(___,
colour = as.factor(year))) +
geom_line() +
facet_wrap(~ ___, ncol = 5) +
theme(legend.position = “bottom”) +
labs(colour = “Year”)
“`

!> Answer:

## Q2D Produce a similar plot (to 2C) that will also allow us to contrast the patterns across the four sensors. (2 marks)

What do you learn? Which Sensors seem the most similar? Or the most different?

“`{r all-senesors}
melb_walk_hols_april <- melb_walk_hols %>% filter(month == ___)

ggplot(___,
aes(___) +
___ +
facet_grid(___) +
theme(legend.position = “bottom”) +
labs(colour = “Year”)

“`

!> Answer:

**Combining weather data with pedestrian counts**

One question we want to answer is: “Does the weather make a difference to the number of people walking out?”

Time of day and day of week are the predominant driving force of the number of pedestrian, depicted in the previous data plots. Apart from these temporal factors, the weather condition could possibly affect how many people are walking in the city. In particular, people are likely to stay indoors, when the day is too hot or too cold, or raining hard.

Daily meteorological data as a separate source, available on [National Climatic Data Center](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/), is used and joined to the main pedestrian data table using common dates.

Binary variables are created to serve as the tipping points

We have pulled information on weather stations for the Melbourne area – can you combine it together into one dataset?

## Q2E Read in and add flagging variables to the weather data (2 marks)

– high_prcp if `prcp` > 5 (if yes, “rain”, if no, “none”)
– high_temp if `tmax` > 33 (if yes, “hot”, if no, “not”)
– low_temp if `tmin` < 6 (if yes, "cold", if no, "not") ```{r join weather-data} # Now create some flag variables melb_weather_2018 <- read_csv( here::here("data-raw/melb_ncdc_2018.csv") ) %>%
mutate(
high_prcp = if_else(condition = prcp > 5,
true = ___,
false = ___),
high_temp = if_else(condition = ___
true = ___,
false = ___),
low_temp = ___
)
)
“`

## Q2F summarise the pedestrian count data and join it to the weather data (2 marks)

The weather data is per day, and the pedestrian count data is every hour. One way to explore this data is to collapse the pedestrian count data down to the total daily counts, so we can compare the total number of people each day to the weather for each day. This means each row is the total number of counts at each sensor, for each day.

Depending on how you do this, you will likely need to merge the pedestrian count data back with the weather data. Remember that we want to look at the data for 2018 only

“`{r summarise-sensor-weather-data}
melb_daily_walk_2018 <- melb_walk_hols %>%
filter(___) %>%
group_by(Sensor, Date) %>%
summarise(___) %>%
ungroup()

melb_daily_walk_weather_2018 <- melb_daily_walk_2018 %>%
___(___, by = ___)

melb_daily_walk_weather_2018
“`

## Q2H Exploring sensor count against weather flagging variables (4 marks)

Create a few plots that look at the spread of the daily totals for each of the sensors, according to the weather flagging variables (`high_prcp`, `high_temp`, and `low_temp`).
Write a paragraph that tells us what you learn from these plots, how you think weather might be impacting how people go outside. Make sure to discuss the strengths and limitations of the plots summarised like this, what assumption do they make?

“`{r plot-one}
# Plot of count for each sensor against high rain
ggplot(melb_daily_walk_weather_2018,
aes(y = Count,
x = Sensor,
colour = ___)) +
geom_boxplot() +
theme(legend.position = “bottom”)
“`

“`{r plot-two}
# Plot against high temperature
ggplot(___,
aes(y = ___,
x = ___,
colour = ___)) +
___
“`

“`{r plot-three}
# Plot of low temperature
ggplot(___) +
___

“`

!> Answer:

## Q2F Combine the weather data with the pedestrian count data (2 marks)

The visualisations tell us something interesting about the data, but to really understand the data, we need to perform some modelling. To do this, you need to combine the weather data with the pedestrian data. We have provided the weather data for 2018 and 2020, combine with the pedestrian data for 2018 and 2020.

“`{r combine weather-data}

melb_weather_2018 <- read_csv(here::here("data-raw/melb_ncdc_2018.csv")) melb_weather_2020 <- read_csv(here::here("data-raw/melb_ncdc_2020.csv")) # task: combine the weather data together into an object, `melb_weather` melb_weather <- bind_rows(___, ___) # remember to add info about high precipitation, high temperature, + low temps mutate( high_prcp = ___, high_temp = ___, low_temp = ___ ) # now combine this weather data with the walking data melb_walk_weather <- melb_walk_hols %>%
___(melb_weather, by = ___)

“`

# Part 3: Modelling Count data

We have been able to start answering the question, “Does the weather make a difference to the number of people walking out?” by looking at a few exploratory plots. However, we want to get a bit more definitive answer by performing some statistical modelling.

We are going to process the data somewhat so we can fit a linear model to the data. First, let’s take a set relevant variables to be factors. This ensures that the linear model interprets them appropriately.

We also add one to count and then take the natural log of it. The reasons for this are a bit complex, but essentially a linear model is not the most ideal model to fit for this data, and we can help it be more ideal by taking the log of the counts, which helps stabilise the residuals (predictions – observed) when we fit the model.

“`{r process-walking-data}
melb_walk_weather_prep_lm <- melb_walk_weather %>%
mutate_at(.vars = vars(Sensor,
Time,
month,
year,
workday,
high_prcp,
high_temp,
low_temp),
as_factor) %>%
mutate(log_count = log1p(Count))
“`

## Q3A: Fit a linear model (1 mark)

Now we fit a linear model, predicting logCount using Time, Month, weekday and the weather flag variables (`high_prcp`, `high_temp`, and `low_temp`)

“`{r fit-lm}
walk_fit_lm <- lm ( formula = log_count ~ Time + ___ + ___ + ___ + ___ + ___, data = ___ ) ``` ## Q3B: Evaluate the model fit statistics (1 mark) Provide some summary statistics on how well this model fits the data? What do you see? What statistics tell us about our model fit? ```{r glance-fit-lm} glance(___) ``` !> Answer:

## Q3C: Look at the model predictions (1 mark)

We have had one look at the model fit statistics, but let’s now look at the fitted and observed (`log_count`) values, for each sensor:

“`{r aug-broom-plot}
peds_aug_lm <- augment(walk_fit_lm, data = melb_walk_weather_prep_lm) ggplot(peds_aug_lm, aes(x = ___, y = ___)) + geom_point() + facet_wrap(~___) ``` There is actually a lot of variation. Looking at this, you might assume that the model does a bad job of fitting the residuals. However, we must remember that there is an inherent time structure to the data. A better way to explore this is to look directly at the temporal components. We can do this directly with a calendar plot. ## Q3D: Arrange the data so we can examine predictions (1 Mark) Before we can get the data into the right format for analysis, we need to pivot the data into a longer format, so that we have columns of Date, Time, Count, Model, and log_count. ```{r calendar-fits} flinders_lm <- peds_aug_lm %>%
# filter the data to only look at flinder st
___() %>%
# Select the Date, Time, Count, .fitted, and log_count
___() %>%
# Now pivot the log count and fitted columns into a column called “model”
# data so we have columns of Date, Time, Count, Model,
# and log_count.
pivot_longer(___) %>%
# Now we’re going to undo the intiial data transformation
mutate(Count = expm1(log_count))
“`

## Q3F: Plot the observed and fitted values in a calendar plot (3 Marks)

“`{r calendar-fits-observed}
flinders_lm_cal <- flinders_lm %>%
# Let’s just look at 2020 to make it a bit easier
filter(year(Date) == “2020”) %>%
frame_calendar(x = ___, y = ___, date = ___)

gg_cal <- ggplot(___) + # Part of the interface to overlaying multiple lines in a calendar plot # is drawing two separate `geom_line`s - # See https://pkg.earo.me/sugrrants/articles/frame-calendar.html#use-in-conjunction-with-group_by # for details geom_line(data = filter(flinders_lm_cal, model == ___), aes(x = .Time, y = .Count, colour = model, group = Date)) + geom_line(data = filter(flinders_lm_cal, model == ___), aes(x = .Time, y = .Count, colour = model, group = Date)) prettify(___) + theme(legend.position = "bottom") ``` Write a paragraph answering these questions: - What do you see in the difference between the fitted and the observed (log_count)? - What is a difference between 2020 and 2018? - Is there a difference across sensors? - What other variables might you consider adding to the model, and why? !> Answer:

## Q3G: Fit one more (or more!) models to explore and compare (2 marks)

What sort of improvements do you think you could make to the model? Fit two more models, Try adding more variables to the linear model.

“`{r fitting-a-few-models}
walk_fit_lm_*** <- lm( ___ ) walk_fit_lm_*** <- lm( ___ ) ``` Why did you add those variables? !> Answer:

## Q3H Compare the model fit statistics (2 marks)

“`{r fitting-a-few-more-models}
bind_rows(
first = glance(___),
sensor = ___,
year = ___,
.id = “type”
)
“`

Which model does the best? Why do you think that is? What statistics are you basing this decision off?

!> Answer:

## Q3I Recreate the same calendar plot for your best model and explore the model fit. (2 Marks)

(Suggestion – Perhaps write this as a function to speed up comparison)

“`{r recreate-calednar}
peds_aug_lm_sensor_*** <- augment(___, ___) ``` ```{r recreate-calendar-function-optional} pivot_sensor <- function(lm_fit, sensor = "Flinders Street Station Underpass"){ lm_fit %>%
filter(Sensor == sensor) %>%
select(___) %>%
pivot_longer(___) %>%
mutate(Count = expm1(log_count))
}

calendar_fit_obs <- function(lm_fit_aug){ data_cal <- lm_fit_aug %>%
frame_calendar(___)

gg_cal <- ggplot(___) + geom_line(data = filter(___), aes(___)) + geom_line(data = filter(___), aes(___)) prettify(___) + theme(legend.position = "bottom") } pivot_sensor(___, ))) %>%
filter(year(Date) == “2020”) %>%
calendar_fit_obs()

“`

What do you see? How does it compare to the previous model?

!> Answer:

## Q3J Compare model residuals against the fit (3 marks)

Compare the fitted against the residuals, perhaps write a function to help you do this in a more readable way.

“`{r fitting-even-more-models}
walk_fit_lm_*** <- augment(___, ___) walk_fit_lm_*** <- augment(___, ___) walk_fit_lm_s*** <- augment(___, ___) plot_fit_resid <- function(data){ ggplot(data, aes(x = ___, y = ___)) + geom_point() + facet_wrap(~Sensor) } plot_fit_resid(___) plot_fit_resid(___) plot_fit_resid(___) ``` We've looked at all these models, now pick your best one, and compare the predicted values against the actual observed values. What do you see? Is the model good? Is it bad? Do you have any thoughts on why it is good or bad? !> Answer:

# References

Make sure to reference all of the R packages that you used here, along with any links or blog posts that you used to help you answer these questions

# Extra code

This code below here is what was used to retrieve the data in the `data-raw` folder.

“`{r ped-data, eval = FALSE}
#
# walk_2018 <- melb_walk(from = ymd("2018-01-01"), to = ymd("2018-04-30")) # walk_2020 <- melb_walk(from = ymd("2020-01-01"), to = ymd("2020-04-30")) ## # # # write_rds(walk_2018, # here::here("2020/assignment-2/data-raw/walk_2018.rds"), # compress = "xz") # # write_rds(walk_2020, # here::here("2020/assignment-2/data-raw/walk_2020.rds"), # compress = "xz") ``` ```{r get-map-data, eval = FALSE} # melb_bbox <- c(min(ped_loc$longitude) - .001, # min(ped_loc$latitude) - 0.001, # max(ped_loc$longitude) + .001, # max(ped_loc$latitude) + 0.001) # # melb_map <- get_map(location = melb_bbox, source = "osm") # write_rds(melb_map, # path = here::here("2020/assignment-2/data-raw/melb-map.rds"), # compress = "xz") ``` ```{r get-station-data, eval = FALSE} # code to download the stations around the airport and the weather times # this is purely here so you can see how we downloaded this data # it is not needed for you to complete the assignment, so it is commented out # melb_stns <- read_table( # file = "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt", # col_names = c("ID", # "lat", # "lon", # "elev", # "state", # "name", # "v1", # "v2", # "v3"), # skip = 353, # n_max = 17081 # ) %>%
# filter(state == “MELBOURNE AIRPORT”)
# #
# get_ncdc <- function(year){ # vroom::vroom( # glue::glue( # "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/{year}.csv.gz" # ), # col_names = FALSE, # col_select = 1:4 # ) # } # # clean_ncdc <- function(x){ # x %>%
# filter(X1 == melb_stns$ID, X3 %in% c(“PRCP”, “TMAX”, “TMIN”)) %>%
# rename_all(~ c(“station”, “date”, “variable”, “value”)) %>%
# mutate(date = ymd(date), value = value / 10) %>%
# pivot_wider(names_from = variable, values_from = value) %>%
# rename_all(tolower)
# }

# ncdc_2018 <- get_ncdc(2018) # melb_ncdc_2018 <- clean_ncdc(ncdc_2018) # write_csv(melb_ncdc_2018, # path = here::here("2020/assignment-2/data-raw/melb_ncdc_2018.csv")) # # ncdc_2020 <- get_ncdc(2020) # beepr::beep(sound = 4) # melb_ncdc_2020 <- clean_ncdc(ncdc_2020) # beepr::beep(sound = 4) # write_csv(melb_ncdc_2020, # path = here::here("2020/assignment-2/data-raw/melb_ncdc_2020.csv")) ```