# HW7: weather
#
# In the whole exercise, do NOT use a `for`, `while`, or `repeat` loop!
#
# 1. Load the `tidyverse`,`lubridates` and `rattle` packages and use
# `theme_set(theme_light())` to set the theme for the rest of the exercise.
# Then, use `data(weather)` to read the dataset `weather` from the
# `rattle` package, and convert it into a tibble `weather` using
# `as_tibble()`.
#
# To check your answer:
#
# The output of `print(weather, n = 5)` is
#
# “`
# # A tibble: 366 x 24
# Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine
#
# 1 2007-11-01 Canberra 8 24.3 0 3.4 6.3
# 2 2007-11-02 Canberra 14 26.9 3.6 4.4 9.7
# 3 2007-11-03 Canberra 13.7 23.4 3.6 5.8 3.3
# 4 2007-11-04 Canberra 13.3 15.5 39.8 7.2 9.1
# 5 2007-11-05 Canberra 7.6 16.1 2.8 5.6 10.6
# # … with 361 more rows, and 17 more variables: WindGustDir
# # WindGustSpeed
# # WindSpeed9am
# # Humidity3pm
# # Cloud9am
# # RainToday
# “`
#
## Do not modify this line!
# 2. From `weather`, extract two tibbles `weather_9am` and `weather_3pm`.
# `weather_9am` should contain the column `Date`, followed by all the
# variables which names ending with `”9am”`
# Note that:
# – `”9am”` should be removed from the names of the variables.
# – The `Date` should be converted into a date-time object as follows.
# Instead of `”Y-M-D”`, it should `”Y-M-D 09:00:00 AECT”`,
# where `”AECT”` is the time zone code for Australian Eastern Time
# `09:00:00` corresponds to 9am.
#
# To check your answer:
#
# The output of `print(weather_9am, n = 5)` is
#
# “`
# # A tibble: 366 x 7
# Date WindDir WindSpeed Humidity Pressure Cloud Temp
#
# 1 2007-11-01 09:00:00 SW 6 68 1020. 7 14.4
# 2 2007-11-02 09:00:00 E 4 80 1012. 5 17.5
# 3 2007-11-03 09:00:00 N 6 82 1010. 8 15.4
# 4 2007-11-04 09:00:00 WNW 30 62 1006. 2 13.5
# 5 2007-11-05 09:00:00 SSE 20 68 1018. 7 11.1
# # … with 361 more rows
# “`
#
# The output of `print(weather_3pm, n = 5)` is
#
# “`
# # A tibble: 366 x 7
# Date WindDir WindSpeed Humidity Pressure Cloud Temp
#
# 1 2007-11-01 15:00:00 NW 20 29 1015 7 23.6
# 2 2007-11-02 15:00:00 W 17 36 1008. 3 25.7
# 3 2007-11-03 15:00:00 NNE 6 69 1007. 7 20.2
# 4 2007-11-04 15:00:00 W 24 56 1007 7 14.1
# 5 2007-11-05 15:00:00 ESE 28 49 1018. 7 15.4
# # … with 361 more rows
# “`
#
## Do not modify this line!
# 3. From the tibbles `weather_9am` and `weather_3pm`, create a tibble
# `humidity_temp` by
# – full-joining them,
# – creating `hour`, a factor variable corresponding to the hour of day,
# – creating `yday`, a numeric variable corresponding to the day of year,
# – selecting the `yday`, `hour`, `Humidity`, and `Temp` columns,
# – making the data longer by pivoting around the `Humidity` and `Temp`
# columns,
# – sorting by day of year and hour.
# To do so, you can use `full_join()` (or `bind_rows()`), `mutate()`,
# `dplyr::select()`, `pivot_longer()` and `arrange()`.
#
# To check your answer:
#
# The output of `print(humidity_temp, n = 5)` is
#
# “`
# # A tibble: 1,464 x 4
# yday hour name value
#
# 1 1 9 Humidity 50
# 2 1 9 Temp 21.9
# 3 1 15 Humidity 20
# 4 1 15 Temp 31.8
# 5 2 9 Humidity 43
# # … with 1,459 more rows
# “`
#
## Do not modify this line!
# 4. Produce a smooth regression plot of `value` against `yday`,
# colored by `hour` and with line type by `name`.
# To do this, use `humidity_temp` and:
# – `ggplot` to initialize the plot,
# – `geom_smooth()` to get the smooth regression plot, and set the `se`
# as `FALSE`, and the `method` as `”loess”`,
# – `labs()` to set the
# – title as `”Temperature and humidity are negatively correlated”`,
# – x-axis as `”Day of the year”`,
# – y-axis as `”Humidity (percent) and temperature (degrees C)”`,
# – color as `”Hour”`,
# – line type as `”Measurement”`,
# – `theme` to set the `legend.position` as `”bottom”`.
# Store the plot into a `ggplot` object `temp_humidity_plot`.
#
## Do not modify this line!
# 5. From `weather_9am` and `weather_3pm`, extract two tibbles
# `weather_wind_9am` and `weather_wind_3pm` by
# – selecting the variables `WindDir` and `Temp`,
# – adding a variable `Hour` (respectively as `”9am”` and `”3pm”`).
# To do that, you can use `dplyr::select()` and `mutate()`.
# Then, from the tibbles `weather_wind_9am` and `weather_wind_3pm`,
# create a tibble `weather_wind` by
# – full-joining them,
# – dropping the NAs,
# – creating `WindMainDir`, a character variable corresponding to the
# first letter of `WindDir`,
# – transforming `Hour` into a factor with levels as `c(“9am”, “3pm”)`.
# To do so, you can use `full_join()`, `drop_na()`, `str_sub()`,
# and `mutate()`.
#
# To check your answer:
#
# The output of `print(weather_wind_9am, n = 5)` is
#
# “`
# # A tibble: 366 x 3
# WindDir Temp Hour
#
# 1 SW 14.4 9am
# 2 E 17.5 9am
# 3 N 15.4 9am
# 4 WNW 13.5 9am
# 5 SSE 11.1 9am
# # … with 361 more rows
# “`
#
# The output of `print(weather_wind_3pm, n = 5)` is
#
# “`
# # A tibble: 366 x 3
# WindDir Temp Hour
#
# 1 NW 23.6 3pm
# 2 W 25.7 3pm
# 3 NNE 20.2 3pm
# 4 W 14.1 3pm
# 5 ESE 15.4 3pm
# # … with 361 more rows
# “`
#
# The output of `print(weather_wind, n = 5)` is
#
# “`
# # A tibble: 700 x 4
# WindDir Temp Hour WindMainDir
#
# 1 SW 14.4 9am S
# 2 E 17.5 9am E
# 3 N 15.4 9am N
# 4 WNW 13.5 9am W
# 5 SSE 11.1 9am S
# # … with 695 more rows
# “`
## Do not modify this line!
# 6. Finally, use `weather_wind` to produce a boxplot of `Temp` against
# `WindMainDir`, facetted by `Hour`.
# To do this, use `weather_wind` and:
# – `ggplot` to initialize the plot,
# – `geom_boxplot()` to get the boxplot,
# – `facet_grid()` for the facetting,
# – `labs()` to set the
# – title as `”Temperature is higher in the afternoon”`,
# – subtitle as
#
# “`
# str_wrap(“In the morning, it decreases when
# the winds turns from East to West”, width = 70)
# “`
#
# – x-axis as `”Wind direction”`,
# – y-axis as `”Temperature (degree C)”`,
# – `theme` to set the `legend.position` as `”bottom”`.
# Store the plot into a `ggplot` object `wind_boxplot`.
#
## Do not modify this line!
# HW7: relational1
#
# This exercise was inspired by exercise 6 in Chapter 2 of
# [Bit By Bit: Social Research in the Digital Age]
# (https://www.bitbybitbook.com/en/1st-ed/observing-behavior/observing-activities/)
# by Matt Salganik.
#
# “In a widely discussed paper, Michel and colleagues
# ([2011](https://doi.org/10.1126/science.1199644)) analyzed the content of
# more than five million digitized books in an attempt to identify long-term
# cultural trends. The data that they used has now been released as the Google
# NGrams dataset, and so we can use the data to replicate and extend some of
# their work.
#
# In one of the many results in the paper, Michel and colleagues argued that we
# are forgetting faster and faster. For a particular year, say “1883,” they
# calculated the proportion of all terms published in each year between 1875
# and 1975 that were “1883”. They reasoned that this proportion is a measure of
# the interest in events that happened in that year. In their figure 3a, they
# plotted the usage trajectories for three years: 1883, 1910, and 1950. These
# three years share a common pattern: little use before that year, then a
# spike, then decay.”
#
# They noticed the rate of decay for each year mentioned seemed to increase
# with time and they argued that this means that we are forgetting the past
# faster and faster.
#
# The full paper can be found
# [here](https://aidenlab.org/papers/Science.Culturomics.pdf), and you are
# going to replicate part of figure 3a.
#
# To do so we will focus on the mention of terms that can represent years
# (strings like “1765”, “1886”, “1897”, “1937”…). The raw data was fetched
# for you from the [Google Books NGram Viewer website]
# (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) and
# preprocessed into two files:
# – `mentions_yearly_counts.tsv` contains the number of mentions of
# different terms per year and the number of books retrieved where the term
# appeared each year (one row per term per year).
# – `yearly_total_counts.csv` contains the total number of mentions of
# all terms per year as well as the number of pages and books retrived each
# year (one row per year).
#
# In the whole exercise, do NOT use a `for`, `while`, or `repeat` loop!
#
# 1. Load the `tidyverse` and `scales` packages and use
# `theme_set(theme_light())` to set the theme for the rest of the exercise.
# Then, use `read_tsv` to load `/course/data/mentions_yearly_counts.tsv` and assign
# the resulting tibble to `terms_mentions`.
# Set the parameters of `read_tsv()` in order to make sure of the following:
# – Column names should be `”term”`, `”year”`, `”n_mentions”`,
# and `”book_count”`.
# – Column types should be `character`, `integer`, `integer`, `integer`.
# Hint: you can use parameters `col_names` and `col_types` to achieve this.
#
# To check your answer:
#
# The output of `print(terms_mentions, n = 5)` is
#
# “`
# # A tibble: 53,393 x 4
# term year n_mentions book_count
#
# 1 1817 1524 31 1
# 2 1817 1575 17 1
# 3 1817 1607 3 1
# 4 1817 1637 2 1
# 5 1817 1662 1 1
# # … with 5.339e+04 more rows
# “`
#
## Do not modify this line!
# 2. Similarly, load `/course/data/yearly_total_counts.csv` using `read_csv()` and
# assign the resulting tibble to `total_mentions`.
# Set the parameters of `read_csv()` in order to make sure of the following:
# – Column names should be `”year”`, `”total_mentions”`,
# `”total_page_count”`, `”total_book_count”`.
# – Column types should be `integer`, `double`, `integer`, `integer`.
# Hint: you can use parameters `col_names` and `col_types` to achieve this.
# Note: the reason you should read in the `total_mentions` as a `double`
# column is that it contains very large integers that don’t fit within the
# bounds of numbers represented by the `integer` type in R. Using a
# double-precision number is our only recourse.
#
# To check your answer:
#
# The output of `print(total_mentions, n = 5)` is
#
# “`
# # A tibble: 425 x 4
# year total_mentions total_page_count total_book_count
#
# 1 1505 32059 231 1
# 2 1507 49586 477 1
# 3 1515 289011 2197 1
# 4 1520 51783 223 1
# 5 1524 287177 1275 1
# # … with 420 more rows
# “`
#
## Do not modify this line!
# 3. From `terms_mentions`, extract a new tibble `mentions` by left-joining
# the `total_mentions` tibble.
# `mentions` should print to:
#
# To check your answer:
#
# The output of `print(mentions, n = 5)` is
#
# “`
# # A tibble: 53,393 x 7
# term year n_mentions book_count total_mentions total_page_count
#
# 1 1817 1524 31 1 287177 1275
# 2 1817 1575 17 1 186706 1067
# 3 1817 1607 3 1 381763 1600
# 4 1817 1637 2 1 681719 2315
# 5 1817 1662 1 1 239762 1471
# # … with 5.339e+04 more rows, and 1 more variable:
# # total_book_count
# “`
#
## Do not modify this line!
# 4. Check that your join was successful by using `anti_join()` to drop all
# observations in `terms_mentions` that have a match in `mentions` and
# assign the resulting tibble to `diagnosis`. If the join went as expected
# `diagnosis` should be an empty tibble with the same columns as
# `terms_mentions`.
#
## Do not modify this line!
# 5. From `mentions`, extract a new tibble `relative_mention_counts` by adding
# a column `frac_total` containing the frequency of mentions of each term
# per year. This frequency is obtained by dividing the number of
# mentions of each term per year by the total number of mentions of all
# terms that year,
# Then, select only columns `term`, `year`, `n_mentions`,
# `total_mentions` and `frac_total`.
# Hint: you can use `mutate()` and `select()`.
#
# To check your answer:
#
# The output of `print(relative_mention_counts, n = 5)` is
#
# “`
# # A tibble: 53,393 x 5
# term year n_mentions total_mentions frac_total
#
# 1 1817 1524 31 287177 0.000108
# 2 1817 1575 17 186706 0.0000911
# 3 1817 1607 3 381763 0.00000786
# 4 1817 1637 2 681719 0.00000293
# 5 1817 1662 1 239762 0.00000417
# # … with 5.339e+04 more rows
# “`
#
## Do not modify this line!
# 6. From `relative_mention_counts`, extract a new tibble
# `examples_mention_counts` by
# – keeping only the terms `”1883″`, `”1910″` and `”1950″`,
# – transform the terms from characters to a factor in which the levels
# are in reversed alphabetical order (`”1950″`, `”1910″` and `”1883″`)
# Hint: you can use `filter()`, `mutate()` and `fct_rev()`.
# Note: the order matters to reproduce the same colors as the original
# figure without setting them explicitely when generating the plot.
#
# To check your answer:
#
# The output of `print(examples_mention_counts, n = 5)` is
#
# “`
# # A tibble: 825 x 5
# term year n_mentions total_mentions frac_total
#
# 1 1883 1515 1 289011 0.00000346
# 2 1883 1520 1 51783 0.0000193
# 3 1883 1524 15 287177 0.0000522
# 4 1883 1574 4 62235 0.0000643
# 5 1883 1575 3 186706 0.0000161
# # … with 820 more rows
# “`
#
## Do not modify this line!
# 7. Generate a plot to reproduce the large window of figure 3a and assign the
# result to `paper_figure`.
# To do this, use `examples_mention_counts` and:
# – `ggplot` to initialize the plot,
# – `geom_line()` to get the lineplot,
# – `scale_y_continuous(label = percent)` to set the y-axis labels to
# the percent format,
# – `coord_cartesian()` and its argument `xlim` to limit the coordinates
# to show only the mentions across the timeframe `1850`-`2012`,
# – `labs()` to set the
# – title as `”Are we neglecting the past faster?”`,
# – x-axis as `”Year”`
# – y-axis as `”Frequency of mention of each term”`,
# – color as `”Term”`.
# Store the plot into a `ggplot` object `paper_figure`.
#
## Do not modify this line!
# HW7: relational2
#
# In the whole exercise, do NOT use a `for`, `while`, or `repeat` loop!
#
# 1. Load the `tidyverse`, `scales` and `lubridates` packages and use
# `theme_set(theme_light())` to set the theme for the rest of the exercise.
# Then, use `read_csv` to load
# – `/course/data/guest_house/booking.csv` and assign it to a tibble `booking`,
# – `/course/data/guest_house/guest.csv` and assign it to a tibble `guest`,
# – `/course/data/guest_house/rate.csv` and assign it to a tibble `rate`,
# – `/course/data/guest_house/room.csv` and assign it to a tibble `room`.
#
# To check your answer:
#
# The output of `print(booking, n = 5)` is
#
# “`
# # A tibble: 347 x 8
# booking_id booking_date room_no guest_id occupants room_type_reque…
#
# 1 5001 11/3/2016 101 1027 1 single
# 2 5002 11/3/2016 102 1179 1 double
# 3 5003 11/3/2016 103 1106 2 double
# 4 5004 11/3/2016 104 1238 1 double
# 5 5005 11/3/2016 105 1540 3 family
# # … with 342 more rows, and 2 more variables: nights
# # arrival_time
# 2. From `booking`, extract a tibble `room_earning` by left-joining the
# `rate` tibble. Then, add a column `earning` containing the earning of
# each booking, defined as `amount` times `nights`.
# Finally, some tidying:
# – Convert `booking_date` to class `”Date”`.
# – Convert `room_no` to a factor with levels ordered by median `earning`.
# – Sort by `booking_date`, breaking the ties by `room_no`.
# – Select the columns `booking_date`, `room_no`, `guest_id`,
# `nights`, and `earning`.
# Hint: to do that, you can use `left_join()`, `mutate()`, `mdy()`,
# `fct_order()`, `arrange()` and `select()`.
#
# To check your answer:
#
# The output of `print(room_earning, n = 5)` is
#
# “`
# # A tibble: 347 x 5
# booking_date room_no guest_id nights earning
#
# 1 2016-11-03 309 1060 1 56
# 2 2016-11-03 102 1179 2 112
# 3 2016-11-03 301 1406 4 192
# 4 2016-11-03 108 1136 1 56
# 5 2016-11-03 210 1626 5 360
# # … with 342 more rows
# “`
#
# The output of `print((room_earning %>% pull(room_no))[1:5], max.levels = 3)` is
#
# “`
# [1] 309 102 301 108 210
# 30 Levels: 309 102 … 203
# “`
#
## Do not modify this line!
# 3. Let’s visualize the earnings per rooms using a boxplot.
# To do this, use `room_earning` and:
# – `ggplot` to initialize the plot,
# – `geom_boxplot()` to get the boxplot,
# – `scale_y_continuous()` with `label_dollar()` to set the y-axis labels
# to dollar format,
# – `coord_flip()` to flip the axes,
# – `labs()` to set the
# – title as `”Most rooms have earnings between $100 and $300″`,
# – x-axis as `”Room number”`,
# – y-axis as `”Earnings”`.
# Store the plot into a `ggplot` object `room_earning_plot`.
#
## Do not modify this line!
# 5. From `guest`, extract a tibble `guest_spending` that stores, for the 10
# “biggest clients”:
# – `name`, the full name of the `guest`,
# – `nights`, the total number of nights spent by the guest,
# – `spending`, the total dollar amount spent by the guest.
# Here, the 10 biggest clients means the 10 that spent the most.
# Make sure that:
# – `name` is a factor whose levels are ordered by median `spending`,
# – you sort the data by descending total spending.
# You can use :
# – `mutate()` and `paste()` to concatenate `first_name` and
# `last_name` in `guest`,
# – `right_join()` to join the aforementioned tibble with
# `room_earning`,
# – `group_by()` and `summarize()` to compute the total number of
# `nights` and `spending` for each guest,
# – `arrange()` and `head()` to sort and select the biggest spenders,
# – `fct_reorder()` to transform `name` into factor with appropriate
# levels ordering.
#
# To check your answer:
#
# The output of `print(guest_spending, n = 5)` is
#
# “`
# # A tibble: 10 x 3
# name nights spending
#
# 1 Sir Edward Garnier 11 780
# 2 Robert Halfon 10 768
# 3 Angela Rayner 14 744
# 4 Karin Smyth 11 696
# 5 Sir Alan Haselhurst 11 680
# # … with 5 more rows
# “`
#
# The output of `print((guest_spending %>% pull(name))[1:3], max.levels = 3)` is
#
# “`
# [1] Sir Edward Garnier Robert Halfon Angela Rayner
# 10 Levels: Craig Tracey Hannah Bardell … Sir Edward Garnier
# “`
#
## Do not modify this line!
# 6. Produce a barplot of the spendings of the 10 biggest clients.
# The bars should be horizontal and ordered by total spending
# from highest (on top of the figure) to lowest.
# To do this, use `guest_spending` and:
# – `ggplot` to initialize the plot,
# – `geom_col()` to get the barplot,
# – `scale_y_continuous()` with `label_dollar()` to set the y-axis labels
# to dollar format,
# – `coord_flip()` to flip the axes,
# – `labs()` to set the
# – title as `”Top 10 guests all spent more than $600″`,
# – x-axis as `”Name”`,
# – y-axis as `”Spending”`.
# Store the plot into a `ggplot` object `guest_spending_plot`.
#
## Do not modify this line!
# HW7: relational3
#
# In this exercise, we’ll look at data about a university.
# This dataset contains four tables:
# – `department.csv` contains information about the university’s departments,
# including `Department_ID`, `Department_Name` and `DOE` (date of
# establishment).
# – `employee.csv` contains information about the faculty, including
# `Employee_ID`, `DOB` (date of birth), `DOJ` (date of university joining)
# and the `Department_ID` to which the person belongs.
# – `s_admission.csv` contains information about students’ admission,
# including `Student_ID`, `DOA` (date of admission), `DOB`,
# `Department_Choices` (choices student made during counselling) and
# `Department_Admission` (department offered to student).
# – `s_performance` contains information about students’ performances,
# including `Student_ID`, `Semester_Name`, `Paper_ID`, `Paper_Name` and
# `Marks` (marks scored in examination).
# After preprocessing each table, we will focus on the Engineering School.
#
# In the whole exercise, do NOT use a `for`, `while`, or `repeat` loop!
#
# 1. Load the `tidyverse` and `lubridates` packages and use
# `theme_set(theme_light())` to set the theme for the rest of the exercise.
# Then, use `read_csv` to load
# – `/course/data/department_info.csv` and assign it to a tibble `booking`,
# – `/course/data/employee_info.csv` and assign it to a tibble `guest`,
# – `/course/data/student_counselling_info.csv` and assign it to a tibble
# `s_admission`,
# – `/course/data/student_performance_info.csv` and assign it to a tibble
# `s_performance`.
#
# To check your answer:
#
# The output of `print(department, n = 5)` is
#
# “`
# # A tibble: 40 x 3
# Department_ID Department_Name DOE
#
# 2 IDEPT5528 Biosciences and Bioengineering 1943-06-28 00:00:00
# 3 IDEPT3115 Chemical Engineering 1940-05-01 00:00:00
# 4 IDEPT5881 Chemistry 2013-06-08 00:00:00
# 5 IDEPT4938 Civil Engineering 1941-10-27 00:00:00
# # … with 35 more rows
# “`
#
# The output of `print(employee, n = 5)` is
#
# “`
# # A tibble: 1,000 x 4
# Employee_ID DOB DOJ Department_ID
#
# 1 IU196557 1983-02-23 00:00:00 2009-10-31 00:00:00 IDEPT4938
# 2 IU449901 1985-09-02 00:00:00 2009-06-07 00:00:00 IDEPT2357
# 3 IU206427 1971-07-30 00:00:00 2008-05-09 00:00:00 IDEPT4670
# 4 IU688905 1973-07-20 00:00:00 2002-01-17 00:00:00 IDEPT2601
# 5 IU634582 1991-11-16 00:00:00 2000-02-13 00:00:00 IDEPT7626
# # … with 995 more rows
# “`
#
# The output of `print(s_admission, n = 5)` is
#
# “`
# # A tibble: 4,000 x 5
# Student_ID DOA DOB Department_Choi…
#
# 1 SID201311… 2013-07-01 00:00:00 1996-02-05 00:00:00 IDEPT7783
# 2 SID201311… 2013-07-01 00:00:00 1995-07-31 00:00:00 IDEPT6347
# 3 SID201311… 2013-07-01 00:00:00 1995-09-05 00:00:00 IDEPT1836
# 4 SID201311… 2013-07-01 00:00:00 1996-01-12 00:00:00 IDEPT8473
# 5 SID201311… 2013-07-01 00:00:00 1995-07-30 00:00:00 IDEPT5528
# # … with 3,995 more rows, and 1 more variable:
# # Department_Admission
# “`
#
# The output of `print(s_performance, n = 5)` is
#
# “`
# # A tibble: 209,611 x 5
# Student_ID Semster_Name Paper_ID Paper_Name Marks
#
# 1 SID20131143 Sem_1 SEMI0012995 Paper 1 44
# 2 SID20131143 Sem_1 SEMI0015183 Paper 2 74
# 3 SID20131143 Sem_1 SEMI0018371 Paper 3 80
# 4 SID20131143 Sem_1 SEMI0015910 Paper 4 44
# 5 SID20131143 Sem_1 SEMI0016208 Paper 5 95
# # … with 2.096e+05 more rows
# “`
#
## Do not modify this line!
# 2. From `employee`, extract a tibble `employee_new` with columns
# `Employee_ID`, `Department_ID`, `age`, `seniority` and `seniority_level`,
# where
# – `age` is the age of the professor (in years),
# – `seniority` is the year of number of years since joining,
# – and `seniority_level` is a factor with two levels, namely `”junior”`
# (`seniority` <= 10) and `"senior`" (`seniority` > 10).
# To do this, you can use `mutate()` to create the columns
# – `age` and `seniority` from `DOB` and `DOJ` repectively by
# – creating an interval with `%–%`, `today()` and `dyears()`.
# – flooring the years to integers with `floor()`.
# – `seniority_level` to categorize `seniority` by
# – cutting `seniority` with `cut()`, setting
# `breaks = min(seniority), 10, max(seniority)`,
# `labels = c(“junior”, “senior”)`, and `include.lowest = TRUE`,
# Then, you can use `dplyr::select()` to select the desired columns.
#
# To check your answer:
#
# The output of `print(employee_new, n = 5)` is
#
# “`
# # A tibble: 1,000 x 5
# Employee_ID Department_ID age seniority seniority_level
#
# 1 IU196557 IDEPT4938 37 10 junior
# 2 IU449901 IDEPT2357 34 10 junior
# 3 IU206427 IDEPT4670 48 11 senior
# 4 IU688905 IDEPT2601 46 18 senior
# 5 IU634582 IDEPT7626 28 20 senior
# # … with 995 more rows
# “`
#
## Do not modify this line!
# 3. From `s_admission`, extract a tibble `s_admission_new` with columns
# `Student_ID`, `Department_Admission`, `age` and `school_year`, where
# – `age` is the age of the student,
# – `school_year` is the class standing, a factor with four levels, namely
# `”graduate”`, `”senior”`, `”junior”`, `”sophomore”` or `”freshman”`.
# To do this, you can use `mutate()` to create the columns `age` from
# `DOB` similarly as above, and `school_year` by
# – transforming `DOA` into a factor with `factor()`.
# – collapsing the levels `”2013-07-01″` and `”2014-07-01″` into
# `”graduate”` with `fct_collapse()`,
# – recoding the other levels with `fct_recode()` to make
# – `”2015-07-01″` into `”senior”`,
# – `”2016-07-01″` into `”junior”`,
# – `”2017-07-01″` into `”sophomore”`,
# – `”2018-07-01″` into `”freshman”`,
# Then, you can use `dplyr::select()` to select the desired columns.
#
# To check your answer:
#
# The output of `print(s_admission_new, n = 5)` is
#
# “`
# # A tibble: 4,000 x 4
# Student_ID Department_Admission age school_year
#
# 1 SID20131143 IDEPT7783 24 graduate
# 2 SID20131151 IDEPT6347 24 graduate
# 3 SID20131171 IDEPT1836 24 graduate
# 4 SID20131176 IDEPT8473 24 graduate
# 5 SID20131177 IDEPT5528 24 graduate
# # … with 3,995 more rows
# “`
#
# The output of `print(tail(s_admission_new), n = 5)` is
#
# “`
# # A tibble: 6 x 4
# Student_ID Department_Admission age school_year
#
# 1 SID20189917 IDEPT6347 19 freshman
# 2 SID20189921 IDEPT7005 18 freshman
# 3 SID20189926 IDEPT1836 19 freshman
# 4 SID20189949 IDEPT4938 18 freshman
# 5 SID20189982 IDEPT6347 19 freshman
# # … with 1 more row
# “`
#
## Do not modify this line!
# 4. From `s_performance`, extract a tibble `s_performance_new` with columns
# `Student_ID`, `mean_score`, `min_score` and `max_score`. `mean_score`,
# `min_score` and `max_score` are aggregated across semesters and papers.
# To do this, you can use:
# – `group_by()` to group the data by `Student_ID`.
# – `summarize()` to calculate the mean, minimum and maximum of `Marks`.
#
# To check your answer:
#
# The output of `print(s_performance_new, n = 5)` is
#
# “`
# # A tibble: 3,819 x 4
# Student_ID mean_score min_score max_score
#
# 1 SID20131143 71.7 22 100
# 2 SID20131151 72.6 40 100
# 3 SID20131171 70.9 20 99
# 4 SID20131176 70.9 42 100
# 5 SID20131177 70.3 29 100
# # … with 3,814 more rows
# “`
#
## Do not modify this line!
# 5. From `department`, extract a tibble `department_engie` with columns
# `Department_ID` and `Department_Name`, and the rows are those where
# `Department_Name` has a match with the string `”Engineering”` (case insensitive).
# To do this, you can use:
# – `filter()` to filter `Department_Name` for the Engineering School
# with `str_detect()` to match the string `”Engineering”`, either with
# a regexp and the alternation `”|”`, or with `fixed()` and the
# `ignore_case` argument,
# – `dplyr::select()` to select the desired columns.
#
# To check your answer:
#
# The output of `print(department_engie, n = 5)` is
#
# “`
# # A tibble: 17 x 2
# Department_ID Department_Name
#
# 1 IDEPT4670 Aerospace Engineering
# 2 IDEPT5528 Biosciences and Bioengineering
# 3 IDEPT3115 Chemical Engineering
# 4 IDEPT4938 Civil Engineering
# 5 IDEPT1423 Computer Science & Engineering
# # … with 12 more rows
# “`
#
## Do not modify this line!
# 6. From `employee_new`, extract a tibble `employee_engie` by inner-joining
# `department_engie` and transforming `Department_Name` into a factor with
# levels ordered by frequency.
# Hint: you can use `fct_infreq()`.
# Note that the inner-join is used because there is no employee record
# for three engineering departments. Hence, another join would create NAs.
#
# To check your answer:
#
# The output of `print(employee_engie, n = 5)` is
#
# “`
# # A tibble: 422 x 6
# Employee_ID Department_ID age seniority seniority_level
#
# 1 IU196557 IDEPT4938 37 10 junior
# 2 IU449901 IDEPT2357 34 10 junior
# 3 IU206427 IDEPT4670 48 11 senior
# 4 IU572796 IDEPT1142 42 12 senior
# 5 IU393717 IDEPT1825 47 10 junior
# # … with 417 more rows, and 1 more variable: Department_Name
# “`
#
# The output of `print(employee_engie$Department_Name[1:2], max.levels = 2)` is
#
# “`
# [1] Civil Engineering Energy Science and Engineering
# 14 Levels: Aerospace Engineering … Centre of Studies in Resources Engineering (CSRE)
# “`
#
## Do not modify this line!
# 7. One important aspect of faculty make-up is seniority.
# Draw horizontal stacked bar charts for `Department_Name` to visualize the
# distribution of `seniority_level` for each department of the Engineering
# school
# To do this, use `employee_engie` and:
# – `ggplot` to initialize the plot,
# – `geom_bar()` to get the barplot,
# – `coord_flip()` to flip the axes,
# – `labs()` to set the
# – title as
#
# “`
# str_wrap(“Most departments have more senior professors
# than juniors”, width = 40)
# “`
#
# – subtitle as `”Exceptions are CASDE and Chemical Engineering”`,
# – x-axis as `”Department’s name”`,
# – y-axis as `”Number of professors”`,
# – fill as `”Seniority level”`.
# Store the plot into a `ggplot` object `g1`.
#
## Do not modify this line!
# 8. Inner-join the tibbles `s_admission_new`, `s_performance_new` and
# `department_engie` to create a new tibble `student_engie`, with
# the `school_year` column transformed into a factor with levels ordered
# by median `mean_score`.
# To do this, you can use:
# – `inner_join()` with the key `”Student_ID”` to join
# `s_admission_new` with `s_performance_new`.
# – `inner_join()` with the key `”Department_Admission”` in the left table
# and `”Department_ID”` in the right table to further join
# `department_engie`,
# – `mutate()` and `fct_reorder()` to transform and reorder `school_year`
# appropriately.
#
# To check your answer:
#
# The output of `print(student_engie, n = 5)` is
#
# “`
# # A tibble: 1,740 x 8
# Student_ID Department_Admi… age school_year mean_score min_score
#
# 1 SID201311… IDEPT7783 24 graduate 71.7 22
# 2 SID201311… IDEPT8473 24 graduate 70.9 42
# 3 SID201311… IDEPT5528 24 graduate 70.3 29
# 4 SID201311… IDEPT8473 24 graduate 64.3 40
# 5 SID201311… IDEPT3115 24 graduate 69.8 21
# # … with 1,735 more rows, and 2 more variables: max_score
# # Department_Name
# “`
#
# The output of `print(student_engie$school_year[1:2], max.levels = 2)` is
#
# “`
# [1] graduate graduate
# 5 Levels: senior … graduate
# “`
#
## Do not modify this line!
# 9. Using `student_engie`, let’s draw an horizontal boxplots for
# `mean_score` against `school_year`.
# To do this, use `student_engie` and:
# – `ggplot` to initialize the plot,
# – `geom_boxplot()` to get the barplot,
# – `coord_flip()` to flip the axes,
# – `labs()` to set the
# – title as `”Score distribution is similar across years”`,
# – x-axis as `”Class standing”`,
# – y-axis as `”Mean paper score”`.
# Store the plot into a `ggplot` object `g2`.
## Do not modify this line!
# HW7: nycstock
#
# In this exercise, you will conduct a data analysis on NYC stock prices.
# The dataset consists of following files:
# – `prices.csv`: daily prices, with most of data spanning from 2010 to the
# end of 2016 (for newer companies, the date range is shorter).
# The dataset doesn’t account for stock splits.
# – `securities.csv`: general description of each company (e.g., sector).
# – `fundamentals.csv`: metrics extracted from annual SEC 10K fillings
# (2012-2016), which allows to derive most of the popular fundamental
# indicators.
#
# In the whole exercise, do NOT use a `for`, `while`, or `repeat` loop!
#
# 1. Load the `tidyverse` and `lubridates` packages and use
# `theme_set(theme_light())` to set the theme for the rest of the exercise.
# Then, use `read_csv` to load
# – `/course/data/prices.csv` (with `locale = locale(tz = “America/New_York”)`
# to set the timezone properly) and assign it to a tibble `prices`,
# – `/course/data/securities.csv` and assign it to a tibble `securities`,
# – `/course/data/fundamentals.csv` and assign it to a tibble `fund`.
#
# To check your answer:
#
# The output of `print(prices, n = 5)` is
#
# “`
# # A tibble: 851,264 x 7
# date symbol open close low high volume
#
# 1 2016-01-05 00:00:00 WLTW 123. 126. 122. 126. 2163600
# 2 2016-01-06 00:00:00 WLTW 125. 120. 120. 126. 2386400
# 3 2016-01-07 00:00:00 WLTW 116. 115. 115. 120. 2489500
# 4 2016-01-08 00:00:00 WLTW 115. 117. 114. 117. 2006300
# 5 2016-01-11 00:00:00 WLTW 117. 115. 114. 117. 1408600
# # … with 8.513e+05 more rows
# “`
#
# The output of `print((prices %>% pull(date))[1:3])` is
#
# “`
# [1] “2016-01-05 EST” “2016-01-06 EST” “2016-01-07 EST”
# “`
#
# The output of `print(securities, n = 5)` is
#
# “`
# # A tibble: 505 x 8
# Ticker symbol Security SEC filings GICS Sector
#
# 1 MMM 3M Comp… reports Industrials
# 2 ABT Abbott … reports Health Care
# 3 ABBV AbbVie reports Health Care
# 4 ACN Accentu… reports Information …
# 5 ATVI Activis… reports Information …
# # … with 500 more rows, and 4 more variables: GICS Sub
# # Industry
# # added
# “`
#
# The output of `print(fund, n = 5)` is
#
# “`
# # A tibble: 1,781 x 78
# Ticker Symbol Period Ending Accounts Payab… Accounts Recei…
#
# 1 AAL 12/31/12 3068000000 -222000000
# 2 AAL 12/31/13 4975000000 -93000000
# 3 AAL 12/31/14 4668000000 -160000000
# 4 AAL 12/31/15 5102000000 352000000
# 5 AAP 12/29/12 2409453000 -89482000
# # … with 1,776 more rows, and 74 more variables: Add’l
# # income/expense items
# # Expenditures
# # Cash and Cash Equivalents
# # Inventories
# # Revenue
# # Charges
# # Depreciation
# # Earnings Before Tax
# # Equity Earnings/Loss Unconsolidated Subsidiary
# # Assets
# # Profit
# # Interest Expense
# # Liabilities
# # Investments
# # Stocks
# # Net Cash Flow-Operating
# # Flows-Financing
# # Income
# # Applicable to Common Shareholders
# # Operations
# # Items
# # Other Assets
# # Current Liabilities
# # Financing Activities
# # Other Liabilities
# # Other Operating Items
# # ROE
# # and Development
# # Purchase of Stock
# # Short-Term Debt / Current Portion of Long-Term Debt
# # Short-Term Investments
# # Current Assets
# # Equity
# # Equity
# # For Year
# # Outstanding
# “`
#
## Do not modify this line!
library(tidyverse)
library(lubridate)
theme_set(theme_light())
prices <- read_csv("/course/data/prices.csv", locale = locale(tz = "America/New_York"))
securities <- read_csv("/course/data/securities.csv")
fund <- read_csv("/course/data/fundamentals.csv")
print(prices, n = 5)
print(securities, n = 5)
print(fund, n = 5)
# 2. In `securities`, let's keep the companies belonging to the top 6 sectors
# (by frequency of occurence), as well as those whose `GICS Sub Industry`
# falls into `"Gold"` or `"Real Estate`" (i.e., `GICS Sub Industry` contains
# either `"Gold"` or `"REITs"`).
# You need to do it in two steps:
# - First, create a tibble named `securities_sectored` that contain
# only the companies that do not belong to those that you want (see
# below).
# Note that `securities_sectored` should contain an additional column
# `GICS Sector truncated` that contains the top 6 factors in `GICS Sector`
# and all the others lumped into an additional level `"Other"`.
# - Second, use `anti_join()` on `securities` and `securities_sectored` to
# create `securities_selected`, which contains only the rows that
# meet the requirements above by deleting the rows from `securities`
# that are in `securities_sectored`.
# Then, use `select()` to only select the columns `Ticker symbol`,
# `Security`, and `GICS Sector`.
# To extract `securities_sectored` from `securities`, you can use:
# - `mutate()` along with `fct_infreq()` and `fct_lump()` to
# reorder the sectors by frequency of occurence and lump
# all except the top 6 into a single level `"Other"`.
# - `filter()` to select rows such that
# - `GICS Sector truncated` do not belong to the top 6 (i.e., the ones
# with the level `"Other"`),
# - `GICS Sub Industry` does not contain the words `"Gold"` or
# `"REITs"` (e.g., with `str_detect()` using a regular expression
# with OR represented by the alternation symbol `"|"`).
# To check your answer:
#
# The output of `print(securities_sectored, n = 5)` is
#
# ```
# # A tibble: 94 x 9
# Ticker symbol Security SEC filings GICS Sector
#
# 1 AES AES Corp reports Utilities
# 2 APD Air Pro… reports Materials
# 3 ALB Albemar… reports Materials
# 4 LNT Alliant… reports Utilities
# 5 AEE Ameren … reports Utilities
# # … with 89 more rows, and 5 more variables: GICS Sub
# # Industry
# # added
# “`
#
# The output of `print(securities_selected, n = 5)` is
#
# “`
# # A tibble: 411 x 3
# Ticker symbol Security GICS Sector
#
# 1 MMM 3M Company Industrials
# 2 ABT Abbott Laboratories Health Care
# 3 ABBV AbbVie Health Care
# 4 ACN Accenture plc Information Technology
# 5 ATVI Activision Blizzard Information Technology
# # … with 406 more rows
# “`
#
## Do not modify this line!
securities_sectored <- securities %>% mutate(“GICS Sector truncated”=fct_infreq())
?fct_infreq
# 3. From `fund`, extract a tibble `fund_time` by converting the column name
# of `fund` from `Ticker Symbol` to `Ticker symbol` (for consistency between
# the column names of the different tables).
# Then:
# – create new column `Period Ending Year` by extract the year from
# `Period Ending`
# – drop the rows containing `NA` values,
# – select the columns `Ticker symbol`, `Period Ending Year` and
# `Gross Margin`.
# To do that, you can use:
# – `rename()` to convert the column name,
# – `mutate()`, `mdy()` and `year()` to create `Period Ending Year`,
# – `drop_na()` to drop the rows that contain `NA` values.
# – `dplyr::select()` to select the required columns.
#
# To check your answer:
#
# The output of `print(fund_time, n = 5)` is
#
# “`
# # A tibble: 1,299 x 3
# Ticker symbol Period Ending Year Gross Margin
#
# 1 AAL 2012 58
# 2 AAL 2013 59
# 3 AAL 2014 63
# 4 AAL 2015 73
# 5 AAP 2012 50
# # … with 1,294 more rows
# “`
#
## Do not modify this line!
# 4. Extract a tibble `securities_fund` by joining `securities_selected`
# and `fund_time`: `securities_fund` should contain all the rows in
# `securities_selected` where there is a match in `fund_time`, using
# `Ticker symbol` as key.
#
# To check your answer:
#
# The output of `print(securities_fund, n = 5)` is
#
# “`
# # A tibble: 988 x 5
# Ticker symbol Security GICS Sector Period Ending …
#
# 1 MMM 3M Comp… Industrials 2013
# 2 MMM 3M Comp… Industrials 2014
# 3 MMM 3M Comp… Industrials 2015
# 4 ABT Abbott … Health Care 2012
# 5 ABT Abbott … Health Care 2013
# # … with 983 more rows, and 1 more variable: Gross Margin
# “`
#
## Do not modify this line!
# 5. Generate histograms of the `Gross Margin` by facceted by sector.
# To do this, use `securities_fund` and:
# – `geom_histogram()` with `binwidth = 10`,
# – `facet_wrap()` to facet by `GICS Sector`,
# – `labs()` to set the
# – title as `”The gross margin distribution varies by sector”`,
# – x-axis as `”Gross Margin (%)”`
# – y-axis as `”Count (n)”`.
# Store the plot into a `ggplot` object `gross_margin`.
#
## Do not modify this line!
# 6. Let’s now look at stock prices.
# First, from `securities_fund`, extract a tibble `tickers_sectors` that
# contains the `Ticker symbol`, `Security`, and `GICS Sector` columns,
# keeping only the `unique()` combinations and `arrange()` them by
# `Ticker symbol`.
# Second, from `prices`, extract a tibble `full_stock` that contains the
# `Ticker symbol` (renamed from `symbol`), `close`, `open`, `date`,
# and `year` (extracted from `date`), and add it the `Security` and
# `GICS Sector` information from `tickers_sectors`.
# Filter out the rows where the `GICS Sector` is not available.
# To do that, you can use:
# – `rename()` to conver `symbol` into `Ticker symbol`,
# – `mutate()` and `year()` to extract the year,
# – `dplyr::select()` to select the relevant variables
# – `left_join()` to add the security name and sector,
# – `filter()` to delete the rows where the sector information is
# not available.
#
# To check your answer:
#
# The output of `print(tickers_sectors, n = 5)` is
#
# “`
# # A tibble: 276 x 3
# Ticker symbol Security GICS Sector
#
# 1 AAL American Airlines Group Industrials
# 2 AAP Advance Auto Parts Consumer Discretionary
# 3 AAPL Apple Inc. Information Technology
# 4 ABBV AbbVie Health Care
# 5 ABC AmerisourceBergen Corp Health Care
# # … with 271 more rows
# “`
#
# The output of `print(full_stock, n = 5)` is
#
# “`
# # A tibble: 468,588 x 7
# Ticker symbol close open date year Security
#
# 1 AAL 4.77 4.84 2010-01-04 00:00:00 2010 America…
# 2 AAP 40.4 40.7 2010-01-04 00:00:00 2010 Advance…
# 3 AAPL 214. 213. 2010-01-04 00:00:00 2010 Apple I…
# 4 ABC 26.6 26.3 2010-01-04 00:00:00 2010 Ameriso…
# 5 ABT 54.5 54.2 2010-01-04 00:00:00 2010 Abbott …
# # … with 4.686e+05 more rows, and 1 more variable: GICS
# # Sector
# “`
#
## Do not modify this line!
# 7. Plot the evolution to the (closing) stock price in 2010~2016 for the
# following companies:
# `”Aetna Inc”`, `”Amazon.com Inc”`, `”Facebook”`, `”Whole Foods Market”`,
# `”FedEx Corporation”`, `”Boeing Company”`, `”The Walt Disney Company”`.
# To do that, use:
# – `filter()` to select the rows of the right companies,
# – `ggplot()` to initialize the plot,
# – `geom_line()` to get the lineplot (color by `Security`),
# – `geom_smooth()` with `method = “loess”` to add a smoothing trend,
# – `labs()` to set the
# – title as `”Amazon’s stock price more than doubled!”`,
# – x-axis as `”Date”`
# – y-axis as `”Daily close price (USD)”`.
# Store the plot into a `ggplot` object `trend`.
## Do not modify this line!
# 10. Calculate the annual “Rate of Return” (RoR, or return) on the securities
# in `full_stock`. The RoR is defined as the net gain or loss on an
# investment over a specified time period, calculated as a percentage of the
# investment’s initial cost.
# Formally, RoR = (current value – initial value) / initial value.
# From `full_stock`, extract a tibble `return_stock` grouping by
# `year`, `Ticker symbol`, `GICS Sector`.
# Then, for year and stock, extract the open corresponding to the first
# (i.e. `min()`) date, and the close corresponding to the last
# (i.e. `max()`) date.
# Finally, compute `return` as `(close – open) / open`.
# Hint: this can be done simply using `group_by()` and `summarize()`.
#
# To check your answer:
#
# The output of `print(return_stock, n = 5)` is
#
# “`
# # A tibble: 1,866 x 6
# # Groups: year, Ticker symbol [1,866]
# year Ticker symbol GICS Sector open close return
#
# 1 2010 AAL Industrials 4.84 10.0 1.07
# 2 2010 AAP Consumer Discretionary 40.7 66.2 0.625
# 3 2010 AAPL Information Technology 213. 323. 0.511
# 4 2010 ABC Health Care 26.3 34.1 0.298
# 5 2010 ABT Health Care 54.2 47.9 -0.116
# # … with 1,861 more rows
# “`
#
## Do not modify this line!
# 11. From `return_stock`, extract a tibble `summary_stock` by calculating
# the mean, 0.25 quantile, mean, and 0.75 quantile of `return`
# for each `GICS Sector`.
# Hint: you can do that using `group_by()` and `summarize()`.
#
# To check your answer:
#
# The output of `print(summary_stock, n = 5)` is
#
# “`
# # A tibble: 8 x 4
# GICS Sector q25 mean_return q75
#
# 1 Consumer Discretionary -0.0357 0.150 0.318
# 2 Consumer Staples 0.00668 0.0902 0.192
# 3 Financials -0.0286 0.100 0.243
# 4 Health Care -0.0195 0.150 0.256
# 5 Industrials -0.0422 0.140 0.299
# # … with 3 more rows
# “`
## Do not modify this line!