1
HW7: Dates Manipulation on CitiBike data
1. Load the packages readr.
Use readcsv to create a tibble from the csv file at
coursedatacitybike.csv taken from sampled data from
Citibike usage datahttps:www.citibikenyc.comsystemdata and
assign it to variable trips. The file contains data corresponding to
every bike trip taken by users in August 2016.
Peek at the first rows of trips using head to get a feel of
what your data is like and assign the result to tibble headtrips.
headtrips should print to something like
A tibble: 6 x 15
tripduration starttime stoptime start station ??? start station ???
dbl chr chr dbl chr
1 5945 852016??? 85201??? 228 E 48 St 3 Ave
2 1494 832016??? 83201??? 460 S 4 St Wythe ???
3 826 830201??? 83020??? 3301 Columbus Ave ???
4 278 815201??? 81520??? 3256 Pier 40 Hudso???
5 729 882016??? 88201??? 347 Greenwich St ???
6 502 822201??? 82220??? 382 University Pl ???
??? with 10 more variables: start station latitude dbl, start station
longitude dbl, end station id dbl, end station name chr,
end station latitude dbl, end station longitude dbl,
bikeid dbl, usertype chr, birth year dbl, gender dbl
Do not modify this line!
2. Load the dplyr package.
Modify the gender column to a factor column with levels ordered from 0
to 1 and the labels Unknown for 0, Male for 1, Female for 2.
Assign the resulting tibble to tripswithgenders.
Hint: you can use mutate and factor, and judiciously use the
levels and labels arguments of factor.
The gender column of tripswithgenders should print to:
A tibble: 100,000 x 1
gender
fct
1 Male
2 Male
3 Male
4 Female
5 Female
6 Male
7 Male
8 Male
9 Male
10 Male
??? with 99,990 more rows
And the rest of the tibble should be the same as in question 1.
Do not modify this line!
3. Load the lubridate package.
Starting with trips, modify the starttime and stoptime columns to
contain dates in the format monthdayyearhourminutesecond.
Assign the resulting tibble to tripswithdates.
Hint: you can use mutate and mdyhms.
The starttime and stoptime columns of tripswithdates should print
to:
A tibble: 100,000 x 2
starttime stoptime
dttm dttm
1 20160805 14:15:11 20160805 15:54:16
2 20160803 22:56:34 20160803 23:21:28
3 20160830 07:41:07 20160830 07:54:54
4 20160815 20:39:47 20160815 20:44:26
5 20160808 17:40:31 20160808 17:52:40
6 20160822 07:26:03 20160822 07:34:26
7 20160831 16:32:42 20160831 16:40:08
8 20160810 19:40:36 20160810 19:59:15
9 20160824 11:29:54 20160824 11:32:12
10 20160830 12:16:21 20160830 12:21:09
??? with 99,990 more rows
And the rest of the tibble should be the same as in question 1.
Do not modify this line!
4. Create a tibble containing the starttime column from tripswithdates
as well as two additional columns startymd and starthour containing
the starting time day and hour of each trip.
To do so, you need to:
starting with tripswithdates, keep only the starttime column,
you can use select for this
then create the two columns from starttime, you can use mutate
along with floordate and hour for this
assign the result to tibble tripsstarttimes
A tibble: 100,000 x 3
starttime startymd starthour
dttm dttm int
1 20160805 14:15:11 20160805 00:00:00 14
2 20160803 22:56:34 20160803 00:00:00 22
3 20160830 07:41:07 20160830 00:00:00 7
4 20160815 20:39:47 20160815 00:00:00 20
5 20160808 17:40:31 20160808 00:00:00 17
6 20160822 07:26:03 20160822 00:00:00 7
7 20160831 16:32:42 20160831 00:00:00 16
8 20160810 19:40:36 20160810 00:00:00 19
9 20160824 11:29:54 20160824 00:00:00 11
10 20160830 12:16:21 20160830 00:00:00 12
??? with 99,990 more rows
Do not modify this line!
5. Compute 1 the total number of trips, 2 the number of days in which there
was at least one trip and 3 the average number of trips that started
for each hour of the day over the whole month. Assign the resulting
tibble to tripsperhour.
To do so, you need to:
starting with tripsstarttimes, group by the hour of the day, you
can use groupby for this
then compute the three statistics listed above you can use
summarize for this
Hint: The function n returns the number of rows, and
ndistinctcolumn returns the number of distinct values in column.
tripsperhour should print to:
A tibble: 24 x 4
starthour numtrips numdays meantrips
int int int dbl
1 0 986 31 31.8
2 1 506 31 16.3
3 2 322 31 10.4
4 3 193 30 6.43
5 4 195 31 6.29
6 5 587 31 18.9
7 6 2262 31 73.0
8 7 4784 31 154.
9 8 7958 31 257.
10 9 6528 31 211.
??? with 14 more rows
Do not modify this line!
2
HW7: Strings and Dates
In this exercise, you will manipulate a dataset with strings and factors.
Throughout the exercise:
Use themelight for the plots.
Do not change the default position of the plot title.
Do not print the plot.
In this exercise, you have to recreate the figures found at
the left of the instructions.
We suggest functions you can use to create the plots, but
you are free to use the method you are the most comfortable with.
Make sure that the figures look exactly like the ones you are supposed to create.
1. Load the tidyverse, rattle, ggplot2 and lubridate packages.
Use dataweather to read the dataset weather from the
rattle package.
Store it in a tibble weather using astibble.
Do not modify this line!
2. Create two tibbles weather9am and weather3pm.
weather9am should contain all the variables which names end with
9am and the column Date. 9am should be removed from the
names of its variables and the format of Date should be changed as follow :
Instead of YMD, add YMD h:m:s AECT where AECT is the time zone
code for Australian Eastern Time and h:m:s is 09:00:00 9 am.
For example, the Date column first value will be 20071101 09:00:00 AEDT.
weather3pm should contain all the variables which names end with
3pm and the column Date. 3pm should be removed from the
names of its variables and the format of Date should be changed as follow :
Instead of YMD, add YMD h:m:s AECT where AECT is the time zone
code for Australian Eastern Time and h:m:s is 15:00:00 3 pm.
For example, the Date column first value will be 20071101 15:00:00 AEDT.
To do so, you can use :
select and endswith to select the right columns
renameall and strremove to rename the columns correctly
mutate, ymdhms and paste to change the format of Date
Hint : If you use ymdhms, set tz to AustraliaCanberra
weather9am should print to :
A tibble: 366 x 7
Date WindDir WindSpeed Humidity Pressure Cloud Temp
dttm ord dbl int dbl int dbl
1 20071101 09:00:00 SW 6 68 1020. 7 14.4
2 20071102 09:00:00 E 4 80 1012. 5 17.5
3 20071103 09:00:00 N 6 82 1010. 8 15.4
4 20071104 09:00:00 WNW 30 62 1006. 2 13.5
5 20071105 09:00:00 SSE 20 68 1018. 7 11.1
6 20071106 09:00:00 SE 20 70 1024. 7 10.9
7 20071107 09:00:00 SE 19 63 1025. 4 12.4
8 20071108 09:00:00 SE 11 65 1026. 6 12.1
9 20071109 09:00:00 E 19 70 1026. 7 14.1
10 20071110 09:00:00 S 7 82 1024. 7 13.3
??? with 356 more rows
Do not modify this line!
3. Join the tibbles weather9am and weather3pm to create a tibble
weatherarrange that contains the data in both weather9am and
weather3pm. However, its Date should change to become the
corresponding day of the year, and be named Day of the year.
For example, 20071123 15:00:00 AEDT will become 23.
Then, order the tibble from the earliest Day of the year to the latest.
To do so, you can use :
fulljoin to join the datasets
mutate and yday to change the Date
rename to change the name of Date
arrange to reorder the tibble
weatherarrange should print to :
A tibble: 732 x 7
Day of the year WindDir WindSpeed Humidity Pressure Cloud Temp
dbl ord dbl int dbl int dbl
1 1 ESE 2 50 1016. 7 21.9
2 1 W 11 20 1012. 6 31.8
3 2 ESE 2 43 1013. 0 23
4 2 WSW 9 14 1009. 1 33.6
5 3 ESE 20 69 1017. 8 19.2
6 3 ESE 24 55 1016. 7 22.3
7 4 ESE 24 56 1016. 6 20.3
8 4 ESE 26 45 1013 7 23.9
9 5 SSE 7 69 1012. 7 18.6
10 5 E 15 40 1007. 2 26.8
??? with 722 more rows
Do not modify this line!
4. Plot both the smoothed temperature Temp multiplied by 3,
and Humidity as function of the date.
To do that, you can use :
ggplot to initialize a ggplot object. You can set its arguments
data and mapping to plot from weatherarrange, with
Day of the year as its xaxis
geomsmooth to plot the smoothed Humidity, with col set
to green, method set to loess, and formula to y x
These are the approximation methods for the smoothing process
geomsmooth to plot the smoothed 3Temp, with col set
to yellow, method set to loess, and formula to y x
labs to set :
title to Temperature and Humidity are negatively correlated in Canberra
x to Day of the year
y to Humidity in green and rescaled Temperature in yellow
themelight
Save the plot into temphumidityplot.
Do not modify this line!
5. Create a tibble weatherwind that contains the wind direction WindDir,
the temperature Temp, and the main wind direction WindMainDir, for each
date and time we have two recordings a day at 9am and 15pm.
It should be ordered from lowest temperature to highest, and not contain
any NA value. WindMainDir should correspond to the first letter of
WindDir and be converted to factor. For example, SW main direction is S.
To do that, please create two tibbles weatherwind9am and weatherwind3pm
that contain the columns WindDir and Temp from weather9am ans
weatherwind3pm. You can use select.
weatherwind9am should print to :
A tibble: 366 x 2
WindDir Temp
ord dbl
1 SW 14.4
2 E 17.5
3 N 15.4
4 WNW 13.5
5 SSE 11.1
6 SE 10.9
7 SE 12.4
8 SE 12.1
9 E 14.1
10 S 13.3
??? with 356 more rows
weatherwind3pm has the same columns, but with different values.
Then, join these two tibbles into a tibble weatherwind using fulljoin.
You can then :
use dropna to drop the NA values
use mutate, factor and strsub to create the column WindMainDir.
use arrange to reorder the dataset.
weatherwind should print to :
A tibble: 667 x 3
WindDir Temp WindMainDir
ord dbl fct
1 N 0.1 N
2 E 0.8 E
3 SSE 1 S
4 SE 1.2 S
5 SE 1.4 S
6 NNW 1.4 N
7 NW 1.8 N
8 N 2.1 N
9 ESE 2.6 E
10 SE 2.7 S
??? with 657 more rows
Do not modify this line!
6. Plot the facetted histogram of the temperature Temp for each main
wind direction WindMainDir. Add a vertical line for temperature
29 degrees Celsius to see what winds bring hot temperatures.
To do this, you can use :
ggplot to initialize a ggplot object. You can set its arguments
data and mapping to plot from weatherarrange, with
Temp as its xaxis
geomhistogram to create the histograms of Temperature, with
color set as black, fill as orange and binwidth as 0.5.
geomvline to draw the vertical line with aesxintercept 29
and color set to green
facetwrap to facet over WindMainDir
labs to set :
title to The winds going west and north bring the highest temperatures
x to Temperature Celsius
themelight
Save this ggplot object to windhistogram.
Do not modify this line!
3
HW7: Reproducing a paper figure
This exercise was inspired by exercise 6 in Chapter 2 of
Bit By Bit: Social Research in the Digital Age
https:www.bitbybitbook.comen1stedobservingbehaviorobservingactivities
by Matt Salganik.
In a widely discussed paper, Michel and colleagues
2011https:doi.org10.1126science.1199644 analyzed the content of
more than five million digitized books in an attempt to identify longterm
cultural trends. The data that they used has now been released as the Google
NGrams dataset, and so we can use the data to replicate and extend some of
their work.
In one of the many results in the paper, Michel and colleagues argued that we
are forgetting faster and faster. For a particular year, say ???1883,??? they
calculated the proportion of all terms published in each year between 1875
and 1975 that were ???1883???. They reasoned that this proportion is a measure of
the interest in events that happened in that year. In their figure 3a, they
plotted the usage trajectories for three years: 1883, 1910, and 1950. These
three years share a common pattern: little use before that year, then a
spike, then decay.
They noticed the rate of decay for each year mentioned seemed to increase
with time and they argued that this means that we are forgetting the past
faster and faster.
The full paper can be found
herehttps:aidenlab.orgpapersScience.Culturomics.pdf, and you are
going to replicate part of figure 3a.
To do so we will focus on the mention of terms that can represent years
strings like 1765, 1886, 1897, 1937…. The raw data was fetched
for you from the Google Books NGram Viewer website
http:storage.googleapis.combooksngramsbooksdatasetsv2.html and
preprocessed into two files:
datamentionsyearlycounts.tsv contains the number of mentions of
different terms per year and the number of books retrieved where the term
appeared each year
one row per term per year
datayearlytotalcounts.csv contains the total number of mentions of all
terms per year as well as the number of pages and books retrived each year
one row per year
1. Load the readr package.
Read in datamentionsyearlycounts.tsv using readtsv and assign
the resulting tibble to termsmentions.
Set the parameters of readtsv in order to make sure of the following:
column names should be, in order: term, year, nmentions,
bookcount
column types should be, in order: character, integer, integer,
integer
Hint: you can use parameters colnames and coltypes to achieve this.
termsmentions should print to:
A tibble: 53,393 x 4
term year nmentions bookcount
chr int int int
1 1817 1524 31 1
2 1817 1575 17 1
3 1817 1607 3 1
4 1817 1637 2 1
5 1817 1662 1 1
6 1817 1675 5 1
7 1817 1693 8 1
8 1817 1705 1 1
9 1817 1708 1 1
10 1817 1713 1 1
??? with 53,383 more rows
Do not modify this line!
2. Read in datayearlytotalcounts.csv using readcsv and assign the
resulting tibble to totalmentions.
Set the parameters of readcsv in order to make sure of the following:
column names should be, in order: year, totalmentions,
totalpagecount, totalbookcount
column types should be, in order: integer, double, integer,
integer
Hint: you can use parameters colnames and coltypes to achieve this.
Note: the reason you should read in the totalmentions as a double
column is that it contains very large integers that dont fit within the
bounds of numbers represented by the integer type in R. Using a
doubleprecision number is our only recourse.
totalmentions should print to:
A tibble: 425 x 4
year totalmentions totalpagecount totalbookcount
int dbl int int
1 1505 32059 231 1
2 1507 49586 477 1
3 1515 289011 2197 1
4 1520 51783 223 1
5 1524 287177 1275 1
6 1525 3559 69 1
7 1527 4375 39 1
8 1541 5272 59 1
9 1563 213843 931 1
10 1564 70755 387 1
??? with 415 more rows
Do not modify this line!
3. Load the dplyr package. In order to join the totalmentions
Left join the totalmentions on termsmentions by year and assign
the resulting tibble to mentions.
Hint: you can use leftjoin.
mentions should print to:
A tibble: 53,393 x 7
term year nmentions bookcount totalmentions totalpagecount totalbookcount
chr int int int dbl int int
1 1817 1524 31 1 287177 1275 1
2 1817 1575 17 1 186706 1067 1
3 1817 1607 3 1 381763 1600 2
4 1817 1637 2 1 681719 2315 3
5 1817 1662 1 1 239762 1471 3
6 1817 1675 5 1 1644156 8918 14
7 1817 1693 8 1 1038415 7426 16
8 1817 1705 1 1 4908749 28840 60
9 1817 1708 1 1 6481151 37416 70
10 1817 1713 1 1 4720647 25961 77
??? with 53,383 more rows
Do not modify this line!
4. Check that your join was successful by using antijoin to drop all
observations in termsmentions that have a match in mentions and
assign the resulting tibble to diagnosis. If the join went as expected
diagnosis should be an empty tibble and print to:
A tibble: 0 x 4
??? with 4 variables: term chr, year int, nmentions int,
bookcount int
Do not modify this line!
5. Do the following:
starting with mentions, add a column fractotal that computes the
frequency of mentions of each term per year divides the number of
mentions of each term per year by the total number of mentions of all
terms that year,
select only columns term, year, nmentions, totalmentions and
fractotal.
Assign the resulting tibble to relativementioncounts.
Hint: you can use mutate and select.
relativementioncounts should print to:
A tibble: 53,393 x 5
term year nmentions totalmentions fractotal
chr int int dbl dbl
1 1817 1524 31 287177 0.000108
2 1817 1575 17 186706 0.0000911
3 1817 1607 3 381763 0.00000786
4 1817 1637 2 681719 0.00000293
5 1817 1662 1 239762 0.00000417
6 1817 1675 5 1644156 0.00000304
7 1817 1693 8 1038415 0.00000770
8 1817 1705 1 4908749 0.000000204
9 1817 1708 1 6481151 0.000000154
10 1817 1713 1 4720647 0.000000212
??? with 53,383 more rows
Do not modify this line!
6. Load the forcats package.
To prepare the tibble to build the figure with:
keep only the terms 1883, 1910 and 1950,
transform the terms from characters to a factor in which the levels
are in reversed alphabetical order 1950, 1910 and 1883
Assign the result to examplesmentioncounts.
Hint: you can use filter, mutate and fctrev to reverse the
order of levels of a factor.
Note: the order matters to us to reproduce the same colors as the original
figure without setting them explicitely when generating the plot.
examplesmentionscounts should print to:
A tibble: 825 x 5
term year nmentions totalmentions fractotal
fct int int dbl dbl
1 1883 1515 1 289011 0.00000346
2 1883 1520 1 51783 0.0000193
3 1883 1524 15 287177 0.0000522
4 1883 1574 4 62235 0.0000643
5 1883 1575 3 186706 0.0000161
6 1883 1584 1 151925 0.00000658
7 1883 1607 2 381763 0.00000524
8 1883 1637 5 681719 0.00000733
9 1883 1643 1 177489 0.00000563
10 1883 1644 3 1018174 0.00000295
??? with 815 more rows
Do not modify this line!
7. Load the ggplot2 and scales packages.
Generate a plot to reproduce the large window of figure 3a and assign the
result to paperfigure.
To do so, you can, in the following order:
create a plot from examplesmentioncounts using ggplot with the
appropriate columns assigned in the aesthetics x, y and color
arguments,
add the lines using geomline,
add scaleycontinuouslabel percent to set the yaxis ticks to
the percent format,
limit the coordinates to show only the mentions across the timeframe
18502012 using coordcartesian and its argument xlim,
use labs to:
set title to Are we forgetting the past faster?
set x to Year
set y to Frequency of mention of each term
set color to Term
finally add themelight for a clear plot
Do not modify this line!
4
HW7: Relational data
In this exercise, you will familiarize yourself with
relational data and its manipulation.
Throughout the exercise:
Use themelight for the plots.
Do not change the default position of the plot title.
Do not print the plot.
In this exercise, you have to recreate the figures found at
the left of the instructions.
We suggest functions you can use to create the plots, but
you are free to use the method you are the most comfortable with.
Make sure that the figures look exactly like the ones you are supposed to create.
1. Load the tidyverse, readr, and lubridate packages.
Use the function readcsv to read the dataset
booking.csv, guest.csv, rate.csv, and room.csv
from path dataguesthouse.
Store the corresponding dataframe into a tibble
booking, guest, rate, and room.
Do not modify this line!
2. Create a tibble roomearning from table booking and rate
using leftjoin.
For each bookingdate and roomno, there should be a column
earning containing the earning of this room.
The column bookingdate should be class Date.
To do that, you can use:
leftjoin to join tables booking and rate.
Note that you should join by two columns.
mutate to create a new column earning that
is the multiplication of nights and amount.
mdy to transform the bookingdate to class Date.
arrange to order the tibble.
select to select the columns.
The first rows of its print should be :
A tibble: 347 x 5
bookingdate roomno guestid nights earning
date dbl dbl dbl dbl
1 20161103 101 1027 7 336
2 20161103 102 1179 2 112
3 20161103 103 1106 2 144
4 20161103 104 1238 3 168
5 20161103 105 1540 7 588
6 20161103 106 1021 3 168
7 20161103 107 1623 3 168
8 20161103 108 1136 1 56
9 20161103 109 1585 4 288
10 20161103 201 1613 6 288
??? with 337 more rows
Do not modify this line!
3. Create a tibble roomearningfiltered from roomearning.
This tibble will only have data whose roomno is
smaller than 200 and bookingdate between 20161115
and 20191215including both.
To do that, you can use:
filter to filter the roomno that is smaller than
200.
filter to filter the date.
mutate and as.character to transform the roomno
from numeric to character.
The first rows of its print should be :
A tibble: 76 x 5
bookingdate roomno guestid nights earning
date chr dbl dbl dbl
1 20161115 101 1344 1 48
2 20161115 105 1127 5 280
3 20161115 103 1041 4 288
4 20161115 109 1624 1 72
5 20161115 102 1598 2 144
6 20161115 106 1208 5 280
7 20161116 101 1185 1 48
8 20161116 109 1249 3 216
9 20161117 101 1187 5 240
10 20161117 104 1477 2 144
??? with 66 more rows
Do not modify this line!
4. Plot line plot: the earnings of room by date
using roomearningfiltered.
The x axis should be date labelled as Booking Date
with ticks 1115, 1201, 1215.
The y axis should be earning labelled as Earning dollars
with ticks 100, 200, 300 default setting.
To do that, you can use:
ggplot to initialize a ggplot object. You can set its arguments
data and mapping to plot the bookingdate and earning column
of the roomearningfiltered.
Use aes to set parameters mapping.
geomline to plot the line plot.
scalexdate to set the ticks of x axis to
1115, 1201, 1215.
labs to format the labels such that:
title Most rooms have earning around 100 to 300
x Booking date
y Earning dollars
facetwrap to be faceted by roomno.
themelight to set light theme i.e. a light backgroung.
Save the plot into roomearningplot.
Do not modify this line!
5. Create a tibble guestspending that stores the total spending and
total number of nights staying for each guest by their full name
from tibble guest and roomearning using rightjoin.
You can use :
mutate and paste to concatenate firstname and
lastname in guest.
rightjoin to join the aforementioned tibble with
roomearning.
groupby to group by name.
summarize to compute the total spending spending
and total number of nights staying nights.
ungroup to ungroup the tibble.
topn to select 10 rows with top spendings.
arrange to order the tibble by descending order of
total spending.
fctreorder to set name to factor and order the
level by ascending order of total spending.
The first rows of its print should be :
A tibble: 10 x 3
name nights spending
fct dbl dbl
1 Sir Edward Garnier 11 780
2 Robert Halfon 10 768
3 Angela Rayner 14 744
4 Karin Smyth 11 696
5 Sir Alan Haselhurst 11 680
Do not modify this line!
6. Plot the bar plot of spending by name using
dataset guestspending.
The bars should be horizontal and ordered by total spending
from highest on top of the figure to lowest.
The x axis should be Spending dollars without label
with ticks 0, 200, 400, 600, 800default setting.
The y axis should be Name with ticks
Sir Edward Garnier, Robert Halfon, Angela Rayner,
…, Craig Tracey from top to bottomdefault setting.
To do that, use:
ggplot to initialize a ggplot object.
Set the date and mapping parameter correctly.
geomcol to plot the box plot.
coordflip to flip the axises.
labs to format the labels such that:
title Top 10 guests all spent more than 600
y Spending dollars
x Name
themelight to set light theme.
Save the plot into guestspendingplot.
Do not modify this line!
5
HW7: Relational data
In this exercise, you will familiarize yourself with
relational data and its manipulation.
Throughout the exercise:
Use themelight for the plots.
Do not change the default position of the plot title.
Do not print the plot.
In this exercise, you have to recreate the figures found at
the left of the instructions.
We suggest functions you can use to create the plots, but
you are free to use the method you are the most comfortable with.
Make sure that the figures look exactly like the ones you are supposed to create.
1. Load the tidyverse, readr, and lubridate packages.
Use the function readcsv to read the dataset
booking.csv, guest.csv, rate.csv, and room.csv
from path dataguesthouse.
Store the corresponding dataframe into a tibble
booking, guest, rate, and room.
Do not modify this line!
2. Create a tibble roomearning from table booking and rate
using leftjoin.
For each bookingdate and roomno, there should be a column
earning containing the earning of this room.
The column bookingdate should be class Date.
To do that, you can use:
leftjoin to join tables booking and rate.
Note that you should join by two columns.
mutate to create a new column earning that
is the multiplication of nights and amount.
mdy to transform the bookingdate to class Date.
arrange to order the tibble.
select to select the columns.
The first rows of its print should be :
A tibble: 347 x 5
bookingdate roomno guestid nights earning
date dbl dbl dbl dbl
1 20161103 101 1027 7 336
2 20161103 102 1179 2 112
3 20161103 103 1106 2 144
4 20161103 104 1238 3 168
5 20161103 105 1540 7 588
6 20161103 106 1021 3 168
7 20161103 107 1623 3 168
8 20161103 108 1136 1 56
9 20161103 109 1585 4 288
10 20161103 201 1613 6 288
??? with 337 more rows
Do not modify this line!
3. Create a tibble roomearningfiltered from roomearning.
This tibble will only have data whose roomno is
smaller than 200 and bookingdate between 20191115
and 20191215including both.
To do that, you can use:
filter to filter the roomno that is smaller than
200.
filter to filter the date.
mutate and as.character to transform the roomno
from numeric to character.
The first rows of its print should be :
A tibble: 76 x 5
bookingdate roomno guestid nights earning
date chr dbl dbl dbl
1 20161115 101 1344 1 48
2 20161115 105 1127 5 280
3 20161115 103 1041 4 288
4 20161115 109 1624 1 72
5 20161115 102 1598 2 144
6 20161115 106 1208 5 280
7 20161116 101 1185 1 48
8 20161116 109 1249 3 216
9 20161117 101 1187 5 240
10 20161117 104 1477 2 144
??? with 66 more rows
Do not modify this line!
4. Plot line plot: the earnings of room by date
using roomearningfiltered.
The x axis should be date labelled as Booking Date
with ticks 1115, 1201, 1215.
The y axis should be earning labelled as Earning dollars
with ticks 100, 200, 300 default setting.
To do that, you can use:
ggplot to initialize a ggplot object. You can set its arguments
data and mapping to plot the bookingdate and earning column
of the roomearningfiltered.
Use aes to set parameters mapping.
geomline to plot the line plot.
scalexdate to set the ticks of x axis to
1115, 1201, 1215.
labs to format the labels such that:
title Most rooms have earning around 100 to 300
x Booking Date
y Earning dollars
facetwrap to be faceted by roomno.
themelight to set light theme i.e. a light backgroung.
Save the plot into roomearningplot.
Do not modify this line!
5. Create a tibble guestspending that stores the total spending and
total number of nights staying for each guest by their full name
from tibble guest and roomearning using rightjoin.
You can use :
mutate and paste to concatenate firstname and
lastname in guest.
rightjoin to join the aforementioned tibble with
roomearning.
groupby to group by name.
summarize to compute the total spending spending
and total number of nights staying nights.
ungroup to ungroup the tibble.
topn to select 10 rows with top spendings.
arrange to order the tibble by descending order of
total spending.
fctreorder to set name to factor and order the
level by ascending order of total spending.
The first rows of its print should be :
A tibble: 10 x 3
name nights spending
fct dbl dbl
1 Sir Edward Garnier 11 780
2 Robert Halfon 10 768
3 Angela Rayner 14 744
4 Karin Smyth 11 696
5 Sir Alan Haselhurst 11 680
Do not modify this line!
6. Plot the bar plot of spending by name using
dataset guestspending.
The bars should be horizontal and ordered by total spending
from highest on top of the figure to lowest.
The x axis should be Spending dollars without label
with ticks 0, 200, 400, 600, 800default setting.
The y axis should be Name with ticks
Sir Edward Garnier, Robert Halfon, Angela Rayner,
…, Craig Tracey from top to bottomdefault setting.
To do that, use:
ggplot to initialize a ggplot object.
Set the date and mapping parameter correctly.
geomcol to plot the box plot.
coordflip to flip the axises.
labs to format the labels such that:
title Top 10 guests all spent more than 600
y Spending dollars
x Name
themelight to set light theme.
Save the plot into guestspendingplot.
Do not modify this line!
6
HW7: relational regex ggplot
Throughout the exercise:
Do NOT use for, while or repeat loops.
Use to structure your operations.
Use themelight for the plots.
For graphs with titles, make the format as
themeplot.title elementtexthjust 0.5, plot.subtitle elementtexthjust 0.5.
1. Load the packages tidyverse and lubridate.
Use readcsv to read the datasets from data folder:
movies.csv into a tibble movie.
ratings.csv into a tibble ratings.
To check your solution, movies prints to:
A tibble: 27,254 x 3
movieId title genres
dbl chr chr
1 1 Toy Story 1995 AdventureAnimationChildrenComedyFanta..
2 2 Jumanji 1995 AdventureChildrenFantasy
3 3 Grumpier Old Men 1995 ComedyRomance
4 4 Waiting to Exhale 1995 ComedyDramaRomance
5 5 Father of the Bride Part II 1995 Comedy
6 6 Heat 1995 ActionCrimeThriller
7 7 Sabrina 1995 ComedyRomance
8 8 Tom and Huck 1995 AdventureChildren
9 9 Sudden Death 1995 Action
10 10 GoldenEye 1995 ActionAdventureThriller
??? with 27,244 more rows
ratings prints to:
A tibble: 50,000 x 4
userId movieId rating timestamp
dbl dbl dbl dbl
1 36660 280 4 834049053
2 91867 2657 5 971578971
3 107259 2566 2 944169497
4 129338 4161 2.5 1137405482
5 128693 2355 4 984620012
6 97984 4052 3.5 1112061639
7 97700 3949 4 1274039326
8 10443 94839 2 1420951497
9 81462 4105 3 1092108684
10 6031 593 4 834163850
??? with 49,990 more rows
Do not modify this line!
2. Turn timestamp in the ratings dataset into normal format
e.g. yearmonthday.
To do this, you can use:
as.POSIXct to turn timestamp into
normal format, and specify the argument origin 19700101.
withtzto change the time zone to UTC ,by passing
tzone UTC.
Store the returned dataset into ratings.
To check your result, ratings prints to:
A tibble: 50,000 x 5
userId movieId rating timestamp date
dbl dbl dbl dbl dttm
1 36660 280 4 834049053 19960606 08:17:33
2 91867 2657 5 971578971 20001015 03:02:51
3 107259 2566 2 944169497 19991202 21:18:17
4 129338 4161 2.5 1137405482 20060116 09:58:02
5 128693 2355 4 984620012 20010315 01:33:32
6 97984 4052 3.5 1112061639 20050329 02:00:39
7 97700 3949 4 1274039326 20100516 19:48:46
8 10443 94839 2 1420951497 20150111 04:44:57
9 81462 4105 3 1092108684 20040810 03:31:24
10 6031 593 4 834163850 19960607 16:10:50
??? with 49,990 more rows
Do not modify this line!
3. Create a tibble averageratingindifferentyears of dimension 20×2.
The procedure should first create a new column called yearofrating that
gets the year of the rating and converted into a factor variable. Then we
group the dataset by yearofrating and summarize the dataset such that
we calculate the average ratings of
each year and store averages in a column called averageratingofyear.
To do that, you can use:
mutate and as.factor to coerce the year of date into a
factor variable.
groupby to group by yearofrating.
summarize to summarize the mean of ratings of different year and
store the value into averageratingofyear.
To check your result, averageratingindifferentyears prints to:
A tibble: 20 x 2
yearofrating averageratingofyear
fct dbl
1 1996 3.56
2 1997 3.56
3 1998 3.49
4 1999 3.59
5 2000 3.59
6 2001 3.52
7 2002 3.51
8 2003 3.50
9 2004 3.42
10 2005 3.41
??? with 10 more rows
Do not modify this line!
4. Draw a point plot of averageratingofyear vs. yearofrating.
Name the title as 2004 and 2005 have lower average ratings,
subtitle as While 2014 has highest average ratings.
To do this, you can use:
geompoint to draw a point plot of averageratingofyear
vs. yearofrating.
labs to name the title as
2004 and 2005 have lower average ratings,
the subtitle as While 2014 has highest average ratings,
the xaxis as Year,
the yaxis as Average rating.
Store the plot into a variable g1.
Do not modify this line!
5. Extract the year of movies from title column in the movies dataset.
Store the values into a new column called year and convert values into numeric.
To do this, you can use:
mutate, mapchr and as.numeric to generate a new column called year
and transform values into numeric.
hint: pay attention to substr to extract the characters that we want.
Store the returned dataset into movies.
Optional material: The title column format is relatively clean here. What if
we have 4digit years and 2digit years mix e.g. 1995, 96, 02, 2005, etc.?
Then the method above will not work and how should we modify the code or use other methods?
Way 1: Locate the index of and then use substr to extract years.
Way 2: Use regular expression. For instance, you can combine gsub with
pattern .09, replacement 1, and x title in order to
extract the year.
Here, gsub matches to argument pattern within each element of a character
vector, and then replace the designed pattern.
The pattern is .09.
. means the string starts with a character and . means there are one or more characters.
and means there really exist parenthese. is just the syntax for escaping.
09 means there exists a string of characters that can be converted to numeric.
The replacement is 1.
1 means we only need to keep the characters that can be converted to numeric.
If interested, please go to the following website for more detials:
https:stringr.tidyverse.orgarticlesregularexpressions.html.
To check your result, movies prints to:
A tibble: 27,254 x 4
movieId title genres year
dbl chr chr dbl
1 1 Toy Story 1995 AdventureAnimationChildrenComedyFa??? 1995
2 2 Jumanji 1995 AdventureChildrenFantasy 1995
3 3 Grumpier Old Men 1995 ComedyRomance 1995
4 4 Waiting to Exhale 1995 ComedyDramaRomance 1995
5 5 Father of the Bride Part II 1??? Comedy 1995
6 6 Heat 1995 ActionCrimeThriller 1995
7 7 Sabrina 1995 ComedyRomance 1995
8 8 Tom and Huck 1995 AdventureChildren 1995
9 9 Sudden Death 1995 Action 1995
10 10 GoldenEye 1995 ActionAdventureThriller 1995
??? with 27,244 more rows
Do not modify this line!
6. Convert the genres column in movies dataset into characters.
Then separate the genres of a same movie such that the same movie is split
into several entries with different kinds of genres.
To do this, you can use:
mutate and as.character to convert genres into characters.
separaterows to split up the genres.
Store the returned dataset into movies.
To check your result, movies prints to:
A tibble: 54,374 x 4
movieId title genres year
dbl chr chr dbl
1 1 Toy Story 1995 Adventure 1995
2 1 Toy Story 1995 Animation 1995
3 1 Toy Story 1995 Children 1995
4 1 Toy Story 1995 Comedy 1995
5 1 Toy Story 1995 Fantasy 1995
6 2 Jumanji 1995 Adventure 1995
7 2 Jumanji 1995 Children 1995
8 2 Jumanji 1995 Fantasy 1995
9 3 Grumpier Old Men 1995 Comedy 1995
10 3 Grumpier Old Men 1995 Romance 1995
??? with 54,364 more rows
Do not modify this line!
7. Join the movies and ratings together by movieId. Remove movieId, userId
and timestamp columns after join.
To do this, you can use:
innerjoin to join two dataset.
select to remove specified columns.
Store returned dataset to ratingsofmovies.
To check your result, ratingsofmovies prints to:
A tibble: 132,725 x 5
title genres year rating date
chr chr dbl dbl dttm
1 Toy Story 1995 Adventure 1995 3.5 20050317 20:22:12
2 Toy Story 1995 Adventure 1995 3 19990706 18:41:32
3 Toy Story 1995 Adventure 1995 3 20090818 18:31:29
4 Toy Story 1995 Adventure 1995 4.5 20090820 06:36:02
5 Toy Story 1995 Adventure 1995 5 19970126 14:23:23
6 Toy Story 1995 Adventure 1995 5 20010118 17:22:03
7 Toy Story 1995 Adventure 1995 3.5 20040306 07:59:24
8 Toy Story 1995 Adventure 1995 4 19961024 22:12:42
9 Toy Story 1995 Adventure 1995 4 19971120 12:59:43
10 Toy Story 1995 Adventure 1995 4 20050127 20:49:25
??? with 132,715 more rows
Do not modify this line!
8. Draw a horizontal boxplot of rating vs. genres.
To do this, you can use:
geomboxplot to draw a boxplot of rating vs. genres.
labs to name the title as:
Most movie genres have median ratings from 3.5 to 4,
name the xaxis as Genres,
name the yaxis as Ratings.
coordflip to flip x and y.
Store the plot into a variable g2.
Do not modify this line!
9. Group by the genres columns in ratingsofmovies. Then summrize the mean of
rating and number of rating of different genres. Store average of rating
values into a column averagerating, number of rating
into a column numberofratings.
Finally convert the genres into a factor variable.
To do this, you can use:
groupby to group by genres.
summarize to calculate the mean of
rating and number of rating of different genres. Store average of
rating values into a column averagerating, number of rating
into a column numberofratings.
mutate and as.factor to convert genres into a factor variable.
Store the returned dataset into averageratingofgenres.
To check your result, averageratingofgenres prints to:
A tibble: 20 x 3
genres averagerating numberofratings
fct dbl int
1 no genres listed 4.5 1
2 Action 3.43 13970
3 Adventure 3.49 10875
4 Animation 3.59 2807
5 Children 3.38 4133
6 Comedy 3.42 18701
7 Crime 3.68 8265
8 Documentary 3.74 588
9 Drama 3.67 22331
10 Fantasy 3.49 5252
??? with 10 more rows
Do not modify this line!
10.Draw a point plot of averagerating vs. numberofratings, colored by genres.
Draw a smooth curve that passes through points using loess method.
Name the title as
When number of ratings is over 5000, the average ratings start to be constant around 3.5.
To do this, you can use:
geompoint to draw a point plot of averagerating vs. numberofratings,
color by genres.
geomsmooth to draw a smooth curve using method as loess.
labs to name the title as
When number of ratings is over 5000, the average ratings start to be constant around 3.5.
the xaxis as Number of ratings,
the yaxis as Average rating,
the color legend as Genres.
Store the plot into a variable g3.
Do not modify this line!
11.Calculate the interval of time passed between the date corresponding the
rating and the release year from ratingsofmovies dataset,
and store this interval into a column called interval. You can assume
that a film released in a given year was actually released on January 1st.
Then, convert the units of interval into years and store its
values into a new column called yearspassed.
Finally, filter out rows with yearspassed less or equal to 60.
To do this, you can use:
mutate to create two required columns.
hint 1: pay attention to makedatetime to convert the release year
into a date and remembers that datesdatetimes can
be substracted to create intervals
hint 2: intervals can be divided by durations, and the dyears
function can help your compute the number of years that have passed
filter to filter out rows with yearspassed less or equal to 60.
Store the returned dataset to ratingsovertime.
To check your result, ratingsovertime prints to:
A tibble: 130,087 x 7
title genres year rating date interval yearspassed
chr chr dbl dbl dttm drtn dbl
1 Toy Story 199??? Adventure 1995 3.5 20050317 20:22:12 3728.8488 da??? 10.2
2 Toy Story 199??? Adventure 1995 3 19990706 18:41:32 1647.7788 da??? 4.51
3 Toy Story 199??? Adventure 1995 3 20090818 18:31:29 5343.7719 da??? 14.6
4 Toy Story 199??? Adventure 1995 4.5 20090820 06:36:02 5345.2750 da??? 14.6
5 Toy Story 199??? Adventure 1995 5 19970126 14:23:23 756.5996 da??? 2.07
6 Toy Story 199??? Adventure 1995 5 20010118 17:22:03 2209.7236 da??? 6.05
7 Toy Story 199??? Adventure 1995 3.5 20040306 07:59:24 3352.3329 da??? 9.18
8 Toy Story 199??? Adventure 1995 4 19961024 22:12:42 662.9255 da??? 1.82
9 Toy Story 199??? Adventure 1995 4 19971120 12:59:43 1054.5415 da??? 2.89
10 Toy Story 199??? Adventure 1995 4 20050127 20:49:25 3679.8677 da??? 10.1
??? with 130,077 more rows
Do not modify this line!
12.Draw a straight line plot of rating vs. yearspassed, colored by genres.
Name the title as The average rating usually increases with time,
subtitle as One exception is animation movies,
xaxis as Years between release and rating,
yaxis as Average rating,
legend as Genres.
To do this, you can use:
ggplot to setup your plot rating vs. yearspassed, colored by genres.
geomsmooth to draw a straight line plot using method as lm.
labs to name the title The average rating usually increases with time,
subtitle as One exception is animation movies,
xaxis as Years between release and rating,
yaxis as Average rating,
color legend as Genres.
Store the plot into a variable g4.
Do not modify this line!
7
HW7: NYC stock analysis
In this exercise, you will conduct complete data analysis on NYC stock price.
Dataset consists of following files:
prices.csv: raw, asis daily prices. Most of data spans from 2010 to the end 2016,
for companies new on stock market date range is shorter.
There have been approximmately 140 stock splits in that time,
this set doesnt account for that.
securities.csv: general description of each company with division on sectors
fundamentals.csv: metrics extracted from annual SEC 10K fillings 20122016,
should be enough to derive most of popular fundamental indicators.
1. Do the following:
load the readr, dplyr and tidyr package
load coursedataprices.csv using readcsv and save it to raw.
load datasecurities.csv using readcsv and save it to sectors.
load datafundamentals.csv using readcsv and save it to fund.
raw should look like:
A tibble: 851,264 x 7
date symbol open close low high volume
dttm chr dbl dbl dbl dbl dbl
1 20160105 00:00:00 WLTW 123. 126. 122. 126. 2163600
2 20160106 00:00:00 WLTW 125. 120. 120. 126. 2386400
3 20160107 00:00:00 WLTW 116. 115. 115. 120. 2489500
4 20160108 00:00:00 WLTW 115. 117. 114. 117. 2006300
5 20160111 00:00:00 WLTW 117. 115. 114. 117. 1408600
6 20160112 00:00:00 WLTW 116. 116. 114. 116. 1098000
7 20160113 00:00:00 WLTW 116. 113. 113. 117. 949600
8 20160114 00:00:00 WLTW 114. 114. 110. 115. 785300
9 20160115 00:00:00 WLTW 113. 113. 112. 115. 1093700
10 20160119 00:00:00 WLTW 114. 110. 110. 116. 1523500
??? with 851,254 more rows
securities should look like:
A tibble: 505 x 8
Ticker symbol Security SEC filings GICS Sector
chr chr chr chr
1 MMM 3M Comp??? reports Industrials
2 ABT Abbott ??? reports Health Care
3 ABBV AbbVie reports Health Care
4 ACN Accentu??? reports Information ???
5 ATVI Activis??? reports Information ???
6 AYI Acuity ??? reports Industrials
7 ADBE Adobe S??? reports Information ???
8 AAP Advance??? reports Consumer Dis???
9 AES AES Corp reports Utilities
10 AET Aetna I??? reports Health Care
??? with 495 more rows, and 4 more variables: GICS Sub
Industry chr, Address of Headquarters chr, Date first
added date, CIK chr
fund should print to:
A tibble: 1,781 x 79
X1 Ticker Symbol Period Ending Accounts Payab???
dbl chr date dbl
1 0 AAL 20121231 3068000000
2 1 AAL 20131231 4975000000
3 2 AAL 20141231 4668000000
4 3 AAL 20151231 5102000000
5 4 AAP 20121229 2409453000
6 5 AAP 20131228 2609239000
7 6 AAP 20150103 3616038000
8 7 AAP 20160102 3757085000
9 8 AAPL 20130928 36223000000
10 9 AAPL 20140927 48649000000
??? with 1,771 more rows, and 75 more variables: Accounts
Receivable dbl, Addl incomeexpense items dbl, After Tax
ROE dbl, etc…
Do not modify this line!
2. Load the stringr and lubridate packages.
Currently, the date in raw are in UTC time. We want to convert them
to New York time zone.
To do so, you can:
use mutate to convert the date column
use forcetz to format the date, with argument
tz set to AmericaNewYork.
Save the generated tibble into rawtime.
rawtimedate should be in this format:
1 20160105 EST 20160106 EST 20160107 EST 20160108 EST..
6 20160112 EST 20160113 EST 20160114 EST 20160115 EST..
Do not modify this line!
3. Load the package forcats.
In securities, keep the companies belonging to the top 6 sectors
by frequency of occurence, as well as those whose GICS Sub Industry
falls into Gold or Real Estate i.e., GICS Sub Industry contains
either Gold or REITs.
You need to do it in three steps:
First, use mutate and factor to convert the GICS Sector
variable of securities from character to a factor. Its levels should
be the unique values of GICS Sector.
Second, create a tibble named securitiessectored that contain
only the companies that do not belong to those that you want see below.
Note that securitiessectored should contain an additional column
GICS Sector truncated that contains the top 6 factors in GICS Sector
and all the others lumped into an additional level Other.
Third, use antijoin on securities and securitiessectored to
create securitiesselected, which contains only the rows that actually
meet the requirements above by deleting the rows from securities
that are in securitiessectored.
To achieve the second step, you can use:
mutate along with fctinfreq and fctlump to
reorder the sectors by frequency of occurence and lump
all except the top 6 into a single level Other.
filter to select the sectors that do not belong to the
top 6 i.e., the ones with the level Other.
filter along with strdetect to additionally filter out
the observations whose GICS Sub Industry contains neither
Gold nor REITs. In the pattern, you can use a
regular expression with or represented by the alternation
symbol to do that.
To help you, securities is as in part 1, except that the GICS Sector
column is a factor whose levels print to
Levels: Industrials … Telecommunications Services.
securitiessectored should print to:
A tibble: 94 x 9
Ticker symbol Security SEC filings GICS Sector
chr chr chr fct
1 AES AES Corp reports Utilities
2 APD Air Pro??? reports Materials
3 ALB Albemar??? reports Materials
4 LNT Alliant??? reports Utilities
5 AEE Ameren ??? reports Utilities
6 AEP America??? reports Utilities
7 AWK America??? reports Utilities
8 APC Anadark??? reports Energy
9 APA Apache ??? reports Energy
10 T ATT Inc reports Telecommunic???
??? with 84 more rows, and 5 more variables: GICS Sub
Industry chr, Address of Headquarters chr, Date first
added date, CIK chr, GICS Sector truncated fct
securitiesselected should print to:
A tibble: 411 x 8
Ticker symbol Security SEC filings GICS Sector
chr chr chr fct
1 MMM 3M Comp??? reports Industrials
2 ABT Abbott ??? reports Health Care
3 ABBV AbbVie reports Health Care
4 ACN Accentu??? reports Information ???
5 ATVI Activis??? reports Information ???
6 AYI Acuity ??? reports Industrials
7 ADBE Adobe S??? reports Information ???
8 AAP Advance??? reports Consumer Dis???
9 AET Aetna I??? reports Health Care
10 AMG Affilia??? reports Financials
??? with 401 more rows, and 4 more variables: GICS Sub
Industry chr, Address of Headquarters chr, Date first
added date, CIK chr
Do not modify this line!
4. Convert the column name of fund from Ticker Symbol
to Ticker symbol. This makes sure there is consistency between
column names of the different tables!.
Then create new column Period Ending Year to extract the year from
Period Ending. Then Drop NA values of fund. Select columns Ticker symbol,
Period Ending Year and Gross Margin.
Save the new tibble as fundtime.
To do that, you can use:
dplyr::rename to convert the column name.
mutate to create column Period Ending Year, inside mutate:
use strreplaceall to first convert to .
mdy with tz set to AmericaNewYork to convert the date
to EDT time zone.
year to extract the year.
You can use pipe on Period Ending inside mutate
dropna to drop the rows that contain NA values.
dplyr::select to select the intested columns.
fundtime should look like:
A tibble: 1,299 x 3
Ticker symbol Period Ending Year Gross Margin
chr dbl dbl
1 AAL 2012 58
2 AAL 2013 59
3 AAL 2014 63
4 AAL 2015 73
5 AAP 2012 50
6 AAP 2013 50
7 AAP 2015 45
8 AAP 2016 45
9 AAPL 2013 38
10 AAPL 2014 39
??? with 1,289 more rows
Do not modify this line!
5. Select the following columns from securitiesselected:
Ticker symbol, Security, GICS Sector
Join the two tibbles securitiesselected and fundtime by Ticker symbol.
Drop the fundtime rows with NA if the corresponding Ticker symbol
in securitiesselected is not in fundtime.
To do that, you can use:
dplyr::select to select correponding columns.
innerjoin to automatically drop rows in a tibble when not matched
with other tibble. Set the argument by to Ticker symbol.
Save the concatenated tibble to securitiesfund.
securitiesfund should print to:
A tibble: 988 x 5
Ticker symbol Security GICS Sector Period Ending ???
chr chr fct dbl
1 MMM 3M Comp??? Industrials 2013
2 MMM 3M Comp??? Industrials 2014
3 MMM 3M Comp??? Industrials 2015
4 ABT Abbott ??? Health Care 2012
5 ABT Abbott ??? Health Care 2013
6 ABT Abbott ??? Health Care 2014
7 ABT Abbott ??? Health Care 2015
8 ABBV AbbVie Health Care 2013
9 ABBV AbbVie Health Care 2014
10 ABBV AbbVie Health Care 2015
??? with 978 more rows, and 1 more variable: Gross Margin dbl
Do not modify this line!
6. Load the library ggplot2.
Generate histograms of Gross Margin in different sectors in different periods
and assign the plot to variable grossmargin.
To do so, you can:
create the plot by calling ggplot on securitiesfund,
with mapping aes, in which argument fill should set to Period
Ending Year and x should set to Gross Margin.
adding geomhistogram with binwidth set to 10, color set to
black and fill set to orange.
then facet on GICS Sector using facetwrap.
labs to format the labels such that:
title Gross margin distributed differently in different sectors
x Gross Margin
y Count n
then add the light theme using themelight.
Do not modify this line!
7. Join the two tibbles rawtime with securitiesfund to get each
companys stock price trend in each year with its corresponding
Gross Margin in that year.
Note: the column you want to join the tables by is Ticker symbol,
make sure they have the exact same name before joining.
To do that, you can do the following:
first convert the column name symbol of rawtime to Ticker symbol.
Hint: use dplyr::rename.
use mutate, and year to extract the year of the price happening.
use dplyr::select to select only the following columns:
Ticker symbol, close, open, date,year
use leftjoin with argument by set to Ticker symbol.
use filter to delete the rows where the Perior Ending Year is not
corresponding to the date of the price.
Save the generated tibble into fullstock.
fullstock should print to:
A tibble: 246,615 x 9
Ticker symbol close open date year Security
chr dbl dbl dttm dbl chr
1 AAL 5.12 5.2 20120103 00:00:00 2012 America???
2 AAP 69.1 71.1 20120103 00:00:00 2012 Advance???
3 ABT 56.7 56.6 20120103 00:00:00 2012 Abbott ???
4 ADS 103. 103. 20120103 00:00:00 2012 Allianc???
5 AKAM 32.9 33.0 20120103 00:00:00 2012 Akamai ???
6 ALK 73.9 76.4 20120103 00:00:00 2012 Alaska ???
7 AME 42.2 43.3 20120103 00:00:00 2012 AMETEK ???
8 AMT 58.8 60.5 20120103 00:00:00 2012 America???
9 APH 46.0 46.5 20120103 00:00:00 2012 Ampheno???
10 ARNC 9.23 8.94 20120103 00:00:00 2012 Arconic???
??? with 246,605 more rows, and 3 more variables: GICS
Sector fct, Period Ending Year dbl, Gross Margin dbl
Do not modify this line!
8. Generate the stock close price trend plot in 20102016 of the
following company:
Aetna Inc, Amazon.com Inc, Facebook, Whole Foods Market,
FedEx Corporation, Boeing Company, The Walt Disney Company.
To do that, please do data manipulations first and create
a tibble filteredcompany. You can use:
filter and in to filter the selected companies.
filteredcompany should print to:
A tibble: 5,292 x 9
Ticker symbol close open date year Security
chr dbl dbl dttm dbl chr
1 AMZN 257. 256. 20130102 00:00:00 2013 Amazon.???
2 BA 77.1 76.6 20130102 00:00:00 2013 Boeing ???
3 DIS 51.1 50.8 20130102 00:00:00 2013 The Wal???
4 FB 28 27.4 20130102 00:00:00 2013 Facebook
5 FDX 94.2 93.5 20130102 00:00:00 2013 FedEx C???
6 WFM 92.0 93.1 20130102 00:00:00 2013 Whole F???
7 AMZN 258. 257. 20130103 00:00:00 2013 Amazon.???
8 BA 77.5 77.0 20130103 00:00:00 2013 Boeing ???
9 DIS 51.2 51.0 20130103 00:00:00 2013 The Wal???
10 FB 27.8 27.9 20130103 00:00:00 2013 Facebook
??? with 5,282 more rows, and 3 more variables: GICS Sector fct,
Period Ending Year dbl, Gross Margin dbl
Do not modify this line!
9. Generate the stock close price trend plot in 2016 of the following company:
Aetna Inc, Amazon.com Inc, Facebook, Extra Space Storage,
FedEx Corporation, JPMorgan Chase Co., Oracle Corp..
To do that, use:
ggplot on filteredcompany, with mapping aes,
in which argument y should set to close and x should set to date.
geomlineaes in which color is set to Security.
labs to format the labels such that:
title Six company daily stock close price from 2010 2016
x date
y Daily close price USD
use themelight to set the theme.
Do not modify this line!
10. Caculate the annual Rate of Return RoR on the securities in fullstock.
Rate of Return is defined as the net gain or loss on an investment
over a specified time period, calculated as a percentage of
the investment???s initial cost.
Namely, RoR current value initial value initial value
To calculate this index on securities in fullstock, you can:
group the stock prices by Period Ending Year and Ticker symbol
using groupby
select the record of start of the year and end of the year
by using filter to select date equal to mindate or
maxdate
mutate the date to open if it is equal to mindate,
otherwise close. use ifelse inside mutate
use pivotlonger to extract open and close from date.
To do that, inside pivotlonger, set copen, close
as first argument, and then namesto as status, valuesto
as price. This will add two columns recording the opening
and closing price for each row.
use filter to select the right status for each row by
condition date status.
usesummarize to calculate the RoR for each stock in
each year. Inside summarize, use diff to calculate the
price difference of open price and close price and divide the
difference by price1 which represents the open price.
use leftjoin to add the annual RoR to securitiesfund
by joining with securitiesfund on Ticker symbol and
Period Ending Year.
Set by to cTicker symbol, Period Ending Year.
use droplevels drop the unselected sectors.
Save the generated tibble into returnstock.
returnstock should print to:
A tibble: 980 x 6
Groups: Period Ending Year 5
Period Ending ??? Ticker symbol Return Security GICS Sector
dbl chr dbl chr fct
1 2012 AAL 1.60 America??? Industrials
2 2012 AAP 0.0170 Advance??? Consumer Dis???
3 2012 ABT 0.158 Abbott ??? Health Care
4 2012 ADS 0.406 Allianc??? Information ???
5 2012 AKAM 0.241 Akamai ??? Information ???
6 2012 ALK 0.436 Alaska ??? Industrials
7 2012 AME 0.133 AMETEK ??? Industrials
8 2012 AMT 0.278 America??? Real Estate
9 2012 APH 0.390 Ampheno??? Information ???
10 2012 ARNC 0.0291 Arconic??? Industrials
??? with 970 more rows, and 1 more variable: Gross Margin dbl
Do not modify this line!
11. Calculate the mean, 0.25 quantile and 0.75 quantile of Return for
each GICS Sector.
To do that, use:
groupby to group the stocks by GICS Sector.
summarize to calculate meanreturn using mean,
25 quantile q1 using quantile with probs set to 0.25,
75 quantile q2 using quantile with probs set to 0.75.
mutate to reorder the factor GICS Sector using fctreorder
according to meanreturn.
Save the generated tibble into summarystock.
The first four lines of summarystock should print to:
A tibble: 8 x 4
GICS Sector meanreturn q1 q2
fct dbl dbl dbl
1 Industrials 0.153 0.0291 0.315
2 Health Care 0.192 0.0325 0.323
3 Information Technology 0.196 0.00665 0.350
4 Consumer Discretionary 0.121 0.0513 0.294
Do not modify this line!
8
HW7: email
In this exercise, you will perform data analysis with emails within 184 people
from year 1998 to 2001.
1. Lets first read in the required datasets.
Load the readr package.
Use readcsv to load the people.csv data set from data folder
and assign it to a tibble people. This data set contains information
about the 184 people who sent emails between each other.
Use readcsv to load the email.csv data set from data folder
and assign it to a tibble email. This data set contains information about
each email sent: time, sender and receiver.
Do not modify this line!
2. onset variable in email is the time when the email is sent, but we can
see that it is shown in a weird way. It turns out that it represents how
many seconds from the start time. Lets convert it to normal
time stamps.
Note: 19980101 is encoded as 883612800 in onset.
Load the dplyr package.
Load the lubridate package
Use mutate to create a new variable time in email dataset.
We can obtain time by following the next three steps:
Substract the onset by 883612800 to get the seconds difference with
19980101.
Use as.POSIXct to change the seconds difference into normal timestamp.
Specify the argument origin 19980101.
Use withtzto change the time zone to UTC ,by passing
tzone UTC.
Save your generated tibble into emailwtime whose first few rows should
look like:
A tibble: 38,131 x 9
onset terminus tail head onset.censored terminus.censor??? duration edge.id
dbl dbl dbl dbl lgl lgl dbl dbl
9.58e8 9.58e8 30 30 FALSE FALSE 0 1
9.59e8 9.59e8 30 30 FALSE FALSE 0 1
9.59e8 9.59e8 30 30 FALSE FALSE 0 1
9.64e8 9.64e8 30 30 FALSE FALSE 0 1
9.70e8 9.70e8 30 30 FALSE FALSE 0 1
9.70e8 9.70e8 30 30 FALSE FALSE 0 1
9.73e8 9.73e8 30 30 FALSE FALSE 0 1
9.74e8 9.74e8 30 30 FALSE FALSE 0 1
9.79e8 9.79e8 30 30 FALSE FALSE 0 1
9.85e8 9.85e8 30 30 FALSE FALSE 0 1
??? with 38,121 more rows, and 1 more variable: date dttm
Do not modify this line!
3. Now lets take a look into the people dataset.
We can see some missing values in personname column, but we can get a
persons name using hisher emailid.
For example, Albert Meyerss email ID is just albert.meyers.
To fill in missing values with email ID, lets first create a function
emailidtoname to tranform albert.meyers into Albert Meyers.
You need to:
Load the stringr package.
Load the purrr package.
Create a function function emailidtoname that take a string input
named emailid and returns the name extracted from the email id.
To do that, you can use:
strsplit to extract the first and last name from emailid,
split by ..
mapchar and paste0 to combine the first and last name using
collapse .
strtrim to remove whitespace from start and end of string.
strtotitle to capitalize each word.
Name your.
Do not modify this line!
4. Now lets implement the function you just created to our people tibble.
If personname is not missing, do not modify it. If it is missing, change
personname to the output of emailidtoname by taking emailid as
input, you can use emailid emailidtoname.
To do that, you can use:
mutate and ifelse to change the column personname.
is.na to check if the personname is missing.
Save your generated tibble into peoplenew, whose first few rows should
look like:
A tibble: 184 x 5
vertex.names emailid personname role dept
dbl chr chr chr chr
1 albert.meyers Albert Meyers Employee Specialist
2 a..martin Thomas Martin Vice President NA
3 andrea.ring Andrea Ring NA NA
4 andrew.lewis Andrew Lewis Director NA
5 andy.zipper Andy Zipper Vice President Enron Online
6 a..shankman Jeffrey Shankman President Enron Global Mkts
7 barry.tycholiz Barry Tycholiz Vice President NA
8 benjamin.rogers Benjamin Rogers Employee Associate
9 bill.rapp Bill Rapp NA NA
10 bill.williams Bill Williams NA NA
??? with 174 more rows
Do not modify this line!
5. We still have NA in role and dept. This time, we want to fill in the
missing values in role with Employee and missing values in dept as
General. Lets create a new tibble peoplenew2 that fills such gaps…
and more!
First, load the tidyr package and create a vector roleorder
cEmployee, Trader, Manager, Managing Director, Director,
In House Lawyer, Vice President, President, CEO
Now, can fill the missing roles and make it a factor using the levels
in roleorder. To do that, you can use:
replacena to fill in the missing values in these two columns.
Remember the missing values in role should now become Employee
and the missing values in dept should now become General.
mutate to change role into a factor with factor and
specify levels roleorder to change role into a factor according
to our order.
Save your generated tibble into peoplenew2 whose first few rows should
look like:
A tibble: 184 x 5
vertex.names emailid personname role dept
dbl chr chr fct chr
1 albert.meyers Albert Meyers Employee Specialist
2 a..martin Thomas Martin Vice President General
3 andrea.ring Andrea Ring Employee General
4 andrew.lewis Andrew Lewis Director General
5 andy.zipper Andy Zipper Vice President Enron Online
6 a..shankman Jeffrey Shankman President Enron Global Mkts
7 barry.tycholiz Barry Tycholiz Vice President General
8 benjamin.rogers Benjamin Rogers Employee Associate
9 bill.rapp Bill Rapp Employee General
10 bill.williams Bill Williams Employee General
??? with 174 more rows
Do not modify this line!
6. Lets combine the two datasets emailwtime and peoplenew2 together,
we want to keep the information about every email and add the name, email ID,
role and department for the sender as well as for the receiver.
The numbers in tail and head represent different people. The key to join
this two tibbles are tail and head from emailwtime, and vertex.names
from peoplenew2. In other words, you need to use two leftjoin to
add first the information of the receiver and then of the sender.
You will also need to update the names of the columns added by the joins.
To add information of a receiver, you can use:
leftjoin to combine emailwtime and peoplenew2, specify
by ctailvertex.names because tail represents the receiver
of an email.
rename to change the column names to specify that theyre the
receiver information: change emailid to receiveremail, personname
to receiver, role to receiverrole and dept to receiverdept.
Then, to add information of a sender, you can similarly use:
leftjoin to combine with peoplenew again, this time with
by cheadvertex.names .
rename to change the column names to specify that theyre the
sender information: change emailid to senderemail, personname
to sender, role to senderrole and dept to senderdept.
Finally, you can use:
select to only keep the date, sender information and receiver
information each has 4 columns with email ID, name, role and department.
startswith to select all the four columns of sender receiver
information.
Save your generated tibble into t1, which should have 9 columns, in the order of
date, senderemail sender, senderrole, senderdept, receiveremail,
receiver, receiverrole, receiverdept.
The first fews rows of t1 should look like:
A tibble: 38,131 x 9
date senderemail sender senderrole senderdept receiveremail
dttm chr chr fct chr chr
20000515 08:35:00 debra.perli??? Debra??? Employee General debra.perling???
20000518 04:15:00 debra.perli??? Debra??? Employee General debra.perling???
20000524 02:58:00 debra.perli??? Debra??? Employee General debra.perling???
20000719 07:09:00 debra.perli??? Debra??? Employee General debra.perling???
20000928 02:45:00 debra.perli??? Debra??? Employee General debra.perling???
20000928 02:52:00 debra.perli??? Debra??? Employee General debra.perling???
20001027 04:38:00 debra.perli??? Debra??? Employee General debra.perling???
20001110 02:52:00 debra.perli??? Debra??? Employee General debra.perling???
20010105 03:17:00 debra.perli??? Debra??? Employee General debra.perling???
20010323 02:02:00 debra.perli??? Debra??? Employee General debra.perling???
??? with 38,121 more rows, and 3 more variables: receiver chr,
receiverrole fct, receiverdept chr
Do not modify this line!
7. We noticed that there are emails that one person sent to himher self
with the same sendemail and receiveremail. We do not care about these
emails and want to filter them out.
Use filter to filter out rows in t1 whose senderemail is exactly
the same as receiveremail.
To simply our analysis with time the email is sent, we want to further
parse the information in date by creating new columns named year, month,
day and hour. To do that, you can use:
mutate to create the new columns year, month, day and hour.
year to extract year in date.
month to extract month in date.
day to extract day in date.
hour to extract hour in date.
Save your generated tibble into t2. The first few rows should look like:
A tibble: 34,427 x 13
date senderemail sender senderrole senderdept receiveremail
dttm chr chr fct chr chr
20010315 02:43:00 jeffrey.sha??? Jeffr??? President Enron Glob??? greg.whalley
20010402 13:44:00 jeffrey.sha??? Jeffr??? President Enron Glob??? greg.whalley
20010605 22:40:00 jeffrey.sha??? Jeffr??? President Enron Glob??? greg.whalley
20010611 05:20:00 jeffrey.sha??? Jeffr??? President Enron Glob??? greg.whalley
20010308 02:52:00 kim.ward Kim W??? Employee General jason.williams
20010328 07:40:00 kim.ward Kim W??? Employee General jason.williams
20010328 18:40:00 kim.ward Kim W??? Employee General jason.williams
20010402 07:53:00 kim.ward Kim W??? Employee General jason.williams
20010402 17:53:00 kim.ward Kim W??? Employee General jason.williams
20010403 07:24:00 kim.ward Kim W??? Employee General jason.williams
??? with 34,417 more rows, and 7 more variables: receiver chr,
receiverrole fct, receiverdept chr, year dbl, month dbl, day int,
hour int
Now we have all the information we want. Lets try to answer some insteresting
questions in the following exercises:
Do not modify this line!
8. Who are the top 3 people who sent emails the most? What are their roles and
departments?
To answer this, you can use:
groupby to group by sender, senderrole and senderdept
summarize by specify countn to count the number of emails
arrange to sort the tibble by count of emails in decreasing order
head to extract the first three rows.
Save your generated tibble into p1 which should have the following structure:
A tibble: 3 x 4
sender senderrole senderdept count
chr fct chr int
Do not modify this line!
9. During which period of day are people tend to send emails?
Use:
ggplot to initialize a ggplot object.
Set its arguments data and mapping.
geomhistogram to plot a histogram for hour.
Specify bins 24 to set the bins
labs to format the labels such that:
title People send more emails during noon.
x Hour.
y Countn.
themelight to change the theme of plots.
theme to change the subtitle to the middle of the plot as well.
Set its argument plot.title using elementtexthjust 0.5.
Save your plot to g1.
Do not modify this line!
10. What is the trend of using emails? Do people use it more frequently in 1999 or 2001?
Lets first create a tibble to store the information.
Use:
filter to only keep the emails sent before 20020101.
groupby to group the data by year, month sender and
senderrole
summarize such that
datemindate keep record of the earliest date in each group.
countn the number of emails sent in that period.
arrange to order the rows by date.
Save your generated tibble into p2.
Do not modify this line!
11.Then lets visualize it. We want to plot the count against date
with colors spliting by senderrole.
Use:
ggplot to initialize a ggplot object.
Set its arguments data, mapping.
geompoint to plot a histogram for hour.
geomsmooth to add a smoothing regression line.
labs to format the labels such that:
title People are using emails more frequently in 2001 than 1999.
x Date.
y Countn,
color Sender Role
themelight to change the theme of plots.
theme to change the subtitle to the middle of the plot as well.
Set its argument plot.title using elementtexthjust 0.5.
Save your plot to g2.
Do not modify this line!