ETX2250/ETF5922: Data Visualization and Analytics
Data manipulation
Lecturer:
Department of Econometrics and Business Statistics
Week 03
2/67
Data Import
Data often stored in a single table.
Columns are separated by a comma or a tab.
The readr package is very useful for reading les into R as data frame objects.
3/67
Data frame
A data frame is an object with:
Each row corresponding to an observation or case, Each column corresponding to a variable.
Two types of data frames in R are: The data.frame in base R The tibble in the tidyverse
4/67
The readr package
Two of the more useful functions are read_csv and read_tsv for comma delimited and tab delimited les respectively.
Generally the defaults for this function work quite well.
One argument that is worth discussing a little is col_types
5/67
Reading in long numbers
The largest integer that many computers will accurately store is about 19 digits long.
Some identi cation numbers are longer than that (for instance IBAN numbers used for bank transfers). In some cases, R may try to read these numbers in as integers.
To avoid errors these should be read in as characters.
6/67
Column types
Specifying col_types=’c’ overrides the default behaviour of readr. You can use this to force everything to be read in as a character.
If needed these can subsequently be converted to numeric variables.
7/67
Getting Data
Although data will sometimes be provided to you, it is useful to know some ways to get data off the web. This can be integrated into you R work ow.
Techniques range from commands to download les to more sophisticated web scraping.
Websites can change so always make sure to save data as well.
8/67
Downloading les in R
As an example consider wholesale electricity prices which can be downloaded online.
9/67
Downloading les in R
The download.file function in the utils package can be used to download a csv le.
download.file(‘https://www.aemo.com.au/aemo/data/nem/priceanddemand/PRICE_AND_D
‘data/data_202006_NSW.csv’)
#url loc is (all one line)
#’https://www.aemo.com.au/aemo/data/
#nem/priceanddemand/PRICE_AND_DEMAND_
#202006_NSW1.csv’
dat<-read_csv('data/data_202006_NSW.csv')
The rst argument is the URL the second argument is the location where you want to store the data.
10/67
E
Bene ts
Location were data is retrieved is kept. Usually URL are created systematically.
Data for Victoria will have VIC as part of URL.
Data for May 2020 will have 202005 as part of URL.
A large number of les can be downloaded and combined using a loop.
11/67
Web Scraping
Some websites do not simply include csv les in the URL. Instead the data is embedded in the html of the website itself. The R package rvest can be used to scrape data from websites. The example on the next slides scrape data from Yahoo Finance.
12/67
Data from Yahoo Finance
13/67
library(rvest) html<-read_html('https://au.finance.yahoo.com/quote/AAPL/analysis?p=AAPL') tab<-html_table(html)
print(tab[[1]])
## # A tibble: 5 x 5
## `Earnings estimat~ `Current qtr. (Jun 2~ `Next qtr. (Sep 2~ `Current year
##
## 1 No. of analysts
## 2 Avg. Estimate
## 3 Low estimate
## 4 High estimate
## 5 Year ago EPS
27 4
14/67
( d 0 5 4 5 3
Web scraping
This is only the tip of the iceberg
Data from many websites will not be as easy to download.
Some knowledge of html can be useful when using the html_node and html_attr functions. Also useful is the Selector Gadget tool.
Comply with a website’s Terms and Conditions when web scraping especially if doing so for commercial reasons.
15/67
Data is messy
Data
Data rarely comes in the fomat we want: May not want to use all variables, May not want to use all observations, May need to transform variables.
17/67
A few simple functions
Much can be done with a few simple functions from the dplyr package: Choose variables with select
Choose observations with filter
Transform variables with mutate
In all cases both input and output is a data frame.
18/67
Diamonds
The diamonds data is a tibble
diamonds
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
##
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
19/67
Select variables
The select function can be used if we only want to focus on a subset of variables. For the diamonds dataset we may only be interested in carat, cut and price.
selected<-select(diamonds,carat, cut, price)
selected
## # A tibble: 53,940 x 3
## carat cut price
##
## 1 0.23 Ideal 326
## 2 0.21 Premium 326
## 3 0.23 Good 327
## 4 0.29 Premium 334
## 5 0.31 Good 335
## 6 0.24 Very Good 336
## 7 0.24 Very Good 336
## 8 0.26 Very Good 337
20/67
Select variables
To drop variables, use a – sign
select(diamonds,-carat, -cut, -price)
## # A tibble: 53,940 x 7
## color clarity depth table x y z
##
## 1 E
## 2 E
## 3 E
## 4 I VS2 62.4 58 4.2 4.23 2.63
SI2 61.5
SI1 59.8
VS1 56.9
SI2 63.3
VVS2 62.8
VVS1 62.3
SI1 61.9
VS2 65.1
VS1 59.4
58 4.34 4.35 2.75
57 3.94 3.96 2.48
57 3.95 3.98 2.47
55 4.07 4.11 2.53
61 3.87 3.78 2.49
61 4 4.05 2.39
## 5 J
## 6 J
## 7 I
## 8 H
## 9 E
## 10 H
## # … with 53,930 more rows
21/67
55 3.95 3.98 2.43
61 3.89 3.84 2.31
65 4.05 4.07 2.31
Filter variables
Suppose we only want to consider diamonds worth more than $1000
filter(diamonds,(price>1000))
## # A tibble: 39,416 x 10
1 0.7 Ideal
color clarity depth table price x y z
E SI1 62.5 57 2757 5.7 5.72 3.57
##
##
##
## 2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
carat cut
## 3 0.7 Ideal G
## 4 0.71 Very Good E
## 5 0.78 Very Good G
## 6 0.7 Good E
## 7 0.7 Good F
## 8 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
## 9 0.73 Very Good E SI1 61.6 59 2760 5.77 5.78 3.56
## 10 0.8 Premium H SI1 61.5 58 2760 5.97 5.93 3.66
## # … with 39,406 more rows
VS2 61.6
VS2 62.4
SI2 63.8
VS2 57.5
VS1 59.4
56 2757 5.7 5.67 3.5
57 2759 5.68 5.73 3.56
56 2759 5.81 5.85 3.72
58 2759 5.85 5.9 3.38
62 2759 5.71 5.76 3.4
22/67
Logical stamtents
The term (price>1000) is an example of a logical statement. It can be true or false. Other examples are
Price less than 1000 (price<1000)
Price less than or equal to 1000 (price<=1000) Price not 1000 (price!=1000)
Price exactly 1000 (price==1000)
Note two equals signs.
In general ! is the negation of a statement.
23/67
And operator
If two statements need to be satis ed use & filter(diamonds,
(price>1000)&(cut==’Ideal’))
## # A tibble: 14,700 x 10
## carat cut color clarity depth table price x y z
##
## 1 0.7 Ideal E SI1 62.5 57 2757 5.7 5.72 3.57
## 2 0.7 Ideal G VS2 61.6 56 2757 5.7 5.67 3.5
## 3 0.74 Ideal G SI1 61.6 55 2760 5.8 5.85 3.59
## 4 0.8 Ideal I VS1 62.9 56 2760 5.94 5.87 3.72
## 5 0.75 Ideal G SI1 62.2 55 2760 5.87 5.8 3.63
## 6 0.74 Ideal I
## 7 0.81 Ideal F
## 8 0.59 Ideal E
## 9 0.8 Ideal F SI2 61.4 57 2761 5.96 6 3.67
## 10 0.74 Ideal E SI2 62.2 56 2761 5.8 5.84 3.62
VVS2 62.3
SI2 58.8
VVS2 62
55 2761 5.77 5.81 3.61
57 2761 6.14 6.11 3.6
55 2761 5.38 5.43 3.35
24/67
Or operator
If either one or the other statement needs to be satis ed use | filter(diamonds,(cut==’Ideal’)|color==’E’)
## # A tibble: 27,445 x 10
## carat cut color clarity depth table price x y z
##
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 5 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
## 6 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 7 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
## 8 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
## 9 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
## 10 0.23 Very Good E VS2 63.8 55 352 3.85 3.92 2.48
## # … with 27,435 more rows
25/67
In operator
Another useful operator is %in% filter(diamonds,
(cut %in% c(‘Ideal’,’Fair’)))
## # A tibble: 23,161 x 10
## carat cut color clarity depth table price x y z
##
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 3 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
## 4 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 5 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
## 6 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
## 7 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75
## 8 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
## 9 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44
## 10 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72
26/67
Not operator
Use ! as not filter(diamonds,
!(cut %in% c(‘Ideal’,’Fair’)))
## # A tibble: 30,779 x 10
## carat cut color clarity depth table price
##
## 1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 3 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 5 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 6 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 7 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 8 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## 9 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
## 10 0.22 Premium F SI1 60.4 61 342 3.88 3.84 2.33
x y z
27/67
Your turn
Write R code for the following:
price greater than 2000 and carat less than 3.
price greater than 2000 and cut either Very Good, Premium or Ideal. carat less than 2 or price greater than 500.
carat less than 2 and cut not equal to Premium.
28/67
Creating New Variables
Mpg
Mpg is also a tibble
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl cl
##
## 1 audi
## 2 audi
## 3 audi
## 4 audi
## 5 audi
## 6 audi
## 7 audi
## 8 audi
## 9 audi
## 10 audi
## # … with 224 more rows