程序代写代做代考 ETX2250/ETF5922: Data Visualization and Analytics

ETX2250/ETF5922: Data Visualization and Analytics
Basic Visualization
Lecturer:
Department of Econometrics and Business Statistics 
 Week 2

The grammar of graphics
At rst using ggplot2 can seem too complicated.
Once mastered it can be used to very easily create detailed plots.
It is built on the ideas of Grammar of Graphics a text by .
The objective is to nd an abstract set of rules for creating almost any graphic.
2/87

Data
The starting point for all visualisation is a dataset.
In these slides, we will consider the datasets diamonds, mpg and economics which come built in with
the ggplot2 package.
Later on we learn how to read in data.
The diamonds data contains data on the price, size and quality of over 50000 diamonds.
3/87

Aes and Geom
Think of an aesthetic (or aes) as a way of perceiving a variable: Position on x or y axis
Color
Size
Think of a geometry (or geom) as a way of representing a variable: Points
Lines
ggplot maps aesthetics to geometries
4/87

Histogram

Histogram
Consider a histogram of the variable price
In a histogram, values of the variable we are interested in lie along the horizontal (x) axis. The histogram creates bins then counts the number of observations in each bin.
To get started type
ggplot(data = diamonds,mapping = aes(x=price))
6/87

What do we see?
7/87

What do we see?
We do have an x axis with a label price and some values. Otherwise we see nothing.
We need to add a geometry to the plot.
We do this with the geom_histogram function.
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram()
8/87

What do we see?
9/87

Modication
Suppose want to use a different number of bins or change the color of the bins? These are not features of the data or the aes
These are features of the geom.
So these are controlled by arguments in the geom_histogram function.
10/87

Change bins
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram(bins = 5)
11/87

Change boundary
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram(bins = 5, boundary=0)
12/87

Change binwidth
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram(binwidth = 500)
13/87

Change color
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram(binwidth = 500,fill = ‘red’)
14/87

Change border color
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram(binwidth = 500,color=’white’,
fill = ‘blue’)
15/87

An aside on colour

Customise colour
Many colours come built in to R.
In some cases you may wish to select your own color.
Customising colour requires appreciating how a computer understands color. We will do this by looking at RGB hex codes.
Using this system, to a computer #ff0000 is red.
17/87

The RGB system
One color model used by computers encodes every colour by the amount of red, green and blue light mixed to make that colour.
This is called the RGB color model.
A value between 0 and 255 indicates the strength of red, green and blue. These values between 0 and 255 are represented in two hexadecimal digits.
18/87

Hexadecimal
In hexadecimal: a is ten,
b is eleven, c is twelve… f is fteen.
Take the rst digit and multiply by 16 and add the second digit
Hexadecimal is used since each digit corresponds to 4 bits in computer memory.
19/87

Examples
10 in hexadecimal is 1a in hexadecimal is 2b in hexadecimal is What is e4 in decimal?
in decimal in decimal in decimal
20/87
34 = 11 + 61 × 2 62 = 01 + 61 × 1 61 = 0 + 61 × 1

Color picker
One online tool to nd the hex code of a color is here.
Suppose we want to the histogram to be this brown color.
The hex code is #b35900 which is 179/256 red, 89/256 green and no blue.
This can be provided as a string, to the fill or color argument of geom_histogram.
21/87

Brown histogram
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram(binwidth = 500,color=’white’,
fill = ‘#b35900′)
22/87

Finding hex codes
It is useful to know hex codes since at times you may want to match colors for a specic purpose. For instance you may want the colors to match the brand colors of a client.
For example a simple online search tells us that Coca Color red is #f40000
The green color worn by NBA team the Milwaukee Bucks is #00471b.
23/87

Histograms (Bucks Colors)
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram(fill=’#00471b’,color=’#eee1c6′)
24/87

An exercise
Find the hex codes for a color(s) associated with: A brand you like, or
A sports team you like, or
Your country’s ag,
Anything else
Construct a histogram of the variable carat with these colors.
25/87

Density plot

Density
For a smoother version of a histogram we can use a different geom called geom_density. This in fact computes a kernel density estimate of the variable.
The level of smoothness is controlled by a bandwidth parameter
All the computation is done by ggplot2.
27/87

Density plot
ggplot(data = diamonds,mapping = aes(x=price))+
geom_density()
28/87

Density plot (thicker)
ggplot(data = diamonds,mapping = aes(x=price))+
geom_density(size=3)
29/87

How is density calculated?
The kernel density estimate is a popular nonparametric technique that estimates a density as
Here, is a kernel function that depends on a bandwidth .
30/87
h
)(hK
1=i n
) i x − x ( h K ∑ 1 = ) x ( ^f
n

Uniform kernel
The simplest kernel function is the uniform kernel if
otherwise
At a point , the estimated density is proportional to the number of points that are close to . By close, we mean within units of .
31/87
xh xx
0 = )u(hK h < |u| h/1 = )u(hK Extremes If the bandwidth gets extremely large then for any , all sample points are considered close. The formula for the kernel density becomes a at line. If the bandwidth gets extremely small then for any we choose, the density is just the number of points in the sample equal to . The kernel density is made up of spikes at the sample points. 32/87 x x x Defaults By default, geom_density Uses a Gaussian kernel Selects the bandwidth using Silverman's rule of thumb The same principles apply: Large bandwidth leads to more smoothness Small bandwidth leads to more bumpiness 33/87 Density plot: Low bandwidth ggplot(data = diamonds,mapping = aes(x=price))+ geom_density(bw=100) 34/87 Density plot: High bandwidth ggplot(data = diamonds,mapping = aes(x=price))+ geom_density(bw=2000) 35/87 Density plot: Low bandwidth ggplot(data = diamonds,mapping = aes(x=price))+ geom_density(bw=0.0001) 36/87 Density plot: High bandwidth ggplot(data = diamonds,mapping = aes(x=price))+ geom_density(bw=80000) 37/87 Summary With both histograms and density plots If the bin width or bandwidth is too small the plot may look bumpy. This can exaggerate features that are not signicant. If the bin width or bandwidth is too large the plot may smooth over important features like local modes. Always try a few different values of bin width or bandwidth. 38/87 Finding outliers Outliers Histograms and density plots give a good idea of shape and local modes. Sometimes they can obscure outliers. For nding outliers a rug plot can be useful For nding outliers while still getting a good idea of skew, boxplots can be useful. We can investigate using the variable carat 40/87 Carat: Histogram ggplot(data = diamonds,mapping = aes(x=carat))+ geom_histogram() 41/87 Carat: Rug plot ggplot(data = diamonds,mapping = aes(x=carat))+ geom_rug() 42/87 Box plot The box plot summarises 5 numbers Median First quartile Third quartile Upper Fence Lower Fence Anything lying outside the fences represented as dots. When no points lie outside the fence, the fence is set to the maximum or minimum. 43/87 )1Q − 3Q( × 5.1 − 1Q = L )1Q−3Q(×5.1+3Q= U 3Q 1Q Carat: Boxplot ggplot(data = diamonds,mapping = aes(y=carat))+ geom_boxplot() 44/87 Change of aesthetic Notice that the aesthetic changed! In the boxplot, the value of the variable is represented by the vertical (or y axis). We can change the denition of the upper and lower fence by passing the coef argument to geom_boxplot. This changes the 1.5 used in calculating the fence to whatever you specify 45/87 Changing fences ggplot(data = diamonds,mapping = aes(y=carat))+ geom_boxplot(coef=4) 46/87 Notches Notches can be added to a boxplot These are set to This roughly gives a 95% condence interval for the median. We will use a smaller dataset on the mileage of cars for this example to clearly illustrate the notches. 47/87 n√ )1Q−3Q(×85.1 Notches ggplot(data = mpg,mapping = aes(y=cty))+ geom_boxplot(notch = T) 48/87 One Non-Metric Variable Nominal v Ordinal Non-metric variables are made up of nominal and ordinal variables. Nominal variables have no ordering in the categories of data: Manufacturer of car (Audi, Toyota, etc). Ordinal variables do have an ordering in the categories: Quality of diamonds (Fair, Good, etc). 50/87 Non-metric variables in R Non-metric variables can be stored in R as Character variables (nominal data) Factors (nominal data) Ordered factors (ordinal data) You can check with the str function 51/87 Diamonds data str(diamonds) ## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame) ## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ... ## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 . ## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ... ## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ... ## $x ## $y ## $z : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ... : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 52/87 . . . . . . Mpg data str(mpg) ## tibble [234 x 11] (S3: tbl_df/tbl/data.frame) ## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ... ## $ model ## $ displ ## $ year ## $ cyl ## $ trans ## $ drv ## $ cty ## $ hwy ## $ fl ## $ class : chr [1:234] "a4" "a4" "a4" "a4" ... : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ... : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 20 : int [1:234] 4 4 4 4 6 6 6 4 4 4 ... : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" : chr [1:234] "f" "f" "f" "f" ... : int [1:234] 18 21 20 21 16 18 18 18 16 20 ... : int [1:234] 29 29 31 30 26 26 27 26 25 28 ... : chr [1:234] "p" "p" "p" "p" ... : chr [1:234] "compact" "compact" "compact" "compact" ... 53/87 0 Bar plot A common plot for non-metric data is the bar plot for the frequency of observations for each level of the factor. The height of each bar indicates the number of observations in a particular category. This can be done using geom_bar 54/87 Bar plot ggplot(data = diamonds, mapping = aes(x=cut))+ geom_bar() 55/87 Bar plot ggplot(data = mpg, mapping = aes(x=manufacturer))+ geom_bar() 56/87 Mosaic plot Used to visualize the counts of two discrete variables (can be dicult to read!) Need to use an additional package ggmosaic library(ggmosaic) ggplot(data = mpg)+ geom_mosaic(mapping = aes( x= product(drv,cyl), fill = drv)) 57/87 Two Continuous Variables What to look for Outliers Dependence or correlation Remember that correlation does not imply causation! Non linear relationships. 59/87 Scatter plot For two metric variables use a scatter plot One variable is represented by the x aesthetic The other is represented by the y aesthetic The geometry we use is geom_point. We will continue to use the diamonds dataset 60/87 Scatterplot ggplot(data = diamonds, mapping = aes(x=carat,y=price))+geom_point() 61/87 Overplotting When using big datasets, sometimes the points cover one another or are too close. This is sometimes called overplotting. Some solutions: Try smaller points (size) Try more transparent points (alpha) Try a different geom 62/87 Changing size ggplot(data = diamonds, mapping = aes(x=carat,y=price))+ geom_point(size=0.1) 63/87 Changing alpha ggplot(data = diamonds, mapping = aes(x=carat,y=price))+ geom_point(alpha=0.2) 64/87 Changing geom ggplot(data = diamonds, mapping = aes(x=carat,y=price))+ geom_bin2d() 65/87 Hexagonal bins ggplot(data = diamonds, mapping = aes(x=carat,y=price))+ geom_hex() 66/87 Changing geom ggplot(data = diamonds, mapping = aes(x=carat,y=price))+ geom_density2d() 67/87 Time series plots When the x variable is time, it often makes more sense to join dots with a line. This way we can see Trend Seasonality Outliers Structural break 68/87 Economics dataset We will use the economics dataset (comes with ggplot2) str(economics) ## spec_tbl_df [574 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## $ date ## $ pce ## $ pop ## $ psavert : num [1:574] 12.6 12.6 11.9 12.9 12.8 11.8 11.7 12.3 11.7 12.3 . ## $ uempmed : num [1:574] 4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ... : Date[1:574], format: "1967-07-01" "1967-08-01" ... : num [1:574] 507 510 516 512 517 ... : num [1:574] 198712 198911 199113 199311 199498 ... ## $ unemploy: num [1:574] 2944 2945 2958 3143 3066 ... Notice date is its own type of variable 69/87 . Unemployed persons ggplot(economics, aes(x=date, y=unemploy))+ geom_line() 70/87 An aside on log scales Scale For variables that are heavily skewed it can be better to look at a log scale. For a regular scale you add as you move up the scale. For a log scale you multiply as you move up the scale. The log scale has the effect of putting more distance between smaller values and compressing higher values. 72/87 Regular scale ggplot(data = diamonds, mapping = aes(x=carat,y=price))+ geom_point() 73/87 Log scale ggplot(data = diamonds, mapping = aes(x=carat,y=price))+ geom_point()+scale_x_log10()+scale_y_log10() 74/87 Metric and Non-Metric Data Side by side plots When one variable is metric and the other non-metric we can easily put plots next to one another side by side. Simply map the non-metric variable to the x aesthetic and the metric variable to the y aesthetic. 76/87 Boxplots ggplot(data = diamonds, mapping = aes(x=cut,y=price))+ geom_boxplot() 77/87 Change axes ggplot(data = diamonds, mapping = aes(x=price,y=cut))+ geom_boxplot() 78/87 With notches Recall that the notches provide a condence interval around the median. These are particularly useful when comparing boxplots to one another. In general, if the condence intervals overlap then the medians are not signcantly different. This is NOT a formal test, but still gives a useful indication. 79/87 Boxplots (no overlap) ggplot(data = mpg, mapping = aes(x=drv,y=hwy))+ geom_boxplot(notch=T) 80/87 Boxplots (some overlap) 81/87 Violin plot A violin plot is a newer visualisation. A kernel density is mirrored then arranged vertically. Specify the same way but use geom_violin 82/87 Violin plot ggplot(data = diamonds, mapping = aes(x=cut,y=price))+ geom_violin() 83/87 Violin plot ggplot(data = diamonds, mapping = aes(x=cut,y=price))+ geom_violin()+coord_flip() 84/87 Jittering A scatter plot can be used for non-metric data but can easily suffer from overplotting (one point on another). ggplot(data = mpg, mapping = aes(x=cyl,y=cty))+ geom_point() 85/87 Jittering Add random noise by jittering ggplot(data = mpg, mapping = aes(x=cyl,y=cty))+ geom_point(position = 'jitter') 86/87 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Lecturer: Department of Econometrics and Business Statistics   Week 2