ETX2250/ETF5922: Data Visualization and Analytics
Basic Visualization
Lecturer:
Department of Econometrics and Business Statistics
Week 2
The grammar of graphics
At rst using ggplot2 can seem too complicated.
Once mastered it can be used to very easily create detailed plots.
It is built on the ideas of Grammar of Graphics a text by .
The objective is to nd an abstract set of rules for creating almost any graphic.
2/87
Data
The starting point for all visualisation is a dataset.
In these slides, we will consider the datasets diamonds, mpg and economics which come built in with
the ggplot2 package.
Later on we learn how to read in data.
The diamonds data contains data on the price, size and quality of over 50000 diamonds.
3/87
Aes and Geom
Think of an aesthetic (or aes) as a way of perceiving a variable: Position on x or y axis
Color
Size
Think of a geometry (or geom) as a way of representing a variable: Points
Lines
ggplot maps aesthetics to geometries
4/87
Histogram
Histogram
Consider a histogram of the variable price
In a histogram, values of the variable we are interested in lie along the horizontal (x) axis. The histogram creates bins then counts the number of observations in each bin.
To get started type
ggplot(data = diamonds,mapping = aes(x=price))
6/87
What do we see?
7/87
What do we see?
We do have an x axis with a label price and some values. Otherwise we see nothing.
We need to add a geometry to the plot.
We do this with the geom_histogram function.
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram()
8/87
What do we see?
9/87
Modi cation
Suppose want to use a different number of bins or change the color of the bins? These are not features of the data or the aes
These are features of the geom.
So these are controlled by arguments in the geom_histogram function.
10/87
Change bins
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram(bins = 5)
11/87
Change boundary
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram(bins = 5, boundary=0)
12/87
Change binwidth
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram(binwidth = 500)
13/87
Change color
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram(binwidth = 500,fill = ‘red’)
14/87
Change border color
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram(binwidth = 500,color=’white’,
fill = ‘blue’)
15/87
An aside on colour
Customise colour
Many colours come built in to R.
In some cases you may wish to select your own color.
Customising colour requires appreciating how a computer understands color. We will do this by looking at RGB hex codes.
Using this system, to a computer #ff0000 is red.
17/87
The RGB system
One color model used by computers encodes every colour by the amount of red, green and blue light mixed to make that colour.
This is called the RGB color model.
A value between 0 and 255 indicates the strength of red, green and blue. These values between 0 and 255 are represented in two hexadecimal digits.
18/87
Hexadecimal
In hexadecimal: a is ten,
b is eleven, c is twelve… f is fteen.
Take the rst digit and multiply by 16 and add the second digit
Hexadecimal is used since each digit corresponds to 4 bits in computer memory.
19/87
Examples
10 in hexadecimal is 1a in hexadecimal is 2b in hexadecimal is What is e4 in decimal?
in decimal in decimal in decimal
20/87
34 = 11 + 61 × 2 62 = 01 + 61 × 1 61 = 0 + 61 × 1
Color picker
One online tool to nd the hex code of a color is here.
Suppose we want to the histogram to be this brown color.
The hex code is #b35900 which is 179/256 red, 89/256 green and no blue.
This can be provided as a string, to the fill or color argument of geom_histogram.
21/87
Brown histogram
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram(binwidth = 500,color=’white’,
fill = ‘#b35900′)
22/87
Finding hex codes
It is useful to know hex codes since at times you may want to match colors for a speci c purpose. For instance you may want the colors to match the brand colors of a client.
For example a simple online search tells us that Coca Color red is #f40000
The green color worn by NBA team the Milwaukee Bucks is #00471b.
23/87
Histograms (Bucks Colors)
ggplot(data = diamonds,mapping = aes(x=price))+
geom_histogram(fill=’#00471b’,color=’#eee1c6′)
24/87
An exercise
Find the hex codes for a color(s) associated with: A brand you like, or
A sports team you like, or
Your country’s ag,
Anything else
Construct a histogram of the variable carat with these colors.
25/87
Density plot
Density
For a smoother version of a histogram we can use a different geom called geom_density. This in fact computes a kernel density estimate of the variable.
The level of smoothness is controlled by a bandwidth parameter
All the computation is done by ggplot2.
27/87
Density plot
ggplot(data = diamonds,mapping = aes(x=price))+
geom_density()
28/87
Density plot (thicker)
ggplot(data = diamonds,mapping = aes(x=price))+
geom_density(size=3)
29/87
How is density calculated?
The kernel density estimate is a popular nonparametric technique that estimates a density as
Here, is a kernel function that depends on a bandwidth .
30/87
h
)(hK
1=i n
) i x − x ( h K ∑ 1 = ) x ( ^f
n
Uniform kernel
The simplest kernel function is the uniform kernel if
otherwise
At a point , the estimated density is proportional to the number of points that are close to . By close, we mean within units of .
31/87
xh xx
0 = )u(hK h < |u| h/1 = )u(hK
Extremes
If the bandwidth gets extremely large then for any , all sample points are considered close. The formula for the kernel density becomes a at line.
If the bandwidth gets extremely small then for any we choose, the density is just the number of points in the sample equal to .
The kernel density is made up of spikes at the sample points.
32/87
x x
x
Defaults
By default, geom_density
Uses a Gaussian kernel
Selects the bandwidth using Silverman's rule of thumb
The same principles apply:
Large bandwidth leads to more smoothness Small bandwidth leads to more bumpiness
33/87
Density plot: Low bandwidth
ggplot(data = diamonds,mapping = aes(x=price))+
geom_density(bw=100)
34/87
Density plot: High bandwidth
ggplot(data = diamonds,mapping = aes(x=price))+
geom_density(bw=2000)
35/87
Density plot: Low bandwidth
ggplot(data = diamonds,mapping = aes(x=price))+
geom_density(bw=0.0001)
36/87
Density plot: High bandwidth
ggplot(data = diamonds,mapping = aes(x=price))+
geom_density(bw=80000)
37/87
Summary
With both histograms and density plots
If the bin width or bandwidth is too small the plot may look bumpy. This can exaggerate features that are not signi cant.
If the bin width or bandwidth is too large the plot may smooth over important features like local modes.
Always try a few different values of bin width or bandwidth.
38/87
Finding outliers
Outliers
Histograms and density plots give a good idea of shape and local modes. Sometimes they can obscure outliers.
For nding outliers a rug plot can be useful
For nding outliers while still getting a good idea of skew, boxplots can be useful. We can investigate using the variable carat
40/87
Carat: Histogram
ggplot(data = diamonds,mapping = aes(x=carat))+
geom_histogram()
41/87
Carat: Rug plot
ggplot(data = diamonds,mapping = aes(x=carat))+
geom_rug()
42/87
Box plot
The box plot summarises 5 numbers Median
First quartile Third quartile Upper Fence Lower Fence
Anything lying outside the fences represented as dots.
When no points lie outside the fence, the fence is set to the maximum or minimum.
43/87
)1Q − 3Q( × 5.1 − 1Q = L )1Q−3Q(×5.1+3Q= U
3Q 1Q
Carat: Boxplot
ggplot(data = diamonds,mapping = aes(y=carat))+
geom_boxplot()
44/87
Change of aesthetic
Notice that the aesthetic changed!
In the boxplot, the value of the variable is represented by the vertical (or y axis).
We can change the de nition of the upper and lower fence by passing the coef argument to geom_boxplot.
This changes the 1.5 used in calculating the fence to whatever you specify
45/87
Changing fences
ggplot(data = diamonds,mapping = aes(y=carat))+
geom_boxplot(coef=4)
46/87
Notches
Notches can be added to a boxplot These are set to
This roughly gives a 95% con dence interval for the median.
We will use a smaller dataset on the mileage of cars for this example to clearly illustrate the notches.
47/87
n√ )1Q−3Q(×85.1
Notches
ggplot(data = mpg,mapping = aes(y=cty))+
geom_boxplot(notch = T)
48/87
One Non-Metric Variable
Nominal v Ordinal
Non-metric variables are made up of nominal and ordinal variables. Nominal variables have no ordering in the categories of data:
Manufacturer of car (Audi, Toyota, etc).
Ordinal variables do have an ordering in the categories:
Quality of diamonds (Fair, Good, etc).
50/87
Non-metric variables in R
Non-metric variables can be stored in R as Character variables (nominal data) Factors (nominal data)
Ordered factors (ordinal data)
You can check with the str function
51/87
Diamonds data
str(diamonds)
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 .
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $x
## $y
## $z
: num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
: num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05
: num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39
52/87
.
. . .
. .
Mpg data
str(mpg)
## tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model
## $ displ
## $ year
## $ cyl
## $ trans
## $ drv
## $ cty
## $ hwy ## $ fl
## $ class
: chr [1:234] "a4" "a4" "a4" "a4" ...
: num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
: int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 20
: int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
: chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)"
: chr [1:234] "f" "f" "f" "f" ...
: int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
: int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
: chr [1:234] "p" "p" "p" "p" ...
: chr [1:234] "compact" "compact" "compact" "compact" ...
53/87
0
Bar plot
A common plot for non-metric data is the bar plot for the frequency of observations for each level of the factor.
The height of each bar indicates the number of observations in a particular category. This can be done using geom_bar
54/87
Bar plot
ggplot(data = diamonds, mapping = aes(x=cut))+
geom_bar()
55/87
Bar plot
ggplot(data = mpg, mapping = aes(x=manufacturer))+
geom_bar()
56/87
Mosaic plot
Used to visualize the counts of two discrete variables (can be di cult to read!) Need to use an additional package ggmosaic
library(ggmosaic) ggplot(data = mpg)+
geom_mosaic(mapping = aes( x= product(drv,cyl), fill = drv))
57/87
Two Continuous Variables
What to look for
Outliers
Dependence or correlation
Remember that correlation does not imply causation! Non linear relationships.
59/87
Scatter plot
For two metric variables use a scatter plot
One variable is represented by the x aesthetic The other is represented by the y aesthetic The geometry we use is geom_point.
We will continue to use the diamonds dataset
60/87
Scatterplot
ggplot(data = diamonds,
mapping = aes(x=carat,y=price))+geom_point()
61/87
Overplotting
When using big datasets, sometimes the points cover one another or are too close. This is sometimes called overplotting.
Some solutions:
Try smaller points (size)
Try more transparent points (alpha) Try a different geom
62/87
Changing size
ggplot(data = diamonds,
mapping = aes(x=carat,y=price))+
geom_point(size=0.1)
63/87
Changing alpha
ggplot(data = diamonds,
mapping = aes(x=carat,y=price))+
geom_point(alpha=0.2)
64/87
Changing geom
ggplot(data = diamonds,
mapping = aes(x=carat,y=price))+
geom_bin2d()
65/87
Hexagonal bins
ggplot(data = diamonds,
mapping = aes(x=carat,y=price))+
geom_hex()
66/87
Changing geom
ggplot(data = diamonds,
mapping = aes(x=carat,y=price))+
geom_density2d()
67/87
Time series plots
When the x variable is time, it often makes more sense to join dots with a line. This way we can see
Trend Seasonality Outliers Structural break
68/87
Economics dataset
We will use the economics dataset (comes with ggplot2)
str(economics)
## spec_tbl_df [574 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date
## $ pce
## $ pop
## $ psavert : num [1:574] 12.6 12.6 11.9 12.9 12.8 11.8 11.7 12.3 11.7 12.3 .
## $ uempmed : num [1:574] 4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
: Date[1:574], format: "1967-07-01" "1967-08-01" ...
: num [1:574] 507 510 516 512 517 ...
: num [1:574] 198712 198911 199113 199311 199498 ...
## $ unemploy: num [1:574] 2944 2945 2958 3143 3066 ...
Notice date is its own type of variable
69/87
.
Unemployed persons
ggplot(economics, aes(x=date, y=unemploy))+
geom_line()
70/87
An aside on log scales
Scale
For variables that are heavily skewed it can be better to look at a log scale. For a regular scale you add as you move up the scale.
For a log scale you multiply as you move up the scale.
The log scale has the effect of putting more distance between smaller values and compressing higher values.
72/87
Regular scale
ggplot(data = diamonds,
mapping = aes(x=carat,y=price))+
geom_point()
73/87
Log scale
ggplot(data = diamonds,
mapping = aes(x=carat,y=price))+
geom_point()+scale_x_log10()+scale_y_log10()
74/87
Metric and Non-Metric Data
Side by side plots
When one variable is metric and the other non-metric we can easily put plots next to one another side by side.
Simply map the non-metric variable to the x aesthetic and the metric variable to the y aesthetic.
76/87
Boxplots
ggplot(data = diamonds,
mapping = aes(x=cut,y=price))+
geom_boxplot()
77/87
Change axes
ggplot(data = diamonds,
mapping = aes(x=price,y=cut))+
geom_boxplot()
78/87
With notches
Recall that the notches provide a con dence interval around the median.
These are particularly useful when comparing boxplots to one another.
In general, if the con dence intervals overlap then the medians are not sign cantly different. This is NOT a formal test, but still gives a useful indication.
79/87
Boxplots (no overlap)
ggplot(data = mpg,
mapping = aes(x=drv,y=hwy))+
geom_boxplot(notch=T)
80/87
Boxplots (some overlap)
81/87
Violin plot
A violin plot is a newer visualisation.
A kernel density is mirrored then arranged vertically. Specify the same way but use geom_violin
82/87
Violin plot
ggplot(data = diamonds,
mapping = aes(x=cut,y=price))+
geom_violin()
83/87
Violin plot
ggplot(data = diamonds,
mapping = aes(x=cut,y=price))+
geom_violin()+coord_flip()
84/87
Jittering
A scatter plot can be used for non-metric data but can easily suffer from overplotting (one point on another).
ggplot(data = mpg,
mapping = aes(x=cyl,y=cty))+
geom_point()
85/87
Jittering
Add random noise by jittering
ggplot(data = mpg,
mapping = aes(x=cyl,y=cty))+
geom_point(position = 'jitter')
86/87
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Lecturer:
Department of Econometrics and Business Statistics
Week 2