Stat 260, Lecture 3, Data Visualization II
Stat 260, Lecture 3, Data Visualization II
David Stenning
1 / 29
Load packages
library(ggplot2)
data(diamonds)
library(dplyr)
library(gapminder)
data(gapminder)
gapminder <- mutate(gapminder,
log10Pop = log10(pop),
log10GdpPercap = log10(gdpPercap))
2 / 29
References
I Chapter 3 of online text.
I ggplot2 cheatsheet at [https://www.rstudio.com/wp-
content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf]
I Wickham (2009) ggplot2: Elegant graphics for data analysis,
Chapters 4 and 5. Book available at
[https://ggplot2-book.org/]
I Chang (2012) R graphics cookbook. Available at
[http://www.cookbook-r.com/Graphs/]
I R Graph Gallery (goes beyond ggplot). Availabe at
[https://www.r-graph-gallery.com/]
3 / 29
https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf
https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf
https://ggplot2-book.org/
http://www.cookbook-r.com/Graphs/
https://www.r-graph-gallery.com/
More ggplot Examples
The best way to learn data visualization is look at examples, such as
those in the references on the previous slide. In this lecture, we will
go through more examples to illustrate working with ggplot layers:
I data,
I aesthetic mapping,
I geom,
I statistical transformation and
I position adjustment
4 / 29
Diamonds dataset
p <- ggplot(diamonds,aes(x=carat,y=price,colour=cut)) +
geom_point()
names(p)
## [1] "data" "layers" "scales" "mapping" "theme"
## [6] "coordinates" "facet" "plot_env" "labels"
p
0
5000
10000
15000
0 1 2 3 4 5
carat
p
ri
ce
cut
Fair
Good
Very Good
Premium
Ideal
5 / 29
Data
I The data must be a data frame
I A copy of the data is stored in the plot object (changes to the source
dataframe do not change plot)
I Possible to change the data of a plot object with %+%, as in
set.seed(123)
subdiamonds <- sample_n(diamonds,size=100)
p <- p %+% subdiamonds
p
0
5000
10000
15000
0.5 1.0 1.5 2.0
carat
p
ri
ce
cut
Fair
Good
Very Good
Premium
Ideal
6 / 29
Different data in different layers
I Can specify data for a layer to use.
p + geom_smooth(data=diamonds, method="lm")
0
10000
20000
30000
0 1 2 3 4 5
carat
p
ri
ce
cut
Fair
Good
Very Good
Premium
Ideal
7 / 29
Aesthetic mappings
I We have seen that we can specify a default mapping in the
initialization, or specific mappings in the layers
I Warning: Specifying a mapping twice can have unexpected
consequences!
p + geom_point(aes(color=clarity))
0
5000
10000
15000
0.5 1.0 1.5 2.0
carat
p
ri
ce
cut
Fair
Good
Ideal
IF
Premium
SI1
SI2
Very Good
VS1
VS2
VVS1
VVS2
8 / 29
Exercise
Write R code to produce a scatterplot of price versus carat, with
different colors for values of clarity, using the subdiamonds
dataset.
9 / 29
Setting vs mapping
I An alternative to mapping aesthetics to variables is to set the
aesthetic to a constant.
I Set with the layer parameter, rather than mapping with aes()
I E.G., set the color of points on a plot to dark blue
ggplot(subdiamonds,aes(x=carat,y=price)) + geom_point(color="darkblue")
0
5000
10000
15000
0.5 1.0 1.5 2.0
carat
p
ri
ce
I Exercise: Redo the above plot with the mapping color=“darkblue” for the
geom_point() rather than the parameter color=“darkblue”.
10 / 29
Grouping
I Many geoms in ggplot2 group observations (rows of the dataframe).
I E.G., a boxplot of a quantitative variable by a categorical
variable groups observations by the categorical variable.
I Default group is combinations (interaction) of all categorical
variables in the aesthetic specification.
I If this is not what you want, or if there are no categorical variables,
specify group yourself.
11 / 29
gapminder data again: Grouping to plot time series
I For plotting time series (multiple measurements on each
observational unit) we want to group by observational unit.
ggplot(gapminder,aes(x=year,y=lifeExp,group=country)) +
geom_line(alpha=.5)
40
60
80
1950 1960 1970 1980 1990 2000
year
lif
e
E
xp
12 / 29
Different groups on different layers
I To add summaries by continent, we need to specify different groups
on different layers.
ggplot(gapminder,aes(x=year,y=lifeExp,color=continent)) +
geom_line(aes(group=country),alpha=0.2) +
geom_smooth(aes(group=continent),se=FALSE)
40
60
80
1950 1960 1970 1980 1990 2000
year
lif
e
E
xp
continent
Africa
Americas
Asia
Europe
Oceania
I Exercise Redo the above, but remove the mapping in geom_line() and specify
group=country in the default mapping. What grouping variable is used by
geom_line()? What grouping variable is used by geom_smooth()?
13 / 29
Using interaction() to specify groups
I Could do boxplots of life expectancy by year and continent. (Not
recommended – too busy – just using for illustration.)
ggplot(gapminder,aes(x=year,y=lifeExp,
group=interaction(year,continent),color=continent)) +
geom_boxplot()
40
60
80
1950 1960 1970 1980 1990 2000 2010
year
lif
e
E
xp
continent
Africa
Americas
Asia
Europe
Oceania
14 / 29
Overriding group on a layer
I Boxplots of life expectancy by year and continent on one layer,
smoother by continent alone on another.
ggplot(gapminder,aes(x=year,y=lifeExp, group=interaction(year,continent),
color=continent)) + geom_boxplot() +
geom_smooth(aes(group=continent), se=FALSE)
40
60
80
1950 1960 1970 1980 1990 2000 2010
year
lif
e
E
xp
continent
Africa
Americas
Asia
Europe
Oceania
15 / 29
Geoms
I These are the shapes you want on the plot.
I See the list of geoms on the cheatsheet.
I Each has a set of aesthetics that are required for drawing and a set of
aesthetics that it understands.
I E.G., geom_point() requires x and y position, and understands
color, size and shape.
I Aesthetics can also be passed to geoms as parameters.
I Recall difference between geom_point(color="darkblue")
and geom_point(aes(color="darkblue"))
I Each geom also has a default statistic (stat) and positional
adjustment.
I More on these next.
16 / 29
Stats
I Stats are statistical summaries of groups of data points. E.G.,
I stat_smoother() is a moving average of the y positions over
windows of x positions.
I stat_bin() is a binning of a quantitative variable into bins of
a given width
I See the cheatsheet for a list.
I Stats create new variables, and these variables can be mapped to
aesthetics.
I E.G., stat_bin() creates the variables count, density and x
I Enclose derived variable name with .. to use.
p <- ggplot(gapminder,aes(x=lifeExp)) +
geom_histogram(aes(y= ..density..))
17 / 29
Stats
p <- ggplot(gapminder,aes(x=lifeExp)) +
geom_histogram(aes(y= ..density..))
p
0.00
0.01
0.02
0.03
0.04
40 60 80
lifeExp
d
e
n
si
ty
18 / 29
Position adjustment
I See cheatsheet or “Position Adjustments” section of text for more
information.
I Default for most geoms is no adjustment (“identity”)
I Adjustment to x and/or y position, such as “jitter” can reduce
overplotting.
I Boxplots in recent plot of gapminder data were “dodge”d.
I Histograms of a continuous variable by a categorical grouping
variable are “stack”ed by default.
gdat <- filter(gapminder,year==1952)
ggplot(gdat,aes(x=lifeExp,color=continent)) + geom_histogram()
0
5
10
30 40 50 60 70
lifeExp
co
u
n
t
Africa
Americas
Asia
Europe
Oceania
19 / 29
Tools for working with layers
I Many possible topics
I In the interest of time, focus on a few
I displaying distributions
I adding error bars and other measures of uncertainty
I annotating a plot
20 / 29
Displaying distributions
I The standard histogram is geom_histogram().
I Displays counts by default, but we have seen how to display as
density
I Density is better for comparing the shape of distributions
I To include group information, can stack bars (see previous
example), use faceting to produce separate histograms, or
superpose density histograms.
21 / 29
Histograms with faceting
gdat <- filter(gdat,continent != "Oceania")
h <- ggplot(gdat,aes(x=lifeExp))
h + geom_histogram(aes(y= ..density..), binwidth=5) + facet_grid(continent ~ .)
A
frica
A
m
e
rica
s
A
sia
E
u
ro
p
e
30 40 50 60 70 80
0.00
0.02
0.04
0.06
0.08
0.00
0.02
0.04
0.06
0.08
0.00
0.02
0.04
0.06
0.08
0.00
0.02
0.04
0.06
0.08
lifeExp
d
e
n
si
ty
22 / 29
Histograms superposed
h + geom_freqpoly(aes(y=..density..,color=continent), binwidth=5)
0.00
0.02
0.04
0.06
0.08
20 40 60 80
lifeExp
d
e
n
si
ty
continent
Africa
Americas
Asia
Europe
23 / 29
Density estimation
I geom_density() plots a density estimate
I Think of adding up small normal densities centred at each
observed datapoint, with the width of each distribution a
user-defined parameter
I Can superpose multiple density plots.
h + geom_density(aes(color=continent))
0.00
0.02
0.04
0.06
0.08
30 40 50 60 70
lifeExp
d
e
n
si
ty
continent
Africa
Americas
Asia
Europe
24 / 29
Violin plots
I Instead of superposing, dodge density estimates with a violin plot.
I Violins are density estimate and its mirror image displayed
vertically
h + geom_violin(aes(x=continent,y=lifeExp))
30
40
50
60
70
Africa Americas Asia Europe
lifeExp
lif
e
E
xp
25 / 29
Adding measures of uncertainty
I geom_smooth() includes pointwise error bands by default.
I For factors can add error bars.
gfit <- lm(lifeExp ~ continent,data=gdat)
newdat <- data.frame(continent=c("Africa","Americas","Asia","Europe"))
mm <- data.frame(newdat,predict(gfit,newdata=newdat,interval="confidence"))
ggplot(mm,aes(x=continent,y=fit)) +
geom_point() + geom_errorbar(aes(ymin=lwr,ymax=upr))
40
50
60
Africa Americas Asia Europe
continent
fit
26 / 29
Measures of uncertainty with stat summaries
I Variety of built-in summaries, or can write your own (not covered).
I Summarize y for different values of x or bins of x values.
library(Hmisc) # Need to have Hmisc package installed
ggplot(gdat,aes(x=continent,y=lifeExp)) +
geom_violin() + # superpose over violin plot
stat_summary(fun.data="mean_cl_normal", color="red")
30
40
50
60
70
Africa Americas Asia Europe
continent
lif
e
E
xp
27 / 29
Annotating a plot
I Basic tools for annotating are
I geom_text() and geom_label() to add text
I Geoms such as geom_abline() to add line segments (see
cheetsheet)
I labs() for adding axis labels, titles, and captions
I annotate() to add annotations using aesthetics passed as
vectors to the function, rather than mapped from a dataframe.
I Can add annotations one at a time or many at a time
I to add many at a time, create a data frame
28 / 29
Many annotations
gm07 <- filter(gapminder, year ==2007)
topOilbyGDP <- c("Kuwait","Guinea","Norway","Saudi Arabia")
gdpOil <- filter(gm07,country %in% topOilbyGDP)
ggplot(gm07,aes(x=gdpPercap,y=lifeExp)) + geom_point() +
geom_text(data=gdpOil,aes(label=country))
Guinea
Kuwait
Norway
Saudi Arabia
40
50
60
70
80
0 10000 20000 30000 40000 50000
gdpPercap
lif
e
E
xp
29 / 29