Stat 260, Lecture 2, Data Visualization I
Stat 260, Lecture 2, Data Visualization I
David Stenning
1 / 26
Florence Nightingale
I Florence Nightingale was a true pioneer in the field of data
visualization.
I She invented (what is now known as) the polar area diagram to
illustrate seasonal sources of patient mortality in the hospital
she managed.
2 / 26
Data Visualization
I Reading: Chapters 2 and 3 of the (online) text.
I The key steps of a data analysis are illustrated in the following
figure from the text:
I We will jump into the middle of this program and discuss data
visualization, because it is often the most interesting.
I We will use the tidyverse package ggplot2 to visualize data.
3 / 26
Loading ggplot2
I Before you start, make sure you install the tidyverse
collection of R packages. You can do this:
I With the Tools -> Install Packages menu item in
RStudio. Type tidyverse into the text box and click Install.
I By typing install.packages(“tidyverse”) into the R
console.
I Install once, load every time with library():
library(tidyverse)
4 / 26
Example: Car mileage
I Do cars with big engines use more fuel that cars with small
engines?
data(mpg)
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
##
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
I Type View(mpg) to view the dataset and ?mpg for details on the variables.
5 / 26
First ggplot
I displ is the engine displacement in litres and hwy is the
highway mileage in miles per gallon.
I We can plot hwy versus displ as follows:
ggplot(data=mpg) + geom_point(mapping = aes(x=displ,y=hwy))
20
30
40
2 3 4 5 6 7
displ
h
w
y
I Generally a negative linear trend, though there are some cars with large engines
that get better-than-expected mileage.
6 / 26
Second ggplot
I We can plot hwy versus displ with colors to represent
different kinds of cars:
ggplot(data=mpg) + geom_point(mapping = aes(x=displ,y=hwy,color=class))
20
30
40
2 3 4 5 6 7
displ
h
w
y
class
2seater
compact
midsize
minivan
pickup
subcompact
suv
I The 2seaters are sports cars, with large engines but light bodies.
I Exercise: Redo the above scatterplot but using the aesthetic shape=class to
plot different shapes for different kinds of cars.
7 / 26
Components of our ggplot
ggplot(data=mpg) +
geom_point(mapping = aes(x=displ,y=hwy,color=class))
I Our plot requires a dataset, objects to plot for each observation
in the dataset, and a mapping of the variables in that dataset
to features of the plot.
I In ggplot(data = mpg) we use the dataset mpg,
I we plot points for each observation in the dataset with
geom_point(),
I and mapping = aes(x=displ,y=hwy,color=class) maps
engine displacement to the x-axis, highway mileage to the
y-axis and vehicle class to the color of each point.
8 / 26
Components of our ggplot
ggplot(data=mpg) +
geom_point(mapping = aes(x=displ,y=hwy,color=class))
20
30
40
2 3 4 5 6 7
displ
h
w
y
class
2seater
compact
midsize
minivan
pickup
subcompact
suv
9 / 26
Geometric objects
I The geometric objects to plot, or “geoms” are specified by the
functions geom_XXX().
I Example: scatterplot smooths for each class of car.
ggplot(data=mpg) + geom_smooth(mapping = aes(x=displ,y=hwy,color=class))
20
30
2 3 4 5 6 7
displ
h
w
y
class
2seater
compact
midsize
minivan
pickup
subcompact
suv
10 / 26
Multiple geoms
I Scatterplot smooths are more often plotted over top of points,
as a summary of the trend.
I In the following we specify a default aesthetic mapping that is
used by both the points and smooth.
ggplot(data=mpg, mapping = aes(x=displ,y=hwy,color=class)) +
geom_point() +
geom_smooth()
20
30
40
2 3 4 5 6 7
displ
h
w
y
class
2seater
compact
midsize
minivan
pickup
subcompact
suv
11 / 26
Facets
I Instead of using colors to distinguish car class on one
scatterplot, we can split into multiple scatterplots or “facets”.
ggplot(data=mpg, mapping = aes(x=displ,y=hwy,color=class)) +
geom_point() + geom_smooth() +
facet_wrap(~ class,nrow=2)
pickup subcompact suv
2seater compact midsize minivan
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
2 3 4 5 6 7
20
30
40
20
30
40
displ
h
w
y
class
2seater
compact
midsize
minivan
pickup
subcompact
suv
I (The colors are no longer necessary, but are included to emphasize how our
earlier scatterplot has been split.)
12 / 26
Faceting
I The facets by car class were added with facet_wrap().
I The variable to facet on was given in the “formula” ~ class.
Formulas will make more sense in the next faceting example.
I facet_wrap() would have put made three rows in this
example. We force two with nrow=2.
I Exercise: Repeat the above example, but omit the nrow=2
argument to facet_wrap().
13 / 26
Faceting on two variables.
I The variable drv is four-, front- or rear-wheel drive.
I The variable cyl is the number of cylinders.
ggplot(data=mpg) +
geom_point(mapping = aes(x=displ,y=hwy,color=class)) +
facet_grid(drv ~ cyl)
4 5 6 8
4
f
r
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
20
30
40
20
30
40
20
30
40
displ
h
w
y
class
2seater
compact
midsize
minivan
pickup
subcompact
suv
14 / 26
ggplot layers
I The g’s stand for Grammar of Graphics.
I Like English grammar is the way in which words are put
together to form sentences, a grammar of graphics is a way to
put together basic graphical elements to make a graph.
I ggplots are built in layers, comprised of data, a mapping, a
geom and optionally stats, such as a scatterplot smoother.
I The layers are arranged and labelled on the graph by scales
and coords (next lecture).
I The data can also be broken into subsets and displayed in
separate graphs by a facet specification.
15 / 26
Another example: The gapminder data
I The gapminder dataset contains life expectancy, population
size and GDP per capita for countries throughout the world
from 1952 to 2007 in seven-year intervals.
library(gapminder)
data(gapminder)
head(gapminder)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
##
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
16 / 26
Subsetting the gapminder data
I Plot life expectancy versus GDP per capita for 2007.
I Need to subset, or “filter” the data to observations from 2007.
gm07 <- filter(gapminder,year==2007) ggplot(gm07, aes(x=gdpPercap,y=lifeExp,color=continent)) + geom_point() 40 50 60 70 80 0 10000 20000 30000 40000 50000 gdpPercap lif e E xp continent Africa Americas Asia Europe Oceania 17 / 26 Transform GDP per capita I GDP per capita differs by several orders of magnitude across countries. For such explanatory variables, the log scale may be more appropriate. gm07 <- mutate(gm07,log10GdpPercap = log10(gdpPercap)) ggplot(gm07, aes(x=log10GdpPercap,y=lifeExp,color=continent)) + geom_point() 40 50 60 70 80 2.5 3.0 3.5 4.0 4.5 log10GdpPercap lif e E xp continent Africa Americas Asia Europe Oceania 18 / 26 Build a plot by layer I In this example we build the plot by layer and do not show the plot until all layers are added. I Our plot will be for the entire gampinder dataset. I Set the mapping: gapminder <- mutate(gapminder, log10GdpPercap = log10(gdpPercap)) p <- ggplot(gapminder, aes(x=log10GdpPercap,y=lifeExp,color=continent)) 19 / 26 Add the geoms I Overplotting means we can’t tell how many points per area of the plot. I Set a transparency, or “alpha” value to make points semi-transparent. Then many points will add up to solid, and few points will show as semi-transparent. p2 <- p + geom_point(alpha=0.1) I alpha is the transparency aesthetic, between 0 and 1, best applied directly to the geom. 20 / 26 Add statistical transformations I Statistical transformations or stats summarize the data; e.g., a scatterplot smoother p2 + stat_smooth() 40 60 80 3 4 5 log10GdpPercap lif e E xp continent Africa Americas Asia Europe Oceania 21 / 26 Exercise: I The variable year is quantitative, but can still be used as a grouping variable. Make scatterplots of lifeExp versus log10GdpPercap with points colored by year. Add a scatterplot smoother with (i) no grouping variable and (ii) year as the grouping variable. Set the SE to FALSE for your smoothers. 22 / 26 ggplot scales I The scales are mappings from the data to the graphics device I domain of continent is the five continents, range is the hexidecimal of the five colors represented on the graph I domain of lifeExp is 23.599 to 82.603, range is [0,1], which grid converts to a range of vertical pixels on the graph. I legends and axes provide the inverse mapping p2 40 60 80 3 4 5 log10GdpPercap lif e E xp continent Africa Americas Asia Europe Oceania 23 / 26 ggplot coodinate system I The coordinate system is another layer in how the data get mapped to the graphics device. I Usually Cartesian, but could be, e.g., polar coordinates, or a transformation. ggplot(gapminder,aes(x=gdpPercap,y=lifeExp,color=continent)) + geom_point(alpha=0.5) + coord_trans(x="log10") 40 60 80 50000 100000 150000 gdpPercap lif e E xp continent Africa Americas Asia Europe Oceania 24 / 26 ggplot faceting I How to break up the data into subsets and arrange multiple plots on the graphics device. p2 + facet_grid(continent ~ .) A fricaAm e rica sA siaE u ro p eOce a n ia 3 4 5 40 60 80 40 60 80 40 60 80 40 60 80 40 60 80 log10GdpPercap lif e E xp continent Africa Americas Asia Europe Oceania 25 / 26 Why so many components? I A framework for the components of a graph is summarized by this figure from the online text: I Gives the user the ability to change indvidual components one at a time. 26 / 26