CS计算机代考程序代写 Stat 260, Lecture 2, Data Visualization I

Stat 260, Lecture 2, Data Visualization I

Stat 260, Lecture 2, Data Visualization I

David Stenning

1 / 26

Florence Nightingale

I Florence Nightingale was a true pioneer in the field of data
visualization.

I She invented (what is now known as) the polar area diagram to
illustrate seasonal sources of patient mortality in the hospital
she managed.

2 / 26

Data Visualization
I Reading: Chapters 2 and 3 of the (online) text.
I The key steps of a data analysis are illustrated in the following

figure from the text:

I We will jump into the middle of this program and discuss data
visualization, because it is often the most interesting.

I We will use the tidyverse package ggplot2 to visualize data.
3 / 26

Loading ggplot2

I Before you start, make sure you install the tidyverse
collection of R packages. You can do this:
I With the Tools -> Install Packages menu item in

RStudio. Type tidyverse into the text box and click Install.
I By typing install.packages(“tidyverse”) into the R

console.
I Install once, load every time with library():

library(tidyverse)

4 / 26

Example: Car mileage

I Do cars with big engines use more fuel that cars with small
engines?

data(mpg)
head(mpg)

## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
##
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~

I Type View(mpg) to view the dataset and ?mpg for details on the variables.

5 / 26

First ggplot
I displ is the engine displacement in litres and hwy is the

highway mileage in miles per gallon.
I We can plot hwy versus displ as follows:

ggplot(data=mpg) + geom_point(mapping = aes(x=displ,y=hwy))

20

30

40

2 3 4 5 6 7
displ

h
w

y

I Generally a negative linear trend, though there are some cars with large engines
that get better-than-expected mileage.

6 / 26

Second ggplot
I We can plot hwy versus displ with colors to represent

different kinds of cars:
ggplot(data=mpg) + geom_point(mapping = aes(x=displ,y=hwy,color=class))

20

30

40

2 3 4 5 6 7
displ

h
w

y

class

2seater

compact

midsize

minivan

pickup

subcompact

suv

I The 2seaters are sports cars, with large engines but light bodies.

I Exercise: Redo the above scatterplot but using the aesthetic shape=class to
plot different shapes for different kinds of cars.

7 / 26

Components of our ggplot

ggplot(data=mpg) +
geom_point(mapping = aes(x=displ,y=hwy,color=class))

I Our plot requires a dataset, objects to plot for each observation
in the dataset, and a mapping of the variables in that dataset
to features of the plot.

I In ggplot(data = mpg) we use the dataset mpg,

I we plot points for each observation in the dataset with
geom_point(),

I and mapping = aes(x=displ,y=hwy,color=class) maps
engine displacement to the x-axis, highway mileage to the
y-axis and vehicle class to the color of each point.

8 / 26

Components of our ggplot

ggplot(data=mpg) +
geom_point(mapping = aes(x=displ,y=hwy,color=class))

20

30

40

2 3 4 5 6 7
displ

h
w

y

class

2seater

compact

midsize

minivan

pickup

subcompact

suv

9 / 26

Geometric objects

I The geometric objects to plot, or “geoms” are specified by the
functions geom_XXX().

I Example: scatterplot smooths for each class of car.
ggplot(data=mpg) + geom_smooth(mapping = aes(x=displ,y=hwy,color=class))

20

30

2 3 4 5 6 7
displ

h
w

y

class

2seater

compact

midsize

minivan

pickup

subcompact

suv

10 / 26

Multiple geoms
I Scatterplot smooths are more often plotted over top of points,

as a summary of the trend.
I In the following we specify a default aesthetic mapping that is

used by both the points and smooth.
ggplot(data=mpg, mapping = aes(x=displ,y=hwy,color=class)) +

geom_point() +
geom_smooth()

20

30

40

2 3 4 5 6 7
displ

h
w

y

class

2seater

compact

midsize

minivan

pickup

subcompact

suv

11 / 26

Facets
I Instead of using colors to distinguish car class on one

scatterplot, we can split into multiple scatterplots or “facets”.
ggplot(data=mpg, mapping = aes(x=displ,y=hwy,color=class)) +

geom_point() + geom_smooth() +
facet_wrap(~ class,nrow=2)

pickup subcompact suv

2seater compact midsize minivan

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

2 3 4 5 6 7

20

30

40

20

30

40

displ

h
w

y

class

2seater

compact

midsize

minivan

pickup

subcompact

suv

I (The colors are no longer necessary, but are included to emphasize how our
earlier scatterplot has been split.)

12 / 26

Faceting

I The facets by car class were added with facet_wrap().
I The variable to facet on was given in the “formula” ~ class.

Formulas will make more sense in the next faceting example.
I facet_wrap() would have put made three rows in this

example. We force two with nrow=2.
I Exercise: Repeat the above example, but omit the nrow=2

argument to facet_wrap().

13 / 26

Faceting on two variables.

I The variable drv is four-, front- or rear-wheel drive.
I The variable cyl is the number of cylinders.

ggplot(data=mpg) +
geom_point(mapping = aes(x=displ,y=hwy,color=class)) +
facet_grid(drv ~ cyl)

4 5 6 8

4
f

r

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

20

30

40

20

30

40

20

30

40

displ

h
w

y

class

2seater

compact

midsize

minivan

pickup

subcompact

suv

14 / 26

ggplot layers

I The g’s stand for Grammar of Graphics.
I Like English grammar is the way in which words are put

together to form sentences, a grammar of graphics is a way to
put together basic graphical elements to make a graph.

I ggplots are built in layers, comprised of data, a mapping, a
geom and optionally stats, such as a scatterplot smoother.

I The layers are arranged and labelled on the graph by scales
and coords (next lecture).

I The data can also be broken into subsets and displayed in
separate graphs by a facet specification.

15 / 26

Another example: The gapminder data

I The gapminder dataset contains life expectancy, population
size and GDP per capita for countries throughout the world
from 1952 to 2007 in seven-year intervals.

library(gapminder)
data(gapminder)
head(gapminder)

## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
##
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.

16 / 26

Subsetting the gapminder data

I Plot life expectancy versus GDP per capita for 2007.
I Need to subset, or “filter” the data to observations from 2007.

gm07 <- filter(gapminder,year==2007) ggplot(gm07, aes(x=gdpPercap,y=lifeExp,color=continent)) + geom_point() 40 50 60 70 80 0 10000 20000 30000 40000 50000 gdpPercap lif e E xp continent Africa Americas Asia Europe Oceania 17 / 26 Transform GDP per capita I GDP per capita differs by several orders of magnitude across countries. For such explanatory variables, the log scale may be more appropriate. gm07 <- mutate(gm07,log10GdpPercap = log10(gdpPercap)) ggplot(gm07, aes(x=log10GdpPercap,y=lifeExp,color=continent)) + geom_point() 40 50 60 70 80 2.5 3.0 3.5 4.0 4.5 log10GdpPercap lif e E xp continent Africa Americas Asia Europe Oceania 18 / 26 Build a plot by layer I In this example we build the plot by layer and do not show the plot until all layers are added. I Our plot will be for the entire gampinder dataset. I Set the mapping: gapminder <- mutate(gapminder, log10GdpPercap = log10(gdpPercap)) p <- ggplot(gapminder, aes(x=log10GdpPercap,y=lifeExp,color=continent)) 19 / 26 Add the geoms I Overplotting means we can’t tell how many points per area of the plot. I Set a transparency, or “alpha” value to make points semi-transparent. Then many points will add up to solid, and few points will show as semi-transparent. p2 <- p + geom_point(alpha=0.1) I alpha is the transparency aesthetic, between 0 and 1, best applied directly to the geom. 20 / 26 Add statistical transformations I Statistical transformations or stats summarize the data; e.g., a scatterplot smoother p2 + stat_smooth() 40 60 80 3 4 5 log10GdpPercap lif e E xp continent Africa Americas Asia Europe Oceania 21 / 26 Exercise: I The variable year is quantitative, but can still be used as a grouping variable. Make scatterplots of lifeExp versus log10GdpPercap with points colored by year. Add a scatterplot smoother with (i) no grouping variable and (ii) year as the grouping variable. Set the SE to FALSE for your smoothers. 22 / 26 ggplot scales I The scales are mappings from the data to the graphics device I domain of continent is the five continents, range is the hexidecimal of the five colors represented on the graph I domain of lifeExp is 23.599 to 82.603, range is [0,1], which grid converts to a range of vertical pixels on the graph. I legends and axes provide the inverse mapping p2 40 60 80 3 4 5 log10GdpPercap lif e E xp continent Africa Americas Asia Europe Oceania 23 / 26 ggplot coodinate system I The coordinate system is another layer in how the data get mapped to the graphics device. I Usually Cartesian, but could be, e.g., polar coordinates, or a transformation. ggplot(gapminder,aes(x=gdpPercap,y=lifeExp,color=continent)) + geom_point(alpha=0.5) + coord_trans(x="log10") 40 60 80 50000 100000 150000 gdpPercap lif e E xp continent Africa Americas Asia Europe Oceania 24 / 26 ggplot faceting I How to break up the data into subsets and arrange multiple plots on the graphics device. p2 + facet_grid(continent ~ .) A fricaAm e rica sA siaE u ro p eOce a n ia 3 4 5 40 60 80 40 60 80 40 60 80 40 60 80 40 60 80 log10GdpPercap lif e E xp continent Africa Americas Asia Europe Oceania 25 / 26 Why so many components? I A framework for the components of a graph is summarized by this figure from the online text: I Gives the user the ability to change indvidual components one at a time. 26 / 26