CS计算机代考程序代写 Stat 260, Lecture 3, Data Visualization II

Stat 260, Lecture 3, Data Visualization II

Stat 260, Lecture 3, Data Visualization II

David Stenning

1 / 29

Load packages

library(ggplot2)
data(diamonds)
library(dplyr)
library(gapminder)
data(gapminder)
gapminder <- mutate(gapminder, log10Pop = log10(pop), log10GdpPercap = log10(gdpPercap)) 2 / 29 References I Chapter 3 of online text. I ggplot2 cheatsheet at [https://www.rstudio.com/wp- content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf] I Wickham (2009) ggplot2: Elegant graphics for data analysis, Chapters 4 and 5. Book available at [https://ggplot2-book.org/] I Chang (2012) R graphics cookbook. Available at [http://www.cookbook-r.com/Graphs/] I R Graph Gallery (goes beyond ggplot). Availabe at [https://www.r-graph-gallery.com/] 3 / 29 https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf https://ggplot2-book.org/ http://www.cookbook-r.com/Graphs/ https://www.r-graph-gallery.com/ More ggplot Examples The best way to learn data visualization is look at examples, such as those in the references on the previous slide. In this lecture, we will go through more examples to illustrate working with ggplot layers: I data, I aesthetic mapping, I geom, I statistical transformation and I position adjustment 4 / 29 Diamonds dataset p <- ggplot(diamonds,aes(x=carat,y=price,colour=cut)) + geom_point() names(p) ## [1] "data" "layers" "scales" "mapping" "theme" ## [6] "coordinates" "facet" "plot_env" "labels" p 0 5000 10000 15000 0 1 2 3 4 5 carat p ri ce cut Fair Good Very Good Premium Ideal 5 / 29 Data I The data must be a data frame I A copy of the data is stored in the plot object (changes to the source dataframe do not change plot) I Possible to change the data of a plot object with %+%, as in set.seed(123) subdiamonds <- sample_n(diamonds,size=100) p <- p %+% subdiamonds p 0 5000 10000 15000 0.5 1.0 1.5 2.0 carat p ri ce cut Fair Good Very Good Premium Ideal 6 / 29 Different data in different layers I Can specify data for a layer to use. p + geom_smooth(data=diamonds, method="lm") 0 10000 20000 30000 0 1 2 3 4 5 carat p ri ce cut Fair Good Very Good Premium Ideal 7 / 29 Aesthetic mappings I We have seen that we can specify a default mapping in the initialization, or specific mappings in the layers I Warning: Specifying a mapping twice can have unexpected consequences! p + geom_point(aes(color=clarity)) 0 5000 10000 15000 0.5 1.0 1.5 2.0 carat p ri ce cut Fair Good Ideal IF Premium SI1 SI2 Very Good VS1 VS2 VVS1 VVS2 8 / 29 Exercise Write R code to produce a scatterplot of price versus carat, with different colors for values of clarity, using the subdiamonds dataset. 9 / 29 Setting vs mapping I An alternative to mapping aesthetics to variables is to set the aesthetic to a constant. I Set with the layer parameter, rather than mapping with aes() I E.G., set the color of points on a plot to dark blue ggplot(subdiamonds,aes(x=carat,y=price)) + geom_point(color="darkblue") 0 5000 10000 15000 0.5 1.0 1.5 2.0 carat p ri ce I Exercise: Redo the above plot with the mapping color=“darkblue” for the geom_point() rather than the parameter color=“darkblue”. 10 / 29 Grouping I Many geoms in ggplot2 group observations (rows of the dataframe). I E.G., a boxplot of a quantitative variable by a categorical variable groups observations by the categorical variable. I Default group is combinations (interaction) of all categorical variables in the aesthetic specification. I If this is not what you want, or if there are no categorical variables, specify group yourself. 11 / 29 gapminder data again: Grouping to plot time series I For plotting time series (multiple measurements on each observational unit) we want to group by observational unit. ggplot(gapminder,aes(x=year,y=lifeExp,group=country)) + geom_line(alpha=.5) 40 60 80 1950 1960 1970 1980 1990 2000 year lif e E xp 12 / 29 Different groups on different layers I To add summaries by continent, we need to specify different groups on different layers. ggplot(gapminder,aes(x=year,y=lifeExp,color=continent)) + geom_line(aes(group=country),alpha=0.2) + geom_smooth(aes(group=continent),se=FALSE) 40 60 80 1950 1960 1970 1980 1990 2000 year lif e E xp continent Africa Americas Asia Europe Oceania I Exercise Redo the above, but remove the mapping in geom_line() and specify group=country in the default mapping. What grouping variable is used by geom_line()? What grouping variable is used by geom_smooth()? 13 / 29 Using interaction() to specify groups I Could do boxplots of life expectancy by year and continent. (Not recommended – too busy – just using for illustration.) ggplot(gapminder,aes(x=year,y=lifeExp, group=interaction(year,continent),color=continent)) + geom_boxplot() 40 60 80 1950 1960 1970 1980 1990 2000 2010 year lif e E xp continent Africa Americas Asia Europe Oceania 14 / 29 Overriding group on a layer I Boxplots of life expectancy by year and continent on one layer, smoother by continent alone on another. ggplot(gapminder,aes(x=year,y=lifeExp, group=interaction(year,continent), color=continent)) + geom_boxplot() + geom_smooth(aes(group=continent), se=FALSE) 40 60 80 1950 1960 1970 1980 1990 2000 2010 year lif e E xp continent Africa Americas Asia Europe Oceania 15 / 29 Geoms I These are the shapes you want on the plot. I See the list of geoms on the cheatsheet. I Each has a set of aesthetics that are required for drawing and a set of aesthetics that it understands. I E.G., geom_point() requires x and y position, and understands color, size and shape. I Aesthetics can also be passed to geoms as parameters. I Recall difference between geom_point(color="darkblue") and geom_point(aes(color="darkblue")) I Each geom also has a default statistic (stat) and positional adjustment. I More on these next. 16 / 29 Stats I Stats are statistical summaries of groups of data points. E.G., I stat_smoother() is a moving average of the y positions over windows of x positions. I stat_bin() is a binning of a quantitative variable into bins of a given width I See the cheatsheet for a list. I Stats create new variables, and these variables can be mapped to aesthetics. I E.G., stat_bin() creates the variables count, density and x I Enclose derived variable name with .. to use. p <- ggplot(gapminder,aes(x=lifeExp)) + geom_histogram(aes(y= ..density..)) 17 / 29 Stats p <- ggplot(gapminder,aes(x=lifeExp)) + geom_histogram(aes(y= ..density..)) p 0.00 0.01 0.02 0.03 0.04 40 60 80 lifeExp d e n si ty 18 / 29 Position adjustment I See cheatsheet or “Position Adjustments” section of text for more information. I Default for most geoms is no adjustment (“identity”) I Adjustment to x and/or y position, such as “jitter” can reduce overplotting. I Boxplots in recent plot of gapminder data were “dodge”d. I Histograms of a continuous variable by a categorical grouping variable are “stack”ed by default. gdat <- filter(gapminder,year==1952) ggplot(gdat,aes(x=lifeExp,color=continent)) + geom_histogram() 0 5 10 30 40 50 60 70 lifeExp co u n t Africa Americas Asia Europe Oceania 19 / 29 Tools for working with layers I Many possible topics I In the interest of time, focus on a few I displaying distributions I adding error bars and other measures of uncertainty I annotating a plot 20 / 29 Displaying distributions I The standard histogram is geom_histogram(). I Displays counts by default, but we have seen how to display as density I Density is better for comparing the shape of distributions I To include group information, can stack bars (see previous example), use faceting to produce separate histograms, or superpose density histograms. 21 / 29 Histograms with faceting gdat <- filter(gdat,continent != "Oceania") h <- ggplot(gdat,aes(x=lifeExp)) h + geom_histogram(aes(y= ..density..), binwidth=5) + facet_grid(continent ~ .) A frica A m e rica s A sia E u ro p e 30 40 50 60 70 80 0.00 0.02 0.04 0.06 0.08 0.00 0.02 0.04 0.06 0.08 0.00 0.02 0.04 0.06 0.08 0.00 0.02 0.04 0.06 0.08 lifeExp d e n si ty 22 / 29 Histograms superposed h + geom_freqpoly(aes(y=..density..,color=continent), binwidth=5) 0.00 0.02 0.04 0.06 0.08 20 40 60 80 lifeExp d e n si ty continent Africa Americas Asia Europe 23 / 29 Density estimation I geom_density() plots a density estimate I Think of adding up small normal densities centred at each observed datapoint, with the width of each distribution a user-defined parameter I Can superpose multiple density plots. h + geom_density(aes(color=continent)) 0.00 0.02 0.04 0.06 0.08 30 40 50 60 70 lifeExp d e n si ty continent Africa Americas Asia Europe 24 / 29 Violin plots I Instead of superposing, dodge density estimates with a violin plot. I Violins are density estimate and its mirror image displayed vertically h + geom_violin(aes(x=continent,y=lifeExp)) 30 40 50 60 70 Africa Americas Asia Europe lifeExp lif e E xp 25 / 29 Adding measures of uncertainty I geom_smooth() includes pointwise error bands by default. I For factors can add error bars. gfit <- lm(lifeExp ~ continent,data=gdat) newdat <- data.frame(continent=c("Africa","Americas","Asia","Europe")) mm <- data.frame(newdat,predict(gfit,newdata=newdat,interval="confidence")) ggplot(mm,aes(x=continent,y=fit)) + geom_point() + geom_errorbar(aes(ymin=lwr,ymax=upr)) 40 50 60 Africa Americas Asia Europe continent fit 26 / 29 Measures of uncertainty with stat summaries I Variety of built-in summaries, or can write your own (not covered). I Summarize y for different values of x or bins of x values. library(Hmisc) # Need to have Hmisc package installed ggplot(gdat,aes(x=continent,y=lifeExp)) + geom_violin() + # superpose over violin plot stat_summary(fun.data="mean_cl_normal", color="red") 30 40 50 60 70 Africa Americas Asia Europe continent lif e E xp 27 / 29 Annotating a plot I Basic tools for annotating are I geom_text() and geom_label() to add text I Geoms such as geom_abline() to add line segments (see cheetsheet) I labs() for adding axis labels, titles, and captions I annotate() to add annotations using aesthetics passed as vectors to the function, rather than mapped from a dataframe. I Can add annotations one at a time or many at a time I to add many at a time, create a data frame 28 / 29 Many annotations gm07 <- filter(gapminder, year ==2007) topOilbyGDP <- c("Kuwait","Guinea","Norway","Saudi Arabia") gdpOil <- filter(gm07,country %in% topOilbyGDP) ggplot(gm07,aes(x=gdpPercap,y=lifeExp)) + geom_point() + geom_text(data=gdpOil,aes(label=country)) Guinea Kuwait Norway Saudi Arabia 40 50 60 70 80 0 10000 20000 30000 40000 50000 gdpPercap lif e E xp 29 / 29