CS计算机代考程序代写 data science Lecture 3 – GGR376

Lecture 3 – GGR376
Spatial Data Science II
Dr. Adams

Today’s Lecture
􏰀 Skewness & Kurtosis 􏰀 Visualization
􏰀 ggplot2

Skewness and Kurtosis

Skewness
􏰀 Rightward / Positive Skew
􏰀 Long tail of high value numbers 􏰀 mean > median > mode
􏰀 Leftward / Negative Skew
􏰀 Long tail of low value numbers 􏰀 mean < median < mode 􏰀 Symmetric 􏰀 When there is no skewness in the data. Skewness Visual Density Distributions of Skewness 0.25 0.20 0.15 0.10 0.05 0.00 0 5 10 15 20 value Name Negative Skew Positive Skew Symmetrical density Skewness Calculations Pearson’s moment coefficient of skewness 􏰁(xi −x)3/N Sk = Var(x)1.5 Many skewness formulas exist, generally though: 􏰀 Sk = 0, Symmetrical 􏰀 Sk < 0, Negative Skewness 􏰀 Sk > 0, Positive Skewness

Skewness in R
# Need to load a package with a skewness function
library(moments) moments::skewness()

Kurtosis
Kurtosis
􏰀 Sharpness of the peak of a frequency-distribution curve 􏰀 Leptokurtic
􏰀 High central peak
􏰀 Mesokurtic
􏰀 Medium central peak 􏰀 Standard normal curve
􏰀 Platykurtic
􏰀 Low central peak

Kurtosis Visual
Density Distributions of Kurtosis
0.8
0.6
0.4
0.2
0.0
5 10 15
value
Name
Leptokurtic Mesokurtic Platykurtic
density

Kurtosis Calculations
Pearson’s measure of kurtosis
􏰁(xi −x)4/N
Ku = Var(x)2 Units are not meaningful (as with skewness):
􏰀 Ku ~ 0, Mesokurtic
􏰀 Ku >> 0 and up to infinity, Leptokurtic 􏰀 Ku < 0 to -2.75, Platykurtic Kurtosis in R # Need to load a package with a kurtosis function library(moments) moments::kurtosis() Visualization No more ulgy plots. markedbyteachers.com/ Scientific Visualizations The simple graph has brought more information to the data analyst’s mind than any other device. - John Tukey Memorable Visualizations What makes a memorable (effective) visualization? (M. A. Borkin et al. 2015) 1. Titles and supporting text need to convey the message of a visualization 2. Pictograms may be included to improve recognition, unlikely to hinder understanding. 3. Redundancy helps effectively communicate the message. “A memorable visualization is often also an effective one.” Why do we use visualizations? Alcohol use and long-term cognitive decline The relationship between alcohol use and long-term cognitive decline in middle and late life: a longitudinal analysis using UK Biobank Giovanni Piumatti, Simon C Moore, Damon M Berridge, Chinmoy Sarkar, John Gallacher Journal of Public Health, https://doi.org/10.1093/pubmed/fdx186 Research Question Explain the causal (not correlation) between cognitive decline and alcohol intake in middle and older aged adults in the UK. Background 􏰀 UK Department of Health, drinkers should not consume more than 16 g/day to minimize the risk of alcohol to health. Methods 􏰀 Data from 13,342 men and women, aged between 40 and 73 years. 􏰀 How often and how much alcohol was consumed. 􏰀 Regression analysis testing the functional relationship and impact of alcohol on cognitive performance. 􏰀 Performance was measured in response of card matching (do a pair of cards match?). 􏰀 Covariates included body mass index, physical activity, tobacco use, socioeconomic status, education and baseline cognitive function. 􏰀 Additional variables that may be predictive of the outcome 􏰀 Direct interest or it may be a confounding or interacting variable Conclusions “UK department of Health guidelines are that drinkers should not consume more than 16 g/day to minimize the risk of alcohol to health. Our findings suggest that to preserve cognitive performance 10 g/day is a more appropriate upper limit.” Results - To preserve cognitive performance 10 g/day limit A Layered Grammar of Graphics “What is a graphic? How can we succinctly describe a graphic? And how can we create the graphics that we have described?” (Wickham 2010) One answer to these was questions was to develop a grammar of graphics. - Grammar referring to a set of rules or constructs to govern the process. What is different about the grammar of graphics? Traditionally, you will select the plot by name. Produce a Plot And try your best to modify the plot We can do better ggplot: The pieces 􏰀 A dataset and the mapping of those data to aesthetics 􏰀 At least one layer, including: 􏰀 geometric object 􏰀 statistical transformation 􏰀 A scale for each aesthetic mapping 􏰀 A coordinate system for the plot 􏰀 Faceting How are plots constructed? Using the grammar of graphics we construct our plots one component at a time. Updating Plots Using a grammar of graphics, we can adjust each piece in isolation. Gapminder example gapminder: Data from Gapminder An excerpt of the data available at Gapminder.org. For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007. library(gapminder) colnames(gapminder::gapminder)[1:3] ## [1] "country" "continent" "year" colnames(gapminder::gapminder)[4:6] ## [1] "lifeExp" "pop" "gdpPercap" Subset the data to 2002 gap <- gapminder::gapminder %>% dplyr::filter(year == 1997)

Review the subset
continent
Africa :52
Americas:25
Asia :33
Europe :30
Oceania : 2
year
Min. :1997
1st Qu.:1997
Median :1997
Mean :1997
3rd Qu.:1997
Max. :1997
lifeExp
Min. :36.09
1st Qu.:55.63
Median :69.39
Mean :65.01
3rd Qu.:74.17
Max. :80.69
pop
Min. :1.456
1st Qu.:3.770
Median :9.735
Mean :3.884
3rd Qu.:2.431
Max. :1.230
gdpPercap
Min. : 312.2
1st Qu.: 1366.8
Median : 4781.8
Mean : 9090.2
3rd Qu.:12022.9
Max. :41283.2

Examine Life Expectancy and GDP
80
70
60
50
40
0 10000
20000 30000
gdpPercap
40000
lifeExp

ggplot template
ggplot(data = ) + (mapping = aes())

GEOM_FUNCTION
􏰀 Geometric objects include points, lines, and polygons (similar to our vector data model).
The grammer of graphics requires that we specify the geometric feature that wil be used to render the plot.
In ggplot, they are known as geoms (geometric objects).

geoms
􏰀 Every geom is related with a certain statistic. 􏰀 The geom_histogram uses the bin statistic
􏰀 Counts by bins values
􏰀 The same is true for each statistic, it has an associated geom

geom_histogram & stat_bin
geom_histogram(mapping = NULL, data = NULL, stat = “bin”, position = “stack”, . . . , binwidth = NULL, bins = NULL, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
Using stats is an alternative aproach to creating the layer
stat_bin(mapping = NULL, data = NULL, geom = “bar”, position = “stack”, . . . , binwidth = NULL, bins = NULL, center = NULL, boundary = NULL, breaks = NULL, closed = c(“right”,
“left”), pad = FALSE, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)

mapping = aes()
aes()
“Aesthetics, things that we can perceive on the graphic.”(Wickham 2010)
􏰀 x-location (x) 􏰀 y-location (y) 􏰀 alpha
􏰀 colour
􏰀 fill
􏰀 group 􏰀 linetype 􏰀 size
􏰀 weight

mapping = aes()
mapping =
Defining how we associate each element (variable) to an aesthetic (visual element of the plot).

One Variable geoms
Continuous Variables
􏰀 Kernal Density (smooth histogram) 􏰀 geom_density()
􏰀 Useful if data come from a smooth distribution 􏰀 Dot-plot
􏰀 geom_dotplot() 􏰀 Histogram
􏰀 geom_histogram() Discrete Variable
􏰀 Bar
􏰀 geom_bar()

geom_density()
ggplot(diamonds, aes(carat)) + geom_density()+
theme(text = element_text(size=20))
1.5
1.0
0.5
0.0
012345
carat
stat_density()
density

geom_dotplot()
# Use fixed-width bins
ggplot(mtcars, aes(x = mpg)) + geom_dotplot(method=”histodot”, binwidth = 1.5)+ theme(text = element_text(size=20))
1.00
0.75
0.50
0.25
0.00
10 15 20 25 30 35
mpg
stat_identity() – The identity statistic leaves the data unchanged.
count

geom_histogram()
ggplot(diamonds, aes(carat)) + geom_histogram(bins = 200)+ theme(text = element_text(size=20))
4000 3000 2000 1000
0
012345
carat
stat_bin(. . . , binwidth = NULL, bins = NULL, . . . ) – Continuous data, count values by bins
count

geom_bar()
# Car counts per class
ggplot(mpg, aes(class))+ geom_bar()
60
40
20
0
2seater
compact
midsize
minivan
class
pickup
subcompact suv
stat_count() – Count observations by grouping (discrete)
count

Common two variable geoms
􏰀 Points
􏰀 geom_point()
􏰀 Points with a little noise 􏰀 geom_jitter()
􏰀 Quantile lines
􏰀 geom_quantile()
􏰀 Performs quantile regression and draws the fitted quantiles with lines
􏰀 Smoothed conditional means
􏰀 geom_smooth()
􏰀 Include standard errors

geom_point()
ggplot(mtcars, aes(wt, mpg))+ geom_point()+
theme(text = element_text(size=20))
35 30 25 20 15 10
stat_identity()
2345
wt
mpg

geom_jitter()
ggplot(mtcars, aes(wt, mpg))+ geom_jitter()+
theme(text = element_text(size=20))
35 30 25 20 15 10
2345
wt
􏰀 Used for handling overplotting caused by discreteness in smaller datasets
stat_identity()
mpg

geom_quantile()
ggplot(mpg, aes(displ, 1 / hwy)) + geom_point()+ geom_quantile()+
theme(text = element_text(size=20))
0.08
0.06
0.04
0.02
234567
displ
stat_quantile(. . . , quantiles = c(0.25, 0.5, 0.75),. . . ) – Quantile regression
1/hwy

geom_smooth()
ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth()+
theme(text = element_text(size=20))
40
30
20
234567
displ
stat_smooth(. . . , level = 0.95, . . . ) – Conditional mean
hwy

Stats
ggplot requires that we specify either a geom_* or a stat_* when we construct our plots.

Scales
Scales are used to determine how each aesthetic will appear. – Where in graphical space should the points occur? – What is the necessary range in colours? – What is the range in size of points?
As you progress you may begin customizing scales, but not at this time.

Coordinate Systems
Known as coord in ggplot, maps the position of objects in graphical space.
They affect how geoms look in the plot. Changing the coord from coord_cartesian to coord_polar for a geom_bar changes the geometry to a circle.
coord_cartesian()
Cartesian coordinate system specifies each point uniquely in a plane by a pair of numerical coordinates (x, y), which are the signed distances to the point from two fixed perpendicular directed lines (0,0), measured in the same unit of length.

geom_boxplot()
ggplot(diamonds, aes(cut, price))+ geom_boxplot()+
theme(text = element_text(size=20))
15000
10000
5000
0
price
Fair Good
Very Good
cut
Premium
Ideal

geom_boxplot()+coord_flip()
ggplot(diamonds, aes(cut, price))+ geom_boxplot()+
coord_flip()+
theme(text = element_text(size=20))
Ideal Premium Very Good Good Fair
cut
0 5000 10000
price
15000

geom_bar
ggplot(mpg, aes(class))+
geom_bar()+
theme(text = element_text(size=20))
60
40
20
0
2seater
compact
midsize
minivan
class
pickup
subcompact suv
count

geom_bar + coord_polar()
ggplot(mpg)+
geom_bar(mapping = aes(x = class, fill = class))+ theme(axis.text.x = element_blank(), text = element_ coord_polar()
text(
60 40 20
0
class
2seater compact midsize minivan pickup subcompact suv
class
count

Multiple Variables
Three variables is the maximum recommended number of variables that should be visualized with a single plot.
􏰀 Two variables visualized by position. 􏰀 One variable by another aesthetic

More than three leads us to using facets
􏰀 Creating subplots
􏰀 Facet based on one variable, facet_wrap() 􏰀 Facet based on two variables, facet_grid()
(text
80 70 60 50 40
0 10000
20000 30000
gdpPercap
40000
ggplot(data = gap)+
geom_point(mapping = aes(gdpPercap, lifeExp,
colour = continent))+ theme
continent
Africa Americas Asia Europe Oceania
lifeExp

facet_wrap()
ggplot(data = gap)+
geom_point(mapping = aes(gdpPercap, lifeExp))+ facet_wrap(~continent, nrow = 2)+
theme(text = element_text(size=20))
Africa
Asia
Americas
80 70 60 50 40
80 70 60 50 40
Europe
0
10000 20000 30000 40000
Oceania
0 10000 20000 30000 40000 0
10000 20000 30000 40000
gdpPercap
lifeExp

facet_grid()
ggplot(data = mpg)+
geom_point(mapping = aes(x = displ, y = hwy))+ facet_grid(drv~cyl)+
theme(text = element_text(size=20))
4
5
6
8
40 30 20
40 30 20
40 30 20
234567234567234567234567
displ
4fr
hwy

General issues with plots
􏰀 Too many variables, which can be overcome through facets
􏰀 Overplotting (discrete outcomes), we can jitter the values
􏰀 Alphabetical ordering, we can define our order using factors or
by mean/median values
􏰀 Polar coordinates, you must be careful when using these as
humans struggle during interpretation.

Ordering demo
# Default order is alphabetical
ggplot(iris, aes(x = Species, y = Sepal.Width)) + geom_boxplot()+
theme(text = element_text(size=20))
4.5 4.0 3.5 3.0 2.5 2.0
setosa
versicolor
Species
virginica
Sepal.Width

Order by median
ggplot(iris, aes(x =
reorder(Species, Sepal.Width,FUN = median), y = Sepal.Width)) +
geom_boxplot()+
xlab(“Species”)
4.5
4.0
3.5
3.0
2.5
2.0
versicolor
virginica
setosa
Species
Sepal.Width

Homework
Read: Chapters 2 & 3, R4DS

References
Borkin, Michelle A, Zoya Bylinskii, Nam Wook Kim, Constance May Bainbridge, Chelsea S Yeh, Daniel Borkin, Hanspeter Pfister, Senior Member, and Aude Oliva. 2015. “Beyond Memorability: Visualization Recognition an,” 1–10. papers3://publication/uuid/ 9AD97C00-BAC3-414A-8B6E-C951EE2CB7FB.
Groeneveld, RA, and G Meeden. 1984. “Measuring Skewness and Kurtosis.” Journal of the Royal Statistical Society. Series D (the Statistician) 33 (4): 391–99.
Wickham, Hadley. 2010. “A Layered grammar of graphics.” Journal of Computational and Graphical Statistics 19 (1): 3–28. doi:10.1198/jcgs.2009.07098.