Introduction to information system
Exploratory Graphs and
Base Plotting System in R
Bowei Chen
School of Computer Science
University of Lincoln
CMP3036M/CMP9063M
Data Science 2016 – 2017 Workshop
Today’s Objectives
• Study the following slides:
– Part I: Exploratory Graphs (If you are familiar with statistical graphical
representations, please skip Part I and jump to Part II directly!)
– Part II: Base Plotting System in R
• Do the exercises 1-5 (all are important!!!)
• Do the additional exercises (which can help you to understand and review
our last week’s lecture)
Part I:
Exploratory Graphs
If you are familiar with statistical graphical representations,
please skip Part I and jump to Part II directly!
Pie Chart
AL
5% AR
5%
AZ
5%
CA
5%
CO
4%
CT
7%
DE
4%
FL
6%
GA
4%
IA
5%
ID
4%
IL
4%
IN
4%
KS
5%
KY
3%
LA
5%
MA
6%
MD
4%
ME
6%
MI
5%
taxs
AL AR AZ CA CO CT DE
FL GA IA ID IL IN KS
KY LA MA MD ME MI
Dataset: Cigarette
A pie chart is used to show the
relative frequencies or percentages
of the levels of a categorical variable
with wedges of a pie/circle..
It is very useful when creating a well
designed document that is intended
to people that will not read the data
(e.g., management)
Scatter Plot
With a scatter plot a mark,
usually a dot or small circle,
represents a single data point.
With one mark (point) for every
data point a visual distribution
of the data can be seen.
Depending on how tightly the
points cluster together, you may
be able to discern a clear trend
in the data.
y = 31.887x – 62057
0
100
200
300
400
500
600
700
1949 1951 1953 1955 1957 1959
Dataset: AirPassengers
AirPassengers Linear (AirPassengers)
Date
Number of air
passengers
Line Plot
A line plot provides an excellent
way to map independent and
dependent variables that are both
quantitative.
It is clear to see how things are
going by the rises and falls a line
plot shows.
0
100
200
300
400
500
600
700
1949 1952 1955 1958
Dataset: AirPassengers
AirPassengers
Date
Number of air
passengers
Multiple Line Plot
Multiple line plots have space-
saving characteristics. Because
the data values are marked by
small marks (points) and not
bars, they do not have to be
offset from each other (only
when data values are very dense
does this become a problem).
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 6
1
1
1
6
2
1
2
6
3
1
3
6
4
1
4
6
5
1
5
6
6
1
6
6
7
1
7
6
8
1
8
6
9
1
9
6
Dataset: StockShare
Stock1 Stock2 Stock3
Day
Cumulative
percentage
Area Chart/Graph
An area chart/graph displays
graphically quantitative data.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 6
1
1
1
6
2
1
2
6
3
1
3
6
4
1
4
6
5
1
5
6
6
1
6
6
7
1
7
6
8
1
8
6
9
1
9
6
Dataset: StockShare
Stock1 Stock2 Stock3
Day
Cumulative
percentage
Bar Chart
A bar plot is a chart that shows
grouped data with rectangular
bars with lengths proportional to
the values that they show. The
bars can be plotted vertically or
horizontally.
It is one of the best methods to
summarise categorical data.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8 9 10
Dataset: StockShare
Stock1 Stock2 Stock3
Day
Percentage
Histogram
A histogram is a graphical
representation of the distribution of
quantitative data. It is an estimate
of the probability distribution of a
quantitative variable and was first
introduced by Karl Pearson.
0
5
10
15
20
25
4
0
4
1
4
2
4
3
4
4
4
5
4
6
4
7
4
8
4
9
5
0
5
1
5
2
5
3
5
4
5
5
5
6
5
7
5
8
5
9
6
0
Dataset: MSFT
Adjust
closing
price
Frequency
Histogram with
Distribution Fit
A histogram with a distribution
fit is normally used to show the
empirical distribution of the
variable. Sometimes, we use
the Normal/Gaussian
distribution to fit the histogram.
0
5
10
15
20
25
4
0
4
1
4
2
4
3
4
4
4
5
4
6
4
7
4
8
4
9
5
0
5
1
5
2
5
3
5
4
5
5
5
6
5
7
5
8
5
9
6
0
Dataset: MSFT
Adjust
closing
price
Frequency
Part II:
Base Plotting System in R
Dataset (1/3)
> data(Chem97, package = “mlmRev”)
> head(Chem97)
lea school student score gender age gcsescore gcsecnt
1 1 1 1 4 F 3 6.625 0.3393157
2 1 1 2 10 F -3 7.625 1.3393157
3 1 1 3 10 F -4 7.250 0.9643157
4 1 1 4 10 F -2 7.500 1.2143157
5 1 1 5 8 F -1 6.444 0.1583157
6 1 1 6 10 F 4 7.750 1.4643157
Dataset (2/3)
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Dataset (3/3)
> data(EuStockMarkets)
> EuStockMarkets <- data.frame(EuStockMarkets) > head(EuStockMarkets)
DAX SMI CAC FTSE
1 1628.75 1678.1 1772.8 2443.6
2 1613.63 1688.5 1750.5 2460.2
3 1606.51 1678.6 1718.0 2448.2
4 1621.04 1684.1 1708.1 2470.4
5 1618.16 1686.6 1723.1 2484.7
6 1610.61 1671.6 1714.3 2466.8
Histogram (1/2)
> hist(Chem97$gcsescore)
Histogram (2/2)
> hist(
+ Chem97$gcsescore,
+ main = “Histogram”,
+ xlab = “gcsescore”,
+ ylab = “Frequency”,
+ col = “green”
+ )
Boxplot (1/2)
> boxplot(Chem97$gcsescore,
+ main = ‘title’,
+ ylab = ‘gcsescore’)
Boxplot (2/2)
> boxplot(
+ Chem97$gcsescore,
+ Chem97$age,
+ main = ‘title’,
+ ylab = ‘value’,
+ names = c(‘gcsescore’,’age’)
+ )
Scatter Plot (1/3)
> plot(
+ Chem97$gcsescore,
+ Chem97$gcsecnt,
+ main = “title”,
+ xlab = “gcsescore”,
+ ylab = ‘gcsecnt’,
+ col = “blue”
+ )
Scatter Plot (2/3)
> pairs(iris)
Scatter Plot (3/3)
> pairs(iris, pch = 21, bg = c(“red”, “green3”, “blue”)[unclass(iris$Species)])
Line Plot (1/3)
> plot(
+ EuStockMarkets$DAX,
+ type = “l”,
+ main = ‘EuStockMarkets’,
+ xlab = ‘Day’,
+ ylab = ‘DAX’
+ )
Line Plot (2/3)
> plot(
+ EuStockMarkets$DAX,
+ type = “l”, col = ‘red’,
+ xlab = ‘Day’, ylab = ‘Price’
+ )
> lines(EuStockMarkets$FTSE,
+ type = “l”, col = ‘blue’)
> title(“EuStockMarkets”, cex.main = 1.1)
> legend(
+ 100, 5500, c(“DAX”, “FTSE”),
+ col = c(‘red’, ‘blue’),
+ text.col = “black”,
+ lty = c(1,1), merge = TRUE
+ )
Line Plot (3/3)
> plot(
+ EuStockMarkets$DAX,
+ EuStockMarkets$CAC,
+ type = “l”,
+ main = ‘EuStockMarkets’,
+ xlab = ‘DAX’,
+ ylab = ‘CAC’
+ )
References
• W. Venables, D. Smith, and the R Core Team (2015) An Introduction to R.
• P. Teetor (2011) R Cookbook. O’Reilly.
• J. Adler (2012) R in a Nutshell, 2nd Edition, O’Reilly
Exercise 1/5
1) Create a vector x from a series 1 to 1000
2) Create a vector y from a series 12 to 10002
3) Generate the following scatter plot that x on x-axis and y on y-axis
Exercise 2/5
1) Create a data frame df that contains 3 variables: x, y, z (i.e., 3 columns)
2) Each variable has 500 observations (i.e., 500 rows)
3) x follows a standard norm distribution N(0,1)
4) y follows a continuous uniform distribution U[0,1]
5) z follows a poison distribution Poisson(0.5)
6) Generate a pairs plot for x, y, z
Please google the pairs function
Exercise 3/5
Plot the following figure where x is in 0, 2𝜋 , and y is sin(𝑥)
Exercise 4/5
1) Download PublicHealthEnglandDataTableDistrict.xlsx from Blackboard
2) Import the dataset into R using openxlsx package
3) Save the data frame into df
4) Plot the histogram of df (as same as on the right)
hint: a) bandwidth; b) values on x-axis
Exercise 5/5
1) Download the file AMZN.csv from Blackboard
2) Import the dataset into R
3) Plot the multiple lines figure as below
Additional Exercises
Well done if you’ve completed the exercises. Once you complete these
additional exercises, you can leave the workshop sessions
Additional Exercise (1/2)
If a random variable 𝑋 follows an exponential distribution, 𝑋~𝐸𝑥𝑝(𝜆), please
prove the following equations:
1)
0
∞
𝜆𝑒−𝜆𝑥𝑑𝑥 = 1
2) 𝔼 X =
1
𝜆
3) 𝕍 𝑋 =
1
𝜆2
Additional Exercise (2/2)
If a random variable 𝑋 follows a Normal distribution, 𝑋 ∼ 𝒩(𝜇, 𝜎2), please
prove the following equation:
−∞
∞ 1
2𝜋𝜎2
exp −
𝑥 − 𝜇 2
2𝜎2
𝑑𝑥 = 1
Thank You!
bchen@lincoln.ac.uk
mailto:bchen@lincoln.ac.uk