程序代写代做代考 data science Excel Introduction to information system

Introduction to information system

Exploratory Graphs and

Base Plotting System in R

Bowei Chen

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M

Data Science 2016 – 2017 Workshop

Today’s Objectives

• Study the following slides:

– Part I: Exploratory Graphs (If you are familiar with statistical graphical

representations, please skip Part I and jump to Part II directly!)

– Part II: Base Plotting System in R

• Do the exercises 1-5 (all are important!!!)

• Do the additional exercises (which can help you to understand and review

our last week’s lecture)

Part I:

Exploratory Graphs

If you are familiar with statistical graphical representations,

please skip Part I and jump to Part II directly!

Pie Chart

AL
5% AR

5%
AZ
5%

CA
5%

CO
4%

CT
7%

DE
4%

FL
6%

GA
4%

IA
5%

ID
4%

IL
4%

IN
4%

KS
5%

KY
3%

LA
5%

MA
6%

MD
4%

ME
6%

MI
5%

taxs

AL AR AZ CA CO CT DE
FL GA IA ID IL IN KS
KY LA MA MD ME MI

Dataset: Cigarette

A pie chart is used to show the

relative frequencies or percentages

of the levels of a categorical variable

with wedges of a pie/circle..

It is very useful when creating a well

designed document that is intended

to people that will not read the data

(e.g., management)

Scatter Plot

With a scatter plot a mark,

usually a dot or small circle,

represents a single data point.

With one mark (point) for every

data point a visual distribution

of the data can be seen.

Depending on how tightly the

points cluster together, you may

be able to discern a clear trend

in the data.

y = 31.887x – 62057

0

100

200

300

400

500

600

700

1949 1951 1953 1955 1957 1959

Dataset: AirPassengers

AirPassengers Linear (AirPassengers)
Date

Number of air

passengers

Line Plot

A line plot provides an excellent

way to map independent and

dependent variables that are both

quantitative.

It is clear to see how things are

going by the rises and falls a line

plot shows.

0

100

200

300

400

500

600

700

1949 1952 1955 1958

Dataset: AirPassengers

AirPassengers
Date

Number of air

passengers

Multiple Line Plot

Multiple line plots have space-

saving characteristics. Because

the data values are marked by

small marks (points) and not

bars, they do not have to be

offset from each other (only

when data values are very dense

does this become a problem).

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 6
1

1
1

6
2

1

2
6

3
1

3
6

4
1

4
6

5
1

5
6

6
1

6
6

7
1

7
6

8
1

8
6

9
1

9
6

Dataset: StockShare

Stock1 Stock2 Stock3

Day

Cumulative

percentage

Area Chart/Graph

An area chart/graph displays

graphically quantitative data.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 6

1
1

1
6

2
1

2
6

3
1

3
6

4
1

4
6

5
1

5
6

6
1

6
6

7
1

7
6

8
1

8
6

9
1

9
6

Dataset: StockShare

Stock1 Stock2 Stock3

Day

Cumulative

percentage

Bar Chart

A bar plot is a chart that shows

grouped data with rectangular

bars with lengths proportional to

the values that they show. The

bars can be plotted vertically or

horizontally.

It is one of the best methods to

summarise categorical data.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8 9 10

Dataset: StockShare

Stock1 Stock2 Stock3

Day

Percentage

Histogram

A histogram is a graphical

representation of the distribution of

quantitative data. It is an estimate

of the probability distribution of a

quantitative variable and was first

introduced by Karl Pearson.

0

5

10

15

20

25

4
0

4
1

4
2

4
3

4
4

4
5

4
6

4
7

4
8

4
9

5
0

5
1

5
2

5
3

5
4

5
5

5
6

5
7

5
8

5
9

6
0

Dataset: MSFT

Adjust

closing

price

Frequency

Histogram with

Distribution Fit

A histogram with a distribution

fit is normally used to show the

empirical distribution of the

variable. Sometimes, we use

the Normal/Gaussian

distribution to fit the histogram.

0

5

10

15

20

25

4
0

4
1

4
2

4
3

4
4

4
5

4
6

4
7

4
8

4
9

5
0

5
1

5
2

5
3

5
4

5
5

5
6

5
7

5
8

5
9

6
0

Dataset: MSFT

Adjust

closing

price

Frequency

Part II:

Base Plotting System in R

Dataset (1/3)

> data(Chem97, package = “mlmRev”)

> head(Chem97)

lea school student score gender age gcsescore gcsecnt

1 1 1 1 4 F 3 6.625 0.3393157

2 1 1 2 10 F -3 7.625 1.3393157

3 1 1 3 10 F -4 7.250 0.9643157

4 1 1 4 10 F -2 7.500 1.2143157

5 1 1 5 8 F -1 6.444 0.1583157

6 1 1 6 10 F 4 7.750 1.4643157

Dataset (2/3)

> data(iris)

> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

Dataset (3/3)

> data(EuStockMarkets)

> EuStockMarkets <- data.frame(EuStockMarkets) > head(EuStockMarkets)

DAX SMI CAC FTSE

1 1628.75 1678.1 1772.8 2443.6

2 1613.63 1688.5 1750.5 2460.2

3 1606.51 1678.6 1718.0 2448.2

4 1621.04 1684.1 1708.1 2470.4

5 1618.16 1686.6 1723.1 2484.7

6 1610.61 1671.6 1714.3 2466.8

Histogram (1/2)

> hist(Chem97$gcsescore)

Histogram (2/2)

> hist(
+ Chem97$gcsescore,
+ main = “Histogram”,
+ xlab = “gcsescore”,
+ ylab = “Frequency”,
+ col = “green”
+ )

Boxplot (1/2)

> boxplot(Chem97$gcsescore,

+ main = ‘title’,

+ ylab = ‘gcsescore’)

Boxplot (2/2)

> boxplot(
+ Chem97$gcsescore,
+ Chem97$age,
+ main = ‘title’,
+ ylab = ‘value’,
+ names = c(‘gcsescore’,’age’)
+ )

Scatter Plot (1/3)

> plot(

+ Chem97$gcsescore,

+ Chem97$gcsecnt,

+ main = “title”,

+ xlab = “gcsescore”,

+ ylab = ‘gcsecnt’,

+ col = “blue”

+ )

Scatter Plot (2/3)

> pairs(iris)

Scatter Plot (3/3)

> pairs(iris, pch = 21, bg = c(“red”, “green3”, “blue”)[unclass(iris$Species)])

Line Plot (1/3)

> plot(

+ EuStockMarkets$DAX,

+ type = “l”,

+ main = ‘EuStockMarkets’,

+ xlab = ‘Day’,

+ ylab = ‘DAX’

+ )

Line Plot (2/3)
> plot(

+ EuStockMarkets$DAX,

+ type = “l”, col = ‘red’,

+ xlab = ‘Day’, ylab = ‘Price’

+ )

> lines(EuStockMarkets$FTSE,

+ type = “l”, col = ‘blue’)

> title(“EuStockMarkets”, cex.main = 1.1)

> legend(

+ 100, 5500, c(“DAX”, “FTSE”),

+ col = c(‘red’, ‘blue’),

+ text.col = “black”,

+ lty = c(1,1), merge = TRUE

+ )

Line Plot (3/3)

> plot(

+ EuStockMarkets$DAX,

+ EuStockMarkets$CAC,

+ type = “l”,

+ main = ‘EuStockMarkets’,

+ xlab = ‘DAX’,

+ ylab = ‘CAC’

+ )

References

• W. Venables, D. Smith, and the R Core Team (2015) An Introduction to R.

• P. Teetor (2011) R Cookbook. O’Reilly.

• J. Adler (2012) R in a Nutshell, 2nd Edition, O’Reilly

Exercise 1/5

1) Create a vector x from a series 1 to 1000

2) Create a vector y from a series 12 to 10002

3) Generate the following scatter plot that x on x-axis and y on y-axis

Exercise 2/5

1) Create a data frame df that contains 3 variables: x, y, z (i.e., 3 columns)

2) Each variable has 500 observations (i.e., 500 rows)

3) x follows a standard norm distribution N(0,1)

4) y follows a continuous uniform distribution U[0,1]

5) z follows a poison distribution Poisson(0.5)

6) Generate a pairs plot for x, y, z

Please google the pairs function 

Exercise 3/5

Plot the following figure where x is in 0, 2𝜋 , and y is sin(𝑥)

Exercise 4/5

1) Download PublicHealthEnglandDataTableDistrict.xlsx from Blackboard

2) Import the dataset into R using openxlsx package

3) Save the data frame into df

4) Plot the histogram of df (as same as on the right)

hint: a) bandwidth; b) values on x-axis

Exercise 5/5

1) Download the file AMZN.csv from Blackboard

2) Import the dataset into R

3) Plot the multiple lines figure as below

Additional Exercises

Well done if you’ve completed the exercises. Once you complete these

additional exercises, you can leave the workshop sessions 

Additional Exercise (1/2)

If a random variable 𝑋 follows an exponential distribution, 𝑋~𝐸𝑥𝑝(𝜆), please
prove the following equations:

1)
0


𝜆𝑒−𝜆𝑥𝑑𝑥 = 1

2) 𝔼 X =
1

𝜆

3) 𝕍 𝑋 =
1

𝜆2

Additional Exercise (2/2)

If a random variable 𝑋 follows a Normal distribution, 𝑋 ∼ 𝒩(𝜇, 𝜎2), please
prove the following equation:

−∞

∞ 1

2𝜋𝜎2
exp −

𝑥 − 𝜇 2

2𝜎2
𝑑𝑥 = 1

Thank You!

bchen@lincoln.ac.uk

mailto:bchen@lincoln.ac.uk