程序代写代做代考 flex data mining Hive hadoop data science Introduction to information system

Introduction to information system

Introduction to R

Bowei Chen

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M

Data Science 2016 – 2017 Workshop

What is R?

• R is a free software environment for

statistical computing and graphics.

• R compiles and runs on a wide

variety of UNIX platforms, Windows

and MacOS.

• R can be downloaded at:

https://cran.r-project.org/

Old logo New logo

https://cran.r-project.org/

Comprehensive R Archive Network (CRAN)

• CRAN includes packages which provide additional functionalities.

• Over 7,801 additional packages (as of January 2016) available at CRAN,

Bioconductor, Omegahat, GitHub, and other repositories.

• R packages are written mainly by academics and company staff.

• The R Foundation is seated in Vienna, Austria and currently hosted by

the Vienna University of Economics and Business. It is a registered

association under Austrian law and active worldwide.

Short History of R (1/2)

• S is a statistical programming language developed primarily by John

Chambers, Rick Becker and Allan Wilks at Bell Laboratories since 1976.

• The two modern implementations of S are:

– R: part of the GNU free software project

– S-PLUS (or S+): A commercial product sold by TIBCO Software

Short History of R (2/2)

• S-PLUS is a commercial implementation of the S programming language sold

by TIBCO Software Inc.

• R was created by Ross Ihaka and Robert Gentleman at the University of

Auckland, New Zealand, and is currently developed by the R Development

Core Team, of which John Chambers is a member. R is named partly after

the first names of the first two R authors and partly as a play on the name of S.

What Can You Do Using R? (1/2)

• Data entry and manipulation

– Input data

• from keyboard

• from spreadsheet

• from another statistics package

– Manipulate data

• Statistical analysis

– Descriptive statistics

– Statistical inference

What Can You Do Using R? (2/2)

• Graphical display

– Predefined plots for some models

– Flexible, powerful options

– Save to image files in various formats

• Write new functions

– Make a change to an existing function

– Create new functions tailored to your exact needs

– Contribute a new package

• Create documents (with Sweave, knitr)

– PDF (article and slides)

– HTML

Why Use R for Data Science Computing? (1/2)

• Open source (R is a GNU S+)

• Good visualisations (ggplot2, lattice, standard plot library)

• Easier for writing custom packages and functions

• Closer to the statistics and machine learning community

• Better LaTeX support (Sweave, knitr)

• Works with Big data (Rhadoop, Rspark, RCpp)

By Gregory Piatetsky, KDnuggets

http://www.kdnuggets.com/author/gregory-piatetsky

Limitations of R

• The quality of some packages is less than perfect. They are not error-free!

• Many R commands give little thought to memory management, and so R

can very quickly consume all available memory. This can be a restriction

when doing data mining. There are various solutions, including using 64 bit

operating systems that can access much more memory than 32 bit ones.

• Documentation is sometimes patchy and terse, and impenetrable to

the non-statistician. However, some very high-standard books are

increasingly plugging the documentation gaps.

RGui

When R is waiting
for us to tell it what
to do, it begins the
line with >

Type
• ‘demo()’ for some demos
• ‘help()’ for on-line help
• ‘help.start()’ for an

HTML browser interface
• ‘q()’ to quit R

Editors and IDEs

• Rstudio

• Juyper Notebook

• Vim

• Emacs (ESS)

• Eclipse (StatET)

• Tinn-R

• Notepad++

• LaTeX/LyX (knitr, Sweave)

• …

https://www.rstudio.com/

https://www.rstudio.com/

R source editor (Ctrl+1)

R console (Ctrl+2)

Environment (Ctrl+8)
history (Ctrl+4)

Help (Ctrl+4)
Files (Ctrl+5)
Plots (Ctrl+6)

Packages (Ctrl+7)

Objects

• Everything in R is an object, having a class.

• Data, intermediate results are stored in R objects

• The Class of the object both describes what the object contains and what

many standard functions

• Objects are usually accessed by name.

R Commands

• R commands are either assignments or expressions

• Commands are separated either by a semicolon ; or newline

x <- 1+2 `<-`(x, 1+2) #same thing x = 1+2 #same thing Assignment Operations An assignment command evaluates an expression and passes the value to a variable but the result is not printed. Expression Operations An expression command is evaluated and (normally) printed. If the statement results in a value, R will print that value automatically. > 1+2

[1] 3

> 1+2*3

[1] 7

> (1+2)*3

[1] 9
In R, any number that you print
out in the console is interpreted as
a vector. A vector is an ordered
collection of numbers. The “[1]”
means that the index of the first
item displayed in the row is 1.

Workspace

• R stores objects in workspace that is kept in memory.

• When quitting R ask you if you want to save that workspace

• The workspace containing all objects you work on can then be restored next

time you work with R along with a history of the used commands.

Variables (1/3)

A variable is a symbol that holds a

value, which can be any R object.

The types of variables are:

• Integer

• Double

• Character

• Logical

• Factor or categorical

Variables (2/3)

Integer, double (numerical values)

> a = 49

> sqrt(a)

[1] 7

> a <- pi > print(a)

[1] 3.141593

Character, string, logical

> a = “The dog ate my homework”

> sub(“dog”,”cat”,a)

[1] “The cat ate my homework“

> a = (1+1==3)

> a

[1] FALSE

Variables (3/3)

Factor

> a <- factor(c("H", "e", "l", "l", "o")) > print(a)

[1] H e l l o

Levels: e H l o

> class(a)

[1] “factor”

Types of Numerical Variables (1/2)

When we use numerical objects, in

mathematical terms, variables can be

classified as:

• Scalars

• Vectors

• Matrices

A scalar is a single number

> x <- 5 > Y <- 100 Types of Numerical Variables (2/2) A vector is a sequence of numbers > x <- c(3, 5, 2) > x

[1] 3 5 2

A matrix is a two-way table of numbers

> x <- matrix(c(2, 3, 4, 5, 6, 7), nrow=3, ncol=2) > x

[,1] [,2]

[1,] 2 5

[2,] 3 6

[3,] 4 7

Variable Names

• You can use simple variable names like x, y, A, and a (note that A and a are
different variable names). You can also use longer names like counter,
index1, or subject_id.

• A variable name can contain digits, but it cannot begin with a digit.

• Be careful about the built-in operators or symbols with your own variable

names! For example, you could create a variable named log, but then you
would no longer be able to use the logarithm function

Comments

A comment is anything you write in

your program code that is ignored by

the computer.

Comments help others understand

your code. Anything following a “#”

character is a comment in R.

> x <- c(3, 5, 2) ## These are the doses of the new drug formulation. Arithmetic Operators Addition + Subtraction - Multiplication * Division / Exponentiation ^ or ** Modulus (x mod y) 5%%2 is 1 x %% y Integer division 5%/%2 is 2 x %/% y Comparison Operators Equal == Not equal != Greater than >

Greater than or equal >=

Less than < Less than or equal <= Logical Operators x and y x & y x or y x | y Not x !x Test if x is TRUE isTRUE(x) Numeric Functions Absolute value abs(x) Square root sqrt(x) Ceiling(3.475) is 4 ceiling(x) Foor(3.475) is 3 floor(x) Round(3.475, digits=2) is 3.48 round(x, digits=n) Signif(3.475, digits=2) is 3.5 signif(x, digits=n) Cosine, sine, tan, … cos(x), sin(x), tan(x) Natural logarithm log(x) Common logarithm log10(x) Exponential of x exp(x) Control Structures: if Syntax: if(cond1==true) { cmd1 } > if (TRUE) {

+ “this will be printed if it is TRUE”

+ }

[1] “this will be printed if it is TRUE”

Control Structures: if-else

Syntax:

if(cond1==true) { cmd1 } else { cmd2 }

> if(1==0) {

+ print(1)

+ } else {

+ print(2)

+ }

[1] 2

Control Structures: ifelse

Syntax:

ifelse(cond, yes, no)

> ifelse(1 == 0,

+ “this will be printed if 1==0”,

+ “this will not be printed if 1!=0”)

[1] “this will not be printed if 1!=0”

Control Structures: for

Syntax:

for (var in seq) { expr }

> x <- c("a", "a", "a", "a", "a") > for (i in x){

+ print(i)

+ }

[1] “a”

[1] “a”

[1] “a”

[1] “a”

[1] “a”

Control Structures: repeat

Syntax:

repeat { (cond) expr }

> i <- 10 > repeat {
+ if (i > 25)
+ break
+ else {
+ print(i); i <- i + 5; + } + } [1] 10 [1] 15 [1] 20 [1] 25 Control Structures: while Syntax: while (cond) { expr } > i <- 10 > while (i <= 25) { + print(i); i <- i + 5 + } [1] 10 [1] 15 [1] 20 [1] 25 Control Structures: switch Syntax: switch(expr, ...) > AA = ‘foo’
> switch(AA,
+ foo = {
+ print(‘foo’) # case ‘foo’
+ },
+ bar = {
+ print(‘bar’) # case ‘bar’
+ },
+ {
+ print(‘default’)
+ })
[1] “foo”

Installing R and RStudio on Your Machine

• Download R from https://cran.r-project.org/

• Download RStudio at https://www.rstudio.com/

https://cran.r-project.org/
https://www.rstudio.com/

Exercise 1/10

demo(graphics)

demo(plotmath)

demo(Japanese)

demo(lm.glm)

demo(hclColors)

Exercise 2/10

x<-c(4,2,6) y<-c(1,0,-1) length(x) sum(x) sum(x^2) x+y x*y x-2 x^2 Exercise 3/10 7:11 seq(2,9) seq(4,10,by=2) seq(3,30,length=10) seq(6,-4,by=-2) Exercise 4/10 rep(2,4) rep(c(1,2),4) rep(c(1,2),c(4,4)) rep(1:4,4) rep(1:4,rep(3,4)) Exercise 5/10 c(T,T,F,F) & c(T,F,F,T) x <- as.logical(0); !x x <- seq(-3,3,length=200) > 0

1:3 + c(T,F,T)

intersect(1:10,5:15)

drinks <- factor(c("beer","beer","wine","water")) Exercise 6/10 x<-c(5,7,9); y<-c(6,3,4); z<-cbind(x,y); print(z) c(1, 2, 3, . . . , 19, 20) x <- c(3,6,8); y <- c(2,5,1); x[y>1.5]

x <- c(3,6,8); y <- c(2,5,1); y[x==6] Exercise 7/10 x <- 1:15 if (sample(x, 1) <= 10) { print("x is less than 10") } else { print("x is greater than 10") } Clean all the variables (the workspace) rm(list=ls()) Clean one variable rm(x) Exercise 8/10 x <- c("apples", "oranges", "bananas", "strawberries") for (i in x) { print(i) } for (i in 1:4) { print(x[i]) } for (i in seq(x)) { print(x[i]) } for (i in 1:4) print(x[i]) Exercise 9/10 i <- 1 while (i < 10) { print(i) i <- i + 1 } Exercise 10/10 z <- c("Alec", "Dan", "Rob", "Karthik"); typeof(z) x <- c(0.5, 0.7) x <- c(TRUE, FALSE) x <- c("a", "b", "c", "d", "e") x <- 9:100 x <- c(1 + (0+0i), 2 + (0+4i)) Additional Exercises 1) Create a number series that repeats 1 to 10 for 10 times 2) Create a number series that repeats each number for 10 times from 1 to 10 3) Find out the same (i.e. same integer and same index) numbers from 1) series and 2) series 4) Create a series from 1 to 30 5) Create a 30 numbers geometric progression for 1.2 (start from 1) 6) Compared series of 4) and 5), get a series of True/False values to state if the number in 4) series is larger than the number in 5) 7) Find out the numbers in 4) series that are larger than number with the same index in 5) series References • W. Venables, D. Smith, and the R Core Team (2015) An Introduction to R. • P. Teetor (2011) R Cookbook. O’Reilly. • J. Adler (2012) R in a Nutshell, 2nd Edition, O’Reilly Thank You! bchen@lincoln.ac.uk mailto:bchen@lincoln.ac.uk