程序代写代做代考 data science Introduction to information system

Introduction to information system

Handling and Processing Strings/Texts in R

Bowei Chen, Deema Hafeth and Jingmin Huang

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M

Data Science 2016 – 2017 Workshop

Today’s Objectives

• Study the following slides:

– Regular expression functions

• Do the exercises 1-4

• Do the additional exercise 1-2

References

• G.Sanchez (2014) Handling and Processing Strings in R

Regular Expression

A regular expression (shortly regex or regexp) is a pattern describing a

certain amount of text. Basically, it is a way for a computer user or programmer

to express how a computer program should look for a specified pattern in text

and then what the program is to do when each pattern match is found.

Regular Expression Functions in R

Function Description

grep() Find regex matches and return (index or value)

grepl() Find regex matches and return (TRUE & FALSE)

sub() Replace the first match

gsub() Replace all the matches

regexpr() Find regex matches (position of the first match)

gregexpr() Find regex matches (position of all match)

regexec() Find regex matches (hybrid of regexpr() and gregexpr())

strsplit() Split regex matches

Metacharacters in R (1/2)

There are some special characters
that have a reserved status and they
are known as metacharacters.

The metacharacters in Extended
Regular Expressions (EREs) are:

In R, we need to escape them with a
double backslash \\ when we want to
represent them in a regex pattern

. \ | ( ) [ { $ * + ?

Metacharacter Escape in R

. \\.

$ \\$

* \\*

+ \\+

? \\?

| \\|

\ \\\

^ \\^

[ \\[

] \\]

{ \\{

} \\}

( \\(

) \\)

Metacharacters in R (2/2)

> money = “$money”

>

> sub(pattern = “$”, replacement =
“XXXXXX”, x = money)

[1] “$moneyXXXXXX“

> money = “$money”

>

> sub(pattern = “\\$”, replacement
= “XXXXXX”, x = money)

[1] “XXXXXXmoney”

Sequences (1/2)
Anchor Description

\\d Match a digital character

\\D match a non-digit character

\\s match a space character

\\S match a non-space character

\\w match a word character

\\W match a non-word character

\\b match a word boundary

\\B match a non-(word boundary)

\\h match a horizontal space

\\H match a non-horizontal space

\\v match a vertical space

\\V match a non-vertical space

Sequences (2/2)

> sub(“\\d”, “_”, “the dandelion war 2010”)

[1] “the dandelion war _010”

> gsub(“\\d”, “_”, “the dandelion war 2010”)

[1] “the dandelion war ____”

>

> sub(“\\D”, “_”, “the dandelion war 2010”)

[1] “_he dandelion war 2010”

> gsub(“\\D”, “_”, “the dandelion war 2010”)

[1] “__________________2010”

Some Regex Character Classes (1/2)

Anchor Description

[aeiou] Match any one lower case vowel

[AEIOU] Match any one upper case vowel

[0123456789] Match any digit

[0-9] Match any digit (same as previous class)

[a-z] Match any lower case ASCII letter

[A-Z] Match any upper case ASCII letter

[a-zA-Z0-9] Match any of the above classes

[^aeiou] Match anything other than a lowercase vowel

[^0-9] Match anything other than a digit

Some Regex Character Classes (2/2)

> # some string
> transport = c(“car”, “bike”, “plane”, “boat”)
> # look for e or i
> grep(pattern = “[ei]”, transport, value = TRUE)
[1] “bike” “plane”
>
> # some numeric strings
> numerics = c(“123”, “17-April”, “I-II-III”, “R 3.0.1”)
> grep(pattern = “[01]”, numerics, value = TRUE)
[1] “123” “17-April” “R 3.0.1”
> grep(pattern = “[0-9]”, numerics, value = TRUE)
[1] “123” “17-April” “R 3.0.1”
> grep(pattern = “[^0-9]”, numerics, value = TRUE)
[1] “17-April” “I-II-III” “R 3.0.1″

POSIX Character Classes (1/2)
Notation Description

[[:lower:]] Lower-case letters

[[:upper:]] Upper-case letters

[[:alpha:]] Alphabetic characters ([[:lower:]] and [[:upper:]])

[[:digit:]] Digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

[[:alnum:]] Alphanumeric characters ([[:alpha:]] and [[:digit:]])

[[:blank:]] Blank characters: space and tab

[[:cntrl:]] Control characters

[[:punct:]] Punctuation characters: ! ” # % & ‘ ( ) * + , – . / : ;

[[:space:]] Space characters: tab, newline, vertical tab, form feed, carriage return, and space

[[:xdigit:]] Hexadecimal digits: 0-9 A B C D E F a b c d e f

[[:print:]] Printable characters ([[:alpha:]], [[:punct:]] and space)

[[:graph:]] Graphical characters ([[:alpha:]] and [[:punct:]])

> # la vie (string)
> la_vie = “La vie en #FFC0CB (rose);\nCest la vie! \ttres jolie”
> # if you print la_vie
> print(la_vie)
[1] “La vie en #FFC0CB (rose);\nCest la vie! \ttres jolie”
> # if you cat la_vie
> cat(la_vie)
La vie en #FFC0CB (rose);
Cest la vie! tres jolie
>
> # remove space characters
> gsub(pattern = “[[:blank:]]”, replacement = “”, la_vie)
[1] “Lavieen#FFC0CB(rose);\nCestlavie!tresjolie”
> # remove digits
> gsub(pattern = “[[:punct:]]”, replacement = “”, la_vie)
[1] “La vie en FFC0CB rose\nCest la vie \ttres jolie”

POSIX Character Classes (2/2)

Quantifiers (1/2)

Notation Description

* The preceding item will be matched zero or more times

+ The preceding item will be matched one or more times

? The preceding item will be matched zero or more times

{n} The preceding item is matched exactly n times

{n,} The preceding item is matched n or more times

{n,m} The preceding item is matched at least n times, but not more than m times

Quantifiers (2/2)

> strings <- c("a", "ab", "acb", "accb", "acccb", "accccb") > grep(“ac*b”, strings, value = TRUE)
[1] “ab” “acb” “accb” “acccb” “accccb”
> grep(“ac*b”, strings, value = FALSE)
[1] 2 3 4 5 6
> grepl(“ac*b”, strings)
[1] FALSE TRUE TRUE TRUE TRUE TRUE
> grep(“ac+b”, strings, value = TRUE)
[1] “acb” “accb” “acccb” “accccb”
> grep(“ac?b”, strings, value = TRUE)
[1] “ab” “acb”
> grep(“ac{2}b”, strings, value = TRUE)
[1] “accb”

Exercises

Exercise 1/4

# dollar
sub(“\\$”, “”, “$Peace-Love”)

# dot
sub(“\\.”, “”, “Peace.Love”)

# plus
sub(“\\+”, “”, “Peace+Love”)

# caret
sub(“\\^”, “”, “Peace^Love”)

# vertical bar
sub(“\\|”, “”, “Peace|Love”)

# opening round bracket
sub(“\\(“, “”, “Peace(Love)”)

# closing round bracket
sub(“\\)”, “”, “Peace(Love)”)

# opening square bracket
sub(“\\[“, “”, “Peace[Love]”)

# closing square bracket
sub(“\\]”, “”, “Peace[Love]”)

# opening curly bracket
sub(“\\{“, “”, “PeacefLoveg”)

Exercise 2/4

# replace word boundary with “_”

sub(“\\w”, “_”, “the dandelion war 2010”)

gsub(“\\w”, “_”, “the dandelion war 2010”)

# replace non-word-boundary with “_”

sub(“\\W”, “_”, “the dandelion war 2010”)

gsub(“\\W”, “_”, “the dandelion war 2010”)

Exercise 3/4

# people names
people = c(
“rori”, “emilia”, “matteo”, “mehmet”, “filipe”, “anna”, “tyler”, “rasmus”,

“jacob”, “youna”, “flora”, “adi”
)
# match “m” at most once
grep(pattern = “m?”, people, value = TRUE)
# match “m” exactly once
grep(pattern = “mf1g”, people, value = TRUE, perl = FALSE)
# match “m” zero or more times, and “t”
grep(pattern = “m*t”, people, value = TRUE)
# match “t”zero or more times, and “m”
grep(pattern = “t*m”, people, value = TRUE)

Exercise 4/4

C1 = c(“apple”, “orange”, “shrubbery”, “blackberry”)
C2 = c(“grape”, “melon”, “kiwi”, “apple”)

a) Set union.

b) Set intersection

c) Set difference

d) Set equality

e) Exact equality

f) Sorting C1 (decreasing order and increasing order)

For the following vectors:

Additional Exercises

Well done if you’ve completed the exercises. Once you complete these

additional exercises, you can leave the workshop sessions 

Additional Exercise 1/2 (1/2)

• For exercise 5, we want you to clear the “Time Period” values of the public

health data by using the regular expression functions of R.

• So for values like “Aug 2014 -Oct 2013″ or ” 2014 -013″ or “2011/12” or

“2010/11-12/13”, we want to you transform them into the “yyyy:yyyy” form.

Where the first ‘yyyy’ is the ‘first’ year, and the second ‘yyyy‘ is the last year.

So for a value like “2010/11 – 12/13”, its new value should look like

“2010:2013”.

• Now go to the next slide

Additional Exercise 1/2 (2/2)

a) Read the dataset “PublicHealthEnglandDataTableDistrict.xlsx”

b) How many distinct/unique values do we have (for “Time Period”)? And how
many observations of each unique values?

c) How many observations of “Time Period” contain the ‘month’ words (e.g. ‘Aug’)?

d) Could you remove these months words from the “Time Period” ?

e) Now could you change these modified values into our target form?

f) Now how many observations of “Time Period” contain the ‘/’ or ‘-‘ character?

g) Could you change these “Time Period” values with ‘/’ or ‘-‘ into our target form?

h) Have you find bad cases like ‘2011:2012:2013:2014′? Could you fix it?

Additional Exercise 2/2

A researcher states that heights of students in Lincoln university follow a normal

distribution 𝒩(177,10).

a) If this is true, how likely we will get a average height >=180 when we

randomly collect 20 students’ height records?

b) If you think the real average height should be higher, what is the critical

region for a test sample with 20 students? (𝛼=0.05 significance level)

c) The normal distribution assumption is not correct in this hypothesis. Can you

think about the reasons?

Thank You!

bchen@lincoln.ac.uk

mailto:bchen@lincoln.ac.uk