Exercise_one_NLP_Jan_2017
1
CS909:
2016-‐2017
Exercise
one:
Regular
expressions,
Segmentation
&
Edit
distance,
FSA
&
FST,
N-‐grams
and
language
models.
Submission:
12
pm
Thursday
March
2nd
2017
Notes:
a) This
exercise
will
contribute
towards
15%
of
your
overall
mark.
b) Submission
should
include
one
or
more
iPython
notebooks
and
a
report,
where
requested,
as
a
.zip.
Preparation:
Getting
to
know
Python
Practice
using
the
iPython
notebooks
from
the
module
website.
Part
A:
Regular
expressions
(20
marks)
1. Using
Python,
write
regular
expressions
to
capture
the
following
sets
of
strings
and
a
function
that
will
match
a
regex
against
a
string
to
evaluate
it.
Sets
of
regular
expressions
to
evaluate:
a. the
set
of
all
alphabetic
strings;
b. the
set
of
all
lower
case
alphabetic
strings
ending
in
a
s;
c. the
set
of
all
strings
with
two
consecutive
repeated
words
(e.g.,
“Mirror
mirror”
and
“the
the”
but
not
“the
bang”
or
“the
big
bang”);
d. all
strings
that
start
at
the
beginning
of
the
line
with
an
integer
and
that
end
at
the
end
of
the
line
with
a
word;
e. all
strings
that
have
both
the
word
“like”
and
the
word
“duck”
in
them
(but
not,
e.g.,
words
such
as
“likes”
that
merely
contain
the
word
“like”;
f. write
a
pattern
that
places
the
first
word
of
a
sentence
in
the
complete
words
of
Shakespeare
in
a
file.
Deal
with
punctuation.
By
“word”,
we
mean
an
alphabetic
string
separated
from
other
words
by
whitespace,
any
relevant
punctuation,
line
breaks,
and
so
forth.
Show
in
your
iPython
notebook
your
work
in
debugging
these
regular
expressions
by
illustrating
examples
that
match
or
don’t
match
the
patterns.
(10
marks)
2. Use
the
ART
Corpus
as
introduced
in
the
Seminars.
Write
a
Python
Program
that
will
(a)
recognize
chemical
compounds
(e.g.
[Fe(Rtpen)(η1-‐OOH)]2+,
CH3CH2).
Examples
of
chemical
compounds
can
be
found
in
paper
b103844n_mode2.Andrew.xml,
and
you
can
also
use
these
expressions
to
evaluate
your
chemical
compound
regular
expressions;
(b)
identify
which
papers
contain
chemical
compound
expressions
and
how
many
sentences
in
each
paper
contain
a
chemical
expression;
(c)
Identify
which
of
the
11
CoreSC
categories
most
chemical
expressions
appear
in.
(10
marks)
Part
B:
Minimum
Edit
distance
(25
marks)
1. Compute
the
minimum
edit
distance
(using
insertion
cost
1,
deletion
cost
1,
2
substitution
cost
2)
of
“refa”
to
“fear”.
Show
your
work
(using
the
edit
distance
grid).
What
are
the
possible
alignments?
(pen
and
paper
exercise)
(5
marks)
2. Figure
out
whether
“drive”
is
closer
to
“brief”
or
to
“divers”
and
what
the
edit
distance
is
to
each.
(use
the
above
assumption
regarding
cost,
deletion,
substitution.
Pen
and
paper
exercise)
(5
marks)
3. Now
implement
a
minimum
edit
distance
algorithm
in
Python
(making
the
same
assumption
about
costs
for
the
three
operations)
and
use
your
hand-‐
computed
results
to
check
your
code.
(10
marks)
4. Expand
the
minimum
edit
distance
algorithm
you
have
written
to
output
an
alignment.
For
this
you
will
need
to
store
pointers
and
add
a
stage
to
compute
the
backtrace.
(5
marks)
Part
C:
Morphology
&
FSTs
(10
marks)
Write
a
finite
state
transducer
for
the
consonant
doubling
spelling
rule
for
single
syllable
verbs
in
English.
The
rule
should
accept
“stop”,
“stops”,
“stopped”
and
“stopping”
but
not
“aimming”
or
“aimmed”.
(pen
and
paper
exercise)
(10
marks)
Below
is
a
link
explaining
the
consonant
doubling
rule:
http://speakspeak.com/resources/english-‐grammar-‐rules/english-‐spelling-‐
rules/double-‐consonant-‐ed-‐ing
Note:
If
you
enjoy
this
exercise
and
are
interested
in
how
you
can
write
an
FST
that
will
automatically
parse
text
visit
the
following
link
for
a
manual
on
using
the
Xerox
FST
tools:
http://web.stanford.edu/~laurik/fsmbook/home.html
Part
D:
N-‐grams
and
language
models
(30
marks)
1. We
are
given
the
following
corpus,
modified
from
the
one
in
lecture
6:
I
am
Sam
Sam
I
am
I
am
Sam
I
do
not
like
green
eggs
and
Sam
If
we
use
linear
interpolation
smoothing
between
a
maximum-‐likelihood
bigram
model
and
a
maximum-‐likelihood
unigram
model
with
lambda1
=
½
and
lambda2=½,
what
is
P(Sam|am)?
Include
and
in
your
counts
just
like
any
other
token.
(3
marks)
2. Write
a
program
to
compute
unsmoothed
unigrams
and
bigrams.
(10
marks)
3. Run
your
N-‐gram
program
on
the
ART
corpus.
Don’t
consider
any
of
the
XML
tags
apart
from
(sentence
boundaries).
What
are
the
most
common
unigrams
and
bigrams?
Give
a
list
of
the
50
most
common
in
each
case.
(10marks)
4. Add
an
option
to
your
program
to
generate
random
sentences.
(7
marks)