程序代写CS代考 Semester 2 2021

Semester 2 2021
Lecture 4, Part 1: Unstructured Data – Text Preprocessing

Text (string) search
String Matching and String Comparison

Exact string search
• Given a string, is some substring contained within it? • Given a string, find all occurrences of some substring.
For example, find Exxon in:
In exes for foxes rex dux mixes a pox of waxed luxes. An axe, and an axon, to exo Exxon max oxen. Grexit or Brexit as quixotic haxxers with buxom rex taxation.

Approximate string search/match
Find exon in:
In exes for foxes rex dux mixes a pox of waxed luxes. An axe, and an axon, to exo Exxon max oxen. Grexit or Brexit as quixotic haxxers with buxom rex taxation.
Not present!
…But what is the “closest” or “best” match?

Approximate string search/matching
Find approximate match(es) for exon in:
In exes for foxes rex dux mixes a pox of waxed luxes.
An axe, and an axon, to exo Exxon max oxen.
Grexit or Brexit as quixotic haxxers with buxom rex taxation.
exon → Exxon Insert x
exon → exo Delete n
exon → axon Replace e with a
exon → oxen Transpose e and o (not covered)

Why do string matching?
Application – spelling correction
Need the notion of a dictionary:
• Here, a list of words (entries) that are “correct” with respect to our
(expectations of our) language
• We can break our input into words (substrings) that we wish to match, and compare each of them against the entries in the dictionary
• A word (item) in the input that doesn’t appear in the dictionary is misspelled
• A word (item) in the input that does appear in the dictionary might be correctly spelled or might be misspelled (beyond the scope of this subject)

Why do string matching?
Application – Spelling Correction
Therefore, the problem here:
Given some item of interest — which does not appear in our dictionary — which entry from the dictionary was truly intended?
Depends on the person who wrote the original string!

Why do string matching?
Application – Detecting Neologisms
Word Blending:
• forming novel words by “blending” two other words
• not simple concatenation (e.g., football is not a blend word) breakfast + lunch → brunch
fork + spoon → spork
Britain + exit → Brexit
• Language changes continuously
• New terms are often coined in colloquial language (e.g., Twitter) • Social media is fertile grounds for linguistic innovation.

Why do string matching?
Application – Detecting Equivalence
street and place name conventions
boulevard|blvd|bd|bde|blv|bl|blvde|blvrd|boulavard|boul|bvd
apartment|apt|ap|aprt|aptmnt
village|vil|vge|vill|villag|villg|vlg|vlge|vllg
• Data cleaning (e.g., deduplication) • Query repair