Text preprocessing
School of Computing and Information Systems
@University of Melbourne 2022
Copyright By PowCoder代写 加微信 powcoder
Text processing for machine learning
• Text is a sequence of characters.
• ML algorithms understand numeric vectors. • The aim
“Cars are driven on the road.”
COMP20008 Elements of Data Processing
Text processing for machine learning
• Features retain as much meaning/information as possible • Reduce the sparsity of the feature vector
“Cars are driven on the road.”
COMP20008 Elements of Data Processing
Preprocessing – Tokenisation
• Granularity of a token • Sentence
• Token separators
• “The speaker did her Ph.D. in Germany. She now works at UniMelb.” • “The issue—and there are many—is that text is not consistent.”
COMP20008 Elements of Data Processing
Preprocessing – Tokenisation
• Split continuous text into a list of individual tokens
• English words are often separated by white spaces but not always • Tokens can be words, numbers, hashtags, etc.
• Can use regular expression
COMP20008 Elements of Data Processing
Preprocessing – Case folding
• Convert text to consistent cases
• Simple and effective for many tasks
• Reduce sparsity (many map to the same lower-case form) • Good for search
I had an AMAZING trip to Italy, Coffee is only 2 bucks, sometimes three! The coffee is so nice.
i had an amazing trip to italy, coffee is only 2 bucks, sometimes three! the coffee is so nice.
COMP20008 Elements of Data Processing
Preprocessing – Stemming
• Words in English are derived from a root or stem inexpensive → in+expense+ive
Driving my car
Taking a drive in my car He drives my car
COMP20008 Elements of Data Processing
Preprocessing – Stemming
• Stemming attempts to undo the processes that lead to word formation • Remove and replace word suffixes to arrive at a common root form
• Result does not necessarily look like a proper ‘word’
• Porter stemmer: one of the most widely used stemming algorithms
• suffix stripping (Porter stemmer) • sses → ss
• tional → tion • tion→t
COMP20008 Elements of Data Processing
Preprocessing – Stemming
https://text-processing.com/demo/stem/
troubles à troubl troubled à troubl trouble à troubl
COMP20008 Elements of Data Processing
Preprocessing – Lemmatization
• To remove inflections and map a word to its proper root form (lemma)
• It does not just strip suffixes, it transforms words to valid roots: running à run
runs à run ran à run
• Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words.
COMP20008 Elements of Data Processing
Preprocessing – Stop Word Removal
• Stop words are ‘function’ words that structure sentences; they are low information words and some of them are very common
• ‘the’, ‘a’, ‘is’,…
• Exclude them from being processed; helps to reduce the number
of features/words
• Commonly applied for search, text classification, topic modelling, topic extraction, etc.
• A stopword list can be custom-made for a specific context/domain
COMP20008 Elements of Data Processing
Stop Word Removal
COMP20008 Elements of Data Processing
Text Normalisation
• Transforming a text into a canonical (standard) form
• Important for noisy text, e.g., social media comments, text
• Used when there are many abbreviations, misspellings and out- of-vocabulary words (oov)
2moro à tomorrow 2mrw à tomorrow tomrw à tomorrow B4 à before
COMP20008 Elements of Data Processing
Noise Removal
• Remove unnecessary spacing
• Remove punctuation and special characters (regular expressions) • Unify numbers
• Highly domain dependent
COMP20008 Elements of Data Processing
So far… Unstructured Text Data
• Text search – approximate string matching
• Preprocessing
– Regular expressions
– Tokenisation
– Case folding
– Stemming
– Lemmatization
– Stop word removal
– Text normalization
– Noise removal
COMP20008 Elements of Data Processing
COMP20008 Elements of Data Processing
COMP20008 Elements of Data Processing
Text representation – I
School of Computing and Information Systems
@University of Melbourne 2022
Sepal length
Sepal width
Petal length
Petal width
Species (label)
Iris setosa
Iris versicolor
Iris setosa
Iris virginica
COMP20008 Elements of Data Processing
How To Represent Text?
https://cis.unimelb.edu.au/about/school/
COMP20008 Elements of Data Processing
Text Representation – BoW
• Bag-of-words: simplest vector space representational model for text
• Disregards word order and grammatical features such as POS
• Each text document as a numeric vector
• each dimension is a specific word from the corpus
• the value could be its frequency in the document or occurrence (denoted by 1 or 0).
COMP20008 Elements of Data Processing
Prepare for BoW
• Word tokenisation
• Case-folding
• Abstraction of number (#num#, #year#)
COMP20008 Elements of Data Processing
Prepare for BoW
• Stop word removal
COMP20008 Elements of Data Processing
Prepare for BoW
• Stop word removal
COMP20008 Elements of Data Processing
Prepare for BoW
• Stemming
• How would this look different if we lemmatised instead?
• Removed punctuation
• Word counts
COMP20008 Elements of Data Processing
Bag of Words
COMP20008 Elements of Data Processing
Term Frequency
• What if a rare word occurs in a document?
• e.g. ‘catarrh’ is less common than ‘mucus’ or ‘stuffiness’
• What if a word occurs in many documents?
• Maybe we want to avoid raw counts?
• Raw frequencies varies with document length
• Don’t capture important (rare) features that would be telling of a type of document
COMP20008 Elements of Data Processing
Text representation – II
School of Computing and Information Systems
@University of Melbourne 2022
Raw Frequencies
• What are the problems?
• What are the alternatives?
COMP20008 Elements of Data Processing
Raw Frequencies
• What are the problems?
• What are the alternatives?
play grace crowd
play grace audience
COMP20008 Elements of Data Processing
COMP20008 Elements of Data Processing
Discourse on Floating Bodies
– Galileo Galilei
Treatise on Light
– Christiaan Huygens
Experiments with Alternate Currents of High Potential and High Frequency
– Nikola Tesla
Relativity: The Special and General Theory
• TF-IDF stands for Term Frequency-Inverse Document Frequency
• Each text document as a numeric vector
• each dimension is a specific word from the corpus
• A combination of two metrics to weight a term (word)
• term frequency (tf): how often a given word appears within a document
• inverse document frequency (idf): down-weights words that appear in many documents.
• Main idea: reduce the weight of frequent terms and increase the weight of rare and indicative ones.
COMP20008 Elements of Data Processing
Term frequency (TF):
• !” !, $ = the raw count of a term in the document. Inverse Document Frequency (IDF):
•&$”! =ln !”# +1or&$”! =ln # +1 !”$%! $%!
• N is the number of document in the collection,
• $”& is the document frequency, the number of document containing the term t.
TF-IDF (L2 normalised): • !”_&$” !,$ = ‘!
∑!”∈$ ‘!”%
where,& =!” !,$ ×&$” ! COMP20008 Elements of Data Processing
Example TF-IDF
Two documents: A – ‘the car is driven on the road’
B – ‘the truck is driven on the highway’
-./(1) = 34 !”# + 1 !”$%!
)* ),, ×7,* ) =9&
1/_-./ 1,.
ln ‘( +1=1.405
ln ‘( +1=1.405
ln ‘( +1=1.405
ln ‘( +1=1.405
= 2.225 * stop words removed COMP20008 Elements of Data Processing
Example TF-IDF
COMP20008 Elements of Data Processing
Example TF-IDF – cont.
• Two documents, A and B.
A. ‘the car is driven on the road’
B. ‘the truck is driven on the highway’
* stop words removed
• Text features for machine learning
COMP20008 Elements of Data Processing
• 3 documents:
A: ‘the car is driven on roads’
B: ‘the truck is driven on a highway’
C: ‘a bike can not be ridden on a highway’
* stop words removed
34 !”# +1 !”$%!
9& = )* ),, ×7,* )
1/−-./ 1,.
ln4/2 +1=1.6931.693
↳ (4/3)+1=1.288
=? COMP20008 Elements of Data Processing
! “!!$ ! “!!$
Features from unstructured text
Features for structured data
TF-IDF features for unstructured text
COMP20008 Elements of Data Processing
Revision – Finding similar texts
• Edit distance
• N-gram distance
• Jaccard similarity
• Sørensen-Dice similarity
• Cosine similarity d1
cos d!,dB =
C&⋅C% ~0.202 C& × C%
COMP20008 Elements of Data Processing
So far… Unstructured Text Data
• Text search – approximate string matching
• Preprocessing
– Regular expressions
– Tokenisation
– Case folding
– Stemming
– Lemmatization
– Stop word removal
– Text normalization
– Noise removal
• Text representation
COMP20008 Elements of Data Processing
Other Text Features
• Part-of-speech tagging
• She saw a bear. bear: NOUN • Your efforts will bear fruit. bear: VERB • bear_NN; bear_VB
• N-grams (bag of n-grams)
COMP20008 Elements of Data Processing
value_curiosity
curiosity_,
passion_and
Advanced text representation
• Surface value vs Semantics • ‘A is better than B’
• ‘B is better than A’
• ‘Distributed Representations of Sentences and Documents’ Quoc Le and (Doc2Vec)
COMP20008 Elements of Data Processing
• Text search – approximate string matching
• Preprocessing
– Regular expressions
– Tokenisation
– Case folding
– Stemming
– Lemmatization
– Stop word removal
– Text normalization
– Noise removal
• Text representation
– Part of Speech Tagging
– Bag-of-n-grams
– Distributed representation learning
COMP20008 Elements of Data Processing
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com