Introduction to information system
Machine Learning with Documents…
Gerhard Neumann
School of Computer Science
University of Lincoln
CMP3036M/CMP9063M Data Science
Bag-of-words
• We need a way to represent textual information, (what is x)?
• We usually speak of Documents and Terms
• Think of a document as a Bag-Of-Words (bow)
Document:
• Texas Instruments said it has developed the rst 32-bit computer chip designed
specically for articial intelligence applications […]
Representation:
Bag-of-words
• Bag-of-Words is a histogram representation of the data
• Need a dictionary (of length L)
• Histograms are often used to transform set data into an aggregated fixed-length
representation, e.g. local image descriptors
Some preprocessing…
• Pre-Processing before converting to Bag-of-Words histogram representation
• Stopword removal: remove un-informative words (several lists available)
– a, i, her, it, is, the, to, …
• Porter Stemming: Groups of rules to transform words to common stem
– remove ‘ed’, ‘ing’, ‘ly’, e.g. visiting,visited -> visit
– libraries,library -> librari
TF-IDF representation
• Re-weighting the entries according to importance
– n(d;w) number of occurences of word w in document d
– d(w) number of documents that contain word w
• Term frequency (TF) (of one document):
• Inverse document frequency (IDF) (of a corpus)
– Less frequent words get higher IDF
– Less frequent words are more descriptive / important
• tf-idf:
– Weight term frequency with word importance (idf)