程序代写代做代考 data science Introduction to information system

Introduction to information system

Machine Learning with Documents…

Gerhard Neumann

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M Data Science

Bag-of-words

• We need a way to represent textual information, (what is x)?

• We usually speak of Documents and Terms

• Think of a document as a Bag-Of-Words (bow)

Document:

• Texas Instruments said it has developed the rst 32-bit computer chip designed

specically for articial intelligence applications […]

Representation:

Bag-of-words

• Bag-of-Words is a histogram representation of the data

• Need a dictionary (of length L)

• Histograms are often used to transform set data into an aggregated fixed-length

representation, e.g. local image descriptors

Some preprocessing…

• Pre-Processing before converting to Bag-of-Words histogram representation

• Stopword removal: remove un-informative words (several lists available)

– a, i, her, it, is, the, to, …

• Porter Stemming: Groups of rules to transform words to common stem

– remove ‘ed’, ‘ing’, ‘ly’, e.g. visiting,visited -> visit

– libraries,library -> librari

TF-IDF representation

• Re-weighting the entries according to importance

– n(d;w) number of occurences of word w in document d

– d(w) number of documents that contain word w

• Term frequency (TF) (of one document):

• Inverse document frequency (IDF) (of a corpus)

– Less frequent words get higher IDF

– Less frequent words are more descriptive / important

• tf-idf:

– Weight term frequency with word importance (idf)