Introduction to information system
Topic Models for Information Retrieval
Gerhard Neumann
School of Computer Science
University of Lincoln
CMP3036M/CMP9063M Data Science
The poor feedbag
Feed the feedbag!
• It is starving… almost dead…
• Feedback (through the „backdoors“):
– Too much math!
• Yes… its data science. But we reduced the math level significantly.
– Assignment is a standard data-set:
• Yes… but the questions are very specific. If you want to copy paste the
assignment from the internet… then good luck!
• One request: please come to us first before „complaining“ to the „big guys“
Motivation
Information retrieval is needed almost
everywhere in the internet
Ad-hoc retrieval
• Goal: identify relevant documents for a given text query
• Input: a short, ambiguous and incomplete query (think of your usual Google search)
• Needs to be done extremely fast
• By far the most popular form of information access
Under the hood…
• What happens if you press the search button?
Agenda for Today
• Recap: Machine Learning with Documents
– How can we represent documents as vectors?
– What are the problems connected to these representations?
• Topic Models:
– Can be seen as clusters of documents
– A document can have several topics (clusters)
• Algorithms:
– Linear Semantic Analysis
– Non-Negative Matrix Factorization
Bag-of-words
• We need a way to represent textual information, (what is x)?
• We usually speak of Documents and Terms
• Think of a document as a Bag-Of-Words (bow)
Document:
• Texas Instruments said it has developed the fi rst 32-bit computer chip designed
specifically for artificial intelligence applications […]
Representation:
Bag-of-words
• Bag-of-Words is a histogram representation of the data
• Need a dictionary (of length L)
• Histograms are often used to transform set data into an aggregated fixed-length
representation
Some preprocessing…
• Pre-Processing before converting to Bag-of-Words histogram representation
• Stopword removal: remove un-informative words (several lists available)
– a, i, her, it, is, the, to, …
• Porter Stemming: Groups of rules to transform words to common stem
– remove ‘ed’, ‘ing’, ‘ly’, e.g. visiting,visited -> visit
– libraries,library -> librari
TF-IDF representation
• Re-weighting the entries according to importance
– n(d;w) number of occurences of word w in document d
– d(w) number of documents that contain word w
• Term frequency (TF) (of one document):
• Inverse document frequency (IDF) (of a corpus)
– Less frequent words get higher IDF
– Less frequent words are more descriptive / important
– If |d(w)| = n, then idf(w) = 0
• tf-idf:
– Weight term frequency with word importance (idf)
Ask questions!!!
Documents as matrices
Document-Term matrix:
• tf-idf representation of every document in the
database
• A is matrix of n documents
• each document is represented as m-
dimensional vector
Document-Term Matrices are huge!
Typical values:
• Corpus: 1.000.000 documents
• Vocabulary: 100.000 words
• Sparseness: < 0.1% • Think of the internet, documents = websites, words = every possible word Beyond keyword-based Search Why not finding the keywords that are typed in a query? • Vocabulary mismatch – different people use different vocabulary to describe the same concept – matching queries and documents based on keywords is insufficient • A webpage can be important although it does not contain the keywords Challenges Why is it a hard problem? • Compactness: few search terms, rarely redundancy – Average number of search terms per query = 2.5 • Variability: Synonyms and semantically related terms, expression, writing styles, etc • Ambiguity: Terms with multiple senses (polysems), e.g. Java, jaguar, bank, head Latent Semantic Analysis Document-Term matrix: Potential problems: • too large • too complicated • missing entries • noisy entries • lack of structure Is there a better way to represent the documents? • There may be a latent structure underlying the data • Possible structure: semantic topics – news, sports, mountainbiking news Latent Semantic Analysis General idea • Map documents to a low-dimensional representation (topics) • Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space) • Compute document similarity based topics Goals • Similar terms map to similar locations in low dimensional space • noise reduction by dimension reduction Semantic Topic Models • We can approximate the documents by using semantic topics Semantic Topic Models We can approximate the documents by using semantic topics • Topic matrix: – Lets assume we have q topics (q << m) – Each topic-vector r describes a „document prototype“ – Average word counts for this prototype • Topic representation of document i: – Document-topic factor: how much document i „belongs“ to topic j – Low-dimensional representation of a document Recap: Multiplication of a vector with a matrix • Matrix-Vector Product: • Think of it as: – We sum over the topic vectors , weighted by their topic factors for document i Matrix Decomposition: • Factorize the document matrix • Document Topic Factors: – Contains a low-dimensional representation for each document • Topic Matrix: – Contains q document prototypes Matrix Decomposition: • Factorize the document matrix • Reduction: Ask questions!!! How do we find the decomposition? Find a decomposition and that minimizes the sum of squarred errors (SSE) – Compute error matrix: – Compute SSE • Both matrices can be found by singular value decomposition (SVD) If you do not know SVD, just remember: • SVD gives you and • It can be shown that SVD minimizes the SSE Similarity of 2 documents For new documents (query search string): • Use document-topic factor (q-dim) instead of (m-dim), • Compute similarity in low-dimensional space – For example with the inner product • Select documents with lowest distance Non-negative matrix factorization • Using SVD, and can contain negative numbers – Why?... Why not? – Does that make sense? • represents word counts of document prototypes • represents how much a document is characterized by a topic – Nope… Non-negative matrix factorization Non-negative matrix factorization (NMF) wants to find L and R • … that minimize the SSE • ... and contain only non-negative entries There exists efficient algorithms for doing this… Example: News 3 from 100 extracted concepts/topics from AP news Example: Science articles • 10 from 128 extracted concepts/topics from articles published at Science magazine (12k) What have we learned today? • Bag of words, TF-IDF: Representation of documents with vectors – Size of the vector = number of words – Huge dimensionality – Very Sparse • Information retrieval with search queries – Much more robust in a low-dimensional space – Topic models: • Prototypes for documents • Can be linearly scaled – Compare documents in the low dimensional space