程序代写代做代考 data science Java information retrieval algorithm database Introduction to information system

Introduction to information system

Topic Models for Information Retrieval

Gerhard Neumann

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M Data Science

The poor feedbag

Feed the feedbag!

• It is starving… almost dead…

• Feedback (through the „backdoors“):

– Too much math!

• Yes… its data science. But we reduced the math level significantly.

– Assignment is a standard data-set:

• Yes… but the questions are very specific. If you want to copy paste the
assignment from the internet… then good luck!

• One request: please come to us first before „complaining“ to the „big guys“

Motivation

Information retrieval is needed almost

everywhere in the internet

Ad-hoc retrieval

• Goal: identify relevant documents for a given text query

• Input: a short, ambiguous and incomplete query (think of your usual Google search)

• Needs to be done extremely fast

• By far the most popular form of information access

Under the hood…

• What happens if you press the search button?

Agenda for Today

• Recap: Machine Learning with Documents

– How can we represent documents as vectors?

– What are the problems connected to these representations?

• Topic Models:

– Can be seen as clusters of documents

– A document can have several topics (clusters)

• Algorithms:

– Linear Semantic Analysis

– Non-Negative Matrix Factorization

Bag-of-words

• We need a way to represent textual information, (what is x)?

• We usually speak of Documents and Terms

• Think of a document as a Bag-Of-Words (bow)

Document:

• Texas Instruments said it has developed the fi rst 32-bit computer chip designed

specifically for artificial intelligence applications […]

Representation:

Bag-of-words

• Bag-of-Words is a histogram representation of the data

• Need a dictionary (of length L)

• Histograms are often used to transform set data into an aggregated fixed-length

representation

Some preprocessing…

• Pre-Processing before converting to Bag-of-Words histogram representation

• Stopword removal: remove un-informative words (several lists available)

– a, i, her, it, is, the, to, …

• Porter Stemming: Groups of rules to transform words to common stem

– remove ‘ed’, ‘ing’, ‘ly’, e.g. visiting,visited -> visit

– libraries,library -> librari

TF-IDF representation

• Re-weighting the entries according to importance
– n(d;w) number of occurences of word w in document d

– d(w) number of documents that contain word w

• Term frequency (TF) (of one document):

• Inverse document frequency (IDF) (of a corpus)
– Less frequent words get higher IDF

– Less frequent words are more descriptive / important

– If |d(w)| = n, then idf(w) = 0

• tf-idf:

– Weight term frequency with word importance (idf)

Ask questions!!!

Documents as matrices

Document-Term matrix:

• tf-idf representation of every document in the
database

• A is matrix of n documents

• each document is represented as m-
dimensional vector

Document-Term Matrices are huge!

Typical values:

• Corpus: 1.000.000 documents

• Vocabulary: 100.000 words

• Sparseness: < 0.1% • Think of the internet, documents = websites, words = every possible word Beyond keyword-based Search Why not finding the keywords that are typed in a query? • Vocabulary mismatch – different people use different vocabulary to describe the same concept – matching queries and documents based on keywords is insufficient • A webpage can be important although it does not contain the keywords Challenges Why is it a hard problem? • Compactness: few search terms, rarely redundancy – Average number of search terms per query = 2.5 • Variability: Synonyms and semantically related terms, expression, writing styles, etc • Ambiguity: Terms with multiple senses (polysems), e.g. Java, jaguar, bank, head Latent Semantic Analysis Document-Term matrix: Potential problems: • too large • too complicated • missing entries • noisy entries • lack of structure Is there a better way to represent the documents? • There may be a latent structure underlying the data • Possible structure: semantic topics – news, sports, mountainbiking news Latent Semantic Analysis General idea • Map documents to a low-dimensional representation (topics) • Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space) • Compute document similarity based topics Goals • Similar terms map to similar locations in low dimensional space • noise reduction by dimension reduction Semantic Topic Models • We can approximate the documents by using semantic topics Semantic Topic Models We can approximate the documents by using semantic topics • Topic matrix: – Lets assume we have q topics (q << m) – Each topic-vector r describes a „document prototype“ – Average word counts for this prototype • Topic representation of document i: – Document-topic factor: how much document i „belongs“ to topic j – Low-dimensional representation of a document Recap: Multiplication of a vector with a matrix • Matrix-Vector Product: • Think of it as: – We sum over the topic vectors , weighted by their topic factors for document i Matrix Decomposition: • Factorize the document matrix • Document Topic Factors: – Contains a low-dimensional representation for each document • Topic Matrix: – Contains q document prototypes Matrix Decomposition: • Factorize the document matrix • Reduction: Ask questions!!! How do we find the decomposition? Find a decomposition and that minimizes the sum of squarred errors (SSE) – Compute error matrix: – Compute SSE • Both matrices can be found by singular value decomposition (SVD) If you do not know SVD, just remember: • SVD gives you and • It can be shown that SVD minimizes the SSE Similarity of 2 documents For new documents (query search string): • Use document-topic factor (q-dim) instead of (m-dim), • Compute similarity in low-dimensional space – For example with the inner product • Select documents with lowest distance Non-negative matrix factorization • Using SVD, and can contain negative numbers – Why?... Why not? – Does that make sense? • represents word counts of document prototypes • represents how much a document is characterized by a topic – Nope… Non-negative matrix factorization Non-negative matrix factorization (NMF) wants to find L and R • … that minimize the SSE • ... and contain only non-negative entries There exists efficient algorithms for doing this… Example: News 3 from 100 extracted concepts/topics from AP news Example: Science articles • 10 from 128 extracted concepts/topics from articles published at Science magazine (12k) What have we learned today? • Bag of words, TF-IDF: Representation of documents with vectors – Size of the vector = number of words – Huge dimensionality – Very Sparse • Information retrieval with search queries – Much more robust in a low-dimensional space – Topic models: • Prototypes for documents • Can be linearly scaled – Compare documents in the low dimensional space