Search Engines

Text, Web And Media Analytics

Beyond Bag of Words

Yuefeng Li | Professor
School of Electrical Engineering and Computer Science
Queensland University of Technology
S Block, Level 10, Room S-1024, Gardens Point Campus
ph 3138 5212 | email

Overview of Bag of Words
Feature-Based Retrieval Models
Inference Network Model
Linear Feature-Based Models
Term Dependence models
Markov Random Field (MRF)
Latent Concept Expansion
Topic Modeling for Latent topics
Probabilistic latent semantic analysis (PLSA)
Latent Dirichlet Allocation (LDA)
Pattern Enhanced LDA
Expert Search
Entity search
Probabilistic Model for Expert Search
6. Word Embedding

This week we discuss some research methods that beyond bag of words. There are two important motivations here.
The first one is to provide more knowledge to understand complex research issues in real applications.
The second is to acquire research capabilities to solve difficult problems.

1. Overview of Bag of Words
“Bag of Words” representation means a simple representation of text data. It has been very successful in retrieval experiments compared to more complex representations of text content.
In this representation, a document is considered to be an unordered collection of words with no relationships, either syntactic or statistical, between them.
From a linguistic point of view, “bag of words” is extremely limited since no one could read a “bag of words” representation and get the same meaning as the original normal text.
In some cases, incorporating simple phrases and word proximity into a word-based representation, which would seem to have obvious benefits.
Many applications (e.g., web search), however, have evolved beyond the stage where a bag of words representation of documents or queries would be adequate.
For these applications, representations and ranking based on many different features are required.
Features derived from the bag of words are still important, but linguistic, structural, metadata, and non-textual content features can also be used effectively.

2. Feature-Based Retrieval Models
Effective retrieval requires the combination of many pieces of evidence or features about a document’s potential relevance.
For example, we may consider
words occur in particular document structures, such as section headings or titles, or
whether words are related to each other.
In addition, evidence such as the date of publication, the document type, or,
in the case of web search, the PageRank number will also be important.

Inference Network Model
The inference network retrieval model is a framework where we can describe the different types of features or evidence, their relative importance, and how they should be combined.
It based on Bayesian networks are probabilistic models that is used to specify a set of events and the dependencies between them.
The networks are directed, acyclic graphs (DAGs), where the nodes in the graph represent events with a set of possible outcomes and arcs represent probabilistic dependencies between the events.
When used as a retrieval model, the nodes represent events such as observing a particular document, or a particular piece of features, or some combination of pieces of features. These events are all binary, meaning that TRUE and FALSE are the only possible outcomes.

Example 1. Inference network model
D – a document (web page) node. Features being combined are words in a web page’s title, body, and

thumbnail.jpeg 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts