CS考试辅导 BM25.

Search Engines

Text, Web And Media Analytics

Copyright By PowCoder代写 加微信 powcoder

Beyond Bag of Words

Yuefeng Li  |  Professor
School of Electrical Engineering and Computer Science
Queensland University of Technology
S Block, Level 10, Room S-1024, Gardens Point Campus
ph 3138 5212 | email 

Overview of Bag of Words
Feature-Based Retrieval Models
Inference Network Model
Linear Feature-Based Models
Term Dependence models
Markov Random Field (MRF)
Latent Concept Expansion
Topic Modeling for Latent topics
Probabilistic latent semantic analysis (PLSA)
Latent Dirichlet Allocation (LDA)
Pattern Enhanced LDA
Expert Search
Entity search
Probabilistic Model for Expert Search
6. Word Embedding

This week we discuss some research methods that beyond bag of words. There are two important motivations here.
The first one is to provide more knowledge to understand complex research issues in real applications.
The second is to acquire research capabilities to solve difficult problems.

1. Overview of Bag of Words
“Bag of Words” representation means a simple representation of text data. It has been very successful in retrieval experiments compared to more complex representations of text content.
In this representation, a document is considered to be an unordered collection of words with no relationships, either syntactic or statistical, between them.
From a linguistic point of view, “bag of words” is extremely limited since no one could read a “bag of words” representation and get the same meaning as the original normal text.
In some cases, incorporating simple phrases and word proximity into a word-based representation, which would seem to have obvious benefits.
Many applications (e.g., web search), however, have evolved beyond the stage where a bag of words representation of documents or queries would be adequate.
For these applications, representations and ranking based on many different features are required.
Features derived from the bag of words are still important, but linguistic, structural, metadata, and non-textual content features can also be used effectively.

2. Feature-Based Retrieval Models
Effective retrieval requires the combination of many pieces of evidence or features about a document’s potential relevance.
For example, we may consider
words occur in particular document structures, such as section headings or titles, or
whether words are related to each other.
In addition, evidence such as the date of publication, the document type, or,
in the case of web search, the PageRank number will also be important.

Inference Network Model
The inference network retrieval model is a framework where we can describe the different types of features or evidence, their relative importance, and how they should be combined.
It based on Bayesian networks are probabilistic models that is used to specify a set of events and the dependencies between them.
The networks are directed, acyclic graphs (DAGs), where the nodes in the graph represent events with a set of possible outcomes and arcs represent probabilistic dependencies between the events.
When used as a retrieval model, the nodes represent events such as observing a particular document, or a particular piece of features, or some combination of pieces of features. These events are all binary, meaning that TRUE and FALSE are the only possible outcomes.

Example 1. Inference network model
D – a document (web page) node. Features being combined are words in a web page’s title, body, and

headings, and ri nodes are document features.
The probabilities associated features are based on language models θ estimated using parameters μ.
qi – query nodes are used to combine (e.g., AND & OR) features from representation nodes and other query nodes.
I – an information need. The network as a whole computes P(I|D, μ), the probability that I is met given D and μ

The probability estimate
It based on the multiple-Bernoulli distribution is the same as the estimate for the multinomial distribution with Dirichlet smoothing:

where fri,D is the number of times feature ri occurs in D, P(ri|C) is the collection (C) probability for feature ri, and μ is the Dirichlet smoothing parameter.
Please note D and μ are random variables, for example, if fri,D is the number of times feature ri occurs in a document title, the collection probabilities would be estimated from the collection of all title texts, and the μ parameter would be specific to titles.
The query nodes specify how to combine evidence using Boolean operators and the above probabilities (you can find more details in chapter 7 of the text book).

Linear Feature-Based Models
The scoring function of Linear feature-based model

Some models support non-linear functions, but linear is more common.

Parameter Setting
To find best values for parameters
need a set of training data T
an evaluation function
where RΛ is the rankings produced by the scoring function for all the queries
It produces real-valued output given the set of ranked lists and the training data.
E is only required to consider the document rankings and not the document scores. This is a standard characteristic of the evaluation measures we described before in week 6, such as NDCG.
Goal of a linear feature-based retrieval model is to find a parameter setting that maximizes E for the training data

Normalized DCG, that is, DCG value / Ideal DCG values (using the perfect ranking), so NDCG <= 1. 3. Term Dependence Models Retrieval models that make use of term relationships are often called term dependence models as they do not assume that words occur independently of each other. It can be incorporated into a number of features that are used as part of the ranking algorithm. For example, Markov Random Field (MRF) model, MRFs are undirected graphical models It first constructs a graph that consists of a document node and one node per query term; It then models the joint distribution over the document random variable and query term random variables; Example 2. Markov Random Field model assumptions: full independence (top left), sequential dependence (top right) full dependence (bottom left) general dependence (bottom right) Markov Random Field (MRF) Potential functions In practice, however, it has been shown that using the sequential dependence assumption is the best option. After the MRF graph has been constructed, a set of potential functions must be defined over the cliques of the graph. The potential functions are meant to measure the compatibility between the observed values for the random variables in the clique. For example, the sequential dependence graph in Example 2, a potential function over the clique consisting of the terms q1, q2, and D might compute how many times the exact phrase “q1 q2” occurs in document D, or How many times the two terms occur within some window of each other. These potential functions are quite general and can compute a variety of different features of the text. In this way, the MRF model is more powerful than other models, such as language models or BM25. Example 3. Galago queries & MRF models Assume we have a query “president abraham lincoln”. The full independence MRF model in Galago query (it does not consider any dependencies between terms) #combine(president abraham lincoln) The sequential dependence MRF model in Galago query It has three components, and each component is weighted (od – ordered, uw – unordered windows) The first component scores the contribution of matching individual terms, and weighted 80% The second component scores the contribution of matching subphrases within the query, and weighted 10% Latent Concept Expansion To summarize, the MRF model is a linear feature-based retrieval model, is an effective method of incorporating features based on term dependence to rank documents. Latent concept expansion (LCE) supports pseudo-relevance feedback in the MRF framework. Latent concept expansion can be viewed as a “feature expansion” technique, in that it enriches the original feature set by including new features based on the expanded query. Latent, or hidden, concepts are words or phrases that users have in mind but do not mention explicitly when they express a query latent concept expansion graph shows dependencies between query words and expansion words better probability estimates for expansion terms expansion features not just terms Example 4. Pseudo-Relevance Feedback (PRF) Graphs Relevance model - where words occur independently, and the question marks are expansion words estimated using PRF LCE model - where dependencies represented between query words and expansion words. The estimated probability for expansion words will use the dependencies, and can produce multi-word phrases as expansion terms rather than only words LCE Example 4. Topic Modeling for Latent topics Improved representations of documents can also be viewed as improved smoothing techniques improve estimates for words that are related to the topic(s) of the document instead of just using background probabilities Approaches Latent Semantic Indexing (LSI) Probabilistic Latent Semantic Indexing (pLSI) or Probabilistic latent semantic analysis (PLSA), Latent Dirichlet Allocation (LDA) - model document as being generated from a mixture of topics Probabilistic latent semantic analysis (PLSA) PLSA uses the probability of a co-occurrence, (w, d) of words and documents, as a mixture of conditionally independent multinomial distributions: P(d, w) = P(d)P(w|d) and Where c means the words' topic, w and d are both generated from the latent topic c in similar ways (using the conditional probabilities P(d|c) and P(w|c)), and M is the number of documents, N is the length of d. Model fitting with Expectation Maximization Bayes’ rule P(d|c) = P(d)P(c|d)/P(c) Latent Dirichlet Allocation (LDA) plate notation D is a corpus, consisting of  M documents each of length Ni α is the Dirichlet prior on the per-document topic distributions β is the Dirichlet prior on the per-topic word distribution i refer to multinomial distribution over topics in document i (i.e., topic distribution) zij is the topic for the jth word in document i and wij is the specific word. k be the multinomial distribution over words for topic k (i.e., topic representation) K denotes the number of topics considered in the model. Document i is represented by a topic distribution i Collection D is represented by a set of topics each of which is represented by a topic representation k for topic k. LDA generates word-topic assignments for each document based on i and k during its sampling process for iteratively estimating i and k . The topic distribution i indicates which topics are important for a particular document i. The topic representation k indicates which words are important to a topic k at collection level   is a K*V (V is the dimension of the vocabulary) Markov matrix (transition matrix), and each row of which denotes the word distribution of a topic. LDA - The generative process Choose  i  Dir() which is the Dirichlet distribution ( a prior distributions in Bayesian statistics) for parameter  Choose k  Dir(β) For each of the word positions i (document), j (word in document i) { choose a topic zi,j  Multinimial (i )   //a generalization of the binomial distribution; choose a word wi,j  Multinimial (zi,j) Table 1. A possible results of LDA: word-topic assignments LDA – term probability Gives language model probabilities Used to smooth the document representation by mixing them with the query likelihood probability as follows: LDA – term probability cont. If the LDA probabilities are used directly as the document representation, the effectiveness will be significantly reduced because the features are too smoothed e.g., in typical TREC experiment, only 400 topics used for the entire collection generating LDA topics is expensive When used for smoothing, effectiveness is improved LDA Example Top words from 4 LDA topics from TREC news Topic Modeling with Scikit Learn https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730 Documents: 20 Newsgoups dataset from sklearn.datasets import fetch_20newsgroups dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes')) documents = dataset.data lda = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf) Pattern Enhanced LDA Pattern-based representations contain structural information which can reveal the association between words A post-LDA approach: firstly, construct a new transactional dataset from the LDA model results of the document collection D; secondly, generate pattern-based representations from the transactional dataset to represent user needs of the collection D. Example 4. Topical document transaction(TDT): Transactional datasets generated from Table 1, where duplicates words are removed 5. Expert search It is a method for finding people who have expertise in a particular area or topic. It is a kind of entity search that uses entities’ structures (e.g., entities as people, organizations, and locations) to provide a ranked list of entities in response to a query. Entity Search Identify entities in text as we discussed for information extraction. Entity representation Construct “pseudo-documents” to represent entities that based on words occurring near the entity over the whole corpus. also called “context vectors” For example, if the organization entity “california institute of technology” occurred 65 times in a corpus, every word within 20 words of those 65 occurrences would be accumulated into the pseudo-document representation. Retrieve ranked lists of entities instead of documents Example 5. Entity Search (organization search based on a TREC news corpus) Probabilistic Model for Expert Search Let D be a set of documents and q be a query Rank candidate entities e by the joint distribution P(e, q) of entities and query terms P(q|e,d) involves ranking entities in those documents with respect to a query P(e|d) component corresponds to finding documents that provide information about an entity Estimation of Probabilities for Expert Search Assuming words and entities are independent in a given document d The two components could be computed separately by using q and e as queries for probabilistic retrieval. Normally leads to poor performance Instead estimate the strength of association between e and q using proximity of co-occurrence of the query words and the entities. 6. Word Embedding It is a term used for the representation of words for text analysis and NLP. It encodes the meaning of a word in the form of a real-valued vector, such that the words that are closer in the vector space are expected to be similar in meaning. For example, Word2vec embeddings in Python package gensim. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text.  The word2vec algorithms include skip-gram and CBOW models, using either hierarchical softmax or negative sampling. E.g., the vector of “machines” = [-0.0366 -0.0750 0.1216 -0.0339 0.0837 0.0024 0.0983 0.0981 -0.1342 0.0736] References Chapter 7 & 11 in text-book - W. , Search Engines - Information retrieval in Practice; Pearson, 2010. https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation T. Hofmann, Probabilistic Latent Semantic Analysis, arXiv preprint arXiv:1301.6705, 2013 - arxiv.org. Y. Gao, Y. Xu, and Y. Li, Pattern-based Topics for Document Modelling in Information Filtering, IEEE Transactions on Knowledge & Data Engineering, 2015, vol. 27, no. 6, pages 1629-1642. /docProps/thumbnail.jpeg 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com