Search Engines
Text, Web And Media Analytics
Week 7 – Lecture
Copyright By PowCoder代写 加微信 powcoder
Text mining and user information needs
Yuefeng Li | Professor
School of Computer Science
Queensland University of Technology
S Block, Level 10, Room S-1024, Gardens Point Campus
ph 3138 5212 | email
1. Concepts of Text Mining and Information Needs
2. Query based approaches for acquiring information needs
Stem classes
Spelling correction
Noisy channel model
3. Query Expansion
Association Measures
Context vectors
4. Relevance Feedback
5. Query Context and Personalization
6. Applications
Snippet Generation – information summery
Searching Advertisements
1. Concepts of Text Mining and Information Needs
Text mining (or text analytics, the application of data mining in textural data), is the process of deriving high-quality information from text (https://en.wikipedia.org/wiki/Text_mining) where
High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.
Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.
‘High quality’ in text mining usually refers to some combination of relevance, novelty, and interest.
Typical text mining tasks include Information filtering, text classification (categorization), clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).
The steps for text mining
Define the problem and specific goals
need to acquire subject matter expertise sufficient to define the problem and the outcome in an appropriate manner (understand user information needs)
Identify the text that needs to be collected (data collection)
Represent the text
Extract features in text (high-quality information)
Text analysis
Reach an insight or recommendation
The end results of the analysis are to apply the output to the problem definition or expected goal or user information needs.
Example of high-quality information
Hotel on-line reviews:
The hotel has a great location, but it was a horrible experience! Only stayed here because it was the pre-accommodation choice for one of our tours.
Everyone was extremely warm and welcoming. We’ve stayed at some top-notch places and this is definitely in our top 2. Great for a romantic getaway.
This place is really over the top from a luxury standpoint and overall.
OPINION_UNIT The hotel has a great location
OPINION_UNIT but it was a horrible experience!
OPINION_UNIT Only stayed here because it was the pre-accommodation choice for one of our tours.
OPINION_UNIT Everyone was extremely warm and welcoming.
OPINION_UNIT We’ve stayed at some top notch places and this is definitely in our top 2.
OPINION_UNIT Great for a romantic getaway.
OPINION_UNIT This place is really over the top from a luxury standpoint and overall
Entities – Great Location = place ? (support from three reviewers).
Also, can highlight Negative aspect, such as “horrible experience”, or Positive aspect, such as “Great for a romantic getaway”. So, understand user information needs is very important for extract high-quality information from text.
Information Needs
If something is relevant for a person in relation to a given topic, we might say that the person needs the information for that topic, or
For example, an information need is the underlying cause of the query that a person submits to a search engine.
We can describe it as the motivation for a person to search for information on the Web.
Sometimes called information problem to emphasize that an information need is generally related to a task.
Sometimes it is difficult for people to define exactly what their information need is since that information is a gap in their knowledge.
Categorized using variety of dimensions for understanding information needs, e.g.,
number of relevant documents being sought
type of information that is needed
type of task that led to the requirement for information
2. Query based approaches for acquiring Information Needs
A query can represent very different information needs
May require different search techniques and ranking algorithms to produce the best rankings
A query can be a poor representation of the information need
Users may find it difficult to express their information needs as
users are encouraged to enter short queries both by the search engine interface, and by the fact that long queries don’t work
User Interaction
It is used to discuss relevance in IR models.
To produce a query that is a better representation of the information need.
Interaction with the system occurs
during query formulation and reformulation
while browsing the result
Key aspect of effective retrieval
users can’t change ranking algorithm but can change results through interaction
helps refine description of information need
e.g., same initial query, different information needs
how does user describe what they don’t know?
ASK Hypothesis
Belkin et al (1982) proposed a model called Anomalous State of Knowledge (ASK)
ASK hypothesis:
difficult for people to define exactly what their information need is, because that information is a gap in their knowledge
Search algorithms should look for information that fills those gaps
Interesting ideas, but little practical impact (yet)
We discuss another popular idea in this lecture
Query Transformation and Refinement
Keyword Queries
Query languages in the past were designed for professional searchers (intermediaries)
Read the textbook in pages 188-189 for the following example that use query language (e.g., wildcard operators and various forms of proximity operators) to specify the information need.
A wildcard operator is used to define the minimum string match required for a word to match the query. For example, NEGLECT! will match “neglected”, “neglects”, or just “neglect”.
A proximity operator is used to define constraints on the distance between words that are required for them to match the query.
One type of proximity constraint is adjacency. For example, the quotes around ”CHANNEL MARKER” specify that the two words must occur next to each other.
The more general window operator specifies a width (in words) of a text window that is allowed for the match. For example, /5 specifies that the words must occur within five words of each other.
Other typical proximity operators are sentence and paragraph proximity. For example, /P specifies that the words must occur in the same paragraph.
In this query language, if no constraint is specified, it is assumed to be a Boolean OR.
Keyword Queries cont.
Simple, natural language queries were designed to enable everyone to search
Current search engines do not perform well (in general) with natural language queries
People trained (in effect) to use keywords
compare average of about 2.3 words/web query to average of 30 words/CQA query
Keyword selection is not always easy
query refinement techniques can help
Query-Based Stemming
Make decision about stemming at query time rather than during indexing (where the words in documents are stemmed during indexing, the words in the queries must also be stemmed)
improved flexibility, effectiveness.
Query is expanded using word variants
e.g., “rock climbing” expanded with “climb” (as a new long query “rock climbing climb”), not stemmed to “climb” (as a new short query “rock climb”).
Please note documents are not stemmed if you use query expansion.
Stem Classes
A stem class is the group of words that will be transformed into the same stem by the stemming algorithm
generated by running stemmer on large corpus
e.g., Porter stemmer on TREC News
Stem Classes
AS we know that stem classes are often too big and inaccurate.
They contain a number of errors; e.g.,
The words relating to “police” and “policy” should not be in the same class.
Other words are not errors, but may be used in different contexts; e.g., “banked” is more often used in discussions of flying and pool, but this stem class will add words that are more common in financial discussions.
The length of the lists is an issue if the stem classes are used to expand the query; e.g.,
Adding 22 words to a simple query will certainly negatively impact response time and, if not done properly using a synonym operator, could cause the search to fail.
Modify using Analysis of Word Co-occurrence
Assumption:
Word variants that could substitute for each other should co-occur often in documents
Modifying Stem Classes
Modifying Stem Classes
Dices’ Coefficient is an example of a term association measure
where nx is the number of windows containing x
Two vertices are in the same connected component of a graph if there is a path between them
forms word clusters
Example output of modification
Spell Checking
Important part of query processing
10-15% of all web queries have spelling errors
Errors include typical word processing errors but also many other simple types, e.g.
Spell Checking
Basic approach: suggest corrections for words not found in spelling dictionary
Suggestions found by comparing word to words in dictionary using similarity measure
Most common similarity measure is edit distance
number of operations required to transform one word into the other
Edit Distance
Damerau-Levenshtein distance
counts the minimum number of insertions, deletions, substitutions, or transpositions of single characters required
e.g., Damerau-Levenshtein distance 1
distance 2 (doceration -> decoration)
Edit Distance cont.
Number of techniques used to speed up calculation of edit distances
restrict to words starting with same character
restrict to words of same or similar length
restrict to words that sound the same
Last option uses a phonetic code to group words
e.g., Soundex, a type of phonetic encoding that was originally used for the problem of matching names in medical records.
Soundex Code
Spelling Correction Issues
Ranking corrections
“Did you mean…” feature requires accurate ranking of possible corrections, e.g., present them in decreasing order of their frequency in the language.
Choosing right suggestion depends on context (other words)
e.g., lawers → lowers, lawyers, layers, lasers, lagers but trial lawers → trial lawyers
Run-on errors (where word boundaries are skipped or mistyped)
e.g., “mainscourcebank”
Missing spaces can be considered as another single character error.
Noisy Channel for Spelling Correction
It is a general framework that can address the issues of ranking, context, and run-on errors
The user chooses word w based on a probability distribution P(w)
called the language model
can capture context information, e.g., P(w1|w2)
The user tries to write word w, but the noisy channel causes the person to write word e instead, with a probability P(e|w)
called error model
represents information about the frequency of spelling errors; e.g., the probabilities for words (e) that are edit distance 1 away from the word w will be quite high.
Noisy Channel Model
Now the problem is that
The corrector (algorithm) observes an error e (golf curse) and tries to suggest a correct work w (course), i.e.,
the probability that the correct word is w given that we can see the person wrote e.
We need to estimate probability of correction
P(w|e) = P(e|w)P(w)
It is the product of the error model probability and the language model probability.
Estimate language model probability using context in query log or a collection
e.g., P(w) = λP(w) + (1 − λ)P(w|wp)
wp is the previous word; e.g.,
“fish tink”
“tank” and “think” both likely corrections, but P(tank|fish) > P(think|fish)
Noisy Channel Model cont.
Estimating error model probability P(e|w)
The simple approach: assume all words with same edit distance have same probability, only edit distance 1 and 2 considered.
More complex approach: incorporate estimates based on common typing errors.
These estimates are derived from large collections of text by finding many pairs of correctly and incorrectly spelled words.
Spellcheck Process and Query Suggestion
Query Logs
Query logs are the best source of information about queries and related terms
short pieces of text and click data
e.g., most frequent words in queries containing “tropical fish” from MSN log:
stores, pictures, live, sale, types, clipart, blue, freshwater, aquarium, supplies
query suggestion based on finding similar queries
group based on click data
Some studies have shown that the error model is more important when the language model is based just on the collection of documents, rather than the query log.
Experiments with this spelling corrector show that the language model from the query log is the most important component in terms of correction accuracy.
3. Query Expansion
A variety of automatic or semi-automatic query expansion techniques have been developed
goal is to improve effectiveness by matching related terms
semi-automatic techniques require user interaction to select best expansion terms
The key to effective expansion is to choose words that are appropriate for the context, or topic, of the query.
For example, “aquarium” may be a good expansion term for “tank” in the query “tropical fish tanks”, but not appropriate for the query “armor for tanks”.
Query suggestion is a related technique
alternative queries, not necessarily more terms
Query Expansion cont.
Approaches usually based on an analysis of term co-occurrence
either in the entire document collection, a large collection of queries, or the top-ranked documents in a result list
query-based stemming also an expansion technique
Automatic expansion based on general thesaurus (e.g., WordNet https://wordnet.princeton.edu/ ) not effective
does not take context into account ( see the previous example for “aquarium”)
To address this problem, people use all the words in a query to find related words rather than expanding each word separately.
Term Association Measures
Dice’s Coefficient For two words (or terms) a and b
Mutual Information
P(a, b) is the probability that a and b occur in the same text window. P(a) is the probability that word a occurs in a text window, and na is the number of occurrences, P(a) = na/N, where a can be a query of a word.
means that the formula is rank equivalent
Term Association Measures cont.
Mutual Information measure favors low frequency terms
Expected Mutual Information Measure (EMIM) addresses this problem by weighting the mutual information value using the probability P(a, b).
It interested in the case where both terms occur, giving the formula:
Term Association (χ2) Measure
Pearson’s Chi-squared (χ2) measure
compares the number of co-occurrences of two words with the expected number of co-occurrences if the two words were independent
normalizes this comparison by the expected number.
also limited form focused on word co-occurrence
Association Measure Summary
Association Measure Example
Most strongly associated words for “tropical” in a collection of TREC news
stories. Co-occurrence counts are measured at the document level.
Association Measure Example
Most strongly associated words for “fish” in a collection of TREC news stories.
Association Measure Example
Most strongly associated words for “fish” in a collection of TREC news stories. Co-occurrence counts are measured in windows of 5 words.
Summary of Association Measures
In summary, both automatic and semi-automatic query expansion methods have been discussed.
Some term association measures are better than others, but term association based on single words does not produce good expansion terms, because it does not capture the context of the query.
The best way to capture query context is to use a query log, both to analyse word associations and to find similar queries based on clickthrough data.
E.g., associated words are of little use for expanding the query “tropical fish”
Expansion based on whole query words takes context into account, e.g., using Dice with term “tropical fish” gives the following highly associated words:
goldfish, reptile, aquarium, coral, frog, exotic, stripe, regent, pet, wet
If there is no query log available, the best alternative would be to context vectors, Relevance Feedback (or Pseudo-Relevance Feedback), as described below.
Context vectors
Represent words by the words that co-occur with them (a virtual document)
e.g., top 35 most strongly associated words for “aquarium” (using Dice’s coefficient or word embedding techniques https://en.wikipedia.org/wiki/Word_embedding ):
Rank words (e.g., “aquarium”) for a query (e.g., “tropical fish”) by ranking context vectors (virtual documents) to decide expansion terms.
4. Relevance Feedback
The user identifies relevant (and maybe non-relevant) documents to form a training set.
The system modifies query or finding terms from those documents.
E.g., words that occur more frequently in the relevant documents than in the non-relevant documents, or in the collection as a whole, are added to the query or increased in weight.
Modifying the query is in fact equivalent to learning a classifier that distinguishes between relevant and nonrelevant documents.
but, the problem is very little training data!
The alternative technique is pseudo-relevance feedback, in which we assume top-ranked documents are relevant to the initial query:
where instead of asking the user to identify relevant documents, the s.ystem assumes that the top-ranked documents are relevant
Relevance Feedback Example
Top 10 documents
for “tropical fish”
Relevance Feedback Example
If we assume top 10 are relevant, most frequent terms are (with frequency):
a (926), td (535), href (495), http (357), width (345), com (343), nbsp (316), www (260), tr (239), htm (233), class (225), jpg (221)
too many stopwords and HTML expressions
Use only snippets and remove stopwords
tropical (26), fish (28), aquarium (8), freshwater (5), breeding (4), information (3), species (3), tank (2), Badman’s (2), page (2), hobby (2), forums (2)
Relevance Feedback Example
If document 7 (“Breeding tropical fish”) is explicitly indicated to be relevant, the most frequent terms are:
breeding (4), fish (4), tropical (4), marine (2), pond (2), coldwater (2), keeping (1), interested (1)
Specific weights and scoring methods used for relevance feedback depend on retrieval model
Relevance Feedback
Both relevance feedback and pseudo-relevance feedback are effective, but not used in many applications
pseudo-relevance feedback has reliability issues, especially with queries that don’t retrieve many relevant documents
Some applications use relevance feedback
filtering, “more like this”
Query suggestion more popular
may be less accurate, but can work if initial query fails
5. Query Context and Personalization
If a query has the same words as another query, results will be the same regardless of
who submitted the query
why the query was submitted
where the query was submitted
what other queries were submitted in the same session
All that matters is what words were used to describe the query.
The other factors (query contexts, e.g., user profiles, privacy issues, location) could have a significant impact on relevance
But it is difficult to incorporate into ranking
User Models
Much research has been done, in particular, on learning user models or profiles to represent a person’s interests so that a search can be personalized.
User models are used to generate user profiles based on documents that the person looks at
such as web pages visited (or clicked), email messages, or word processing documents on the desktop
Modify queries using words from user profiles
Generally, not effectiv
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com