Search Engines
Text, Web And Media Analytics
Copyright By PowCoder代写 加微信 powcoder
Textual Data Pre-Processing
1. Processing Text
Text Statistics
Tokenizing
Stopping and stemming
Phrases and N-grams
2. Information Extraction
Named Entity
HMM – Hidden Markov Model
3. Document Structure and Markup
Hyperlinks
1. Processing Text
Converting documents to index terms
Matching the exact string of characters typed by the user is too restrictive
i.e., it doesn’t work very well in terms of effectiveness
Not all words are of equal value in a search
Sometimes not clear where words begin and end
Not even clear what a word is in some languages
e.g., Chinese, Korean
Text Statistics
Huge variety of words used in text but
Many statistical characteristics of word occurrences are predictable
e.g., distribution of word counts
Retrieval models and ranking algorithms depend heavily on statistical properties of words
e.g., important words occur often in documents but are not high frequency in collection
Zipf’s Law
Distribution of word frequencies is very skewed
a few words occur very often, many words hardly ever occur
e.g., two most common words (“the”, “of”) make up about 10% of all word occurrences in text documents
Zipf’s Law
Vocabulary Growth
As corpus grows, so does vocabulary size
Fewer new words when corpus is already large
Observed relationship (Heaps’ Law):
where v is vocabulary size (number of unique words), n is the number of words in corpus, k, β are parameters that vary for each corpus (typical values given are 10 ≤ k ≤ 100 and β ≈ 0.5)
Web Example
Heaps’ Law works with very large corpora
new words occurring even after seeing 30 million!
parameter values different than typical TREC values
New words come from a variety of sources
spelling errors, invented words (e.g. product, company names), code, other languages, email addresses, etc.
Search engines must deal with these large and growing vocabularies
Document Parsing
Document parsing involves the recognition of the content and structure of text documents.
Forming words from sequence of characters is called tokenizing.
Surprisingly complex in English, can be harder in other languages
Definition of Words in Early IR systems:
any sequence of alphanumeric characters of length 3 or more
terminated by a space or other special character
upper-case changed to lower-case
Tokenizing Example
“Bigcorp’s 2007 bi-annual report showed profits rose 10%.”
“bigcorp 2007 annual report showed profits rose”
Too simple for search applications or even large-scale experiments
Why? Too much information lost
Small decisions in tokenizing can have major impact on effectiveness of some queries
Tokenizing Problems
Small words can be important in some queries, usually in combinations
xp, ma, pm, ben e king, el paso, master p, gm, j lo, world war II, VW (Volkswagen)
Both hyphenated and non-hyphenated forms of many words are common
Sometimes hyphen is not needed
e-bay, wal-mart, active-x, cd-rom, t-shirts
At other times, hyphens should be considered either as part of the word or a word separator
winston-salem, mazda rx-7, e-cards, pre-diabetes, t-mobile, spanish-speaking
Tokenizing Problems cont.
Special characters are an important part of tags, URLs, code in documents
Capitalized words can have different meaning from lower case words
Bush, can be a part of a word, a part of a possessive, or just a mistake
rosie o’donnell, can’t, don’t, 80’s, 1890’s, men’s straw hats, master’s degree, england’s ten largest cities, shriner’s
Tokenizing Problems
Numbers can be important, including decimals
nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, 288358
Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations
I.B.M., Ph.D., cs.umass.edu, F.E.A.R.
Note: tokenizing steps for queries must be identical to steps for documents
Tokenizing Process
First step is to use parser (for a specific markup language, e.g. HTML) to identify appropriate parts of document to tokenize
Defer complex decisions to other components
word is any sequence of alphanumeric characters, terminated by a space or special character, with everything converted to lower-case
everything indexed
example: 92.3 → 92 3 but search finds documents with 92 and 3 adjacent
incorporate some rules to reduce dependence on query transformation components
Tokenizing Process cont.
Examples of rules used with TREC
Apostrophes in words ignored
o’connor → oconnor bob’s → bobs
Periods in abbreviations ignored
I.B.M. → ibm Ph.D. → ph d
>>> import string
>>> line =”
The British Fashion Awards will be held on October 22 at London’s Royal Albert Hall.
“
>>> line = line.replace(“
“, “”).replace(“
“, “”)
>>> line = line.translate(str.maketrans(”,”, string.digits)).translate(str.maketrans(string.punctuation, ‘ ‘*len(string.punctuation)))
‘The British Fashion Awards will be held on October at London s Royal Albert Hall ‘
>>> words=line.split()
[‘The’, ‘British’, ‘Fashion’, ‘Awards’, ‘will’, ‘be’, ‘held’, ‘on’, ‘October’, ‘at’, ‘London’, ‘s’, ‘Royal’, ‘Albert’, ‘Hall’]
>>> terms=[word.lower() for word in words if len(word)>2]
[‘the’, ‘british’, ‘fashion’, ‘october’, ‘royal’, ‘hall’, ‘will’, ‘albert’, ‘london’, ‘held’, ‘awards‘]
Function words (determiners, prepositions) have little meaning on their own; and
High occurrence frequencies
Treated as stopwords (i.e. removed)
reduce index space, improve response time, improve effectiveness
Can be important in combinations
e.g., “to be or not to be”
Stopping cont.
Stopword list can be created from high-frequency words or based on a standard list
Lists are customized for applications, domains, and even parts of documents
e.g., “click” is a good stopword for anchor text
Best policy is to index all words in documents, make decisions about which words to use at query time
Many morphological variations of words
inflectional (plurals, tenses)
derivational (making verbs nouns etc.)
In most cases, these have the same or very similar meanings
Stemmers attempt to reduce morphological variations of words to a common stem
usually involves removing suffixes
Can be done at indexing time or as part of query processing (like stopwords)
——屈折变化(复数,时态)
——派生变化(使动词成为名词等)
在大多数情况下,它们有相同或非常相似的含义
词干研究者试图将单词的形态变化减少到一个共同的词干,通常需要去除后缀
可以在索引时完成,也可以作为查询处理的一部分完成(如stopwords)
Stemming cont.
Generally a small but significant effectiveness improvement
can be crucial for some languages
e.g., 5-10% improvement for English, up to 50% in Arabic
Words with the Arabic root ktb
Two basic types
Dictionary-based: uses lists of related words
Algorithmic: uses program to determine related words
Algorithmic stemmers
suffix-s: remove ‘s’ endings assuming plural
e.g., cats → cat, lakes → lake, wiis → wii
Many false negatives (have different stems): triangle → triangular
Some false positives (have the same stem): policy → police
Porter Stemmer
Algorithmic stemmer used in IR experiments since the 70s
Consists of a series of rules designed to the longest possible suffix at each step
Effective in TREC
Produces stems not words
Makes a number of errors and difficult to modify
>>> import os
>>> os.chdir (‘C:\\2020\\Python_code\\workshop_tutor\\wk_solutions’)
>>> os.getcwd()
‘C:\\2020\\Python_code\\workshop_tutor\\wk_solutions’
>>> from stemming.porter2 import stem
>>> stems=[stem(term) for term in terms]
[‘the’, ‘british’, ‘fashion’, ‘octob’, ‘royal’, ‘hall’, ‘will’, ‘albert’, ‘london’, ‘held’, ‘award’]
Porter Stemmer
Example step (1 of 5)
Porter Stemmer
Porter2 stemmer addresses some of these issues
Approach has been used with other languages
Hybrid algorithmic-dictionary
Word checked in dictionary
If present, either left alone or replaced with “exception”
If not present, word is checked for suffixes that could be removed
After removal, dictionary is checked again
Produces words not stems
Comparable effectiveness
Lower false positive rate, somewhat higher false negative
Stemmer Comparison
Which stemmer is better?
Phrases and N-grams
Many queries are 2-3 word phrases
Phrases are
More precise than single words
e.g., documents containing “black sea” vs. two words “black” and “sea”
Less ambiguous
e.g., “big apple” vs. “apple”
Can be difficult for ranking when we use them
e.g., Given query “fishing supplies”, how do we score documents with
exact phrase many times, exact phrase just once, individual words in same sentence, same paragraph, whole document, variations on words?
Text processing issue – how are phrases recognized?
Three possible approaches:
Identify syntactic phrases using a part-of-speech (POS) tagger
Use word n-grams
Store word positions in indexes and use proximity operators in queries
POS Tagging
POS taggers use statistical models of text to predict syntactic tags of words
Example tags:
NN (singular noun), NNS (plural noun), VB (verb), VBD (verb, past tense), VBN (verb, past participle), IN (preposition), JJ (adjective), CC (conjunction, e.g., “and”, “or”), PRP (pronoun), and MD (modal auxiliary, e.g., “can”, “will”).
Phrases can then be defined as simple noun groups, for example
Pos Tagging Example
Python nltk
> pip3 install nltk
>>> import nltk
>>> from nltk.tokenize import word_tokenize
You may need to do
>>> nltk.download(‘punkt’)
>>> nltk.download(‘averaged_perceptron_tagger’)
>>> text = word_tokenize(“Document will describe marketing strategies carried out by U.S. companies for their agricultural chemicals, report predictions for market share of such chemicals, or report market statistics for agrochemicals, pesticide, herbicide, fungicide, insecticide, fertilizer, predicted sales, market share, stimulate demand, price cut, volume of sales.”)
>>> text = [x for x in text if len(x)>1]
[‘Document’, ‘will’, ‘describe’, ‘marketing’, ‘strategies’, ‘carried’, ‘out’, ‘by’, ‘U.S.’, ‘companies’, ‘for’, ‘their’, ‘agricultural’, ‘chemicals’, ‘report’, ‘predictions’, ‘for’, ‘market’, ‘share’, ‘of’, ‘such’, ‘chemicals’, ‘or’, ‘report’, ‘market’, ‘statistics’, ‘for’, ‘agrochemicals’, ‘pesticide’, ‘herbicide’, ‘fungicide’, ‘insecticide’, ‘fertilizer’, ‘predicted’, ‘sales’, ‘market’, ‘share’, ‘stimulate’, ‘demand’, ‘price’, ‘cut’, ‘volume’, ‘of’, ‘sales’]
>>> pos_results = nltk.pos_tag(text)
>>> pos_results
[(‘Document’, ‘NNP’), (‘will’, ‘MD’), (‘describe’, ‘VB’), (‘marketing’, ‘NN’), (‘strategies’, ‘NNS’), (‘carried’, ‘VBD’), (‘out’, ‘RP’), (‘by’, ‘IN’), (‘U.S.’, ‘NNP’), (‘companies’, ‘NNS’), (‘for’, ‘IN’), (‘their’, ‘PRP$’), (‘agricultural’, ‘JJ’), (‘chemicals’, ‘NNS’), (‘report’, ‘VBP’), (‘predictions’, ‘NNS’), (‘for’, ‘IN’), (‘market’, ‘NN’), (‘share’, ‘NN’), (‘of’, ‘IN’), (‘such’, ‘JJ’), (‘chemicals’, ‘NNS’), (‘or’, ‘CC’), (‘report’, ‘NN’), (‘market’, ‘NN’), (‘statistics’, ‘NNS’), (‘for’, ‘IN’), (‘agrochemicals’, ‘NNS’), (‘pesticide’, ‘JJ’), (‘herbicide’, ‘NN’), (‘fungicide’, ‘JJ’), (‘insecticide’, ‘JJ’), (‘fertilizer’, ‘NN’), (‘predicted’, ‘VBD’), (‘sales’, ‘NNS’), (‘market’, ‘NN’), (‘share’, ‘NN’), (‘stimulate’, ‘JJ’), (‘demand’, ‘NN’), (‘price’, ‘NN’), (‘cut’, ‘NN’), (‘volume’, ‘NN’), (‘of’, ‘IN’), (‘sales’, ‘NNS’)]
The POS tagger in the NLTK library outputs specific tags for certain words. The list of POS tags can be found at
Example Noun Phrases
Word N-Grams
POS tagging too slow for large collections
Simpler definition – phrase is any sequence of n words – known as n-grams
bigram: 2 word sequence, trigram: 3 word sequence, unigram: single words
N-grams also used at character level for applications such as OCR
N-grams typically formed from overlapping sequences of words
i.e. move n-word “window” one word at a time in document
stems = [‘the’, ‘british’, ‘fashion’, ‘octob’, ‘royal’, ‘hall’, ‘will’, ‘albert’, ‘london’, ‘held’, ‘award’]
>>> bigrams = [stems[i]+’ ‘+stems[i+1] for i in range(len(stems)-1)]
>>> bigrams
[‘the british’, ‘british fashion’, ‘fashion octob’, ‘octob royal’, ‘royal hall’, ‘hall will’, ‘will albert’, ‘albert london’, ‘london held’, ‘held award’]
>>> trigrams = [stems[i]+’ ‘+stems[i+1]+’ ‘+stems[i+2] for i in range(len(stems)-2)]
>>> trigrams
[‘the british fashion’, ‘british fashion octob’, ‘fashion octob royal’, ‘octob royal hall’, ‘royal hall will’, ‘hall will albert’, ‘will albert london’, ‘albert london held’, ‘london held award’]
Frequent n-grams are more likely to be meaningful phrases
N-grams form a Zipf distribution
Better fit than words alone
Could index all n-grams up to specified length
Much faster than POS tagging
Uses a lot of storage
e.g., document containing 1,000 words would contain 3,990 instances of word n-grams of length 2 ≤ n ≤ 5
N-gram Examples from Several Disciplines
Google N-Grams
Web search engines index n-grams
Google sample:
Most frequent trigram in English is “all rights reserved”
How to select a small set of useful n-grams [2]
2. Information Extraction
Automatically extract structure from text
annotate document using tags to identify extracted structure
Named entity recognition
identify words that refer to something of interest in a particular application
e.g., people, companies, locations, dates, product names, prices, etc.
Named Entity Recognition
Example showing semantic annotation of text using XML tags
Information extraction also includes document structure and more complex features such as relationships and events
Approaches for Named Entity Recognition
Rule-based
Uses lexicons (lists of words and phrases) that categorize names
e.g., locations, peoples’ names, organizations, etc.
Rules also used to verify or find new entity names
e.g., “
“
“
“
Approaches for Named Entity Recognition cont.
Rules either developed manually by trial and error or using machine learning techniques
Statistical
uses a probabilistic model of the words in and around an entity
probabilities estimated using training data (manually annotated text)
Hidden Markov Model (HMM) is one approach
HMM for Extraction
Resolve ambiguity in a word using context (the words that surround it)
e.g., “marathon” is a location or a sporting event, “boston marathon” is a specific sporting event
Model context using a generative model of the sequence of words
Markov property: the next word in a sequence depends only on a small number of the previous words
HMM for Extraction cont.
Markov Model describes a process as a collection of states with transitions between them
each transition has a probability associated with it
next state depends only on current state and transition probabilities
Hidden Markov Model
each state has a set of possible outputs
outputs have probabilities
Examples of MM and HMM
The left is a MM to describe the possible states (categories) of a sentence. It can be extended to an HMM if each state is associated with a probability distribution over words (the output).
The right one is an HMM which assumes a patient has two states: “Healthy” and “Fever”. A doctor cannot see the (hidden) states directly; but the patient can tell the doctor that she/he is “normal”, “cold”, or “dizzy” (the observations).
HMM for Extraction
Could generate sentences with this model
To recognize named entities, find sequence of “labels” that give highest probability for the sentence
only the outputs (words) are visible or observed
states are “hidden”
e.g.,
Viterbi algorithm used for recognition
Inputs and output of HMM
The observation space O, e.g., O is a set of words.
The state space S , e.g., S includes entities, such as, “location”, “person”, “organization” and “not-an-entity”; and the initial probabilities pi of each state si or entity.
Transition matrix A, where Aij is the transition probability of transiting from state si to state sj.
Emission matrix B, where Bij is the probability of observing oj from state si
a sequence of observation Y ( a sub-set of O in order)
The most likely hidden state sequence X
Inputs and output of HMM cont.
p1 p2 … pK
s1 s2 … sK
s1 A11 A12 … A1K
s2 A21 A22 … A2K
sK AK1 AK2 … AKK
o1 o2 … oN
s1 B11 B12 … B1N
s2 B21 B22 … B2N
sK BK1 BK2 … BKN
Initial probabilities
All observations
Input Sequence
Named Entity Recognition
Accurate recognition requires about 1M words of training data (1,500 news stories)
may be more expensive than developing rules for some applications
Both rule-based and statistical can achieve about 90% effectiveness for categories such as names, locations, organizations
others, such as product name, can be much worse
3. Document Structure and Markup
Some parts of documents are more important than others
Document parser recognizes structure using markup, such as HTML tags
Headers, anchor text, bolded text all likely to be important
Metadata can also be important
Links used for link analysis
Html Example
HTML files show more structure, where each field or element in HTML is indicated by a start tag (such as
) and an optional end tag (e.g.,
).
Elements can have attributes (with values), given by attribute_name = ”value” pairs.
The
The main heading is indicated by the
tag , and terms that should be displayed in bold or italic are indicated by or tags etc.
Links are commonly used, such as fish .
Hyperlinks
Links are a key component of the Web.
Important for navigation, but also for Web search
e.g., Example website
“Example website” is the anchor text
“http://example.com” is the destination link
both are used by search engines
Anchor Text
Used as a description of the content of the destination page
i.e., collection of anchor text in all links pointing to a page used as an additional text field
Anchor text tends to be short, descriptive, and similar to query text
Retrieval experiments have shown that anchor text has significant impact on effectiveness for some types of queries
[1] Chapter 4 in text book – W. , Search Engines – Information retrieval in Practice; Pearson, 2010.
[2] M. Albathan, Y. Li, Y. Xu, Using extended random set to find specific patterns, in Proceedings of 2014 IEEE/WIC/ACM International Conference on Web Intelligence (Vol. 2), 11–14 August 2014, Warsaw, Poland, 2014, pp. 30-37 ( Paper Award, https://dl.acm.org/citation.cfm?id=2682826 )
Indexing Text and Ranking Documents
/docProps/thumbnail.jpeg
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com