Week 3 Lecture Review and Questions
Professor Yuefeng Li
School of Computer Science, Queensland University of Technology (QUT)
1. Processing Text From Words to Terms and Text Statistics
Copyright By PowCoder代写 加微信 powcoder
The first step for text analysis is to manipulate the text in some way to conduct information retrieval or text analysis. Test processing is a way to find useful index ‘terms’ or text features from words.
There are several issues that you need to consider when you decide the terms. The first one is case sensitivity since people normally do not want a search function that can only do exactly matching. For example, most search engines do not distinguish between uppercase and lowercase letters. The second one is that not all words are of equal value in a search or text analysis. Usually, tokenization is used to split words. Some words, or call them stopping, may be ignored entirely. We may also use stemming to allow similar words (like “run” and “running”) to have the same semantic meaning. For Web documents, such as web pages, may include formatting information (like bold or large text), explicit structure (like XML tags:
Text processing may be profound to the results of text analysis. These text processing techniques may be simple, but it can take a lot of time for data engineers to produce high quality terms or text features. Understanding the statistical nature of text is fundamental to understanding the meaning of high-quality terms or text features.
Statistical models of word occurrences are very important in IR (Information Retrieval). One of the interesting observations is that the distribution of word frequencies is very skewed. That means there are a few words that have very high frequencies and many words that have low frequencies.
Question 1. Please open a text or an XML file (e.g., 6146.xml) and represent it as a list of paragraphs or sentences, text. You may remove any non-relevant information (e.g., ‘
’, ‘
’, ‘\n’). After that, you need to find all terms and their frequencies (the number of occurrences in the file) if their length > 2, represent them using a dictionary, doc; and print the number of total terms (e.g., 137). Then print the top-10 terms in doc in a descending order, e.g.,
[(‘the’, 8), (‘technical’, 2), (‘bounce’, 2), (‘said’, 2), (‘and’, 2), (‘not’, 2), (‘due’, 2), (‘rose’, 2), (“argent ina’s”, 2), (‘argentine’, 1)]
At last, please plot the distribution of the top-10 terms (see, Fig 1 for an example).
Fig 1. Example of distribution of top-k frequent words.
For a very large data collection or corpus, Zipf’s law is possible true to predict the relationship between the size of vocabulary and the size of the corpus.
Document Parsing
Document parsing is to represent the content and structure of text documents. The primary content of most documents is the words.
Recognizing each word occurrence in a text document is called tokenizing or lexical analysis. Apart from these words, there may be other types of content, such as metadata, images, graphics, code, tables or hyperlinks; where metadata is information about a document that is not part of the text content. Metadata content includes document attributes such as date and author, and, most importantly, the tags that are used by markup languages, such as HTML or XML.
The parser uses the tags and other metadata recognized in the document to interpret the document’s structure based on the syntax of the markup language (syntactic analysis) and to produce a representation of the document that includes both the structure and content.
Text pre-processing usually including the following components:
• Tokenizing–istheprocessofformingwordsfromthe
sequence of characters in a document.
• Stopping-wordsthathavelittlemeaningapartfromother
• Stemming-capturestherelationshipsbetweendifferent
variations of a word.
• Findingphrasesorn-grams–theyareusefulinIRandtext
Question 2. Which of the following is FALSE? And explain why it is FALSE.
(1) Stemming is a component of text processing that captures the relationships between different variations of a word.
(2) Stemming reduces the different forms of a word that occur because of inflection (e.g., plurals, tenses) or derivation (e.g., making a verb into a noun by adding the suffixation) to a common stem.
(3) In general, using a stemmer for search applications with English text produces a small but noticeable improvement in the quality of results.
(4) A dictionary-based stemmer uses a small program to decide whether two words are related, usually based on knowledge of word suffixes for a particular language.
Question 3. (N-grams)
Typically, n-grams are formed from overlapping sequences of words., i.e. move n-word “window” one word at a time in a document. For example, bigrams are 2 words sequences, and trigrams are 3 words sequences.
The definition of Tropical fish is described in the following document:
fish tropical
Please design a python program to print all bigrams and trigrams of the above document that contain at least one of the highlighted key words (‘fish’, ‘tropical’, ‘freshwater’, ‘saltwater’, ‘aquariums’).
Tropical fish are generally those
found in aquatic
environments around the
world, including both freshwater and saltwater species. Fishkeepers often keep tropical
fish in freshwater and saltwater aquariums.
2. Information Extraction
Information extraction is a language technology that focuses on extracting structure from text, e.g., Identifying noun phrases, titles, or even bolded text.
Most of the recent research in information extraction has been concerned with features that have specific semantic content, such as named entities, relationships, and events.
Named entity
A named entity is a word or sequence of words (such as people’s names, company or organization names, locations, time and date expressions, quantities, and monetary values) that is used to refer to something of interest in a particular application,
the process of recognizing them and tagging them in text is sometimes called semantic annotation.
Example of tagging information
, who lives at 10 Water Street, Springfield, MA, is a
long!time collector of tropical fish.
, is a long!time collector of
tropical fish.
Two main approaches have been used to build named entity recognizers: rule based and statistical.
A rule-based recognizer uses one or more lexicons (lists of words and phrases) that categorize names, e.g., people’s names (given names, family names). In many cases, however, rules or patterns are used to verify an entity name or to find new entities that are not in the lists. For example, new person names could be recognized
by rules such as “
A statistical entity recognizer uses a probabilistic model of the words in and around an entity, e.g., Hidden Markov Model (HMM) approach.
Hidden Markov Model
In nature language, words can have many different meanings, e.g.,
“Marathon” could be the name of a race or a location in Greece.
Human beings tell the meaning of a word based on the context of the word, meaning the words that surround it. For instance, “Boston Marathon”, Marathon is preceded by Boston, the text is almost certainly describing a race.
The context of a word can be described by modeling the generation of the sequence of words in a text as a process with the Markov property, meaning that the next word in the sequence depends on only a small number of the previous words, where a Markov Model describes a process as a collection of states with transitions between them; and each of the transitions has an associated probability.
Question 4. (Markov chain)
Fig 2. Markov chain example.
The following is the corresponding transition matrix
Assume when John is sad today, which isn’t very usual: he either goes for a run, gobbles down ice cream or takes a nap next day. The Markov Chain depicted in the following state diagram has 3
possible states: “Sleep”, “Run”, and “Ice Cream”.
(1) According to the above diagram, if John spent sleeping a sad day away, what is the probability of (2)
Then the transition matrix for the states is represented as a list of lists in python [[“SS”,”SR”,”SI”], [“RS”,”RR”,”RI”], [“IS”,”IR”,”II”]]
Please write the corresponding transition matrix for probabilities in a list of lists.
Although Markov models can be used to generate new sentences, they are more commonly used to recognize entities in a sentence.
For example, for the following sentence:
, who lives at 10 Water Street, Springfield, MA, is a long!time collector of tropical fish.
Based on training data, the recognizer may find the sequence of states:
Then, the words that were associated with the entity categories in this sequence would then be tagged.
The key aspect of the above approach to entity recognition is that the probabilities in the sentence model must be estimated from training data. We can generate training data that consists of text manually annotated with the correct entity tags.
he will likely go for a run next day?
The transition matrix will be 3 ́ 3 matrix. To simple represent the matrix, we can use the
following variables, for example,
SS – SleepàSleep
SR – SleepàRun
SI – SleepàIce Cream
From this training data, we can work out the probability of words associated with a given category (i.e., output probabilities), and the probability of transitions between categories.
Please note we can only observe a sentence. The underlying states (entity categories, e.g., “location”, or “person”) are hidden. Fig 3 shows a Hidden Markov Model (diagram). It assumes a patient has two states:
Fig 3. An example of Hidden Markov Model
The observations a hidden
More formally, we have the following inputs and the output for finding the most likely sequence of states in an HMM.
“Healthy” and “Fever”. A doctor cannot see the (hidden)
states directly; but the patient can tell the doctor that she/he
is “normal”, “cold”, or “dizzy” (the observations).
(e.g., normal, cold, dizzy) along with
state (e.g., healthy, fever) form a hidden Markov model
3. Document Structure and Markup
For web search, queries usually do not refer to document structure or fields, but that does not mean that document structure is unimportant. Some parts (e.g., the main heading for the page, and the anchor text for links) of the structure of web pages, indicated by HTML markup, are useful. The document parser must recognize this structure and make it available for indexing.
Example: HTML source for a Wikipedia page
https://en.wikipedia.org/wiki/Tropical_fish
Tropical fish
Tropical fish include fish found in tropical environments around the world, including both freshwater and salt water species. Fishkeepers often use the term tropical fish to refer only those requiring fresh water, with saltwater tropical fish referred to as marine fish.
Tropical fish are popular aquarium fish , due to their often bright coloration. In freshwater fish, this coloration typically derives from iridescence, while salt water fish are generally pigmented.