Search Engine
Homework #5
DATA 1050 Fall 2020
Due 23:59 EST, Wednesday, October 28, 2020
Optional: Coding Problem 3 Resubmission due by 23:59 EST, Monday, October 19, 2020
Setup & Submission
1. Use this GitHub Assignment Invitation Link: https://classroom.github.com/a/Vbl7I9d9
2. Run pytest pytest-timeout flake8 black nltk in a terminal to install all required
packages.
3. Push to master as much as you want before the homework deadline.
4. If your final push will be late, remember to use the Late Day Form *twice*, once before the deadline, and once
again after your final push.
5. When you are ready to submit, visit gradescope and submit your repo. Your report should be included in the repo
as report.md
6. Be sure to also Include a copy and paste of the test results and coverage information in coverage.md in each of
your stencil directories. Two students created a helpful utility script to do this for you.
7. For Coding Problem 3 Resubmission, please only include the coding problems you want to be
re-graded in a stencil_CodingProblems3 subdirectory of your hw5 repo and submit the entire repo via the corresponding gradescope assignment “Coding Problems3 Resubmission”
Overview
Search engines retrieve a list of documents (typically web pages) that match a given query. Since multiple documents may satisfy the query, most search engines also attempt to list more relevant documents first. In this project, you will use the TF-IDF and PageRank algorithms to create part of a web search engine.
Operationally, each searchable document must be analyzed in advance (in a process called indexing) to gather and integrate statistics about its content. This data is used to form an index (like the one found at the end of a book) that allows user search queries to be performed in real-time without the need to revisit all of the original documents.
Document term count information is enough to perform simple query matching since it allows one to find all documents that contain a particular set of terms. It also allows one rank documents relative to one another based on their term frequencies. TF-IDF, otherwise known as term frequency–inverse document frequency, is one way to do this.
Web search engines use “spiders” to “crawl” to the web in order to automatically visit and index web pages. At each page, a count of the number of times each word (“term”) appears in a document is recorded (i.e, the term histogram is recorded), a list of outbound links (outlinks) is collected, and then the spider moves on.
When present, link information defines a document graph, which can be used for document ranking. For example, documents that are referred to many other documents may be more important than ones with very few incoming links (inlinks). Today, Google and most other web search engines use PageRank on link data to help determine web page performance.
pip install –user
conda install
The PageRank algorithm was created by Larry Page and Sergey Brin (the founders of Google) while they were still graduate students at Stanford. While Page’s original technical report suggested using it to rank web page importance based on link information, it has since been used to analyze many other networks. Identifying social media influencers (based on follower networks), and gauging the importance of scientific papers (via citation networks) are two examples.
A Note on Vocabulary
In this project handout,
● A “document” refers to a Wikipedia page
● A “term” refers to a word (or alphanumeric sequence containing no white space)
Details
We used the Wikipedia download interface to download some of its web pages. These documents (web pages) are stored as XML records. In order to create your search engine, you will write code to process these XML files.
In the world of search and natural language processing, a collection of texts is called a corpus. A corpus (plural corpora) is a collection of written texts (web pages with links in this case). Thus the corpora you will be working with are subsets of the pages of the English-language Wikipedia.
This project has four main parts, which are described in detail in the sections below:
1. Calculating TF-IDF
2. Calculating PageRanks
3. Processing the XML files
4. Search Engine and user interface
Note: We suggest you implement your solution project in the sequence listed above. TF-IDF and PageRank are both somewhat math-heavy, and it is important you get them working with artificial data for which you are sure of the answers before you start processing real data. Specifically, this data will be used to create analytic test cases, some of which we provide, and some of which you will provide. We encourage you to make git commits once a subcomponent is working. This makes it easier for you to revert to a previous version.
Proceeding in this order will also help the course staff know which parts of the problem you have completed if you have questions.
Stencil files provided:
● TFIDF.py, PageRank.py: Task: Implement of TF-IDF/PageRank algorithm.
● processing.py: Task: index the Wikipedia corpus
● SearchEngine.py: Task: Implement an API that takes search queries and returns search results ranked by the
chosen algorithm
● app.py: A simple command-line interface (CLI) we supply that will allow you to use your search engine for
interactively
● report.md: Task: Write your report here
Datasets provided:
(note that all datasets have been zipped together into wiki_data.zip. You’ll have to extract them into your stencil directory before you can proceed)
● MiniWiki.xml, SmallWiki.xml: XML files with 2 and 107 Wikipedia articles respectively. As you work on adding functionary to your search engine, we suggest you start by testing with these!
● MedWiki.xml: An XML containing 871 Wikipedia articles. You can try testing with this file once you get your code working with SmallWiki.
● PageWiki.xml: An XML file containing 100 documents with titles 1 to 100. Each document contains nothing except for a link to document 100. You can use this file to test your implementation of PageRank. Document 100 should have a much higher PageRank than the other documents.
Using VSCode
Jupyter notebooks are often used for visualization and analysis, and they work well for one-time coding tasks. In this project and more generally in DATA 1050, we will not be coding in Jupyter notebooks but rather, in Python files. To help with this we recommend you try using the Visual Studio Code Interactive Development Environment (IDE) with the Microsoft Python Development extension.
Please read the VS Code Instructions document to learn how to do the following:
1. How to start The VS Code IDE on data.jupyter.brown.edu
2. Install the Python Development Extension
3. Install and configure other helpful packages and extensions
4. Run the debugger
VS Code is an immensely powerful software development tool. Learning how to use its many features takes a long time. However, because it also provides integrated terminal window support, you can continue to use all of the command line based python development skills you already have.
Since this is the first time we have run VS Code, we will update the VS Code Instructions as needed. We suggest you add notifications for changes to the VS Code Instructions document.
Coding Style Requirements
Using a consistent coding style is a critical part of code quality. For this project, make your style consistent with PEP 8, the style guide for python developers from Guido himself. There are many tools for checking if your code is styled correctly. You can periodically run flake8 *.py in a terminal under your code directory. You can also install a flake8 extension to give you PEP-8 style warning as you code. Either way, resolve all style warnings you see to avoid receiving point deductions.
An auto-formatting tool may also come in handy. From the terminal you can run black -l100
Part 5.1 TF-IDF
As the name suggests, TF-IDF has two parts: TF (term frequency) and IDF (inverse document frequency) . TF-IDF allows us to capture the following two ideas:
● A term that appears many times within a document is likely to be more relevant to that document than a term that appears fewer times (this is captured by TF). we calculate the term frequency as
T F (term, doc) = # of times term appears in the document = P (term|doc) total # of terms in the document
● A term that occurs in fewer documents in the corpus is likely to be a better signal of relevance compared to a common term like “the”. (this is captured by IDF) A rare term often tells more about the document. The inverse document frequency of a term is defined as
IDF (term) = log( # of documents in the corpus ) # of documents in which the term appears
Please use natural log (which is what Python’s math.log does),
● TFIDF(term, doc) = TF(term, doc)×IDF(term)
Examples
Consider a document containing 100 terms wherein the term cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the term cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as ln(10,000,000 / 1,000) = 9.210. Thus, the tf-idf weight for the term cat in this document is the product of these quantities: 0.03 * 9.210 = 0.2763.
You may also find a more sophisticated example here. Complete the following tasks in TFIDF.py.
Task 5.1
● calc_term_frequency: For each document and each term, calculate term frequency. Store the results in a dictionary that maps from each term to a secondary dict. The secondary dict maps document title doc to TF(term, doc). Indexing looks like this: [term][document that contains the term][frequency of that term in each document that contains it] Pay close attention to the docstring. See the provided function test_tfidf_basic for example inputs and outputs.
● calc_tf_idf: Iterate through the dictionary created for TF and multiply each term frequency by the term’s inverse document frequency. Note that the returned dictionary can also serve as a “reverse index,” (i.e., given a term, you can use a single dictionary lookup to get the documents that contain this term)
● search: Given a list of keywords, return a list of matching documents paired with their tf-idf scores. Each returned document should contain at least 1 of the keywords. The returned list should be sorted by document score. When a limit is specified, only return that many (doc, tf-idf-score) pairs (e.g., if limit = 10, you would only return the top 10 matches). Compute tf-idf score by summing the tf-idf values of each query word found in the document.
score(doc,keywords)= ∑ TFIDF(term, doc) term∈keywords
● Include your pytest compatible tests in TFIDF.py.
Notes
● Read carefully through the docstring (prose, parameter description). The docstrings most precisely specify the behavior of each method.
● Use test_tfidf_basic as a starting point to help you understand the expected input and output.
● Some methods are labeled as @staticmethod. This means they do not depend on instance or class data, notice they do not include a self variable. @staticmethods can be invoked without a class instance as illustrated in test_tfidf_basic. Their output does not depend on instance or class variables (i.e. they are regular stateless functions). This is sometimes helpful because each method can be tested separately. (For interested readers, see here for a discussion on the difference between @classmethods (which can depend on class
variables) and @staticmethods.)
Report Q&A: Ordering and Logs
Hw5.1 Your boss has been reviewing your code and wants you to do the following:
“store word frequencies the way you compute”, i.e. s/he wants you to use a dict of dicts which is indexed via
{document:{word:word frequency}}
Think carefully about this suggestion and give an appropriate response.
Part 5.2 PageRank
So far we have scored documents based on their relevance to a given query only. This is how early search engines like AltaVista worked. But then, in the late 90s, the founders of Google came up with an idea for ranking documents based on their position in the web itself. Their PageRank algorithm gives an overall importance score and it pays no attention whatsoever to the documents’ content. Rather, the rankings it computes are based entirely on the link structure between documents: i.e., which documents link to one another and which documents link to it. Documents with high scores (i.e., high PageRanks) have links from other documents that also have high ranks (which are linked to other documents with high ranks). This is a kind of crazy circular idea, but when converted into the appropriate mathematics the algorithm works quite well.
There are many variants of the PageRank algorithm. In this project, we will implement one of the most basic. Key principles underlying the design of PageRank:
1. The more documents that link to a page, the more authoritative it should be.
○ For example, if my personal blog post on “Providence Internet Outages” is linked to by 5 different web
pages (5 different incoming links, aka 5 inlinks), and the Wikipedia document on “Providence, Rhode Island” is linked to by 1000s of web pages (1000s of inlinks), then it makes sense to consider the Wikipedia document more authoritative.
○ Many early search engines use incoming page counts, and at first, this worked well, but over time people realized they could promote a page by making many fake pages that pointed to it.
○ The PageRank innovation was to also consider the authority score of incoming pages. Authoritative pages should also be linked to by other authoritative pages.,
2. The ability of an authoritative page to promote other pages should be constrained by the total number of outgoing links it has, fewer links provide better references.
○ Assume “Research at Brown” is linked to by an NYTimes document which links to only 10 other pages, and my blog “About Me” page is linked to by “Rhode Island Technology Bloggers”, which links to 2000 other pages. Because the “NYTimes” page has only a few links, and “Rhode Island Blogger” has many links, a link from the “NYTimes” page lends a larger fraction of its importance than a link from the “Rhode
Island Technology Bloggers”, leading us to attribute more authority to “Research at Brown” than “About Me”
Let u be a document, Bu be the set of documents that have a link to u , L(u) be the number of outgoing edges from u . The PageRank score for document u is defined as
P ageRank(v) L(v)
The trouble with this equation is that it is circular; in order to calculate the weight of document u , we have to know the weights of all document linking to u . How do we begin calculating the document weights? In practice, we use an iterative approach and stop once the document weights have converged. Here is a suggested approach to implementing PageRank:
● Create a dict weights. Initialize weights for all documents to 1/N, where N is the number of documents in the corpus
● While maximum # of iterations not reached
P ageRank(u) = ∑ v∈B
u
Note:
○
○
○ ○ ○
Create a dict new_weights with weights for all documents set to 0 Iterate through each document x.
For each document y that document x links to, update
new_weights[y] += weights[x] / (# of documents x links to)
Calculate L2-distance between weights and n ew_weights. Break out of the loop if distance < eps * N
Set weights = new_weights
● L2-distance is Euclidean distance
● If you sum up the PageRank value for each document in a corpus, you should get 1.
● Links from a document to itself are ignored. Links to documents outside the corpus are ignored. You may
need to filter such links.
● Documents that link to nothing (i.e. sinks, like B below) are considered linked (once) to all documents including
itself. Therefore, each document receives (weight of this sink) / N.
● Multiple links from one document to another are treated as a single link.
Examples
Let’s say that we have the corpus shown above, which consists of the documents (a.k.a. pages), A, B, and C. A links to B and C, and C links to A. The table below shows the results of running the PageRank algorithm on this
corpus in order to determine the weight (a.k.a. rank) of each document. Remember that since B does not link to anything, we treat it as if it links to every document including itself.
Iteration
Weight of A
Weight of B
Weight of C
Formula
B/3+C/1
A/2+B/3
A/2+B/3
Initial
1/3 = .333
1/3 = .333
1/3 = .333
1
.333 / 3 + .333 / 1= .444
.333 / 2 + .333 / 3 = .277
.333 / 2 + .333 / 3= .277
2
.277 / 3 + .277 / 1 = .370
.444 / 2 + .277 / 3 = .314
.444 / 2 + .277 / 3 = .314
3
.314 / 3 + .314 / 1 = .419
.370 / 2 + .314 / 3 = .290
.370 / 2 + .314 / 3 = .290
4
.290 / 3 + .290 / 1 = .386
.419 / 2 + .290 / 3 = .306
.419 / 2 + .209 / 3 = .306
Task 5.2
● calc_page_rank: Given links between the documents, calculate the PageRank values for each page as specified above.
● search: Searches for documents containing at least one of the query words, sorted by PageRank values and return up to `limit` of them.
● Write unit tests for PageRank.py.
● Use test_page_rank_basic as a starting point to h elp you understand the expected input and output.
● Add file docstring and class docstring in PageRank.py and remove all TODOs.
Report Q&A: Page Rank Understanding
Hw5.2 There is a Markov Chain interpretation of the PageRank algorithm that PageRank value represents the frequency that web surfers arrive at a specific web page following hyperlinks found on other pages.
Included in the model is the idea that a user can also visit a page without clicking a link (e.g., they can visit via bookmark or a typed url). This random jumping process around process can be included in an iterative solution via a damping factor d. ExplainthefollowingequationintermsofprobabilitynotationifPR(X)istheprobabilityofarrivingatnodeX,andL(X) is the number of outbound links from X. (i.e., What does d represent, what does (1 - d)/N represent, and what does the sum represent.)
If 32% of traffic to a webpage comes from non-inlink sources what should the value of d be? See https://en.wikipedia.org/wiki/PageRank if you are stuck.
Part 5.3 Processing the XML file
Tf-idf computes word counts. In order to do this, irrelevant data (like punctuation and word spelling variations) must be removed. Tf-idf calculation is then done with respect to this simplified corpus. PageRank requires all the links in the corpus.
It is now time to process the Wikipedia dump XML to extract each type of data.
Introduction to XML files
The Wikipedia corpora used for this project are stored as XML files. XML stands for eXtensible Markup Language. It is similar to HTML in that elements are surrounded by matching opening and closing tags. An opening tag looks like
Open SmallWiki.xml in a text editor and get familiar with the structure of XML files.
Task 5.3.1 Processing Text
You should work with process_text in processing.py. ● Tokenizing
Tokenization refers to the process of splitting a string into tokens. For our purposes, tokens are sequences of characters with no punctuation or whitespace. See the docstring for more guidance.
Ex: `Are you there?` should become `Are`, `you`, `there`
● Stop Words
After tokenizing, you will need to remove stop words. Stop words are words that appear too often to be meaningful for searching (e.g. “the”, “of”). In building a search engine, it makes sense to discard stop words. Say you search the cat in the hat, the shorter query cat hat gets at the essence of the longer query. The Python Natural Language Toolkit NLTK package includes a list of typical English stopwords that you should filter out.
● Stemming
The next issue to overcome when building a search engine is we want to calculate a count for the stem of each word in each document as opposed to each of its variants. walking, walked, and walks all essentially mean the same thing, so we should reduce all of them down to walk. This is what stemming accomplishes. In the stencil code, we instance a PorterStemmer for you. Note that its stem()method also lowercases all letters, which is desirable behavior.
Here is a suggested approach for the three steps mentioned above.
1. Use sent_tokenize and word_tokenize from nltk.tokenize for tokenizing. If the first execution of these methods gives you an error, follow the printed instructions and fix it. Here are some examples.
2. Strip punctuations at both ends of each word, str.strip might be useful.
3. Stem words with the `stemmer` above.
4. Filter stop words and empty words. We’ve created a set of stopwords for you using the NLTK toolkit in the _english_stop_words please filter using this variable. Doing so will let you (and us) test out different sets of stop words.
Use test_processing_basic as a starting point to help you understand the expected input and output. Task 5.3.2 Processing Links
The extract_links method is used for this part.
In the XML files you will be working with, links can take the following two forms:
1. [[Impeached Presidents]]: This is a normal link. Impeached Presidents is the text displayed on the page, and if you click on it, you are directed to the Impeached Presidents Wikipedia page.
2. [[Washington|Presidents]]: The text that come before the | character is the text displayed on the page (we will call this “display text”), but the page that is linked to is the text that comes after the |. In this case we would see Washington d isplayed on the page, but when we click on it, we would be redirected to the Presidents Wikipedia page.
○ Note: When calculating TF-IDF, Washington should be considered as a word on the page, but Presidents should not.
○ Note: If a link contains more than one | (e.g. [[File:Essendon logo 2010.png|center|150px]]), then do not treat it as a link. The words should also not be counted as words on the page.
To isolate links that appear in the text, it is helpful to use a regex (regular expression) . The Python package that provides regex functionality is called re, which has been imported at the top of the stencil code.
Here is a useful regex: r’\[\[[^\[]+?\]\]’
● Meaning: Match two left brackets (\[\[) and two right brackets (\]\]), making sure there’s something in the middle, but also making sure there is not a left bracket in the middle, which would mean that somehow another link was starting inside this link ([ˆ[]+?).
If you want to play around with regex functions, take a look at this page Feel free to take a look at the documentation for more information!
The code has been given to you in this part so no coding is required for this part, however, we expect you to fully understand it.
Report Q&A: extract_links
Hw5.3 In report.md, explain how extract_links operates and what it returns. In addition, What would be different if we don’t return the second return value (words extracted from display texts)?
Task 5.3.3 Working with XML files in Python
You should work with build_corpus in this part.
You will notice that at the top of processing.py, there is a line that says from xml.etree import ElementTree. This is a Python package that makes it easy to work with XML files. Take a look at the documentation to learn how to use this package.
You should iterate through all the
Use test_processing_basic as a starting point to help you understand the expected input and output. Part 5.4 SearchEngine: Optimizing Search Results
class SearchEngine (located in SearchEngine.py) allows users to query the corpus with various ranking schemes. Command Line Interface (CLI)
Once you have implemented enough SearchEngine functionality you run python app.py to start an interactive command-line interface for your search engine (your result will not necessarily match this screenshot). The commands for use will be displayed when you run it. (The TAs will also use the CLI to evaluate your implementation as part of the grading process).
Tasks
● Read through __init__ and search. Make sure you understand them.
● Implement the ‘smart’ search mode in smart_search. You should take both TF-IDF and PageRank into
account. This is where you can get creative!
Keep track of how you develop your smart search. You will need to explain your approach to implementing the ‘smart’ search mode in report.md. You will also include 2 sample searches you performed (and their results) that demonstrate that your approach returns reasonable results. In addition, include a screenshot in that section of the report of you using the CLI to query your smart engine.
● Add file, class and method docstring in SearchEngine.py and remove all TODOs.
● Be sure that the Command Line Interface (described below) is working for you, since it will be used to also grade
your project.
Report Q&A: process_test
Hw5.4 In report.md, explain why process_text is also applied on the query string. Report Q&A: Incremental Updates
Hw5.5 In report.md, Web spiders continuously find new web pages and revisit old web pages. When a new page is added or an old page is revisited an incremental update to the TF-IDF and PageRank must be made. Explain how you might implement an incremental SearchEngine update() method.
Testing
Thoroughly testing your project should involve a combination of unit testing and integration testing. Use the type of testing that fits the situation. Include tests that confirm the calculations computed by your search engine are correct and that the other important pieces are functioning as you expect. Once you have a working search engine, also verify that search results are reasonable. You should test various queries, keeping in mind that your search engine should handle all inputs gracefully. We encourage you to create your own XML files for testing.
You will only receive credit for testing which is clearly documented in your report. Test coverage is not the only goal (but you should still check coverage). Tell us about why we should be confident that your code survives any user inputs / corpus.
Report
Write your project report in the file called report.md. Include the following under each section heading: # Search Engine Report
## Overview
Overview of how your code works
## Known Bugs
Description and example inputs and outputs for any known bugs.
## Report Questions & Answers
Hw5.1 Respond to your boss’s suggestions to “store word frequencies the way you compute”. (See original question above for more detail.)
Hw5.2 Explain the following equation in terms of probability notation if PR(X) is the probability of arriving at node X, and L(X) is the number of outbound links from X. (i.e., What does d represent, what does (1 – d)/N represent and, what does the sum represents.)
If 32% of traffic to a webpage comes from non-inlink sources what should the value of d be?
Hw5.3 Explain how extract_links operates and what it returns. In addition, What would be different if we don’t return the second return value (words extracted from display texts)?
Hw5.4 Explain why process_text is also applied to query strings.
Hw5.5 Web spiders continuously find new web pages and revisit old web pages. When a new page is added or an old page is revisited an incremental update to the TF-IDF and PageRank must be made. Explain how you might implement an incremental SearchEngine update() method.
## Smart Search
Description of how you implemented the ‘smart’ search mode. Include at least two examples that show that it makes sense, and explains why it can potentially be better than just PageRank or TF-IDF.
## Testing
Description of how you tested your code, include details on what you did beyond running the basic tests included with the stencil.
Include screenshot from using CLI.
Rubric
● Coding Style: 5%
○ Coding style is a critical part of code quality. Run flake8 *.py in a terminal under your code directory.
Resolve all style warnings to avoid point deduction.
● TF-IDF: 25%
● PageRank: 25%
● SearchEngine & CLI (with Smart Search): 15%
● Processing XML file: 10%
● Report: 10%
● Testing: 10%
Special
thanks to CS18 for letting us borrow from their course materials.