4/16/22, 9:51 PM EECS 485 Project 5: Search Engine | p5-search-engine
p5-search-engine
EECS 485 Project 5: Search Engine
Due 11:59pm ET April 17, 2022. This is a group project to be completed in groups of two or three.
Copyright By PowCoder代写 加微信 powcoder
Change Log
Initial Release for W22
3/29: Updated test_style::test_pylint – no longer linting pipeline python files
3/29: Fixed typo in MapReduce Example to reflect how madoop deletes empty intermediate files
4/9: Fix typos in examples to match the correct doc IDs in each partition.
Introduction
A scalable search engine similar to Google or Bing.
The learning goals of this project are information retrieval concepts like PageRank and tf-idf, parallel data processing with MapReduce, and writing an end-to-end search engine.
Create a segmented inverted index of web pages using a pipeline of MapReduce programs. Then, wrap the inverted index in an Index server REST API service, which is segmented by document. Finally, provide a front-end user interface that communicates with the collection of Index segment servers and returns ranked search results.
Group registration
Register your group on the Autograder.
Project folder
Create a folder for this project (instructions). Your folder location might be different.
https://eecs485staff.github.io/p5-search-engine/
4/16/22, 9:51 PM EECS 485 Project 5: Search Engine | p5-search-engine
2 /Users/awdeorio/src/eecs485/p5-search-engine
2 /Users/awdeorio/src/eecs485/p5-search-engine
3 $ git status
4 On branch main
5 Your branch is up-to-date with ‘origin/main’.
7 nothing to commit, working tree clean
8 $ git remote -v
9 origin https://gitlab.eecs.umich.edu/awdeorio/p5-search-engine.git (fetch)
10 origin https://gitlab.eecs.umich.edu/awdeorio/p5-search-engine.git (push)
.gitignore
2 /Users/awdeorio/src/eecs485/p5-search-engine
3 $ head .gitignore
4 This is a sample .gitignore file that’s useful for EECS 485 projects.
env/bin/activate
https://eecs485staff.github.io/p5-search-engine/ 2/40
2 /Users/awdeorio/src/eecs485/p5-search-engine
3 $ls-denv
5 $ echo $VIRTUAL_ENV
6 /Users/awdeorio/src/eecs485/p5-search-engine/env
Version control
Set up version control using the Version control tutorial.
Be sure to check out the Version control for a team tutorial.
After you’re done, you should have a local repository with a “clean” status and your local repository should be connected to a remote GitLab repository.
You should have a file (instructions).
Python virtual environment
Create a Python virtual environment using the Python Virtual Environment Tutorial.
Check that you have a Python virtual environment, and that it’s activated (remember
4/16/22, 9:51 PM EECS 485 Project 5: Search Engine | p5-search-engine
Starter files
Download and unpack the starter files.
Move the starter files to your project directory and remove the original starter_files/ directory and tarball.
This project involves very few starter files. You have learned everything you need to build this project from (almost) scratch. Each major part of the spec will walk you through the expected directory structure. It’s up to you to move the starter files into the right places.
At the end of this project, your directory structure should look something like this: bin/ : Shell scripts. Init scripts and install scripts live here.
hadoop/ : MapReduce programs live here. You’ll write a pipeline of MapReduce programs whose output is an inverted index. See Inverted Index for more details.
index/ : Inverted Index REST API server. See Index server for more details.
search/ : Search server using server-side dynamic pages. See Search server for more details. tests/ : Public test cases.
2 /Users/awdeorio/src/eecs485/p5-search-engine
3 $ wget https://eecs485staff.github.io/p5-search-engine/starter_files.tar.gz
4 $ tar -xvzf starter_files.tar.gz
2 /Users/awdeorio/src/eecs485/p5-search-engine
3 $ mv starter_files/* .
4 $ rm -rf starter_files starter_files.tar.gz
2 /Users/awdeorio/src/eecs485/p5-search-engine
3 $ tree -I ‘env|__pycache__’
├── indexdb
├── install
└── search
10 ├── hadoop
├── hadoop-streaming-2.7.2.jar
├── inverted_index
│ ├── example_input
│ │ └── input.csv
s://eecs485staff.github.io/p5-search-engine/
/22, 9:51 PM EECS 485 Project 5: Search Engine | p5-search-engine
├── example_output
│ ├── part-00000
│ ├── part-00001
│ └── part-00002
│ └── input.csv ├── map0.py
├── map1.py ├──…
├── pipeline.sh ├── reduce0.py ├── reduce1.py ├──…
└── stopwords.txt
word_count
│ ├── input01.txt
│ └── input02.txt
├── map.py
└── reduce.py
│ ├── __init__.py │ ├──api
└── setup.py
requirements.txt
├── __init__.py └──*.py
inverted_index
├── inverted_index_0.txt ├── inverted_index_1.txt └── inverted_index_2.txt pagerank.out stopwords.txt
├── search
│ ├── __init__.py
│ ├── config.py
│ ├── model.py
│ │ └── index.sql
│ ├── static
│ │ └──css
│ │ └── style.css
s://eecs485staff.github.io/p5-search-engine/
https://eecs485staff.github.io/p5-search-engine/ 5/40
/22, 9:51 PM EECS 485 Project 5: Search Engine | p5-search-engine
Install Madoop
To execute the MapReduce programs in your Inverted Index Pipeline, use Madoop, EECS 485’s open-source, light weight MapReduce framework. Madoop has the same interface as Apache Hadoop, a large, scalable MapReduce framework.
Make sure your virtual environment is activated.
Install Madoop. Your version might be different.
If you want to play around with the real Hadoop, see Appendix: C: Installing Real Hadoop (optional). Install script
Installing all of the tool chain requires a lot of steps! Follow the Project 3 Tutorial and write a bash script bin/install to install your app. However, we need to make some small changes from your previous install script.
Unlike Project 3, this project does not have a client-side dynamic portion. As a result, you do not need to install nodeenv , chromedriver or any other client-side dynamic tools.
This project has a different folder structure, so the server installation instruction needs to change. You will also have two backend servers running for this project, so you will need to install two different python packages.
2 /Users/awdeorio/src/eecs485/p5-search-engine
3 $ echo $VIRTUAL_ENV
4 /Users/awdeorio/src/eecs485/p5-search-engine/env
1 $ pip install madoop
2 $ madoop –version
3 Madoop 0.4.0
64 │ └── setup.py
65 ├── var
66 │ └── index.sqlite3
tests └── …
└── *.html
├── __init__.py
4/16/22, 9:51 PM EECS 485 Project 5: Search Engine | p5-search-engine
Install back end
1 pip install -r requirements.txt
2 pip install -e index
3 pip install -e search
If running the install script on CAEN, pip install might throw PermissionDenied errors when updating packages. To allow installation and updates of packages using pip add the following lines before the first pip install command.
Tell pip to write to a different tmp directory.
You will also need to install Madoop. Install madoop
1 pip install madoop MapReduce Example
Next we will run a MapReduce word count program. See the Hadoop Streaming Tutorial for a detailed explanation. Your inverted index pipeline will consist of multiple MapReduce programs a lot like this example.
The example MapReduce program is in hadoop/word_count/ . The code is in map.py and reduce.py .
1 mkdir -p tmp
2 export TMPDIR=tmp
2 /Users/awdeorio/src/eecs485/p5-search-engine/hadoop/word_count 3 $tree
├── input01.txt
└── input02.txt
Take a look at the input files.
1 $ head input/input*
2 ==> input/input01.txt <==
3 Hello World
4 Bye World
s://eecs485staff.github.io/p5-search-engine/
/22, 9:51 PM EECS 485 Project 5: Search Engine | p5-search-engine
Run the MapReduce job. By default, there will be one mapper for each input file.
-input DIRECTORY input directory -output DIRECTORY output directory -mapper FILE mapper executable -reducer FILE reducer executable
If you’re using the real Hadoop, include jar hadoop/hadoop-streaming-2.7.2.jar . Madoop ignores this argument.
1 2 3 4 5 6 7 8 9
10 11 12 13
$ madoop \
-input input \
-output output \
-mapper ./map.py \
-reducer ./reduce.py
Starting map stage
INFO: Starting map stage
INFO: Finished map executions: 2
INFO: Starting group stage
INFO: Starting reduce stage
INFO: Finished reduce executions: 3
INFO: Output directory: output
Output directory: output
After the MapReduce job finishes, the output will be in the output directory. Each reducer produces a one part-XXXXX file.
6 ==> input/input02.txt <==
7 Hello Hadoop
8 Goodbye Hadoop
1 $ ls output/
2 part-00000 part-00001 part-00002
3 $ head output/part-*
4 ==> output/part-00000 <==
5 Goodbye 1
7 ==> output/part-00001 <==
9 Hadoop 2
10 World 2
12 ==> output/part-00002 <==
13 Hello 2
14 $ cat output/part-*
s://eecs485staff.github.io/p5-search-engine/
/22, 9:51 PM EECS 485 Project 5: Search Engine | p5-search-engine
Run the MapReduce job with the --verbose flag. This might come in handy for debugging.
15 Goodbye 1
17 Hadoop 2
18 World 2
19 Hello 2
1 $ madoop \
2 --verbose \
3 -input input \
4 -output output \
5 -mapper ./map.py \
6 -reducer ./reduce.py
7 DEBUG: tmpdir=/var/folders/17/knb89jy144sd874tp00636hr0000gn/T/madoop-hsa2mfki
8 DEBUG: input input/input01.txt size=22B partitions=1
9 DEBUG: partition input/input01.txt >> input/{part-00000}
10 DEBUG: input input/input02.txt size=28B partitions=1
11 DEBUG: partition input/input02.txt >> input/{part-00001}
12 DEBUG: total input size=50B
13 INFO: Starting map stage
14 DEBUG: map.py < input/part-00000 > mapper-output/part-00001
15 DEBUG: map.py < input/part-00001 > mapper-output/part-00002
16 INFO: Finished map executions: 2
17 INFO: Starting group stage
18 DEBUG: mapper-output/part-00001 unique_keys=3
19 DEBUG: mapper-output/part-00002 unique_keys=3
20 DEBUG: mapper-output all_unique_keys=5
21 DEBUG: partition mapper-output/part-00001 >> reducer-input/{part-00000,part-00001,part
22 DEBUG: partition mapper-output/part-00002 >> reducer-input/{part-00000,part-00001,part
23 DEBUG: sort reducer-input/part-00000
24 DEBUG: sort reducer-input/part-00001
25 DEBUG: sort reducer-input/part-00002
26 DEBUG: sort reducer-input/part-00003
27 DEBUG: reducer-input/part-00000 unique_keys=1
28 DEBUG: reducer-input/part-00002 unique_keys=3
29 DEBUG: reducer-input/part-00003 unique_keys=1
30 DEBUG: reducer-input all_unique_keys=5
31 INFO: Starting reduce stage
32 DEBUG: reduce.py < reducer-input/part-00000 >
33 DEBUG: reduce.py < reducer-input/part-00001 >
34 DEBUG: reduce.py < reducer-input/part-00002 >
35 DEBUG: reduce.py < reducer-input/part-00003 >
36 INFO: Finished reduce executions: 4
37 DEBUG: output/part-00000 size=10B
output/part-00000
output/part-00001
output/part-00002
output/part-00003
s://eecs485staff.github.io/p5-search-engine/
/22, 9:51 PM EECS 485 Project 5: Search Engine | p5-search-engine
Inverted Index Pipeline
Build an inverted index using a pipeline of MapReduce programs. Write each map and reduce program as a stand-alone Python program compatible with the Hadoop Streaming Interface. You’ll use Michigan Hadoop (madoop) to execute the MapReduce programs.
The input is a copy of Wikipedia pages (documents). The output is an inverted index containing inverse document frequency, term frequency, and document normalization factor. The output is several files segmented by document.
Before continuing, read the Hadoop Streaming Tutorial.
Directory structure
We’ve provided most of the files to get you started.
38 DEBUG: output/part-00001 size=0B
39 DEBUG: output/part-00002 size=23B
40 DEBUG: output/part-00003 size=8B
41 DEBUG: total output size=41B
42 INFO: Output directory: output
2 /Users/awdeorio/src/eecs485/p5-search-engine/hadoop
3 $ tree -I ‘env|__pycache__’
5 ├── hadoop-streaming-2.7.2.jar
6 ├── inverted_index
example_input
└── input.csv
example_output
├── part-00000
├── part-00001
└── part-00002
└── input.csv
map0.py # Write this
map1.py # Write this
pipeline.sh # Edit this
reduce0.py # Write this
reduce1.py # Write this
s://eecs485staff.github.io/p5-search-engine/
/22, 9:51 PM EECS 485 Project 5: Search Engine | p5-search-engine
25 26 27 28
└── stopwords.txt
word_count
│ ├── input01.txt
│ └── input02.txt
├── map.py
└── reduce.py
hadoop-streaming-2.7.2.jar
inverted_index
word_count
inverted_index
example_input example_output input/input.csv
map*.py reduce*.py pipeline.sh
stopwords.txt
word_count
hadoop/inverted_index/input/input.csv
“
https://eecs485staff.github.io/p5-search-engine/ 10/40
2 /Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index
3 $ grep ‘Gold mining in Alaska’ input/input.csv
4 “15322302”,”Gold mining in Alaska”,”Gold mining in Alaska, a state of the States, has
Pro-tip: Use the Python csv library and add the line csv.field_size_limit(sys.maxsize) ( doc_body is very large for some documents).
this file.
Within the directory:
: Only used if you’re running Apache Hadoop. Most won’t need
: Inverted Index MapReduce pipeline code, input, and output. : Sample MapReduce program, input, and output.
: A small input with correct output.
: Large input consisting of downloaded Wikipedia pages.
and : MapReduce programs that you will write
: MapReduce pipeline script that runs your MapReduce programs. We’ve provided a starter file with helpful comments.
: Words to ignore when building the inverted index. : Sample MapReduce program with small input
The input is a CSV file,
Each line is one document, and has 3 comma-separated values:
For example, the wiki article about Gold Mining in Alaska.
, representing a list of documents.
4/16/22, 9:51 PM EECS 485 Project 5: Search Engine | p5-search-engine
Follow these cleaning steps for one document. When you parse a query in the Index server, you’ll use the same procedure.
1. Combine both document title and document body by concatenating them, separated by a space.
2. Remove non-alphanumeric characters (that also aren’t spaces) like this:
3. The inverted index should be case insensitive. Convert upper case characters to lower case using casefold() .
4. Split the text into whitespace-delimited terms.
5. Remove stop words. We have provided a list of stop words in
hadoop/inverted_index/stopwords.txt .
The output is an inverted index, with one term on each line. For example, the term smelting from the instructor solution:
Term frequency
1 import re
2 text = re.sub(r”[^a-zA-Z0-9 ]+”, “”, text)
2 /Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index
3 $ grep ‘^smelting ‘ output/part-00000
4 smelting 1.5648920412154652 1097432 4 120183.00952255356 1103627 3 400677.2604829173 .
The elements in one line of the inverted index are separated by a space. The first item is the term, then the inverse document frequency. After that, items come in groups of 3, where each group represents the occurrence of the term in one document. For example:
Inverse document frequency
Document (ID)
Term frequency
Normalization factor
Document (ID)
smelting 1.564 11835570 1 5434.523 11936580 1
Pro-tip: Start with the small example input example_input/input.csv .
https://eecs485staff.github.io/p5-search-engine/
map.py doc_id % 3
doc_id % 3
cut -d’ ‘ -f1-5
2 /Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index
3 $ grep ‘^smelting ‘ output/part-* | cut -d’ ‘ -f1-5
4 output/part-00000:smelting 1.5648920412154652 1097432 4 120183.00952255356
5 output/part-00001:smelting 1.5648920412154652 11835570 1 5434.845639455667
6 output/part-00002:smelting 1.5648920412154652 12687106 1 27713.86561253472
1097432 % 3 == 2 11587145 % 3 ==
doc_id % 3 == 0
1 $ egrep ‘^(smelting|smeltingworks) ‘ output/part-00000 | cut -d’ ‘ -f1-5
2 smelting 1.5648920412154652 1097432 4 120183.00952255356
3 smeltingworks 3.514282047860378 11587145 1 19224.387102361983
4/16/22, 9:51 PM EECS 485 Project 5: Search Engine | p5-search-engine
Term appears in document Term appears in do One normalization factor applies to one document . The value stored in the inverted index omits
the square root. See the tf-idf calculation section for more details.
Terms must be in ascending sorted order, which means that the terms should be in alphabetical
Doc IDs must be in ascending sorted order. When a term appears in multiple documents, multiple doc IDs may appear on one line of an inverted index segment. Sort these doc IDs using string comparison, not numeric comparison, for example doc ID 1234 < doc ID 99. Pro-tip: The MapReduce grouper can do this for you.
The output inverted index is divided into several files, segmented by document. Specifically, 3 files partitioned by doc_id % 3 . To produce segments that match the instructor solution, the last
should output as the key.
The same term may appear in multiple segments when that term appears in several documents. This example shows only the first document on each line ( shows columns 1-5). Notice that appears in multiple segments.
Doc IDs with the same value appear in the same segment. For example, Doc IDs 1097432 and 11587145 are in the same segment because and
A small input might not have any occurrences of a term in a doc such that
example. That means your final MapReduce output may have less than 3 files for small inputs.
https://eecs485staff.github.io/p5-search-engine/
kjft jd |id| kift id kfdi kt
4/16/22, 9:51 PM EECS 485 Project 5: Search Engine | p5-search-engine
tf-idf calculation
Here’s a copy of the term frequency and inverse document frequency definitions.
tf-idf score for term in document Term
Frequency of term in document
Number of documents in collection Number of documents that contain term
*The normalization factor stored in the inverted index omits the square root. Use log base 10 in your calculations.
Your numbers may be slightly different due to floating point arithmetic. We compare values with a tolerance of 5%.
MapReduce pipeline
A MapReduce pipeline executes several MapReduce programs. The output of one MapReduce program is the input to the next MapReduce program.
We have provided hadoop/inverted_index/pipeline.sh , which has helpful comments for getting started.
The first MapReduce job counts the total number of documents in the collection. It should read input from input/ and write output to output0 .
The mapper and reducer executables should be named map0.py and reduce0.py . The output should be a single integer.
Inverse document frequency of term in collection
Normalization factor for one document over every term in that document*
https://eecs485staff.github.io/p5-search-engine/
kt )kfdi ∗ kift( 1=k∑√ = ki2w 1=k∑√ = |id| id 2 t t
kift i id k kt kfdi ∗ kift = kiw
)kn/N(gol = kfdi
4/16/22, 9:51 PM EECS 485 Project 5: Search Engine | p5-search-engine
In , copy the output (e.g., . In later pipeline jobs, read the document count from total_document_count.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com