Homework 4: Comparing Search Engine Ranking Algorithms Objectives:
o ExperienceusingSolr
o Investigatingrankingstrategies
Preparation
In this exercise you will use the Apache Solr software to import a set of web pages and investigate different ranking strategies. Solr can be installed on Unix or Windows machines. However, it is much easier to run Solr on a Unix computer. Therefore, we have provided instructions for installing an implementation of Unix, called Ubuntu, on a Windows computer. For details see
http://www-scf.usc.edu/~csci572/2019Spring/hw4/UbuntuVirtualBoxFinal.pdf
Once Ubuntu is successfully installed, or if you are using a Mac or some other Unix computer, you can then follow the instructions for installing Solr directly, which can be found here
http://lucene.apache.org/solr/quickstart.html
The above instructions are for solr-7.X.X.zip. You can either download the latest version of the file (8.0.x) from
http://lucene.apache.org/solr/mirrors-solr-latest-redir.html
or download the version used by our examples (solr-7.7.0.zip) from here
http://archive.apache.org/dist/lucene/solr/7.7.0/
Once Solr is installed you need to learn how to index the web pages that you saved. Instructions for doing this can be found here.
http://www-scf.usc.edu/~csci572/2019Spring/hw4/IndexingwithTIKAV3.pdf
[Note that, we will use the Solr in standalone mode for this assignment. You should start solr using command $ bin/solr start ]
Solr provides a simple user interface that you can use to explore your indexed web pages.
Description of the Exercise
Step 1. Now that you have some basic understanding about indexing using Solr. you can start indexing the web pages that are provided for you. You can download them from
https://drive.google.com/drive/u/0/folders/1EFkdy9cLjq0hNuC_bRQ0UO75djtvwrj7?usp=sharing
The news sites are: Yahoo News, The Guardian, and Reuters. You should download the folder for the news site that is assigned to you based on your USC ID.
1
Note: the website you use for HW#4 may be different from the one you used for HW#2
You should index only the news websites assigned to you.
USC ID ends with
News Sites to Crawl
NewsSite Name
Root URL
01~30
Yahoo News
yahoo1
https://news.yahoo.com/
31~60
The Guardian
guardian
https://www.theguardian.com/us
61~99
Reuters
reuters
https://www.reuters.com/
For each news website there is a folder that contains a set of html files downloaded from the news site and a mapping file for each news site. The mapping file contains a mapping between filenames (e.g.
) that is used to save the html webpages and their URL (e.g. ). You should download all the html files from the website to a
single folder and index these pages using Solr.
Once your set up is complete you need to have access to a web server that can deliver web pages and run scripts. Using this web server, you will create a web page with a text box which a user can retrieve and then enter a query. The user’s query will be processed by a program at your web server which formats the query and sends it to Solr. Solr will process the query and return some results in JSON format. A program on your web server will re-format the results and present them to the user as any search engine would do. Results should be clickable (i.e. open the actual web page on the internet). See ahead for a description of the queries you should use.
Below is a rough outline for how you could structure your solution to the exercise. All three elements: web browser, web server, and Solr would be located on your laptop. Your web server might be the Apache web server coupled with the PhP programming language. An alternative solution would be to use node.js as the server/programming component. In the case of node.js, the programming language is JavaScript. Whatever you use, your program would send the query to Solr and process the results returned by Solr.
0a1d42d7-8b2c-499c-b91a-a7184dc79356.html
https://www.guardiannews.com/
User’s web browser
Your web server Your Solr
query box
format query; send to Solr;
format results;
process query; return results;
1 Some news websites may produce fewer than 20,000 results, e.g. yahoo, which is not a problem
2
Solr server supports several clients (different languages). Clients use requests to ask Solr to do things like perform queries or index documents. Client applications can reach Solr by creating HTTP requests and parsing the HTTP responses. Client APIs encapsulate much of the work of sending requests and parsing responses, which makes it much easier to write client applications.
Clients use Solr’s five fundamental operations to work with Solr. The operations are query, index, delete, commit, and optimize. Queries are executed by creating a URL that contains all the query parameters. Solr examines the request URL, performs the query, and returns the results. The other operations are similar, although in certain cases the HTTP request is a POST operation and contains information beyond whatever is included in the request URL. An index operation, for example, may contain a document in the body of the request.
There are several client APIs available for Solr, refer https://wiki.apache.org/solr/IntegratingSolr. As an example, here we explain how to create a PHP client that accepts input from the user in a HTML form, and sends the request to the Solr server. After the Solr server processes the query, it returns the results which are parsed by the PHP program and formatted for display.
We are using the solr-php-client which is available here https://github.com/PTCInc/solr-php-client . Clone this repository on your computer in the folder where you are developing the user interface.
git clone https://github.com/PTCInc/solr-php-client.git
Note: Your User Interface should allow user to choose between the two searching algorithms. You can
implement it any way you want, radio buttons are one option.
Below is the sample code from the wiki of this repository.
search($query, 0, $limit);
}
catch (Exception $e)
{
} }
?>
response->numFound; $start = min(1, $total);
$end = min($limit, $total);
?>
-
response->docs as $doc) {
-
$value)
// in production you’d probably log or email this error to an admin
// and then show a special message to the user but for this example
// we’re going to show the full exception
die(“SEARCH EXCEPTION {$e->__toString()}
“);
4{ ?>
?>
In order to provide additional parameters to the Solr server, create an array of these parameters as shown below:
$additionalParameters = array(
‘fq’ => ‘a filtering query’,
‘facet’ => ‘true’,
// notice I use an array for a multi-valued parameter ‘facet.field’ => array(
‘field_1’,
‘field_2′
) );
$results = $solr->search($query, $start, $rows, $additionalParameters);
Step 2
In this step you are going to run a series of queries on the Solr index. The comparison now will be to use two different ranking algorithms.
Solr uses
Lucene to facilitate ranking. Lucene uses a combination of the Vector Space Model and the
Boolean model to determine how relevant a given document is to a user’s query. The vector space model
is based on term frequency. The Boolean model is first used to narrow down the documents that need to
be scored based on the use of Boolean logic in the query specification.
Solr permits us to change the ranking algorithm. This gives us an opportunity to use the PageRank
algorithm and see how the results differ. There are several ways to manipulate the ranking of a document
in Solr. Here, we explain the method which uses a field that refers to an External File that stores the
PageRank scores. This external file basically contains a mapping from a key field to the field value.
5
In order to create this External File you must first carry out the PageRank process on the set of web pages
assigned to you. There are several libraries available to help you compute the PageRank given a graph.
One of them is the NetworkX library –
https://pypi.python.org/pypi/networkx/
i.
a.
b.
c.
[We have provided detailed guidelines on how to compute PageRank in the appendix section.]
ii.
alpha=0.85, personalization=None, max_iter=30, tol=1e-06, nstart=None, weight=’weight’,dangling=None
This PageRank function takes a NetworkX graph (http://networkx.github.io/documentation/networkx-
1.10/reference/classes.digraph.html#networkx.DiGraph) as input and returns a dictionary of graph nodes
with corresponding PageRank scores. These are the steps you would need to perform:
Compute the incoming and outgoing links to the web pages, and create a NetworkX graph
Loop through each webpage copied to the directory and extract the outgoing links.
We have provided mapping files for each dataset, which contains mappings from
webpage files with encoded names to their true URLs. Use these mappings to filter out
URLs that are not present in the mapping file. This can arise if the URL is pointing to a
resource outside the news website domain or if it is not an HTML file.
Create a directed graph using the mapping file and extracted URLs. Vertices in the graph
are webpage files and detected edge between two files represent a link between two.
(i.e. There is an edge between a.html and b.html if a.html has an outgoing link for b.html)
Note: You can use an external library to extract links from download webpages. We have provided an
example on how to use JSoup in
http://www-scf.usc.edu/~csci572/2019Spring/hw4/
ExtractingLinks.pdf
Compute the PageRank for this graph and store this in a file in the format
Make sure the document id is the same as the one which is present in your index. In your Solr index, you
would typically have the filename as the id as shown below:
6
Once you have computed the PageRank scores, you need to place this file in the data folder of the core
you have defined. This can be found in the path solr-7.x.x/server/solr/core_name/. The name of the file
should be external_fieldname or external_fieldname.*. For this example the file could be
named external_pageRankFile or external_pageRankFile.txt.
The next step is to add the field in the managed-schema which refers to this score.
We are defining a field “pageRankFile” which is of the type “external”. The field name should be the suffix
after “external_” you provided for your file in the previous step. The keyField attribute defines the key
that will be defined in the external file. It is usually the unique key for the index. A defVal defines a default
value that will be used if there is no entry in the external file for a particular document.
The valType attribute specifies the actual type of values that will be found in the file. The type specified
must be either a float field type, so valid values for this attribute are pfloat, float or tfloat. This attribute
can be omitted.
Once the field has been defined, we need to make sure that when the index is reloaded, it is able to access
the rank file. In order to do that, there are some modifications required in the solrconfig.xml file. We need
to define eventListeners to reload the external file when either a searcher is loaded or when a new
searcher is started. The searcher is basically a Lucene class which enables searching across the Lucene
index. Define these listeners within the
Now reload the index, by going to the Solr Dashboard UI ->Core Admin and clicking on the “Reload”
button.
Now, you can run the queries and compare the results with and without PageRank scores. In the Solr
query view, there is a “sort” text field, where you can specify the field on which the results have to be
sorted along with the order (ascending, descending)
For example, in the below screenshot I have created an external file and added PageRank scores for a few
documents in the index.
“/home/solr-7.x.x/CrawlData/E0-001-086391282-0.html” is the id of the file that has been indexed. The
values indicate the PageRank scores.
The following screenshot consists of the results which uses Solr’s internal ranking algorithm based on tf-
idf for the default query “*:*”. For the sake of clarity, only the ids have been displayed.
7
The next screenshot is the result of the default query again, but with the ranking based on PageRank that
was defined in the external file.
8
Comparing the two results we can see that the order of the results has changed. The file ending with “E0-
001-079680609-8.html” is the first result in the PageRank scoring results, as we had defined a high
score(5.5) for it in the external file.
Once that is done you should run the same set of queries and save the results.
Step 3
Compare the results of the two ranking algorithms.
The Queries
All of the news sites that you have crawled divide their articles into the following set of categories: world, business, sports, tech, nation, politics, opinion, and travel (there may be other categories as well). You should use the queries below to test out the two ranking strategies.
9
Query
Venezuela
Senate
Democrats
Republicans
Patriot Movement
Oscar 2019
Channel
Wall
Appendix
The following section provides the summary of steps one can follow to compute PageRank for the downloaded files.
1) Iterate through each file in the directly that has all the downloaded html files to create the web graph structure.
In each file:
a) Extract outgoing links
b) Find the file name for each outgoing link using the provided map* files
c) Create edges between files that have links connecting them (You can either use the
edgelist or the adjacency list format when creating edges, see https://networkx.github.io/documentation/networkx-1.10/reference/readwrite.html)
Following is an example code that creates an edgelist from downloaded files.
10
Note: We have used the map* files to create the fileUrlMap and urlFileMap data structures.
2) Load the web graph to networkx
>> G = nx.read_edgelist(“edgeList.txt”, create_using=nx.DiGraph())
Note- provide absolute path for “edgeList.txt”
3) Compute PageRank
See: https://networkx.github.io/documentation/networkx- 1.10/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html?highlight=pa gerank
11