Module 4
This is a single, concatenated file, suitable for printing or saving as a PDF for offline viewing. Please note that some animations or images may not work.
Module Learning Objectives
This module introduces you to web mining, which involves extracting content from the internet.
After successfully completing this module, you will be able to do the following:
1. Explain how content is downloaded by web crawlers to aid in forming an index of information on the internet.
2. Extract information, by scraping from web pages. We will use two different types of R packages to do this. One type of R package for this is designed specifically to interact with a specific web site with pre-specified web page design. The other type of R package can interact with any web page but you will need to specify the web page properties you are looking for.
3. Compute search performance metrics such as precision, recall, and the F score.
4. Develop a web application using the R package “shiny” (you will NOT deploy the application to the internet) that scrapes from web pages and executes some basic text analytics.
Module 4 Study Guide and Deliverables
Readings:
Lecture material
Discussions:
Discussion 4 postings end Tuesday, October 6 at 6:00 AM ET
Assignments:
Assignment 4 due Tuesday, October 6 at 6:00 AM ET
Live Classroom:
• Tuesday, September 29 at 9:00 – 10:15 PM ET
• Thursday, October 1 at 9:00 – 10:15 PM ET
Background of Web Mining
Most of the information we use today is stored online. In fact, there are claims that the amount of data generated over the last two years is several times larger than the amount generated previously in the history of mankind. Most of this newly generated data is text, images, and videos in the form of email, instant messaging, Google, YouTube, Facebook, Twitter, blogs, and most of the other technologies that define our digital age. The question arises how efficiently to manage emails, look for information in documents locally on your computer or online (such as via searches on Google, Yahoo!, or Bing), read blogs, etc. We frequently think first of web searches, but there are many other methods, such as email searches, a corporate knowledge base, or legal-information retrieval. In fact, by some statistics, email consumes an average of 13 hours per week per worker, and information searches consume 8.8 hours per week. On top of that, new communication tools such as social networks, instant messaging, Yammer, Twitter, Facebook, and LinkedIn undermine workers’ productivity, too. By some general estimates, a third of our time is spent searching for information, and another quarter is spent analyzing it. It is widely believed that more and more data will be generated in the near future, and the time spent managing this data must be as productive as possible.
The techniques for web mining are similar to the ones used for text mining, with the exception of the use of a search engine. The search engine has the following architecture:
• Crawling subsystem
• Indexing subsystem
• Search interface
Large text databases emerged in early 1990s and, with the rapid progress of communication and network technologies, most data is located online. Thus web-mining technology applies to mining data in the following forms:
• Web pages
• A collection of SGML/XML documents
• Genome databases (e.g., GenBank, Protein Information Resource)
• Online dictionaries (e.g., Oxford English Dictionary)
• Emails or plain texts on a file system
For huge, heterogeneous, unstructured data, the traditional data-mining technology cannot work!
Information retrieval refers to finding material (documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). A simple example is getting a credit card out of your wallet so that you can type in the card number. Information retrieval used to be performed in only few professions, whereas nowadays, hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email. These are becoming the dominant forms of information access, overtaking traditional database-style searching, and mostly referring to unstructured data.
The field of web information retrieval supports users in getting or filtering document collections (through web crawlers or indexing) or further processing the set of retrieved documents based on their contents, such as by clustering.
The subdomains of web information retrieval break down as follows:
• Semistructured data: hyperlinks and HTML tags
• Multimedia data types: text, image, audio, video
• Content management/mining as well as usage/traffic mining
For example, imagine you’re writing an application that allows a user to enter a query to search all the files in his or her Dropbox account, in Github, and on his or her computer. In order to create this application, you need to figure out how to make these files available for searching. This is not a trivial task, since you may have thousands, maybe even millions of files in a variety of types—Word documents, PDF, or other text-based files, such as HTML and XML. Some may be in some proprietary format that you may not be familiar with. Having a text-processing tool that can automatically group similar items and present the results with summarizing labels is a good way to wade through large amounts of text or search results without having to read all, or even most, of the content.
Web Crawlers, Indexing
The process of web crawling is related to gathering pages from the web and indexing them to support a search engine. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, including the link structure that interconnects them.
Web Crawlers
A web (or hypertext) crawler is an automated script that systematically browses the web. The process of web crawling is illustrated in Figure 4.1.
Figure 4.1

The various components of a web search engine. The web crawler is sometimes referred to as a spider. It can copy all of the indexed pages it visits for quicker processing by a search engine.
A web crawler fetches, analyzes, and files information from web servers, typically for the purpose of web indexing. Sometimes a web crawler is referred to as a spider. Web-crawling or spidering software is also used by web sites to update their web content or indexes. As illustrated in Figure 4.1, web crawlers can copy all of the pages they visit for later processing by a search engine, which indexes the downloaded pages so that users can search them much more quickly.
The basic operational steps of a hypertext crawler follow:
• Begin with one or more URLs that constitute a seed set
• Fetch the web page from the seed set
• Parse the fetched web page to extract the text and the links
◦ Extracted text is fed to a text indexer
◦ Extracted links (URLs) are added to URLs of which the corresponding pages have yet to be fetched by the crawler
• The visited URLs are deleted from the seed set
The entire process may be viewed as recursive traversal of a web graph in which each node is a URL, as illustrated on the left of Figure 4.1.
A professional web crawler needs to have a multithread design to be able to process a large number of web pages quickly (a fetch rate). As an illustration of this speed, fetching a billion pages (a small fraction of the static web at present) in a month-long crawl requires fetching several hundred pages each second. That is why massively distributed parallel computing is needed for a professional web crawl.
Test Yourself 4.1
At an average fetch rate of 200 web pages per second, on a typical, single-threaded computer that most likely you would use every day, how long would it take to crawl one billion pages?
About a month.
About two months.
About three months.
About four months. The relevant tags for our task are “div.film-card__title” and “a.film-card__link”. The R implementation requires just few lines of code: # returns a *list* of text nodes udner “sequence” tag # loop over the list returned, and get and modify the node value: Precision= Precision Recall= Recall Going back to our spam email example, from the information given in the example and the earlier-mentioned fact about the sum of the rows and the columns in the CM, we can determine all of the elements in the CM. What was given is the following Total=100 #Actual 1=TP+FN=27 #Classified as 1=TP+FP=21 TP=11 Note that, from the given information, the unknown elements of the CM can be found as follows: FP=#Classified as 1−TP=10 FN=#Actual 1−TP=16 TN=Total−#Classified as 1−FN=63 Thus we have the following: TP=11 FP=10 FN=16 TN=63 Then the precision and the recall for what we will call performance 1, are as follow: Precision= = ∗100=52.4% Recall= = ∗100=40.7% High precision and recall give us confidence in the effectiveness of any classification process, including an information-retrieval system. TP=2 FP=34 FN=1 TN=63 Calculate the precision F Test Yourself 4.3 # Application title (Panel 1) # Widget (Panel 2) ui <- dashboardPage(
dashboardHeader(),
dashboardSidebar(),
dashboardBody()
)
server <- function(input, output) { }
shinyApp(ui,server)
Note that as an illustration, in this example the two scripts “ui.R” and “server.R” are merged into a single script file “app.R”. Note that the single or two script file approach is not related to the type of the pp you are creating. It is purely your choice of convenience if you use single or two script file approach when you create a Shiny or a Shinydashboard app. The end result is the same. In the dashboard example illustrated with the few lines of code shown above, we get the general dashboard outline illustrated in Figure: Dashboard.
Figure: Dashboard

In a similar fashion to Tabs and Menu items can be added to dashboardSidebar(). This is illustrated in Figure: Menu Items.
Figure: Menu Items

Here is the code that created the dashboard illustrated in Figure: Menu Items.
## app.R ##
# code for UI only
library(shiny)
library(shinydashboard)
ui <- dashboardPage(
dashboardHeader(title = "Management Strategies", titleWidth = 250,
dropdownMenuOutput("messageMenu")
),
dashboardSidebar(width = 250,
sidebarMenu(id = "menu",
menuItem("Data Import", tabName = "readdate", icon = icon("list-alt")),
menuItem("Energy Dashboard", tabName = "dashboard", icon = icon("dashboard")),
menuItem("Predictive Analytics", tabName = "analytics", icon = icon("spinner"))
)
),
dashboardBody()
)
server <- function(input, output) { } # does nothing
shinyApp(ui, server)
Note how the header is added with dashboardHeader(). A conditional panel can be created depending on the selected menu item, and as a result of the selection other widgets can be displayed in this menu item. The result of this more involved aspect is illustrated in Figure: Dashboard. More information about this feature and many other Shiny and Shinydashboard features can be found on the relevant links mentioned earlier.
Figure: Other Widgets in Menu Item

Ranking
As can be noted in Figure 4.1, the interconnection of web pages can be represented in a graph. A graph contains nodes (web pages) and edges (links or endorsements). The hyperlink from page A to page B represents an endorsement of page B by the creator of page A. The web is full of instances when page B does not provide an accurate description of itself. Thus, there is often a gap between the terms in a web page and how web users would describe that web page. This affects the HTML parsing performed in crawling and may prevent the extraction of text that is useful for indexing these pages, thus resulting in ranking of the retrieved documents. A possible way to deal with this problem is to introduce scoring and ranking measures derived from the graph's link structure.
A typical way to represent any graph is by using the adjacency matrix (AM). This is a square matrix of a size determined by the number of nodes. Its entries are zeros and ones; the nonzero elements in the AM indicate the connected nodes on the graph. The AM contains the topology of the graph. More specifically, on directed graphs, rows give the out-degree of a node, and columns give the in-degree.
Related to this aspect of graphs is the interlinking of the web pages. Some web pages having a large out-node degree can be considered hub nodes. Those web pages with large in-node degrees are considered to be authority nodes. They point toward the hub nodes. In this kind of classification of the web pages, having an AM representation of the web topology enables us to locate the hub and authority nodes easily on the web. This is done by decomposing the AM using singular value decomposition (SVD), which was described in more details in Module 3. The SVD provides the most important "concepts," or components, of the matrix. An important application of SVD is the principle component analysis (PCA), which projects onto the "best" axis (the most important component).
AM=(US)
V
T
=PC⋅
V
T
A
M
=
(
U
S
)
V
T
=
P
C
⋅
V
T
Related to this is Kleinberg’s algorithm, HITS (hyperlink-induced topic search), which, for given set of web pages and a query, finds the most “authoritative” web pages for the query. The authority vectors are given by the right singular vector of the AM (
A
M
T
AM
A
M
T
A
M
). The authority score can be used to find similar pages.
The page-rank algorithms, given a directed graph, find its most interesting/central node. A node is important if it is connected with other important nodes. Related to this is a Markov chain and the term “steady-state probability.” A node will have a high steady-state probability if it is connected with other high–steady-state probability nodes. Typically, this is determined by finding the eigenvalues and eigenvectors of the AM. The most interesting/central node is obtained from the eigenvector that corresponds to the highest eigenvalue.
Matrices and linear algebra provide an elegant way to implement the algorithms. Real data are often in high dimensions with multiple aspects. SVD and PCA both transform data into some abstract space (specified by a set basis), and this may cause problems in interpreting the procedure. Another approach is to use the CUR and CMD algorithms by specifying (sampling) a subspace of the data.
In conclusion, at its most basic, a search can be described in four parts:
• Indexing – Files, web sites, and database records are processed to make them searchable.
• User Input – Users enter their information need through some form of user interface.
• Ranking – The search engine compares the query to the documents in the index and ranks documents according to how closely they match the query.
• Results Display – This is the big payoff for the user: The final results are displayed via a user interface, whether it’s at the command prompt, in a browser, or on a mobile phone.
A good understanding of the structure and type of content in a document collection is necessary for an optimal search. No algorithm is going to give better results than a search based on the understanding of what is important in a document. After a user has gained a preliminary understanding of the content, indexing is the process of making it searchable by analyzing that content.
As an illustration of this and the previous topics, let's consider the following example.
Example 4.3
Based on user input, a search is performed of an index of Wikipedia's web pages. Their text content (documents) is analyzed. Documents in the collection are filtered and grouped (ranking) by subject similarity. The results are displayed in a dendrogram.
The Shiny app looks like this:
Figure: Shiny App Example (Most Basic Web Search of Indexed Wikipedia Web Pages)

A Shiny app illustration of a most basic web search of indexed Wikipedia web pages. Based on a user selection, the ranked results are displayed. The cosine similarity is used as a distance measure for ranking.
On the left side is a list of 11 topics that have Wikipedia pages. These topics have been all selected in this display. The user can deselect any topic by clicking it and pressing delete. The deselected terms can be reselected again. The R code behind this app downloads the HTML content of these Wikipedia pages, extracts the text content, preprocesses it, and forms the term-document matrix that is finally used for distance measurement to create the dendrogram displayed here.
The code for this app consists of WikiSearch.R, in addition to the two regular scripts, ui.R and server.R. These scripts are displayed below.
The scripts ui.R is as follows:
# ui.R
library(shiny)
titles <- c("Web_analytics", "Text_mining", "Integral", "Calculus",
"Lists_of_integrals", "Derivative", "Alternating_series",
"Pablo_Picasso", "Vincent_van_Gogh", "Leo_Tolstoy", "Web_crawler")
#Define UI for application
shinyUI(fluidPage(
# Application title (Panel 1)
titlePanel("Wiki Pages"),
# Widget (Panel 2)
sidebarLayout(
sidebarPanel(h3("Search panel"),
# Where to search
selectInput("select",
label = h5("Choose from the following Wiki Pages on"),
choices = titles,
selected = titles, multiple = TRUE),
# Start Search
submitButton("Results")
),
# Display Panel (Panel 3)
mainPanel(
h1("Display Panel",align = "center"),
plotOutput("distPlot")
)
)
))
Compare this with the previous ui.R Shiny example and note the following:
• The object titles, which contains the titles of the several Wikipedia web pages
• The widget selectInput(), which has an argument multiple set to TRUE so multiple choices can be selected/deselected
• The plotOutput() in mainPanel(), enabling the display of a plot in the display panel
The scripts server.R is as follows:
# Example: Shiny app that search Wikipedia web pages
# server.R
library(shiny)
library(tm)
library(stringi)
source("WikiSearch.R")
shinyServer(function(input, output) {
output$distPlot <- renderPlot({
result <- SearchWiki(input$select)
plot(result, labels = input$select, sub = "", main="Wikipedia Search")
})
})
Compare this with the previous server.R Shiny example and note the following:
• The reference to the used libraries and the script WikiSearch.R
The script WikiSearch.R is as follows:
# Wikipedia Search
library(tm)
library(stringi)
library(WikipediR)
SearchWiki <- function (titles) {
articles <- lapply(titles,function(i) page_content("en","wikipedia", page_name = i,as_wikitext=TRUE)$parse$wikitext)
docs <- VCorpus(VectorSource(articles)) # Get Web Pages' Corpus
remove(articles)
# Text analysis - Preprocessing
transform.words <- content_transformer(function(x, from, to) gsub(from, to, x))
temp <- tm_map(docs, transform.words, "<.+?>“, ” “) attr(, “class”) Terms
Show Answer
Indexing
Web indexing (or internet indexing) refers to various methods of indexing the contents of a web site or of the internet as a whole. Search engines usually use keywords and metadata to provide a more useful vocabulary for internet or on-site searching.
Metadata web indexing involves assigning keywords or phrases to web pages or web sites within a metadata tag (or “metatag”) field, so that the web page or web site can be retrieved with a search engine that is customized to search the keywords field. This may or may not involve using keywords restricted to a controlled vocabulary list. This method is commonly used by search-engine indexing. Collections with frequent changes would require dynamic indexing. For very large data collections like the web, indexing has to be distributed over computer clusters with hundreds or thousands of machines.
The basic steps in constructing an index as a term-document ID pairs were described in Module 3, Lecture 1. Index implementation for a large data collection consists of the following:
• Partitioning by terms, also known as global index organization.
• Partitioning by documents (more common), also known as local index organization.
Web-Scraping Illustration
The process of web scraping, sometimes called web harvesting or web data extraction is a technique of simulating human web site exploration with intent of extracting information from websites. This is achieved by implementing low-level analysis of HTTP (Hypertext Transfer Protocol), XML (Extensible Markup Language) or any other markup language used. This process can be considered as a tow step process of “Retrieval” and “Parsing”.
For the task of web scraping, R is easy to implement and debug, but compared to other languages such as Python or Ruby has fewer libraries. The following R packages are relevant for the task of web scraping.
Pkg. Name
Retrieve?
Parse?
RCurl
Yes
No
XML
Limited
Yes
rjson
No
Yes
RJSONIO
No
Yes
httr
Yes
Yes
selectr
No
Yes
ROAuth
No
No
rvest
Yes
Yes
Unlike the previous example of sport’s data websites, where the content is nicely tabulated, the process of scraping the content of an arbitrary the web page can be a very hard task. Here are few examples of harvesting the information contained in several different types of web pages.
Example of Retrieving Journal Articles from arXiv
The arXiv website at arxiv.org is an open-access repository of articles in a variety of sciences like physics, computer science, statistics, etc. You can visit this web site and run your own search. For example, you can look up articles written by the Bayesian statistician Andrew Gelman by going to arxiv.org on your web browser and running a simple author search:
Figure: arXiv Homepage

Alternatively, there is an R package called aRxiv that lets you do this. Note the package name capitalizes the R instead of the X.
install.packages(“aRxiv”)
library(aRxiv)
arxiv_count(‘au:”Andrew Gelman”‘) # How many articles by Andrew Gelman?
rec <- arxiv_search('au:"Andrew Gelman"') # Retrieve articles from Andrew Gelman (only 10 articles by default)
nrow(rec)
As of February 2020, the arxiv_count() function finds 59 articles by Andrew Gelman. It may help to reduce the search so fewer articles are returned. One way to do this is to look for articles by Dr. Gelman that mention a particular topic in the abstract, such as "hypothesis testing":
rec <- arxiv_search('au:"Andrew Gelman" AND abs:"hypothesis testing"')
nrow(rec)
This returns a manageable list of just 2 articles, so let’s look at them:
library(tidyverse)
as_tibble(rec)
# A tibble: 2 x 15
id submitted updated title abstract authors affiliations link_abstract link_pdf link_doi
1 1012… 2010-12-hellip; 2011-1hellip; “Inhhellip; ” For … Andrewhellip; “Columbia Uhellip; http://arxivhellip; http://… “”
2 1905hellip; 2019-05-hellip; 2019-0hellip; “Man… ” The … Andrew… “” http://arxiv… http://… “”
# … with 5 more variables: comment
# primary_category
At this point, we could ask to see these abstracts in our web browser:
arxiv_open(rec)
Or we could look to do text mining on the abstracts by building a corpus:
library(tm)
arxivCorpus <- VCorpus(VectorSource(rec$abstract))
Then, with this new Corpus, we could proceed with pre-processing and analysis, similar to what we learned in Module 3.
Example of Retrieving Movie Showtimes
Another relatively simple example of webpage content scraping is gathering relevant information from a movie theater website. For this task a more involved knowledge of the website structure and the HTML structure is required. Let’s consider the “Showtimes” webpage of the “Coolidge Corner Theater” website and let’s try to create an R script that at any time will harvest the titles of the movies currently playing at the theater.
The “Showtimes” webpage for March 13th of 2016 is illustrated:
Figure: Coolidge Corner Theater Homepage

We will now illustrate how to harvest the titles of the movies currently playing at the theater, using the R package "rvest". This task requires familiarity with the webpage’s HTML source syntax, which for the relevant information we are trying to harvest is illustrated in the code segment below. You can see the HTML source by right clicking on the webpage and selecting “View page source”.
Downton Abbey
library(“rvest”)
movies <- read_html("http://www.coolidge.org/showtimes")
titles <- movies %>% html_nodes(“div.film-card__title”) %>% html_nodes(“a.film-card__link”)
Note that class(titles) is “xml_nodeset”, and if you display the titles you will see some HTML/XML code. To extract just the text elements between the HTML tags, we use the html_text function like this:
> html_text(titles)
[1] “Downton Abbey” “Linda Ronstadt: The Sound of My Voice”
[3] “Brittany Runs a Marathon” “Official Secrets”
[5] “The Farewell”
Note that for any other relevant information in the web page you would need to specify the appropriate tags. The help files of package “rvest” contain much more on the package functionality.
Rather than searching for the correct tags in the web page source, it can be easier to use the SelectorGadget extension for the Chrome web browser.
Example of Retrieving and Replacing Information in XML Files
The XML file is a standard way of keeping information for many professional applications, such as GE big data platform for industrial internet or DNA sequencing for example. R is typically one of the many tools for analysis involved in DNA sequencing which is the process of determining the precise order of nucleotides within a DNA molecule. Without a loss of generality, for the purpose of illustrating how to scrape the content of an XML, here we will consider a very small and simple XML file containing a very small “DNA sequence”. Let’s consider the following XML file containing two “sequence” tags, each containing one taxon (group of one or more populations of nucleotides), i.e. “GCAGTTGACACCCTT” and “GACGGCGCGGACCAG”.
# sequencing.xml
GCAGTTGACACCCTT
GACGGCGCGGACCAG
And as a task, let’s try to access and modify the XML information. Here is the R code that will accomplish that task.
library(XML)
# read XML File located in folder “pth”
x = xmlParse(file.path(pth,”sequencing.xml”))
nodeSet = xpathApply(x,”//sequence/text()”)
zz<-sapply(nodeSet,function(G){
text = paste("Ggg",xmlValue(G),"CCaaTT",sep="")
text = gsub("[^A-Z]", "", text)
xmlValue(G) = text
})
This code loads the XML file into the R object “x”, identifies our “DNA sequences” using XML’s package function xpathApply() and replaces (pastes around additional sequeces) the sequences in the R object “x”. Note how the lowercase sequence letters are replaced to uppercase ones for consistency. The resulting R object “x” is displayed below and can be saved as a modified XML file.
> x
GGGGCAGTTGACACCCTTCCAATT
GGGGACGGCGCGGAAGCCAATT
Searching, Precision, and Recall
People can’t always find breaking news or current topics of public conversation with ordinary keyword searches of indexed web resources, and they already get frequent pointers to current information by the electronic equivalent of word of mouth. The phenomenon of finding content by searching user-contributed tags is perhaps one of the most familiar search experiences available online today.
Optionally, hints can be given to the user about the sorts of things that can be profitably searched for via a real-time search interface. For example, Twitter Search lists the top currently trending topics and offers subscription to search results—most commonly in the form of an RSS feed—enabling people to track a term or phrase and be notified almost immediately whenever it appears.
Real-time search tools that capture signals from the social web provide a method for finding extremely current information and news. These tools maintain search engines to keep up with the leading edge.
Searching
When searching on a web site, we always face the task of choosing or guessing the right keyword. This is a very hard task on web sites that don’t have faceted search capabilities. The typical user is not familiar with the content of the database being searched or the exact phrasing of the search target. If categories derived from the search results are presented to the user, these may be useful for narrowing down a search. This concept is commonly called a faceted search. A database of XML documents seems well suited to a faceted search because XML, by nature, is fit for storing metadata in the form of arguments or meta-elements.
Consider an example of a car database. Natural candidates for facets are “color,” “brand,” “price,” etc. Each of these facets can take a value—for example, for the “color” facet, the value may be “red,” “blue,” etc.; or for the “price range” facet, the value may be “0–500,” “500–1000,” etc. When a user starts searching by typing, for example, “sedan,” the respective amounts of retrieved “red” and “blue” sedan cars are displayed, along with all results for that query. This helps the user to choose with a simple mouse click on facet value “red” or “blue.” In a sense, this approach represents an attempt to bridge the gap between searching and browsing. There are technical challenges associated with this approach, especially with unstructured data, since simple precalculation of facet-value counts is generally problematic and, in some cases, infeasible.
The search concepts are based on the following:
• Indexing – Files, websites, and database records are processed into Indexed files to make them searchable.
• User Input – A user interface is provided for users to enter their information.
• Ranking – The search engine compares the query to the documents in the index and ranks documents according to how closely they match the query.
• Results Display – The final results are displayed via a user interface, at the command prompt, in a browser, or on a mobile phone.
Understanding Search Performance
There are numerous metrics for judging how well a search system performs. They can be related to the hardware and to the analysis itself, focusing on quantities such as the following:
• Number of queries the system can process
• Average query-processing time
• Document throughput (processed documents per second)
• Index size (effectiveness of the indexing algorithm)
• Number of unique terms (size of the index)
Precision and Recall
The most common information-retrieval task is the ad hoc retrieval. It provides relevant documents from a collection as a result of an arbitrary user query (each time on a new topic). The user conveys a query, but differentiated from this query is the information need. This is the topic about which the user desires to know more.
A document is considered relevant if it contains information of value with respect to the information need, which is often more of a concept than an exact use of words. Related to this are the metrics we use to determine the relevance.
How do we measure the effectiveness of an information-retrieval system? Let’s illustrate the effectiveness analysis with the following example.
Example 4.1
Consider the classification performance of an email spam filter based on a user-specified set of spam words. The filter analyzed 100 emails, of which 27 were spam. The filter classified 21 emails as spam, of which 11 were actual spam.
Let’s determine the effectiveness of this spam filter. First we will need to define some terms. It is typical to choose the rare event as “positive,” so let’s choose a spam email as a positive (or 1) event and a regular email as a negative (or 0) event. We also need to realize that we have two classes of events (emails)—actual and predicted (classified)—as illustrated in Figure: Two Classes of Events.
Figure: Two Classes of Events

Diagram from which the confusion matrix is created, used to estimate the effectiveness of a classification process—in this case, the process of classifying an email as spam.
The terms defined in the diagram would be used to assess the effectiveness of any classification process; in this context, the process is that of an information-retrieval system (i.e., we wish to determine the quality of its search results). The terms relating to the spam email–classification example are as follow:
• TP – True positive, actual spam emails (true) classified as spam (positive)
• FP – False positive, actual not-spam emails (false) classified as spam (positive)
• FN – False negative, actual spam emails (false) classified as not spam (negative)
• TN – True negative, actual not-spam emails (true) classified as not spam (negative)
In general, these terms are related to the confusion matrix (CM), which is also known as a contingency table or an error matrix and is shown in Figure: Error Matrix.
Figure: Error Matrix

The elements of the confusion matrix (CM). The 1 and the 0 refer to positive and negative events. The precision is calculated from the quantities circled in green (i.e., TP and FP), while the recall is calculated from the quantities circled in red (i.e., TP and FN).
As can be seen in Figure: Error Matrix, the elements of the confusion matrix are TP, FP, FN, and TP. Note that the rows of the confusion matrix add up to the numbers of events classified as positive or negative, and the columns of the confusion matrix add up to the numbers of actual positive and negative events. The sum of the events classified as positive and negative equals the total number of events. The same is true for the sum of the actual positive and negative events.
In various classification R packages, the confusion matrix can be calculated automatically and displayed for evaluating effectiveness.
The 1 and the 0 refer to positive and negative events. A user will usually want to know two key statistics about the system’s returned results for a query—in this case, which emails are spam. These two quantities are the typical elements of a metric to assess the effectiveness of any classification process:
• Precision – The fraction of the returned results that are relevant to the information need
• Recall – The fraction of the relevant documents in the collection that were returned by the system
Mathematically the CM terms relate to precision and recall through the following formulas:
TP
TP+FP
=
T
P
T
P
+
F
P
TP
TP+FN
=
T
P
T
P
+
F
N
Total
=
100
#Actual
1
=
T
P
+
F
N
=
27
#Classified as
1
=
T
P
+
F
P
=
21
T
P
=
11
F
P
=
#Classified as
1
−
T
P
=
10
F
N
=
#Actual
1
−
T
P
=
16
T
N
=
Total
−
#Classified as
1
−
F
N
=
63
T
P
=
11
F
P
=
10
F
N
=
16
T
N
=
63
TP
TP+FP
11
11+10
Precision
=
T
P
T
P
+
F
P
=
11
11
+
10
∗
100
=
52.4
%
TP
TP+FN
11
11+16
Recall
=
T
P
T
P
+
F
N
=
11
11
+
16
∗
100
=
40.7
%
As noted earlier, before we assess the effectiveness of the classification, we typically divide the available data into 60% training, 20% cross-validation, and 20% test datasets. Note that the precision and the recall need to be estimated on the cross-validation dataset. Note also that these two measures together may not always be a good effectiveness indicator.
Some students find alternative explanations useful. One explanation of precision and recall that some students have liked can be found at Explaining Precision and Recall.
Test Yourself 4.2
Consider the following situation, which we will call performance 2:
T
P
=
2
F
P
=
34
F
N
=
1
T
N
=
63
P
P
and the recall
R
R
for this situation.
Show Answer
F
score
=
2⋅(P⋅R)
P+R
score
=
2
⋅
(
P
⋅
R
)
P
+
R
Calculate the F score for both performance 1 and performance 2 and determine which is more effective.
Performance 1:
P=52.4%
P
=
52.4
%
and
R=40.7%
R
=
40.7
%
Performance 2:
P=5.6%
P
=
5.6
%
and
R=66.7%
R
=
66.7
%
Show Answer
Using RStudio’s Shiny & Shinydashboard
Relevant Links
• Shiny from R Studio
• shinydashboard
• Shiny Widgets Gallery
Using Package Shiny
The package Shiny developed by R Studio brings R to the Web and combines the computational power of R with the interactivity of the modern web.
> install.packages(“shiny”)
Shiny is an R package that makes it easy to build interactive web applications (apps) and visualizations straight from R. To use it, not that much web development skills are needed.
Shiny apps have two components:
• user-interface script (ui.R)—controls the layout and appearance of the app.
• server script (server.R)—contains the instructions that the computer needs to build the app.

Shiny apps are created simply by making a new directory and saving a ui.R and server.R file inside it. Every Shiny app has the same structure: two R scripts saved together in a directory. The user-interface (ui) script controls the layout and appearance of the app. It is defined in a source script named ui.R. The server.R script contains the instructions that the computer needs to build the app. Note that each app will need its own unique directory.
Shiny apps collect values from the user through widgets (web elements). These widgets are built into functions of the form xxxxInput(). Some examples include textInput() and selectInput(), although there are many other types of input widget functions.
1. A widget is a way for users to send messages to the Shiny app.
2. Shiny converts the widgets into HTML content.
Another thing to notice is that the Shiny apps automatically responds (reactive output) to user changes in the widgets. Here is the link to the Shiny Widgets Gallery.
The reactive output is achieved by adding an R object to the user-interface with ui.R and calling a widget to output a result created in server.R. These ui.R widgets have the form xxxxOutput(). Some examples include htmlOutput() and plotOutput(). Again, there are many other output widget functions. In addition to the two regular scripts ui.R and server.R you can include other scripts too.You would reference them in server.R as source(“MyRscript.R”).
Building Your First Shiny App
The first step toward building a Shiny App is to create a folder with the two components in it.
• user-interface script (ui.R)
• server script (server.R)
Insert in the R script ui.R the following code:
# ui.R
library(shiny)
#Define UI for application
shinyUI(fluidPage(
titlePanel(“Example 1 Shiny App”),
sidebarLayout(
sidebarPanel(h3(“Widgets panel”)
),
# Display Panel (Panel 3)
mainPanel(
h1(“Display Panel”,align = “center”),
htmlOutput(“text1”)
)
)
))
Insert in the R script server.R the following code:
# server.R
library(shiny)
# Define server logic required to implement search
shinyServer(function(input, output) {
output$text1 <- renderUI({
Str1 <- paste("You have selected:")
Str2 <- paste("and searched for:")
Str3 <- "Search Results:"
HTML(paste(Str1, Str2, Str3, sep = '
‘))
})
})
You run the app by clicking on the “Run App” button in the code editor’s tool bar. When you do, your app will opens in a browser containing the three app panels, “Application”, “Widgets” and “Display” panels as illustrated in the Figure: Example 1 Shinny App.
Figure: Example 1 Shinny App

Note that the display panel here is an htmlOutput. You can specify other options in the “Display” panel to display images, graphs, tables etc.
Notice also that the htmlOutput() function in ui.R refers to a property named “text1”. This property is created in server.R using a renderUI() function and stored in the list named output as output$text1. Notice text1 is identical in both the output list and the name given in htmlOutput(). Normally the ui.R function xxxxOutput() matches with a server.R function renderxxxx(). For example, tableOutput() matches with a server.R function called renderTable(). The pairing htmlOutput() and renderUI() is an exception (although instead of htmlOutput() you can use instead uiOutput() and renderUI()).
Another thing to note is that the server script displays the htmlOutput, so note how in the script’s code the text strings are combined using “HTML”.
Alternatively you can combine both scripts ui.R and server.R into a single script called app.R. You can add some widgets from the gallery to create a more complex app for your needs.
You can also easily add tabs on the app level, in the “sidebarPanel” or in the “mainPanel” by inserting the Panel code in between as illustrated with the following code.
# Display Panel (Panel 3)
mainPanel(
tabsetPanel(id = “tabsMain”,
tabPanel(‘Main Panel Tab1’,
h1(“Display Panel”,align = “center”),
htmlOutput(“text1”)
),
tabPanel(‘Main Panel Tab2’,
h1(tags$i(“Plot Data”),align = “center”,style=”color:#08488A”),
plotOutput(“myPlots1”)
)
)
This modification in the script code will create tabs in the “Display” panel as illustrated in the Figure: Tabs.
Figure: Tabs

Note how you can specify the position and the CSS style of what is displayed in each panel. In addition, you can specify your application’s CSS style template in separate file.
In addition to tabs, relatively easy images and logos can be also added to your Shiny app to create a more professional look, if needed, as illustrated in Figure: More in Shiny App.
Figure: More in Shiny App

Building Your First Shinydashboard App
The first step is to install the Shinydashboard package to your system.
> install. packages(shinydashboard)
Similarly to Shiny, Shinydashboard is provides a standard dashboard look to your apps. It also contains 3 panels
• dashboardHeader
• dashboardSidebar
• dashboardBody
Here is the prototype code that can be used to create a more complex dashboard app.
# Single script app
# Shinydashboard Example
## app.R ##
library(shiny)
library(shinydashboard)
temp <- tm_map(temp, transform.words, "\t", " ")
temp <- tm_map(temp, content_transformer(tolower)) # Conversion to Lowercase
temp <- tm_map(temp, PlainTextDocument)
temp <- tm_map(temp, stripWhitespace)
temp <- tm_map(temp, removeWords, stopwords("english"))
temp <- tm_map(temp, removePunctuation)
temp <- tm_map(temp, stemDocument, language = "english") # Perform Stemming
remove(docs)
# Create Dtm
dtm <- DocumentTermMatrix(temp)
dtm <- removeSparseTerms(dtm, 0.4)
dtm$dimnames$Docs <- titles
docsdissim <- dist(as.matrix(dtm), method = "euclidean") # Distance Measure
h <- hclust(as.dist(docsdissim), method = "ward.D2") # Group Results
}
The script WikiSearch.R uses lapply() to download all of the selected web pages and create a corpus.
• As before, the preprocessing is done with the content_transformer() function and then applied to the corpus with tm_map().
• The term-document matrix is formed and can be fairly large for a larger set of web pages.
• Note the reduction of the term-document matrix by removal of the sparse terms.
• In this particular case, the cosine similarity is used as a distance measure for ranking.
Note that it may take some time to execute the WikiSearch.R script before it displays the result. Since the app is interactive, by changing the web page selection (removing or adding additional Wikipedia web pages), one changes the plot, too.
Example 4.4
Historically, the classic Reuters-21578 collection was the main benchmark for text-classification evaluation. This is a collection of 21,578 newswire articles, originally collected and labeled by Carnegie Group, Inc., and Reuters, Ltd. A smaller subset of the Reuters-21578 collection, related to crude oil, is included with the tm package.
The following code turns the Reuters-21578 collection into a corpus and illustrates some aspects of the techniques discussed so far.
# Example: Reuters documents example
# Reuters Files Location
reut21578 <- system.file("texts", "crude", package = "tm")
# Get Corpus
reuters <- VCorpus(DirSource(reut21578),
readerControl = list(reader = readReut21578XMLasPlain))
# Text analysis - Preprocessing with tm package functionality
transform.words <- content_transformer(function(x, from, to) gsub(from, to, x))
temp <- tm_map(reuters, content_transformer(tolower)) # Conversation to Lowercase
temp <- tm_map(temp, stripWhitespace)
temp <- tm_map(temp, removeWords, stopwords("english"))
temp <- tm_map(temp, removePunctuation)
# Create Document Term Matrix
dtm <- DocumentTermMatrix(temp)
Note the following:
> writeCorpus(reuters)
If needed, this function writes a character representation of the documents in a corpus to multiple files on disk. The first document in the corpus can be examined with the following:
> reuters[[1]]
…or with the following:
> reuters[[“127”]]
…given the following:
> meta(reuters[[1]], “id”)
[1] “127”
The corpus metadata is used to annotate text documents or whole corpora. It can also be annotated with additional information using the meta() function.
> reuters [[1]]$meta
Metadata:
author : character(0)
datetimestamp: 1987-02-26 17:00:56
description :
heading : DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES
id : 127
language : en
origin : Reuters-21578 XML
topics : YES
lewissplit : TRAIN
cgisplit : TRAINING-SET
oldid : 5670
places : usa
people : character(0)
orgs : character(0)
exchanges : character(0)
For example we can index the corpora with additional annotation, such as author.
> reuters[[1]]$meta$author <- "John Smith"
This can be used to create indices based on selections and subset the corpus with them. Namely, filter out documents that satisfy given properties, such as author and heading.
> indx < meta(reuters, "author") == 'John Smith' | meta(reuters, "heading") == 'SAUDI
ARABIA REITERATES COMMITMENT TO OPEC ACCORD'
filtered <- reuters[indx]
Displaying the filtered corpora, we can see that they are the ones containing the specified criteria.
Corpora in tm have two types of metadata:
• Metadata on the corpus level (corpus)
• Metadata related to the individual documents (indexed) in form of a data frame for performance reasons (indexing)
This is useful in classification. The classification directly relates to the documents, but the set of classification levels forms its own entity. An illustration of corpus-level classification is given below:
> meta(filtered, tag = “Classification”, type = “corpus”) <- "on the corpus level"
> meta(filtered, type = “corpus”)
$Classification
[1] “on the corpus level”
[1] “CorpusMeta”
Note the tag “Classification” that was assigned to the metadata on the corpus level.
Classification of individual documents (indexing) is illustrated below:
> meta(filtered, “ix”) <- letters[1:2]
> meta(filtered)
ix
1 a
13 b
Note the indexing (1 and 13) of the filtered corpora. Note that the indices of the reuters corpora that matched the filtering condition were 1 and 13. We indexed them with the labels a and b.
Once the term-document matrix is found, we can see the most frequent terms with the following:
> findFreqTerms(dtm, 10)
We can use the tm package function findAssocs() to find associations (i.e., terms that correlate) with a correlation of at least 0,8 to a given term (for example, “kuwait”—one of the most frequent terms).
We can subset the corpus by some query as follows:
> inspect(DocumentTermMatrix(
+ temp, list(dictionary = c(“prices”, “crude”, “oil”))))
This will list the documents containing the query and the frequency of appearance.
> inspect(DocumentTermMatrix(
+ temp, list(dictionary = c(“prices”, “crude”, “oil”))))
<
Non-/sparse entries: 44/16
Sparsity : 27%
Maximal term length: 6
Weighting : term frequency (tf)
Docs crude oil prices
127 2 5 3
144 0 12 5
191 2 2 0
194 3 1 0
211 0 1 0
236 2 7 5
237 0 3 1
242 0 3 2
246 0 5 1
248 0 9 9
273 5 5 5
349 2 4 1
352 0 5 5
353 2 4 2
368 0 3 0
489 0 4 2
502 0 5 2
543 2 3 2
704 0 3 3
708 1 1 0
Exercise
Implement this approach in a Shiny app.