程序代写 Week 11 Lecture Review Questions

Week 11 Lecture Review Questions
Professor Yuefeng Li
School of Computer Science, Queensland University of Technology (QUT)
Web Analytics Definition

Copyright By PowCoder代写 加微信 powcoder

Web analytics is about understanding and optimizing web usage. It consists of four steps: collecting data, processing data into information, developing key performance indicators and developing an online strategy.
The basic goal of network analytics is to collect and analyse data related to network traffic and usage patterns. It is also the measurement and analysis of data to understand online user behaviour across web pages.
For example: how many users visit, how long they stay, how many pages they visit, which pages they visit, and whether they arrive via a link.
Businesses use web analytics platforms to measure and benchmark website performance and view key performance metrics that drive their business, such as purchase conversion rates.
Web usage mining is an area of research that discover s interesting usage patterns from web data. Usage data (for example, log files or page markup) can capture the identity or origin of web users and their browsing behaviour (or access to certain content) on a website.
Question 1. Web Analytics
Jansen’s lecture about “Understanding user-web interactions via web analytics” (Reference [2]) presented an overview of the Web analytics process, with a focus on providing insight and actionable outcomes from collecting and analysing Web data. Which of the following is false according to Jansen’s lecture? justify your answer.
(1) From a user behaviour perspective, Web analytics is one of a class of unobtrusive methods which are those that allow data collection without directly contacting participants.
(2) A top page is the first page a user views a Website. Top pages and generally fall into three categories: top entry pages, top exit pages, and most popular pages. By using top

entry pages, organizations can determine advertising effectiveness and search engine
popularity.
(3) Transaction logging is an indirect method of recording data about behaviours, where a
record is referred as a trace. The greatest strength of trace data is that the collection of the
data does not interfere with the natural flow of behaviour and events in the given context.
(4) Every Web server keeps a log of page requests that can include (but is not limited to)
visitor Internet Protocol (IP) address, date and time of the request, request page, referrer, and information on the visitor’s Web browser and operating system. The format of the log file is ultimately the decision of the company who runs the Web server.
Question 2. (PageRank)
A Web graph G = (P, L) consists of Web pages (vertices) and links (edges). The PageRank procedure takes a Web graph G as input and then outputs the better PageRank estimate PR using the following equation
where Bu is the set of pages that point to u, and Lv is the number of outgoing links from page v. The following figure G = (P, L) is a Web graph, where P = {A, B, C, D} and L = {(A, B), (A, D),
(C, A), (D, B), (C, B)}.
Assume the initial estimate of each Web page is equally, i.e., PR(A) = PR(B) = PR(C) = PR(D) = 0.25; and λ = 0.15. Calculate the PageRank estimate (PR value) of each Web page after running the first iteration of procedure PageRank(G).
Online communities
Understanding online behaviour among individuals, organizations, and websites is another goal of web analytics. For example, hyperlink analysis can be used to analyse connections between
online users in a website or online community.
Online communities are groups of entities that interact in an online environment and share common goals, characteristics, or interests. It can be represented as an entity graph, where vertices
are entities and edges represent interactions between entities.

Cluster Analysis
Cluster analysis is a popular unsupervised learning method (i.e., it does not require any training data), which provides a different approach to group data based on selected features.
As with classification, the clustering criteria largely depend on how the items are represented. Input instances are assumed to be feature vectors that represent some objects, such as documents. If you are interested in clustering according to some property, it is important to make sure that property is represented in the feature vectors.
After the clustering criteria have been determined (e.g., selected a set of features), we need to determine how we assign data object to clusters, including how many clusters (K) to use and how to assign data object to clusters. Finally, after we have assigned all objects to clusters, how do we quantify how well we did? That is, we must evaluate the clustering. This is often very difficult, although there have been several automatic techniques proposed, such as the square-error criterion.
A very specific form of cluster, called monothetic, which is defined by some fixed set of properties, such as a “red” cluster (including produces according to color, such as red grapes, red apples). However, most clustering algorithms generate polythetic clusters, where members of a cluster share many properties, but there is no single defining property.
In other words, membership in a cluster is typically based on the similarity of feature vectors that represent the objects. This means that a crucial part of defining the clustering algorithm is specifying the similarity measure that is used. The classification and clustering literature often refer to a distance measure, rather than a similarity measure. Any similarity measure, which typically has a value s from 0 to 1, can be converted into a distance measure by using 1–s.
Evaluation
Evaluating the output of a clustering algorithm can be challenging since clustering is an unsupervised learning, there is often little or no labeled data to use for the purpose of evaluation.

For example, if we group emails into two clusters, some of the emails would be assigned to cluster identifier 1 (C1), while the rest would be assigned to cluster 2 (C2). Not only do the cluster identifiers have no meaning, but the clusters may not have a meaningful interpretation. For example, one would hope that one of the clusters would correspond to “spam” emails and the other to “non-spam” emails, but this will not necessarily be the case.
If some labeled data exists, then it is possible to use slightly modified versions of standard IR metrics, such as precision and recall, to evaluate the quality of the clustering.
Let D be a set of instances (documents) and C be a set of K clusters produced by a cluster algorithm over D. For each cluster Ci in C, we define a cluster label label(Ci) to be the (human-assigned) class label (e.g., “spam”) associated with the most instances in Ci.
𝑙𝑎𝑏𝑙𝑒(𝐶!) = 𝑎𝑟𝑔𝑚𝑎𝑥”∈$%&”|{𝑑 ∈ 𝐶!,𝑙𝑎𝑏𝑒𝑙(𝑑) = 𝑥}|
Since it is associated with more of the instances in Ci than any other class label, it is assumed that it is the true label for cluster Ci.
Then we can get MaxClass(Ci), the instances with the cluster label in Ci
𝑀𝑎𝑥𝐶𝑙𝑎𝑠𝑠(𝐶!) = {𝑑 ∈ 𝐶!,𝑙𝑎𝑏𝑒𝑙(𝑑) = 𝑙𝑎𝑏𝑒𝑙(𝐶!)}
Therefore, the precision for cluster Ci is the fraction of instances in the cluster with label(Ci), and the cluster precision for C is
|𝑀𝑎𝑥𝐶𝑙𝑎𝑠𝑠(𝐶 )| !
𝑐𝑙𝑢𝑠𝑡𝑒𝑟 = !/0 ()*+!’!,-
where N = |D|.
The problem of choosing K
It is one of the most challenging issues involved with clustering, since there is really no good solution. The best choice of K largely depends on the task and data set being considered. Therefore, K is most often chosen experimentally.

elbow method does not always yield the “obvious” K. For this problem, people commonly use silhouette score to decide a suitable K value. You may use the silhouette_score function from sklearn python package. This part is beyond the scope of this unit.
Clustering and Search
A possible hypothesis is that documents in the same cluster tend to be relevant to the same queries. We now think about how to use this hypothesis to design a retrieval model.
One solution is called cluster-based retrieval, which ranks clusters instead of individual documents in response to a query.
where the probabilities P(qi|Cj) are estimated using a smoothed unigram language model based on the frequencies of words in the cluster.
The intuition behind this ranking method is that a relevant document with no terms in common with the query could potentially be retrieved if it were a member of a highly ranked cluster with other relevant documents.
The solution can be extended as a cluster language model.
where λ and δ are parameters, fw,D is the word frequency in the document D, fw,Cj is the word frequency in the cluster Cj that contains D, and fw,Coll is the word frequency in the collection Coll.
The second term, which comes from the cluster language model, increases the probability estimates for words that occur frequently in the cluster and are likely to be related to the topic of the document.
elbow method which can be used
A widely used method is heuristic
It consists of plotting the explained
variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. However,
to determine the K value.

Question 3. (Distance Function)
Table 1 shows an example of a binary information table.
d1 1100000
d2 0011010
d3 0011110
d4 0011110
d5 1100011
d6 1100011 Table 1. A binary information table
(1) Draw the binary contingency table of Table 1 for comparing documents d1 and d2. (2) Evaluate the Jaccard distance between d1 and d2.
(3) Evaluate the Simple matching coefficient distance between d1 and d2.
Question 4. (Cluster Precision)
Table 2 shows a collection of documents, which includes 10 documents (each with a unique document id) D = {d1, d2, …, d10}, two classes (spam and not spam) Class = {‘spam’, ‘not_spam’}, and a vocabulary V= {w1, w2, …, w5} = {‘cheep’, ‘buy’, ‘banking’, ‘dinner’, ‘the’}.
Table 2. A collection of documents
Assume your cluster algorithm produces two (K=2) clusters C1 = {d2, d5, d10} and C2 = {d1, d2, d3, d6, d7, d8, d9}.
(1) Calculate MaxClass(Ci) for i = 1 and 2.
(2) Calculate the cluster precision of your cluster algorithm.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com