IFN647 Text, Web And Media Analytics
Text, Web And Media Analytics
Copyright By PowCoder代写 加微信 powcoder
Web Analytics
Yuefeng Li | Professor
School of Electrical Engineering and Computer Science
Queensland University of Technology
S Block, Level 10, Room S-1024, Gardens Point Campus
ph 3138 5212 | email
Web Analytics Definition
Logfile analysis
Link analysis
Link Quality
Online communities
Finding Communities
Cluster Analysis
Distance Measures
The k-medoids Technique
Hierarchical Methods
Evaluation Clustering Algorithms
Community Based Question Answering
1. Web Analytics Definition
Web analytics is to understand and optimize Web usage.
It provides information about the number of visitors to a website and the number of page views.
Steps of Web analytics
2. Logfile Analysis
It (web usage mining) discovers interesting usage patterns from Web sever (logfiles) data in order to understand web users browsing behaviour.
Logfiles contain information on visits from search engine spiders, which generally are excluded from the analytics tools using JavaScript tagging.
Log files are server log which contains a history of page requests for a website, from both humans and robots.
Typically, there will be one line per request, and this will include some or all of the following:
Time the request was made
URL requested
User agent
Response Code
Size of response
IP address of client making the request
Time taken to serve the request
Referrer, the page that provided the link to make this request
Adaptive Web Sites
The building of a complex web site is a thorny problem in user interface design.
Web servers can use user logs to record information for users’ interactions over time.
Definition of adaptive web sites
“Sites that automatically improve their organization and presentation based on user access data”
The challenging issue is
“how to build a web site that can improve itself over time in response to user interactions with the site?”
Example of user access logs:
Approaches to Adaptive Web Sites
Two basic adaptive ways:
Customization – update the web site to meet the needs of individual users
Optimization – alter it to make the navigation easier for all users
Customization
Tailor the interface to each individual user
Manual customization
Allows users to specify display options that are remembered during the entire visit and from one visit to next
Path prediction
To guess where the user wants to go
Optimization
The site should learn from all users to make the site easier to use.
The idea is to view a web site’s design as a particular point in the vast space of possible design (a search problem).
Possible quality metric
To measure the amount of effort a visitor needs to exert on average in order to find what the user wants, where
Effort is defined as a function of the number of links traversed and the difficulty of finding those links
The problem: the search space is very large
Solution: Improve the site’s organization
e.g., find common access patterns from logfiles, or clustering.
3. Link Analysis
It analyses the node and connection structure (document structure) of a web site, or extracting patterns from hyperlinks in the web
Document parser recognizes structure using markup, such as HTML tags
Headers, anchor text, bolded text all likely to be important
Metadata and Links can also be important
Example Web Page
Example 1. Web Page
Hyper Links
Links are a key component of the Web
Important for navigation, but also for search
e.g., Example website
“Example website” is the anchor text
“http://example.com” is the destination link
both are used by search engines
Anchor Text
Used as a description of the content of the destination page
i.e., collection of anchor text in all links pointing to a page used as an additional text field
Anchor text tends to be short, descriptive, and similar to query text
Retrieval experiments have shown that anchor text has significant impact on effectiveness for some types of queries
i.e., more than PageRank
Billions of web pages, some more informative than others
Links can be viewed as information about the popularity (authority?) of a web page
can be used by ranking algorithm
Inlink (links pointing to a page) count could be used as simple measure
Link analysis algorithms like PageRank provide more reliable ratings
less susceptible to link spam
There are tens of billions of web pages, but most of them are not very interesting. Many of those pages are spam and contain no useful content at all. Other pages are personal blogs, wedding announcements, or family picture albums. These pages are interesting to a small audience, but probably not broadly. On the other hand, there are a few pages that are popular and useful to many people, including news sites and the websites of popular companies.
PageRank is associated with the Google search engine.
Random Surfer Model
Browse the Web using the following algorithm:
Choose a random number r between 0 and 1
Go to a random page
Click a link at random on the current page
Start again
PageRank of a page is the probability that the “random surfer” will be looking at that page
links from popular pages will increase PageRank of pages they point to
Dangling Links
Random jump prevents getting stuck on pages that
do not have links
contains only links that no longer point to other pages
have links forming a loop
Links that point to the first two types of pages are called dangling links
may also be links to pages that have not yet been crawled
PageRank (PR) of page C = PR(A)/2 + PR(B)/1
More generally,
where Bu is the set of pages that point to u, and Lv is the number of outgoing links from page v (not counting duplicate links)
For example
BC = {A, B}
LC = 1, LA = 2, LB = 1
Example 2. PageRank
Don’t know PageRank values at start
Assume equal values (1/3 in this case), then iterate:
first iteration: PR(C) = 0.33/2 + 0.33 = 0.5, PR(A) = 0.33, and PR(B) = 0.17
second: PR(C) = 0.33/2 + 0.17 = 0.33, PR(A) = 0.5, PR(B) = 0.17
third: PR(C) = 0.42, PR(A) = 0.33, PR(B) = 0.25
Converges to PR(C) = 0.4, PR(A) = 0.4, and PR(B) = 0.2
PageRank Algorithm
Taking random page jump into account, 1/3 chance of going to any page when r < λ
PR(C) = λ/3 + (1 − λ) · (PR(A)/2 + PR(B)/1)
More generally,
where N is the number of pages, λ typically 0.15
The variable names used in this procedure is different to the definitions used in page 15. For example, |Q| = Lp. The algorithm also does not directly calculate Bq, but assigns (1-\lamda)Ip/|Q| to q for any link p -> q, (p,q).
As we need to distinguish the current PageRank (PR) value and the better PR estimate in each iteration, so the algorithm used I and R.
Link Quality
Link quality is affected by spam and other factors
e.g., link farms to increase PageRank
trackback links in blogs can create loops
links from comments section of popular blogs
Blog services modify comment links to contain rel=nofollow attribute
e.g., “Come visit my web page.”
Trackback Links
4. Online Communities
What is an online community?
Groups of entities that interact in an online environment and share common goals, traits, or interests
Baseball fan community
Digital photography community
Not all communities are made up of humans!
Web communities are collections of web pages that are all about a common topic
Finding Communities
What are the characteristics of a community?
Entities within a community are similar to each other
Members of a community are likely to interact more with other members of the community than those outside of the community
Can represent interactions between a set of entities as a graph
Vertices are entities
Edges (directed or undirected) indicate interactions between the entities
Directed Graph Representation
Hyperlink-Induced Topic Search (HITS) algorithm can be used to find communities
Link analysis algorithm, like PageRank
Each entity has a hub and authority score
Based on a circular set of assumptions
Good hubs point to good authorities
Good authorities are pointed to by good hubs
Iterative algorithm:
HITS cont.
(a) The black node is an authority.
(b) The white nodes are hubs.
Iteration 1: Input
Iteration 1: Update Scores
Iteration 1: Normalize Scores
Iteration 2: Input
Iteration 2: Update Scores
Iteration 2: Normalize Scores
Iteration 3: Input
Iteration 3: Update Scores
Iteration 3: Normalize Scores
Example 3. HITS
Y Li: IFN647 week 11
Cluster Analysis
Clustering is the process of grouping documents (or objects) into clusters so that objects in a cluster are highly similar to each other, but very different from documents in other clusters.
Cluster analysis is a popular unsupervised learning method (i.e., it does not require any training data).
We usually use a distance function to measure the dissimilarities between objects.
Note a similarity measure, which typically has a value s from 0 to 1, can be converted into a distance measure by using 1 – s.
Evaluating the output of a clustering algorithm can be challenging since clustering is an unsupervised learning, there is often little or no labeled data to use for the purpose of evaluation.
For information retrieval, we can rank clusters instead of individual documents in response to query or “topics of interest” if we assume documents in the same cluster tend to be relevant to the same queries.
Information Table
As mentioned before, a training set is user feedback, which can be described as an information table (D, T), where
D is a set of documents in which each document is a set of terms (may include duplicate terms); and
T = {t1, t2, …, tn} is a set of selected terms for all documents in D.
Doc Terms Rel
d1 GERMAN VW GERMAN Yes
d2 US US ECONOM ESPIONAG Yes
d3 US BILL ECONOM ECONOM ESPIONAG ECONOM Yes
d4 US ECONOM ESPIONAG BILL Yes
d5 GERMAN MAN VW ESPIONAG Yes
d6 GERMAN GERMAN MAN VW ESPIONAG Yes
d7 GERMAN VW GERMAN VM No
d8 US ECONOM No
An example of an Information Table
D = {d1, d2, …, d8}
T={GERMAN, VW, US, ECONOM, SPY, BILL, ESPIONAG, MAN}, where
terms are stemmed keywords.
Binary Variables
A binary variable has only two states: 0 or 1, where
0 denotes the variable is absent and
1 denotes presence.
We may view each term as a binary variable in the information table, e.g.,
GERMAN : t1
ECONOM : t4
ESPIONAG : t6
Example of Binary Information Table
Doc t1 t2 t3 t4 t5 t6 t7 Rel
d1 1 1 0 0 0 0 0 yes
d2 0 0 1 1 0 1 0 yes
d3 0 0 1 1 1 1 0 yes
d4 0 0 1 1 1 1 0 yes
d5 1 1 0 0 0 1 1 yes
d6 1 1 0 0 0 1 1 yes
d7 1 1 0 0 0 0 0 no
d8 0 0 1 1 0 0 0 no
E.g., given term “bill”, which is the binary variable t5 in the binary information table:
1 indicates that d3 and d4 include it, while
0 indicates that other documents do not include it.
Binary Contingency Table
A contingency table is a type of table in a matrix format that displays the frequency distribution of the variables.
We can use it to comparing the difference between two documents di and dj based on their binary information table representation.
q is the number of variables that equal 1 in both documents dj and di;
t is the number of variables that equal 0 in both documents dj and di;
s is the number of variables that equal 0 in di but equal 1 in dj; and
r is the number of variables that equal 1 in di but equal 0 in dj.
u=q+r+s+t, the total number of variables.
Document dj
1 0 Sum
1 q r q+r
Document di 0 s t s+t
Sum q+s r+t u
Distance Measures
Simple matching coefficient, usually for symmetric binary variables, that is, no preference on 1 and 0.
Jaccard coefficient, usually for asymmetric binary variables, that is, 1 (or 0) is more important than 0 (or 1).
Document d2
1 0 Sum
1 0 2 2
Document d1 0 3 2 5
Sum 3 4 7
disS(d1, d2) = 5/7
disJac(d1, d2) = 5/5 = 1
Partitioning Approach
Given a set of documents (objects), a partitioning approach attempts to construct k partitions (clusters or groups) of objects such that
Each partition must contain at lease one object; and
Each object must belong to exactly one partition.
The partitioning approach usually creates an initial partitioning. It then uses an iterative relocation technique to reshuffle objects in order to improve the initial partitioning.
The general criterion of a good partitioning is that objects in the same partition are close or related to each other, whereas objects in the different partitions are dissimilar.
The k-medoids Technique
Each partition (cluster) is represented by one of the objects that locate near the centre of the cluster.
The basic idea of the k-medoisd technique is to find k clusters in n objects by
first arbitrarily selecting a representative object (the medoid) for each cluster;
for each remaining object
classify it to the cluster that it is the most similar;
iteratively replaces one of the medoids by one of the non-medoids if the quality of the resulting partitioning is improved. Please note medoids are always restricted to be members of the data set.
Medoids are representative objects of clusters within a data set whose average dissimilarity to all the objects in the cluster is minimal.
Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set.
The Square-Error Criterion
The quality is estimated using a cost function which measures the average dissimilarity between objects and their medoids.
Typically, a square-error criterion is used to describe the cost function, which satisfy
where E is the sum of the square-error for all objects, Ci are the clusters and mi are the medoids (i =1,2,…,k).
Determine New Medoids
To determine whether a non-medoid object o is a good replacement for a current medoid mj, we need to discuss all possible cases for all other non-medoid objects p:
Case 1: p is reassigned to another medoid mi
p currently belongs to mj. If mj is replaced by o as a medoid and p is closest to mi (ij).
Case 2: p is reassigned to o
p currently belongs to mj. If mj is replaced by o as a medoid and p is closest to o.
Case 3: the assigned does not change
p currently belongs to mi (ij). If mj is replaced by o as a medoid and p is still closest to mi.
Case 4: p is reassigned to o
p currently belongs to mi (ij). If mj is replaced by o as a medoid and p is closest to o.
Four Cases for Evaluation of a Cost Function
Each time a reassignment occurs, a difference in square-error E is contributed to the cost function.
The cost function calculates the difference in square-error value if a current medoid is replaced by a non-medoid.
The total cost of swapping (S) is the sum of costs incurred by all non-medoid objects. If the total cost is negative, we swap mj and o since the square-error value E would be reduced.
K-medoids Algorithm
Input – the number of clusters k and a database containing n objects.
Output – A set of k clusters that minimizes the sum of the dissimilarities of all the objects to their nearest medoid.
procedure:
Arbitrarily choose k objects (m1,…,mk), as the initial medoids;
Assign each remaining object to the cluster with the nearest medoid;
Randomly select a non-medoid object, o;
Evaluate the total cost, S, of swapping mj with o;
swap mj with o and form the new groups of the k medoids;
Until no change;
K-medoids Algorithm cont.
This algorithm attempts to decide k partitions for n documents (objects).
In step (1), the initial k medoids are determined randomly;
The algorithm repeatedly tries to get the best medoids for all partitions in step (2). All of the possible pairs of objects are examined.
Hierarchical Methods
A hierarchical method builds a hierarchical classification of objects.
Hierarchical methods:
Agglomerative (also called bottom-up method)
It starts with each object that represents a single group. It then merges the smaller groups into larger ones until a terminal condition holds (or all objects in the same group).
Divisive method (also called top-down method)
It starts with all objects in the same group. It then splits the larger groups into smaller ones until a terminal condition holds (or each object is in a separate group).
Examples of Bottom-up and Top-down methods
Examples of Bottom-up and Top-down methods cont.
For the bottom-up method, in the start point, it puts each object into a singleton cluster.
In the step 1, it merge the two clusters {a} and {b} since they have the smallest minimum Euclidean distance.
For the top-down method, in the start point, it forms an initial cluster using all objects.
The cluster is split in step 1 into two clusters {ab} and {cde} since they have the largest maximum Euclidean distance.
Euclidean Distances
Let Ci and Cj be clusters. p and p’ denotes objects.
Minimum distance
dismin(Ci,Cj)= minpCi,p’Cj(dis(p,p’))
Maximum distance
dismax(Ci,Cj)= maxpCi,p’Cj(dis(p,p’))
Evaluation Clustering Algorithms
Clustering and Search
A possible hypothesis is that documents in the same cluster tend to be relevant to the same queries.
Cluster-based retrieval
The intuition behind this ranking method is that a relevant document with no terms in common with the query could potentially be retrieved if it were a member of a highly ranked cluster with other relevant documents.
Cluster language model
The second term, which comes from the cluster language model, increases the probability estimates for words that occur frequently in the cluster and are likely to be related to the topic of the document.
Summary of Searching within Online Communities
Can apply HITS to entity interaction graph to find communities
Entities with large authority scores are the “core” or “authoritative” members of the community
Clustering
Apply a clustering algorithm to entity graph
How to choose K?
Evaluating community finding algorithms is hard
Can use communities in various ways to improve search, browsing, expert finding, recommendation, etc.
5. Community Based Question Answering
Some complex information needs can’t be answered by traditional search engines
Information from multiple sources
Human expertise
Community based question answering tries to overcome these limitations
Searcher enters question
Community members answer question
A Scenario of Community-Based Question Answering
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com