Week 10 Question Solutions
Professor Yuefeng Li
School of Computer Science, Queensland University of Technology (QUT)
Social media analysis
Copyright By PowCoder代写 加微信 powcoder
It is defined as, “the art and science of extracting valuable hidden insights from vast amounts of semi-structured and unstructured social media data (e.g., Twitter, Facebook and etc.) to enable informed and insightful decision making.”
The lecture notes discussed two research fields: Microblog Retrieval and Sentiment Analysis. The models and algorithms we discussed in the previous lectures can be used here, but you should know some special characteristics of the two fields. For example, a “Tweet” is a short text message, it may include hashtag features or @username feature.
Question 1. Assume C is a collection of tweets. We can use the following query likelihood method to calculate the probability of a tweet d occurring for a given query Q.
𝑃(𝑄,𝑑) = ( 𝑐(𝑤,𝑄)log.1+ 𝑐(𝑤,𝑑) 5+|𝑄|log 𝜇 !∈#,% 𝜇 × 𝑃(𝑤|𝐶) 𝜇 + |𝑑|
Let |d| and |Q| are the respective lengths (size) of the tweet d and query Q, and μ is the smoothing parameter. Interpret the meaning of c(w, d), c(w, Q), and P(w|C).
c(w, d) is the word w’s counts in the given tweet d,
c(w, Q) is the word w’s counts in the given query Q,
P(w|C) is the probability of the word w in the collection C that is used to normalise the model.
Question 2. Which of the following is false for sentiment analysis? and justify your answer.
(1) Sentiment analysis (opinion mining) discovers users’ opinions about products or services in on-line reviews or feedback or observes trends in public mood to analysis of clinical records.
(2) The orientation is the opinion provided about the entity and/or the aspect that was provided by the opinion holder at a specific time.
(3) The goal of emotion classification is to separate subjective from objective information, a binary classification task.
(4) Aspects are features, components or functions of the entity. They can be nouns and/or noun phrases
(5) Polarity classification is to group the expressed opinion in a document, a sentence or an entity feature/aspect in positive, negative or neutral regions.
Solution: (3)
The goal of emotion classification is to classify a piece of text according to a predefined set of basic emotions. The goal of subjectivity classification is to separate subjective from objective information, a binary classification task.
Social search
Social search is a term used to describe search applications that involve communities of people (users) to tag content or answer questions. It is fast becoming the key search paradigm on the web. Users can interact online in a number of ways. For example, a user might visit a social media site that has recently gained a lot of popularity.
The online world is a very social environment where users communicate with each other in various forms. These social interactions provide search systems with new and unique sources of data to exploit, as well as myriad privacy concerns.
Unlike the models we mentioned earlier, we also have a wealth of user interaction data that can help improve the overall user experience in new and interesting ways. For example, user tags used by many social media sites allow users to assign tags to items. The other is collaborative search, which involves a group of users with a common goal searching together in a collaborative environment.
Filtering and recommender systems
The filtration system has two key components. First, the long-term information needs of users must be accurately expressed. This is done by constructing a profile for every information need. Second, given a document that has just arrived in the system, a decision-
making mechanism must be devised to identify which are the relevant profiles for that document.
Not only must this decision-making mechanism be efficient, especially when there may be thousands of profiles, but it must also be highly accurate.
Therefore, the difficulty with a filtering system is that it should not miss relevant documents (high recall), and perhaps more importantly, it should not constantly alert the user to irrelevant documents (high precision).
Question 3. The textbook (chapter 10) showed and a concrete example of static filtering by using a language modeling framework. The following is the equation to calculate a word probability distributionforthegivenprofilemadeupbyT1,…,Tk (thepiecesoftext,e.g.,querydescriptions, documents, or other information):
It then uses negative KL divergence between profile and document model to compute a relevance score as follows:
The above equations used in the textbook don’t meet the standard math descriptions because of the confusion of using the symbol “P”. The problem is that P is either used as a profile variable or represents a probability function. Please update the two equations in correct mathematical expression.
Let Pr represent a probability function and variable P represent a profile. We have the following corrections:
(1−𝜆) ‘ 𝑓!,*! 𝑐! 𝑃𝑟(𝑤|𝑃)=∑’ 𝛼(𝛼&|𝑇|+𝜆|𝐶|
&() &&() &
−𝐾𝐿(𝑃||𝐷)= (𝑃𝑟(𝑤|𝑃)𝑙𝑜𝑔𝑃𝑟(𝑤|𝐷)−(𝑃𝑟(𝑤|𝑃)𝑙𝑜𝑔𝑃𝑟(𝑤|𝑃) !∈+ !∈+
Question 4. (Recommender systems)
Collaborative filtering leverages relationships between users to improve how items (documents) are matched to users (profiles). The figure below shows a group of users and their ratings for an item. The user with the question mark above its head has not rated this item yet. The goal of the recommendation algorithm is to fill in these question marks.
Suppose your team wants to design functions to implement a collaborative filtering-based recommender system, and your task is to determine the function name and its input and output data structures. For privacy reasons, you can use only numbers or IDs to represent users, such as 1 or u1, 2 or u2, etc.; for items, you can use numbers or their names; and ratings are expressed as integers (e.g., 0 to 5).
# The common similarity measures used for clustering users is the correlation measure. # Typically, users are represented by their rating vectors of items.
# We use a list of lists Rvs to represent rating vectors for all users
# for example
# I = [0,1,2,3] # 4 items in I numbered from 0 to 3 – a list
# U = [0,1,2,3,4] # 5 users in U numbered from 0 to 4 – a list
# Rvs = [[1,2,1,0], [0,1,1,1], [2,1,3,5], [1,0,2,0], [0,2,3,4]] – a list of lists
# 5 users’ rating vectors for 4 items, where 0 means the user has not yet rated the item.
def my_correlation(Rvs):
U = [i for i in range(len(Rvs))]
I = [i for i in range(len(Rvs[0]))]
Uc = [[1 for i in range(len(U))] for i in range(len(U))] …
return(U, I, Uc)
# It returns U, I and Uc (the correlation matrix, a list of lists) def my_cluster(U, Uc):
return C # clusters – list of user lists
# The predicted rating to unseen items for all users def my_prediction(Rvs, C, U, I):
return Rvs # updated Rvs
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com