IFN647 Text, Web And Media Analytics
Text, Web And Media Analytics
Copyright By PowCoder代写 加微信 powcoder
Social Media Li | Professor
School of Electrical Engineering and Computer Science
Queensland University of Technology
S Block, Level 10, Room S-1024, Gardens Point Campus
ph 3138 5212 | email
1. Social media analysis
Microblog Retrieval
Sentiment analysis
2. Social search
Searching Tags
Inferring Missing Tags
Browsing and Tag Clouds
Searching within communities
3. Filtering and recommender systems
Static filtering
Adaptive filtering
Recommender systems
Collaborative Filtering
Rating using User Clusters
Rating using Nearest Neighbors
In this week, we mainly discuss the problems in social media analytics, social search, filtering and recommender systems.
We then discuss possible solutions to these problems based on the knowledge you have gained from previous lectures.
1. Social media analytics
It is defined as, “the art and science of extracting valuable hidden insights from vast amounts of semi-structured and unstructured social media data (e.g., Twitter, Facebook) to enable informed and insightful decision making.”
It is also commonly used by marketers to track online conversations about products and companies.
There are three main steps in analysing social media
Data identification, identifying the subsets of available data to focus on for analysis;
What content is of interest. In addition to the text of content, we want to know: who wrote the text?
Where was it found or on which social media venue did it appear?
Are we interested in information from a specific locale?
When did someone say something in social media?
Data analysis, and
Information interpretation.
Social media
Web-based services that allow individuals, communities, and organisations to produce, share and engage with user-generated content.
Media platforms and technologies
e-commerce gateways;
microblogs (e.g., Tumblr, Instagram, Twitter);
social networking (e.g., LinkedIn, Facebook, MySpace);
multimedia portals (e.g., Vimeo, Twitter, Facebook, Periscope, TikTok, YouTube);
virtual worlds (e.g., Second Life);
review platforms (e.g., Tripadvisor, Foursquare); and
social gaming (e.g., World of Warcraft).
Microblog Retrieval
Different types of microblogging technologies are available within social media to help achieve goals
Twitter is a microblogging service introduced in March 2006. With over 125 million daily active users, Twitter is ranked among the most popular social media platforms.
The platform allows everyone to create and share information and ideas in real-time.
“Tweet” is a term that refers to a short text message that a Twitter user can produce.
This short plain text (tweet) can also include videos, photographs, and website URLs.
Until recently, Twitter allowed 140 characters for a plain text message; however, in November 2017, the length was expanded to 280 characters.
The feature can be used to mention a specific person for interaction, e.g., user network
Hashtag feature can also be used to annotate user messages where the prefix “#” character is used as a non-spacing word, e.g., Hashtag based search.
Example 1: tweets in JSON format
Query likelihood
It uses Bayesian (Dircihlet prior) smoothing. It considers both the document and the query size.
c(w, d) is the word count in the given document d,
c(w, Q) is the word count in the given query Q,
|d| and |Q| are the respective lengths (size) of the document and query
P(w|C) is the probability of the word in the collection that is used to normalise the model.
μ is the smoothing parameter
Sentiment Analysis
Sentiment analysis (opinion mining) discovers users’ opinions about products or services in on-line reviews or feedback, or observes trends in public mood to analysis of clinical records.
It is widely used to voice of customers (users) materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.
Opinion can be represented as a tuple of Entity, Aspect, Orientation, Opinion Holder and Time.
An entity is the name of an entity, which could refer to a product for example.
An aspect can be a feature, component or function of the entity.
The orientation is the opinion provided about the entity and/or the aspect that was provided by the opinion holder at a specific time.
Tasks in Sentiment Analysis
Polarity classification
Group the expressed opinion in a document, a sentence or an entity feature/aspect in positive, negative or neutral regions.
Subjectivity classification
Its goal is to separate subjective from objective information, a binary classification task.
It is regarded as a prerequisite to sentiment polarity classification. It may be tackled at different levels of granularity. For instance,
At the document level the aim is to distinguish review-like documents from non-review documents or factual newspaper articles from editorial comments.
On a more fine-grained level, the task is to identify individual text passages (e.g., sentences) as being subjective or objective.
Emotion classification
The goal is to classify a piece of text according to a predefined set of basic emotions.
It tries to identify more fine-grained differences in the expression of sentiment, e.g., six “basic” emotions – anger, disgust, fear, happiness, sadness, and surprise.
Tasks in Sentiment Analysis cont.
Source detection
It aims to identify the person, organisation, or more generally, the entity that is the source of subjective information, including named entity recognition and relationship extraction.
It is an information extraction task.
A typical application for sentiment source detection is a multi-perspective question answering system that tries to answer questions of the form: “What is X’s viewpoint/opinion on topic Y?”
It is an information extraction task.
Target detection
The goal of sentiment target detection is to determine the subject of a sentiment expression. For example,
Which blogs report positively and which negatively on the topic of settlement policy?
It is a problem of information retrieval, where sentences or documents are classified or ranked according to their relevance towards a given topic or a question.
E.g., the most recent google-released Bidirectional Encoder Representations from Transformers (BERT) https://arxiv.org/abs/1810.04805
Extraction of Opinion Sentences
Aspects are nouns and/or noun phrases, for example, “face recognition”, “zoom”, and “touch screen” are aspects of the product “camera”.
Opinion words are mostly adjectives. They are the closest adjective to the aspects in the sentence. An opinion lexicon can be used to identify and extract opinion words along with their orientation.
Extraction of opinions:
Build a list of aspects from two sources: product specifications and word synonyms. Product specifications is a list provided by the manufacturer for each product, while synonyms are the matching words taking from the WordNet dictionary.
Identify the aspects and opinions in sentences. You may need to group aspects based on frequency and synonyms.
Pattern mining is applied to find frequent sets of tags that are sets of POS tags that occur together. A set of tags is defined as frequent if it appears in more than 1% (minimum support) of the review sentences.
For example, the tag of aspect appears first, the sequence of tags [NN][VBZ][RB][JJ] corresponds to the sentence “software is absolutely terrible”.
Weighting sentences by adding tags’ weights and then select sentences with high scores.
Adjective, adverb and verb weights
Tags Description Weight
JJ Adjective 1
JJR Comparative Adjective 2
JJS Superlative Adjective 3
RB Adverb 1
RBR Comparative Adverb 2
RBS Superlative Adverb 3
Verb category Orientation Verbs Comments
Tell verbs Positive tell Positively reinforce an opinion
Chitchat verbs Positive argue, chatter, gab Positively reinforce opinion is being
Advise verbs Positive advise, instruct Positively reinforce an opinion
Negative admonish, caution, warn Negatively reinforce the degree of
certainty about a given opinion
Categories are used for verbs. If the sentence contains a verb from positive categories, then “+1” will be added to the weight and if the verb is from negative categories then “-1” will be subscribed from the total weight.
2. Social Search
Social search
Communities of users actively participating in the search process
Goes beyond classical search tasks
Key differences
Users interact with the system
Users interact with other users either implicitly or explicitly
Social search includes, but is not limited to, the so-called social media sites
Collectively referred to as “Web 2.0” (social Web) as opposed to the classical notion of the Web (“Web 1.0”)
Social media sites
User generated content
Users can tag their own and other’s content
Users can share favorites, tags, etc., with others
Digg, Twitter, Flickr, YouTube, Del.icio.us, CiteULike, MySpace, Facebook, and LinkedIn
Social Search Topics
Searching within communities
Document filtering
Recommender systems
Then: Library card catalogs
Indexing terms chosen with search in mind
Experts generate indexing terms
Terms are very high quality
Terms chosen from controlled vocabulary
Now: Social media tagging
Tags not always chosen with search in mind
Users generate tags
Tags can be noisy or even incorrect
Tags chosen from folksonomies https://en.wikipedia.org/wiki/Folksonomy
A Folksonomy is a classification system
The collective assemblage of tags assigned by many users
Make the use of public tags effective.
Types of User Tags
Content-based
car, woman, sky
Context-based
new york city, empire state building
nikon (type of camera), black and white (type of movie), homepage (type of web page)
Subjective
pretty, amazing, awesome
Organizational
to do, my pictures, readme
Example of is most known for its reference manager to manage and share research papers and generate bibliographies for scholarly articles.
Searching Tags
Tags can be used to describe textual or non-textual items (e.g., images or videos) to provide a textual dimension to items.
These textual representations of items can be very useful for searching; however, tags are very sparse representations of very complex items.
Searching user tags is challenging
Most items have only a few tags
Tags are very short
Boolean, probabilistic, vector space, and language modeling will fail if use naïvely
Must overcome the vocabulary mismatch problem between the query and tags. Possible ways to overcome this problem
Stemming (e.g., stem classes in week 7)
Pseudo-relevance feedback for tag expansion
One unique property of tags is that they are almost exclusively textual keywords that are used to describe textual or non-textual items. Therefore, tags can provide a textual dimension to items that do not explicitly have a simple textual representation, such as images or videos.
These textual representations of non-textual items can be very useful for searching; however, tags are very sparse representations of very complex items.
The simplest way to search a set of tagged items is to use a Boolean retrieval model. However, it may fail.
For example, given the query Q = “fish, bowl” can be read as “fish AND bowl”, which returns items that are tagged with both “fish” and “bowl”. It is likely to produce high-quality results; but may miss many relevant items.
Thus, the approach would have high precision but low recall.
If use a disjunctive (OR) query “fish OR bowl”, will match many more relevant items, but at the cost of precision.
Of course, it is highly desirable to achieve both high precision and high recall. However, doing so is very challenging.
Tag Expansion
It uses search results (pseudo-relevance feedback) to enrich a tag representation.
It overcomes vocabulary mismatch problem by expanding tag representation with external knowledge.
Possible external sources
Web search results
Query logs
After tags have been expanded, we can use standard retrieval models
Age of Aquariums – Tropical Fish
Huge educational aquarium site for tropical fish hobbyists, promoting responsible fish keeping internationally since 1997.
The Krib (Aquaria and Tropical Fish)
This site contains information about tropical fish aquariums, including archived usenet postings and e-mail discussions, along with new …
Keeping Tropical Fish and Goldfish in Aquariums, Fish Bowls, and …
Keeping Tropical Fish and Goldfish in Aquariums, Fish Bowls, and Ponds at AquariumFish.net.
P(w | “tropical fish” )
Example 2.
Tag Expansion Procedure
Use tag “tropical fish” as a query Q to find top-k results;
Select terms with the highest probability, e.g., terms “fish”, “tropical”, “aquariums”, “goldfish”, and “bowls”;
Q is be expanded as Q’= “fish, tropical, aquariums, goldfish, bowls”;
Search by using the enriched query Q’.
Issues in Searching Tags
Even with tag expansion, searching tags is challenging.
Tags are inherently noisy and incorrect.
Many items may not even be tagged!
Typically, it is easier to find popular items with many tags than less popular items with few/no tags.
Inferring Missing Tags
As we just described, items that have no tags pose a challenge to a search system.
How can we automatically tag items with few or no tags?
Uses of inferred tags
Improved tag search
Automatic tag suggestions
Methods for Inferring Tags
TF*IDF if items are textual, such as books, or news articles.
where fw,D is the number of times term w (tag) occurs in item D, N is the total number of items, and dfw is the number of items that term w occurs in.
Classification
Train binary classifier for each tag (use all of the existing tag/item pairs as training data to train the classifiers, and represent an item as a feature vector)
Performs well for popular tags, but not as well for rare tags.
Maximal marginal relevance
Finds tags that are relevant to the item and novel with respect to existing tags (or not very similar to any of the other tags), where t is tag, i is an item and Ti is the current set of tags for item i.
Browsing and Tag Clouds
Search is useful for finding items of interest
Browsing is more useful for exploring collections of tagged items
Various ways to visualize collections of tags
Tag clouds
Alphabetical order
Grouped by category
Formatted/sorted according to popularity
animals architecture art australia autumn baby band barcelona beach berlin
birthday black blackandwhite blue california cameraphone canada canon
car cat chicago china christmas church city clouds color concert day dog
england europe family festival film florida flower flowers food
france friends fun garden germany girl graffiti green halloween hawaii
holiday home house india ireland italy japan july kids lake landscape light live
london macro me mexico music nature new newyork night
nikon nyc ocean paris park party people portrait red river rock
sanfrancisco scotland sea seattle show sky snow spain spring street
summer sunset taiwan texas thailand tokyo toronto travel
tree trees trip uk usa vacation washington water wedding
Example Tag Cloud
Searching within communities
Traditional search assumes single searcher
Collaborative search involves a group of users, with a common goal, searching together in a collaborative setting
Example scenarios
Students doing research for a history report
Family members searching for information on how to care for an aging relative
Team member working to gather information and requirements for an industrial project
An online community – Groups of entities that interact in an online environment and that share common goals, traits, or interests.
Collaborative Search
Two types of collaborative search settings depending on where participants are physically located
Co-located
Participants in same location
CoSearch system
Remote collaborative
Participants in different locations
SearchTogether system
Co-located Collaborative Searching
Remote Collaborative Searching
Collaborative Search Scenarios
Collaborative Search cont.
Challenges
How do users interact with system?
How do users interact with each other?
How is data shared?
What data persists across sessions?
Very few commercial collaborative search systems.
Likely to see more of this type of system in the future.
Document Stream
Document Stream
Profile 1.1
Profile 2.1
3. Filtering and Recommender Systems
Static Filtering
Adaptive Filtering
Represents long term information needs
Can be represented in different ways
Boolean or keyword query
Sets of relevant and non-relevant documents
Relational constraints
“published before 1990”
“price in the $10-$25 range”
Actual representation usually depends on underlying filtering model
Can be static (static filtering) or updated over time (adaptive filtering)
Static Filtering
Given a fixed profile, how can we determine if an incoming document should be delivered?
Treat as information retrieval problem
Vector space
Language modeling
Treat as supervised learning problem
Naïve Bayes
Support vector machines
Static Filtering with Language Models
Assume profile consists of K relevant documents (Ti), each with weight αi
Probability of a word given the profile is (variable P means a profile language model, is used for smoothing):
KL divergence between profile and document model is used as score:
If –KL(P||D) ≥ θ, then deliver D to P (profile)
Threshold (θ) can be optimized for some metric
Please note, the equations used in the textbook don’t exactly meet the standard math descriptions because of the confusion of using the symbol “P”.
It looks P is used as a function if P represents a probability function, but P is also used as a variable to represent a profile.
Adaptive Filtering
In adaptive filtering, profiles are dynamic
How can profiles change?
User can explicitly update the profile
User can provide (relevance) feedback about the documents delivered to the profile
Implicit user behavior can be captured and used to update the profile
Adaptive Filtering Models
Profiles treated as vectors ( P’ is the adapted profile)
Relevance-based language models
Profiles treated as language models
Summary of Filtering Models
Fast Filtering with Millions of Profiles
Real filtering systems
May have thousands or even millions of profiles
Many new documents will enter the system daily
How to efficiently filter in such a system?
Most profiles are represented as text or a set of features
Build an inverted index for the profiles
Distill incoming documents as “queries” and run against index
Evaluation of Filtering Systems
Definition of “good” depends on the purpose of the underlying filtering system
Generic filtering evaluation measure:
α = 2, β = 0, δ = -1, and γ = 0 is widely used
Recommender Systems
Recommender systems recommend items (e.g., products, books or movies) that a user may be interested in.
Amazon.com, Net systems use collaborative filtering to recommend items to users.
Collaborative Filtering
In static and adaptive filtering, users and their profiles are assumed to be independent of each other.
However, in real world, similar users are likely to have similar preferences.
Collaborative filtering exploits relationships between users to improve how items (documents) are
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com