程序代写代做代考 information retrieval data science Document Classification 1: Scenarios

Document Classification 1: Scenarios
This time:
Classification and NLP
Document Classification Scenarios:
Sentiment Analysis
Topic Relevance Detection Document Filtering
Technology for Document Classification
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 1 / 22

Classification Tasks
Many NLP tasks can be view as document classification:
document
classifier
class A class B class C
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 2 / 22

Document Classification Scenarios
Example cases we will consider here:
Sentiment analyser: determining whether a document expresses positive, negative or neutral sentiment about some entity
Topic Relevance tester: determining whether a document concerns a particular topic or topics that are of interest
Document filter: determining whether a document conforms to some notion of “acceptability”
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 3 / 22

Case 1: Sentiment Analysis
The analysis of the opinion being expressed in text
Is the Guardian’s review of last night’s U2 concert positive or negative?
Opinion being expressed about some entity
person organisation work of art product event
place
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 4 / 22

Sentiment Analysis
Opinion monitoring
What are people saying about a brand, a new product, a media campaign?
Opinion mining
Identifying & analysing expressions of sentiment in a collection of documents
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 5 / 22

Social Media Monitoring
Expert-sourced opinion
Newspaper’s political/art/technology/science/business correspondent
Crowd-sourced opinion
weblogs, microblogs, internet forums, customer reviews
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 6 / 22

Experts versus Crowd Sourced Analysis
Expert-sourced analysis:
Limited quantities of data available
Comparatively long documents
Text expresses highly nuanced semantics
Opinion expressed in each individual document has high value
Crowd-sourced analysis:
Huge quantities of data available Comparatively short documents Message is more explicit Individual sources less reliable
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 7 / 22

Promise of Crowd Sourced Analysis
Good fit with current language engineering technology
Suits shallow linguistic analysis
Exploits massive increases in computing power Leverage wisdom of the crowd
Aggregate text sentiment: approximate analysis on a large scale can be effective
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 8 / 22

Sentiment Analysis as a Classification Task
Binary Classification:
document
classifier
negative
positive
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 9 / 22

Sentiment Analysis as a Classification Task
Three-way Classification:
document
classifier
negative neutral positive
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 10 / 22

Issues
Is it appropriate to make one decision per document? Is the document’s function an expression of sentiment?
a film review versus a report describing breaking news Is the sentiment about a single identified entity?
a film review versus a review of latest film releases
Even a specific product review can include expressions of
sentiment about other entities
the HTC One Max doesn’t quite measure up to the Galaxy Note 3
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 11 / 22

Beyond Classification
In addition to classification, an identification process may be needed To identify parts of a document where sentiment is expressed
To identify the entity that are the subject of expressions of sentiment
To identify the specific characteristic of entities that are being evaluated
the iPad’s battery life is particularly impressive
􏰆􏰅􏰄􏰇 􏰆 􏰅􏰄 􏰇
↑↑↑
entity feature quality
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 12 / 22

Case 2: Topic Relevance Tester
Example scenarios:
Search corpus of recently published scientific articles for documents discussing some particular topic
Search all news articles published within last 24 hours for documents mentioning the CEO of Volkswagen Group
Monitor Twitter Firehose in search of tweets that mention X-Factor contestants
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 13 / 22

Topic Relevance Testing as a Classification Task
document
classifier
relevant
not relevant
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 14 / 22

Technical Issues to Address
Specifying what makes a document topically relevant — learn from positive and negative examples
Adapting to dynamics of data source
— in some scenarios discussion of topics changes significantly over time
Separating the wheat from the chaff
— often very small proportion of data is actually relevant
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 15 / 22

Case 3: Document Filter
Example scenarios:
Filtering email: SPAM filter Identifying breaking news
Filtering of user-generated content
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 16 / 22

Document Filtering as a Classification Task
document
classifier
acceptable
not acceptable
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 17 / 22

Document Filtering as a Classification Task
document
classifier
acceptable problem A problem B
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 18 / 22

Technology for Document Classification
Two things to consider:
Methods for classification
Linguistic preprocessing of document
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 19 / 22

Method of Classification
We will illustrate wrt sentiment analysis problem Consider a number of approaches in next two lectures:
Hand-crafted sentiment bearing vocabulary lists Computed sentiment bearing vocabulary lists Naïve Bayes classifier
Also applicable to other document classification tasks
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 20 / 22

Background Reading
Overview of the state-of-the-art:
Bo Pang and Lilian Lee (2008) Opinion Mining and Sentiment Analysis, Foundations and Trends in Information Retrieval, 2(1–2), 1–135.
Available online:
http://www.cs.cornell.edu/home/llee/omsa/omsa.pdf
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 21 / 22

Next Topic: Wordlists for Classification
Words as features
Using wordlists for classification Document scoring and decision criteria Handcrafted v. automatically derived lists The ML approach
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 22 / 22