CS代写 ITB260 / ITN260

ECOMMERCE SITE DEVELOPMENT ITB260 / ITN260

Queensland University of Technology

Copyright By PowCoder代写 加微信 powcoder

Text, Web And Media Analytics
Text Classification

a university for the

Classification introduction
Text classification application
Classifiers
Naïve Bayes Classifier
Vector Space Classification

Rocchio classification
Feature Selection
Evaluating Classifiers

a university for the

1. Classification introduction
Classification (a.k.a categorization) is the process of classifying big data into categories by using classifiers learned from training samples.
Classification problem can be
A “binary” classification problem If there are exactly two classes;
A “multi-class” problem if there are more than two classes and each data object (e.g., a document) falls into exactly one class or
A “multi-label categorization” problem if a data object (e.g., document) may have more than one associated category in a classification scheme.
The process of classification generally consists of the feature selection, classification model (classifier) development and classification evaluation tasks.

a university for the

Classification vs. Clustering
Machine Learning Methods
Supervised learning – infers a classifier of functions from labelled training data.
Semi-supervised learning – assumes a small training set and it intends to enlarge the small training set by using some unlabeled data.
Unsupervised learning – learns patterns from unlabelled data.
Classification and clustering are classical pattern recognition / machine learning problems
Classification
Asks “what class does this item belong to?”
Supervised learning task
Clustering
Asks “how can I group this set of items?”
Unsupervised learning task
Items can be documents, queries, emails, entities, images, etc.
Useful for a wide variety of search engine tasks

a university for the

Document Classification
Document classification is to assign a document to one or more classes or categories.
The documents to be classified may be texts, images, music, etc.
When not otherwise specified, text classification is implied.
Documents may be classified according to their subjects (contents) or other attributes (or meta-data, such as document type, author, printing year etc.)
The content-based approach
The class assigned to a document is based on subjects (topics) in the document.
The request-based approach
The class assigned to a document is based on “relevance”, i.e., the anticipated request from users is influencing how documents are being classified.  It is targeted towards a particular audience or user group.
The statistical approaches for categorization
It uses machine learning methods to learn automatic classification rules based on human-labelled training documents.

a university for the

2. Text Classification Application
Classification is widely used for dealing with big data.
Example applications
Spam detection
Sentiment classification
Semantic classification of advertisements
Many others not covered here!

a university for the

Spam, Spam, Spam
Classification is widely used to detect various types of spam
There are many types of spam
Adding links to message boards
Link exchange networks
Link farming
URL term spam
Phrase stitching

a university for the

Spam Detection
Useful features
Formatting (invisible text, flashing, etc.)
Misspellings
IP address
Different features are useful for different spam detection tasks
Email and web page spam are by far the most widely studied, well understood, and easily detected types of spam
Assassin is an intelligent email filter to identify spam.

https://cwiki.apache.org/confluence/display/SPAMASSASSIN/SpamAssassin

a university for the

Example Spam Assassin Output

a university for the

Blogs, online reviews, and forum posts are often opinionated
Sentiment classification attempts to automatically identify the polarity of the opinion
Negative opinion
Neutral opinion
Positive opinion
Sometimes the strength of the opinion is also important
“Two stars” vs. “four stars”
Weakly negative vs. strongly negative
Useful features

Part of speech tags
Adjectives

a university for the

Classifying Online Ads
Unlike traditional search, online advertising goes beyond “topical relevance”.
A user searching for ‘tropical fish’ may also be interested in pet stores, local aquariums, or even scuba diving lessons.
These are semantically related, but not topically relevant!
As we mentioned before, it is possible to use standard information retrieval techniques to find these semantic matches for advertising,
It is also possible to use a classifier that maps queries (and web pages) into semantic classes.
We can bridge the semantic gap by classifying ads and queries according to a semantic hierarchy.

a university for the

Example 1 (Semantic Class)
Semantic hierarchy ontology
Example: Pets / Aquariums / Supplies
Training data
Large number of queries and ads are manually classified into the hierarchy
Nearest neighbor classification has been shown to be effective for this task.
Hierarchical structure of classes can be used to improve classification accuracy

For example, we have a web page about rainbow fish (a type of tropical fish) and an advertisement for tropical fish food.
The web page is classified as “Aquariums – Fish”
The ad is classified as “Supplies – Fish”.
Here, “Aquariums” is the least common ancestor.
Although the web page and ad do not share any terms in common (except fish), they can be matched because of their semantic similarity.

Rainbow Fish Resources

Discount Tropical Fish Food
Feed your tropical fish a gourmet diet for just pennies a day!
www.cheapfishfood.com

a university for the

How to Classify?
How do humans classify items?
For example, suppose you had to classify the “healthiness” of a food
Identify set of features indicative of health
fat, cholesterol, sugar, sodium, etc.
Extract features from foods
Read nutritional facts, chemical analysis, etc.
Combine evidence from the features into a hypothesis
Add health features together to get “healthiness factor”
Finally, classify the item based on the evidence
If “healthiness factor” is above a certain value, then deem it healthy

a university for the

3. Classifiers
Let C = {c1, c2, . . . , cJ} be a fixed set of classes, and U be the document space (or the set of incoming documents).
Let D be a training set that consists of labelled documents ,where ∈ U × C.
E.g., = <“Beijing joins the World Trade Organization”, China>
A classifier (or classification function) maps documents to classes:
cf: U  C
This type of learning is called supervised learning because a supervisor (the human who defines the classes and labels training documents) serves as a teacher directing the learning process.

a university for the

Naïve Bayes Classifier

Bayes’ rule

P(c|d) = P(d|c)P(c) / P(d)

The law of total probability

P(d) = \sum_{c_i \in C} P(d|c=c_i) P(c=c_i)

a university for the

Probability: Random Variables
Random variables are non-deterministic
Can be discrete (finite number of outcomes) or continues
Model uncertainty in a variable
P(X = x) means “the probability that random variable X takes on value x”
Let X be the outcome of a coin toss
P(X = heads) = P(X = tails) = 0.5
Example: Y = 5 – 2X
If X is random, then Y is random
If X is deterministic then Y is also deterministic
Note: “Deterministic” just means P(X = x) = 1.0!

a university for the

NB Classifier
Documents are classified according to (argmax means return the class c, out of classes C, that maximizes P(c | d) ):

Must estimate P(d | c) and P(c)
P(c) is the probability of observing class c
P(d | c) is the probability that document d is observed given the class is known to be c

Bayes’ rule

P(c|d) = P(d|c)P(c) / P(d)

The law of total probability

P(d) = \sum_{c\in C} P(d|c) P(c)

a university for the

Estimating P(c)
P(c) is the probability of observing class c
Estimated as the proportion of training documents in class c:

Nc is the number of training documents with class label c
N is the total number of training documents

a university for the

Estimating P(d | c)
P(d | c) is the probability that document d is observed given the class is known to be c
Estimate depends on the event space used to represent the documents
What is an event space?
The set of all possible outcomes for a given random variable
For a coin toss random variable the event space is S = {heads, tails}

a university for the

Multiple Bernoulli Event Space
Documents are represented as binary vectors
One entry for every word in the vocabulary
Entry i = 1 if word i occurs in the document and is 0 otherwise
Multiple Bernoulli distribution is a natural way to model distributions over binary vectors
Same event space as used in the classical probabilistic information retrieval model.

a university for the

Example 2. Multiple Bernoulli Document Representation
D = {d1, d2, …, d10}, C = {c1, c2} = {‘spam’, ‘not spam’},
V= {w1, w2, …, w5} = {‘cheep’, ‘buy’, ‘banking’, ‘dinner’, ‘the’}
Nc1 = 3, Nc2 = 7. P(c1) = 3/10 = 0.30, P(c2) = 7/10 = 0.70

a university for the

Multiple-Bernoulli: Estimating P(d | c)
P(d | c) is computed as:

where δ(w, d) is 1 if and only if term w occurs in document d, otherwise it is 0.
Laplacian smoothed estimate:

Collection smoothed estimate:

Where dfw,c is the number of training documents with class label c in which term w occurs, Nw is the total number of training documents in which term w occurs, and μ is a single tuneable parameter.

P(d1|c1) = (1-P(w1|c1))*(1-P(w2|c1))*(1-P(w3|c1))*(1-P(w4|c1))*P(w5|c1)

df_{w1,c1} = 1, Nw1 = 4, Nc1 = 3, N =10, let \mu = 0.5

Then P(w1|c1) = [1+ 0.5* (4/10)] / [3+0.5] = 0.34

a university for the

Multinomial Event Space
Documents are represented as vectors of term frequencies
One entry for every word in the vocabulary
Entry i = number of times that term i occurs in the document
Multinomial distribution is a natural way to model distributions over frequency vectors
Same event space as used in the language modeling retrieval model

a university for the

Example 3. Multinomial Document Representation
D = {d1, d2, …, d10}, C = {c1, c2} = {‘spam’, ‘not spam’},
C1 = {d2, d4, d5}, C2 = {d1, d3, d6, d7, …, d10}.
V= {w1, w2, …, w5} = {‘cheep’, ‘buy’, ‘banking’, ‘dinner’, ‘the’}
|c1| = 20, |c2| = 15, |V| = |c1| + |c2| = 35.
Where |c| is the total number of times of terms that occur in training documents with class label c.

a university for the

Multinomial: Estimating P(d | c)
P(d | c) is computed as:

Laplacian smoothed estimate:

Collection smoothed estimate:

where tfw,c is the number of times that term w occurs in class c in the training set, and |c| is the total number of times of terms that occur in training documents with class label c.

Please note The maximum likelihood estimate for the multinomial model is very similar to the multiple-Bernoulli model. It is computed as:

P(w|c) = tf_{w,c} / |c|

e.g., P(w5 | c1) = P(the | spam) = tf_{w5,c1} / |c1| = 4 / 20

P(the | not spam) = 9 / 15

a university for the

Actually NB implementation

a university for the

Naive Bayes Algorithm
(multinomial model): Training

procedure TRAIN_MULTINOMIAL_NB(C, D)
1. V ← EXTRACT_VOCABULARY(D)
2. N ← COUNT_DOCS(D)
3. for each c ∈ C do {
4. Nc ← COUNT_DOCS_IN_CLASS(D, c)
5. prior[c] ← Nc/N
6. textc ← CONCATENATE_TEXT_OF_ALL_DOCS_IN_CLASS(D, c)
7. for each w ∈ V do
8. tfw,c ← COUNT_TOKENS_OF_TERM(textc , t)
9. for each w ∈ V do
10. condprob[w][c] ← (tfw,c+1) /(|c| + |V|) }
11. return V, prior, condprob

a university for the

Naive Bayes Algorithm
(multinomial model): Testing
procedure APPLY_MULTINOMIAL_NB(C,V, prior, condprob, d)
1. W ← EXTRACT_TOKENS_FROM_DOC(V, d)
2. for each c ∈ C do
3. score[c] ← log prior[c]
4. for each w ∈ W do
5. score[c] += log condprob[w][c]
6. return argmaxc∈C score[c]

a university for the

Vector Space Classification
Documents are represented by vectors of term weights.

a university for the

Vector Space Classifiers
Rocchio classification
It divides the vector space into regions centered on centroids (or prototypes), one for each class, computed as the center of mass of all documents in the class.
Support Vector Machines
kNN (or k nearest neighbors)
For 1NN we assign each document to the class of its closest neighbor.
For kNN we assign each document to the majority class of its k closest neighbors where k is a parameter.
The rationale of kNN classification is that, based on the contiguity hypothesis, we expect a test document d to have the same label as the training documents located in the local region surrounding d.

a university for the

Rocchio classifier

Given a list of vectors in category “A”, and a list of vectors in category “B”. We can sum up all of their vectors from the same category to form their corresponding average vectors (or centroids)
So for any of the document d submitted to the system, it can determine whether the document is belong to category “A” or “B” by checking which category (ie, its centroid) is closed to d.

The training procedure of Rocchio classification for the input training set D uses centroids to represent classes. The centroid of a class is computed as the average vector of class members.

a university for the

Support Vector Machines (SVM)
It is based on geometric principles.
Given a set of inputs labeled ‘+’ and ‘-’, find the “best” hyperplane that separates the ‘+’s and ‘-’s
How is “best” defined?
What if no hyperplane exists such that the ‘+’s and ‘-’s can be perfectly separated?

a university for the

For two-class, separable training data sets, there are lots of possible linear separators.
Intuitively, a decision boundary drawn in the middle seems better than others in a training set; but it may not be the middle of a testing set.
SVM defines is looking for a decision surface that is maximally far away from any data point.
This distance from the decision surface to the closest data point determines the margin of the classifier.
The decision function for an SVM is full specified by a (usually small) subset of the data which defines the position of the separator. These points are referred to as the support vectors.
The goal of SVM is to Maximize the margin.

a university for the

Formalize an SVM

a university for the

Formalize an SVM cont.

a university for the

Functional Margin

a university for the

Separable vs. Non-Separable Data

Non-Separable

a university for the

Linear Separable Case
In math if b=0:

In English:
Find the largest margin hyperplane that separates the ‘+’s and ‘-’s

a university for the

Solving SVM optimization problem is not straightforward
Many good software packages exist
Matlab SVM Toolbox
scikit-learn (Machine Learning in Python)

a university for the

Example 4. scikit-learn: SVM
from sklearn import svm
X = [[0, 0], [1, 1], [2, 2], [-1, 1], [1.5, 2.5]]
y = [0, 1, 1, 0, 1]
clf = svm.SVC(gamma=’scale’)
clf.fit(X, y)
result1 = clf.predict([[2., 2.], [3., 3.], [-2., 0]])
print “SVC Predication Results: %s” % result1
supp_v = clf.support_vectors_
print “Support Vectors: %s” % supp_v
lin_clf = svm.LinearSVC()
lin_clf.fit(X, y)
result2 = clf.predict([[2., 2.], [3., 3.], [-2., 0]])
print “Linear Predication Results: %s” % result2
dec = lin_clf.decision_function([[2., 2.], [3., 3.], [-2., 0]])
print “Linear decision function values: %s” % dec

a university for the

kNN training and testing
procedure TRAIN-KNN(C, D)
1. D′ ← PRE_PROCESS(D)
2. k ← SELECT-K(C, D′)
3. return D′, k

procedure APPLY-KNN(C, D′, k, d)
1. Sk ← COMPUTE_NEAREST_NEIGHBORS(D′, k, d)
2. for each cj ∈ C do
3. pj ← |Sk ∩ cj |/k
4. return argmaxj pj

Given an incoming document, the nearest neighbour classifier finds the training example that is nearest (according to some distance metric) to it.

a university for the

Example 5.
Nearest Neighbor Classification

The pluses + and minuses – indicate positive and negative training examples, respectively. The solid gray line shows the actual decision boundary, which is highly non-linear.
SVMs with a non-linear kernel, would have a difficult time fitting a model to this data. For this reason, the nearest neighbor classifier is optimal as the number of training examples approaches infinity.

a university for the

Classes of Classifiers
Types of classifiers
Generative (Naïve-Bayes)
Discriminative (Rocchio, SVMs)
Non-parametric (nearest neighbor)
Types of learning
Supervised (Naïve-Bayes, kNN, SVMs)
Semi-supervised (Rocchio + relevance models)

The Naïve Bayes classifier is an example of a wider class of probabilistic models called generative models. These models assume that some underlying probability distribution generates both documents and classes.

As the number of training examples grows, however, the power of the generative model can be limited by simplifying distributional assumptions, such as the independence assumption in the Naïve Bayes classifier. In such cases, discriminative models, such as Rocchio and SVM, often outperform generative models. Discriminative models are those that do not model the generative process of documents and classes. Instead, they directly model the class assignment problem given a document as input.

Non-parametric classifiers let the data “speak for itself ” by eliminating all distributional assumptions. One simple example of a nonparametric classifier is the nearest neighbor classifier. Given an incoming document, the nearest neighbour classifier finds the training example that is nearest (according to some distance metric) to it.

a university for the

4. Feature Selection
Document classifiers can have a very large number of features, but NOT all features are useful. Also, excessive features can increase computational cost of training and testing.
Feature Selection Methods reduce the number of features by choosing the most useful features
Filters that select terms or patterns by ranking them with correlation coefficients.
They do not incorporate learning (e.g., tf*idf, BM25 weights).
Wrapper methods that assess subsets of features according to their usefulness to a given predictor.
They use a learning method to measure the quality of subsets of features (eg., cross-validation) without incorporating knowledge about the specific structure of the classification or regression function.
Embedded methods implement feature selection during the model training.
They do not separate the learning from the feature selection part.

a university for the

Other Measures
Information gain is a commonly used feature selection measure based on information theory
It tells how much “info

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com