wk9_lec_support
Week 9 Lecture Review Question Solution¶
© Professor Yuefeng Li
Copyright By PowCoder代写 加微信 powcoder
Question 2¶
Let C = {c0, c1} be a set of classes and D be a training set, where each element in D is a pair (dj, cj), dj is a document and cj is its class. The training procedure of Naive Bayes classification for the input training set D can be found in the lecture notes. It uses Bayes’ Rule to compute the probability of class ci occurring for the given document d, P(ci|d) (i =0 or 1) for all document d. It then assigns a class label ci to document d if ci is the highest probability of being computed given the document.
The training process is to calculate P(w|c0) and P(w|c1) for all terms in V (the vocabulary set of terms in D). Assume V is a list of terms and numbered from 0 to n-1, e.g., V= [‘cheep’, ‘buy’, ‘banking’, ‘dinner’, ‘the’], a document is represented as a vector of terms’ accounts in the document, and D is represented as a list of documents and the class set C is represented as a list of class labels of documents. For example,
D = [[0,0,0,0,2],[3,0,1,0,1],[0,0,0,0,1],[2,0,3,0,2],[5,2,0,0,1],[0,0,1,0,1],[0,1,1,0,1],[0,0,0,0,1],[0,0,0,0,1],[1,1,0,1,2]]
C = [0,1,0,1,1,0,0,0,0,0]
where c1 (spam) and c0 (not spam) and labelled as 1 and 0.
(1) Define a python function TRAIN_MULTINOMIAL_NB(C, D, V) to calculate prior probability P(c) and conditional probabilities P(wj|ci). Please note the function parameters are different to the Naive Bayes Algorithm in the lecture notes.
# Define function to calculate the number of words in documents that labelled as class c in D
def n_words(c,C,D):
for i in range(len(C)):
if C[i]==c:
for j in range(len(D[i])):
n_w = n_w + D[i][j]
return n_w
# The training algorithm, where we assume there are only two classes.
# It is different to the one defined in the lecture notes. It is good to understand the differences.
# For example, here we assume the document parsing is done and the set of terms V is provided.
def TRAIN_MULTINOMIAL_NB(C, D, V):
# initialize variables
N = len(D)
prior = []
condprob = tf = [[0 for w in range(len(V))] for c in range(2)]
for c in range(2):
for i in C:
if C[i]==c:
prior.append(Nc/len(C))
for w in range(len(V)): # calculate tf(c,w)
for d in range(len(D)):
if C[d]==c:
tf[c][w] = tf[c][w] + D[d][w]
for w in range(len(V)):
condprob[c][w] = (tf[c][w]+1)/(n_words(c,C,D)+len(V))
return(prior, condprob)
(2) Define a python function APPLY_MULTINOMIAL_NB(V, prior, condprob, d) to assign a label to document d, where d is represented as a list of words.
import math
def APPLY_MULTINOMIAL_NB(V, prior, condprob, d):
W = {V[i]:i for i in range(len(V)) if V[i] in d}
# W is a dict of term:number pairs
# print(W)
score = {}
for c in range(2):
score[c]=math.log(prior[c])
for (w,i) in W.items():
score[c] = score[c] + math.log(condprob[c][i])
if score[1]>score[0]:
# test the functions
# Assume the training set in Table 1 is represented as two Python lists as follows, where
# X_docs is the list of document vectors of documents in D, and
# y_class is the list of the corresponding classes of documents in D.
X_docs = [[0,0,0,0,2],[3,0,1,0,1],[0,0,0,0,1],[2,0,3,0,2],[5,2,0,0,1],[0,0,1,0,1],[0,1,1,0,1],[0,0,0,0,1],[0,0,0,0,1],[1,1,0,1,2]]
y_class = [0,1,0,1,1,0,0,0,0,0]
# We get 10 documents from D, and each document is a 5-dimentional vector
# The classes are c1 (spam) and c0 (not spam) and labelled as 1 and 0
V = [‘cheep’, ‘buy’, ‘banking’, ‘dinner’, ‘the’]
# test Q2 (1)
(prior, condprob) = TRAIN_MULTINOMIAL_NB(y_class, X_docs, V)
print(prior)
print(condprob)
# test Q2 (2)
d1 = [‘cheep’, ‘buy’, ‘in’, ‘banking’, ‘ABC’]
d2 = [‘the’, ‘dinner’, ‘is’, ‘cheap’]
y = APPLY_MULTINOMIAL_NB(V, prior, condprob, d1)
[0.7, 0.3]
[[0.1, 0.15, 0.15, 0.1, 0.5], [0.44, 0.12, 0.2, 0.04, 0.2]]
Question 3¶
Let given_topic = ‘R101’, design a Python function train_Rocchio(document_vectors, relevance_judgements, given_topic) to calculate the centroid of the relevant class (named as ‘R101’) and the centroid of non-relevant class (named as ‘nR101’).
# Rocchio Training functions
import sys, os
def is_belong_to(givendoc, topic_code, topic_docs):
class_docs = topic_docs[topic_code][str(1)]
return givendoc in class_docs
def train_Rocchio( train_set, topic_set, pos_cls):
“””Convert to dict of positive, negative docs.
train_set: {docid:{term:tfidf}}
topic_set: judgement for whole given collection
pos_cls: the topic code, e.g. ‘R101’ “””
dvec_mean={}
neg_dvec_mean = {}
for (doc, dvec) in train_set.items():
if is_belong_to(doc, pos_cls, topic_set):
pos_n += 1
for (term, score) in dvec.items():
dvec_mean[term] += float(score)
except KeyError:
dvec_mean[term] = float(score)
neg_n += 1
for (term, score) in dvec.items():
neg_dvec_mean[term] += float(score)
except KeyError:
neg_dvec_mean[term] = float(score)
for t in list(neg_dvec_mean.keys()):
neg_dvec_mean[t] /= float(neg_n)
for t in list(dvec_mean.keys()):
dvec_mean[t] /= float(pos_n)
return { pos_cls: dvec_mean, (“!%s” % pos_cls) : neg_dvec_mean }
# test in python
#curr_idx ={‘18586’: {‘quot’: 0.053, ‘passat’: 0.144, ‘saxoni’: 0.222, ‘new’: 0.031, ‘onli’: 0.027, ‘fix’: 0.072, ‘repurchas’: 0.0727, ‘agreement’: 0.072}, ‘22513’: {‘iranian’: 0.274, ‘spi’: 0.058, ‘quot’: 0.051}, ‘26642’: {‘peopl’: 0.177, ‘vessel’: 0.146, ‘capsiz’: 0.170, ‘nagav’: 0.262, ‘river’: 0.170, ‘southern’: 0.127}, ‘26847’: {‘point’: 0.122, ‘index’: 0.149, ‘close’: 0.132, ‘stock’: 0.141, ‘trade’: 0.067, ‘share’: 0.111}}
#judge_set={‘R101’: {‘0’: [‘6146’, ‘18586’, ‘22170’, ‘22513’, ‘26642’, ‘26847’, ‘27577’, ‘30647’, ‘61329’, ‘61780’, ‘77909’, ‘80425’, ‘80950’, ‘81463’, ‘82912’, ‘83167’], ‘1’: [‘39496’, ‘46547’, ‘46974’, ‘62325’, ‘63261’, ‘82330’, ‘82454’]}}
document_vectors ={
‘39496’: {‘VM’: 0.17, ‘US’: 0.01, ‘Spy’: 0.20, ‘Sale’: 0.00, ‘Man’: 0.02, ‘GM’: 0.20, ‘Espionag’: 0.12, ‘Econom’: 0.11, ‘Chief’: 0.00, ‘Bill’: 0.00},
‘46547’: {‘VM’: 0.10, ‘US’: 0.21, ‘Spy’: 0.10, ‘Sale’: 0.00, ‘Man’: 0.00, ‘GM’: 0.00, ‘Espionag’: 0.22, ‘Econom’: 0.20, ‘Chief’: 0.00, ‘Bill’: 0.10},
‘46974’: {‘VM’: 0.00, ‘US’: 0.23, ‘Spy’: 0.10, ‘Sale’: 0.20, ‘Man’: 0.05, ‘GM’: 0.10, ‘Espionag’: 0.10, ‘Econom’: 0.10, ‘Chief’: 0.01, ‘Bill’: 0.00},
‘62325’: {‘VM’: 0.17, ‘US’: 0.01, ‘Spy’: 0.20, ‘Sale’: 0.00, ‘Man’: 0.02, ‘GM’: 0.20, ‘Espionag’: 0.12, ‘Econom’: 0.11, ‘Chief’: 0.00, ‘Bill’: 0.00},
‘6146’: {‘VM’: 0.10, ‘US’: 0.00, ‘Spy’: 0.00, ‘Sale’: 0.30, ‘Man’: 0.10, ‘GM’: 0.20, ‘Espionag’: 0.00, ‘Econom’: 0.12, ‘Chief’: 0.10, ‘Bill’: 0.00},
‘18586’: {‘VM’: 0.00, ‘US’: 0.30, ‘Spy’: 0.00, ‘Sale’: 0.30, ‘Man’: 0.20, ‘GM’: 0.00, ‘Espionag’: 0.00, ‘Econom’: 0.20, ‘Chief’: 0.15, ‘Bill’: 0.20},
‘22170’: {‘VM’: 0.20, ‘US’: 0.00, ‘Spy’: 0.00, ‘Sale’: 0.15, ‘Man’: 0.20, ‘GM’: 0.25, ‘Espionag’: 0.00, ‘Econom’: 0.00, ‘Chief’: 0.00, ‘Bill’: 0.10} }
relevance_judgements={‘R101’: {‘0’: [‘6146’, ‘18586’, ‘22170’], ‘1’: [‘39496’, ‘46547’, ‘46974’, ‘62325’]}}
given_topic = ‘R101’
training_model = train_Rocchio(document_vectors, relevance_judgements, given_topic)
print(training_model)
{‘R101’: {‘VM’: 0.11000000000000001, ‘US’: 0.115, ‘Spy’: 0.15000000000000002, ‘Sale’: 0.05, ‘Man’: 0.022500000000000003, ‘GM’: 0.125, ‘Espionag’: 0.13999999999999999, ‘Econom’: 0.13, ‘Chief’: 0.0025, ‘Bill’: 0.025}, ‘!R101’: {‘VM’: 0.10000000000000002, ‘US’: 0.09999999999999999, ‘Spy’: 0.0, ‘Sale’: 0.25, ‘Man’: 0.16666666666666666, ‘GM’: 0.15, ‘Espionag’: 0.0, ‘Econom’: 0.10666666666666667, ‘Chief’: 0.08333333333333333, ‘Bill’: 0.10000000000000002}}
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com