Data Mining: Concepts and Techniques
— Chapter 8 —
Qiang (Chan) Ye Faculty of Computer Science Dalhousie University University
Copyright By PowCoder代写 加微信 powcoder
Chapter 8. Classification: Basic Concepts
n Classification: Basic Concepts n Decision Tree Induction
n Bayes Classification Methods n Rule-Based Classification
n Model Evaluation and Selection n Summary
Bayes Classification: Why?
n A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probability. It is based on Bayes’ Theorem (described in next slide).
n Performance:
n A simple Bayesian classifier, naïve Bayesian classifier, leads to performance that is comparable to decision tree and selected neural network classifiers.
n Bayesian classifiers are highly accurate and fast when applied to large data sets.
Bayes Classification: Why?
n Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data.
n Class-conditional Independence (key assumption in naive Bayesian classifier):
n Naive Bayesian classifier assumes that the effect of an attribute value on a given class is independent of the effect of other attributes’ values. This assumption is called class-conditional independence.
Bayes’ Theorem
n Let X be a specific data tuple. In Bayesian terms, X is considered “evidence.”
n Let H be some hypothesis that the data tuple X belongs to a specified class C.
n For classification problems, we want to determine P(H|X), the probability that the hypothesis H holds given the “evidence” or observed data tuple X.
n In other words, we are looking for the probability that tuple X belongs to class C, given that we know the attribute description of X.
Bayes’ Theorem
n In Bayesian statistics, P(H|X) is the posterior probability of H, conditional on X.
n For example, suppose that :
n The data set is about customers described using the
attributes age and income.
n X is a 35-year-old customer with an income of $40,000.
n H is the hypothesis that the customer will buy a computer.
n Then P(H|X) reflects the probability that customer X will buy a computer given that we know the customer’s age and income.
Bayes’ Theorem
n In Bayesian statistics, P(H) is the prior probability of H.
n With the previous example, this is the probability that any given customer will buy a computer, regardless of age, income, or any other information, for that matter.
n The posterior probability, P(H|X), is based on more information (e.g., customer information) than the prior probability, P(H), which is independent of X.
n Similarly, P(X|H) is the posterior probability of X, conditional on H. That is, it is the probability that a customer, X, is 35 years old and earns $40,000, given that we know the customer will buy a computer.
Bayes’ Theorem
n P(X) is the prior probability of X. Using our example, it is the probability that a person from our set of customers is 35 years old and earns $40,000.
n “How are these probabilities estimated?”
n P(H), P(X|H), and P(X) may be estimated using the given data
set, as we shall see in the next example.
n Bayes’ theorem is useful in that it provides a way of calculating the posterior probability, P(H|X), using P(H), P(X|H), and P(X).
n Theoretically, Bayes’ theorem can be described using:
Naive Bayesian Classification
The naive Bayesian classifier, or simple Bayesian classifier, works in the following manner:
n 1) Let D be a training set of tuples. Each tuple is associated a class label. As usual, each tuple is represented by an n- dimensional vector, X = {x1, x2, … , xn}, the elements in the vector correspond to attribute A1, A2, … , An, respectively.
n 2) Suppose that there are m classes, C1, C2, … , Cm. Given a tuple, X, naive Bayesian classifier concludes that X belongs to the class that has the highest posterior probability conditional on X.
n That is, the naive Bayesian classifier concludes that tuple X belongs to the class Ci if and only if:
Naive Bayesian Classification
n Therefore, we need to maximize P(Ci|X). The class Ci which leads to the maximum P(Ci|X) is called the maximum posteriori hypothesis. With Bayes’ theorem (Eq. 8.10), we have:
n 3) As P(X) is a constant for all classes, to maximize P(Ci|X), we only need to maximize P(X|Ci)P(Ci).
n If class prior probability P(Ci) is unknown, then it is commonly assumed that the classes are equally likely, that is, P(C1)=P(C2)=…=P(Cm), and thereafter we only need to maximize P(X|Ci).
n Otherwise, we maximize P(X|Ci)P(Ci). Note that the class prior probability could be estimated using P(Ci) = |Ci,D|/|D|, where |Ci,D| is the number of training tuples belonging to class Ci.
Naive Bayesian Classification
n 4) Given a data set with many attributes, it would be extremely computationally expensive to compute P(X|Ci).
n To reduce computation in evaluating P(X|Ci), the assumption of class-conditional independence is made.
n This presumes that the attributes’ values are conditionally independent of one another, given the class label of the tuple (i.e., that there is no dependence relationship between attributes).
n Therefore, we have:
Naive Bayesian Classification
n We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), …, P(xn|Ci) from the training tuples.
n Recall that here xk refers to the value of attribute Ak for tuple X. For each attribute, we need to know whether the attribute is categorical or continuous-valued.
n a) If Ak is categorical, then P(xk|Ci) is the ratio of the number of tuples that belong to class Ci and satisfy Ak = xk to |Ci,D|. Note that |Ci,D| is the number of training tuples belonging to class Ci.
Naive Bayesian Classification
b) If Ak is continuous-valued, then we need to do a bit more work, but the calculation is pretty straightforward. A continuous-valued attribute is typically assumed to have a Gaussian distribution with a mean 𝜇 and standard deviation 𝜎 , defined by:
We simply need to compute 𝜇Ci and 𝜎Ci , which are the mean (i.e., average) and standard deviation, respectively, of the values of attribute Ak for training tuples of class Ci . We then plug these two quantities into Eq. (8.13), together with xk, to estimate P(xk|Ci) .
Naive Bayesian Classification
n For example:
n Let X = (35, $40,000), where A1 and A2 are the attributes age and
income, respectively.
n Let the class label attribute be buys_computer.
n The associated class label for X is yes (i.e., buys_computer = yes).
n Let’s suppose that age has not been discretized and therefore exists as a continuous-valued attribute.
n Suppose that from the training set, we find that customers in D who buy a computer are 38 ± 12 years of age. In other words, for attribute age and this class, we have 𝜇 = 38 years and 𝜎= 12. We can plug these quantities, along with x1 = 35 for our tuple X, into Eq. (8.13) to estimate P(age = 35 | buys_computer = yes).
Naive Bayesian Classification
n 5) To predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci. The classifier predicts that the class label of tuple X is the class Ci if and only if:
In other words, the predicted class label for X is the class Ci for which P(X|Ci)P(Ci) is the maximum.
Naive Bayesian Classification: Example
n To illustrate how naïve Bayesian classification works, we use the same data set used for decision tree induction (i.e. Table 8.1).
n Specifically:
n The data tuples are described by the attributes age, income,
student, and credit_rating.
n The class label attribute, buys_computer, has two distinct
values (namely, {yes, no}).
n Let C1 correspond to the class buys_computer = yes and C2
correspond to buys_computer = no. n The tuple we wish to classify is:
Naive Bayesian Classification: Example
n We need to maximize P(X|Ci)P(Ci), for i= 1, 2.
n P(Ci), the prior probability of each class, can be computed using
the training tuples:
n To compute P(X|Ci), for i= 1, 2, we compute the following conditional probabilities:
Naive Bayesian Classification: Example
n Using these probabilities, we obtain:
n Similarly, we have:
n To find the class, Ci, that maximizes P(X|Ci)P(Ci), we compute:
n Therefore, the naive Bayesian classifier predicts buys_computer = yes for tuple X .
Naive Bayesian Classification: A Trick
n Recall that Eq. (8.12) says:
n What if we encounter probability values of zero?
n Using the previous example, what if there are no training tuples representing students for the class buys_computer = no , which results in P (student = yes|buys_computer = no) = 0?
n Plugging this zero value into Eq. (8.12) would return a zero probability for P(X|Ci).
Naive Bayesian Classification: A Trick
n There is a simple trick to avoid this problem.
n We can assume that our training data set, D, is so large that adding one to each count that we need would only make a negligible
difference to the estimated probability value, yet would conveniently avoid the case of probability values of zero.
n This technique for probability estimation is known as the Laplacian correction or Laplace estimator.
n Note that if we need to modify the counts, we have to revise the denominator used in the probability calculation accordingly.
Naive Bayesian Classification: A Trick
n Suppose that for the class buys_computer = yes in some training database, D, containing 1000 tuples, we have:
n 0 tuples with income = low,
n 990 tuples with income = medium, n 10 tuples with income = high.
n The probabilities of these events, without the Laplacian correction, are 0, 0.990 (from 990/1000), and 0.010 (from 10/1000), respectively.
n Using the Laplacian correction for the three quantities, we pretend that we have 1 more tuple for each income-value pair.
Naive Bayesian Classification: A Trick
n In this way, we instead obtain the following probabilities:
n The “corrected” probability estimates are close to the “uncorrected” original probabilities, yet the zero probability value is avoided.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com