CS计算机代考程序代写 decision tree database cache data mining algorithm Excel CS699 Lecture 5 Classification 2

CS699 Lecture 5 Classification 2


Using IF‐THEN Rules for Classification Represent the knowledge in the form of IF‐THEN rules

Assessment of a rule: coverage and accuracy
– A tuple is covered by R if it satisfies the antecedent of R – ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
– coverage(R) = ncovers /|D| /* D: training data set */
– accuracy(R) = ncorrect / ncovers
– Refer to Example 8.6, page 356
2
R: IF age = youth AND student = yes THEN buys_computer = yes
– IF antecedent/precondition/condition THEN consequent/conclusion

Using IF‐THEN Rules for Classification • Anotherexample
• •
R2: IF income = high AND student = no
3
THEN buys_computer = no R2 covers three tuples
age income
<=30 high no <=30 high no 31...40 high no >40 medium no >40 low yes >40 low yes 31…40 low yes <=30 medium no <=30 low yes >40 medium yes <=30 medium yes 31...40 medium no 31...40 high yes >40 medium no
coverage(R2) = 3/14 = 21.43%
excellent no fair yes fair yes fair yes excellent no excellent yes fair no fair yes fair yes excellent yes excellent yes fair yes excellent no
Among these three tuples, two tuples are correctly classified by R2. accuracy(R2) = 2/3 = 66.67%
studentcredit_ratingbuys_computer fair no


Using IF‐THEN Rules for Classification If more than one rule are triggered, need conflict resolution
4
– Size ordering: assign the highest priority to the triggering rules that has the “toughest” requirement (i.e., with the most attribute tests)
– Class‐based ordering: decreasing order of prevalence or misclassification cost per class
– Rule‐based ordering (decision list): rules are organized into one long priority list, according to some measure of rule quality or by experts
– When a tuple is not covered by any rule, a default rule is used. E.g., a majority class is assigned.

Rule Extraction from a Decision Tree
 One rule is created for each path from the root to a leaf
 Each attribute‐value pair along a path forms a conjunction (AND): the leaf holds the class prediction
5


Rule Extraction from a Decision Tree Example: Rule extraction from our buys_computer decision‐tree
6
no
no
yes yes
excellent fair yes
student?
yes
credit rating?
<=30 31..40 >40
IF age <= 30 AND student = no IF age <= 30 AND student = yes IF age in 31..40 IF age > 40 AND credit_rating = excellent IF age > 40 AND credit_rating = fair
THEN buys_computer = no THEN buys_computer = yes THEN buys_computer = yes THEN buys_computer = no THEN buys_computer = yes
age?

Rule Extraction from the Training Data
• Sequential covering algorithm: Extracts rules directly from training data
• Rules are learned sequentially
• Each rule for a given class will ideally cover many tuples of the class but none (or few) of the tuples of other classes.
• There are many sequential covering algorithms.
• General strategy:
– Rulesarelearnedoneatatime
– Eachtimearuleislearned,thetuplescoveredbytherulesareremoved
– The process repeats on the remaining tuples unless termination condition, e.g., when no more training examples or when the quality of a rule returned is below a user‐specified threshold
• Comp. w. decision‐tree induction: learning a set of rules simultaneously February 5, 2020 Data Mining: Concepts and Techniques 7

• Example
student credit_ratings_comp
• •
First, yes rules.
IF ? THEN buys_computer = yes
<=30 high no excellent 31...40 high no fair >40 medium no fair
>40 low yes fair
no yes yes yes no yes no yes yes yes yes yes no
Simple Covering Algorithm (Example)
age <= 30 2/5 age 31..40 4/4 age > 40 3/5 income = low 3/4 income = medium 4/5 income = high 2/4 student = no 3/7 student = yes 6/7 CR = fair 6/8 CR = excellent 3/6
>40 low yes excellent 31…40 low yes excellent <=30 medium no fair <=30 low yes fair February 5, 2020 >40 medium no excellent
age income
<=30 high no fair no >40 medium yes fair <=30 medium yes excellent 31...40 medium no excellent 31...40 high yes fair Data Mining: Concepts and Techniques 8 • • A new rule is created IF age = 31..40 age income student <=30 high no <=30 high no credit_ratings_comp fair no excellent no • >40 medium low

is repeated (for yes tuples) Once all yes tuples are covered,
Simple Covering Algorithm (Example)
THEN buys_computer = yes
Four tuples covered by this rule >40
no fair yes yes fair yes yes excellent no
are removed and the same
>40 low <=30 medium <=30 low >40 medium <=30 medium >40 medium
no fair no yes fair yes yes fair yes yes excellent yes no excellent no
the same is done for no tuples
February 5, 2020
Data Mining: Concepts and Techniques
9


IF age = 31..40
age income student <=30 high no <=30 high no credit_ratings_comp fair no excellent no • IF age = 31..40 AND student = yes Simple Covering Algorithm (Example) THEN buys_computer = yes age <= 30 2/5 age > 40 3/5 income = low 2/3 income = medium 3/5 income = high 0/2 student = no 1/5 student = yes 4/5 CR = fair 4/6 CR = excellent 1⁄4
>40 medium >40 low
>40 low <=30 medium <=30 low no fair yes yes fair yes yes excellent no THEN buys_computer = yes February 5, 2020 Data Mining: Concepts and Techniques 10 >40 medium <=30 medium >40 medium
no fair no yes fair yes yes fair yes yes excellent yes no excellent no

• •
(Numerical) prediction is similar to classification – constructamodel
– usemodeltopredictcontinuousvalues
• •
Major method for prediction: regression
– modeltherelationshipbetweenoneormoreindependentorpredictor
Numeric Prediction
Prediction is different from classification
– Classification refers to prediction of categorical class label – Prediction models continuous‐valued functions
variables and a dependent or response variable Regression analysis
– Linear and multiple regression
– Non‐linear regression
– Other regression methods: generalized linear model, Poisson regression, log‐linear models, regression trees
February 5, 2020 Data Mining: Concepts and Techniques 11

Linear Regression
• Linear regression: involves a response variable y and a single predictor variable x
y=w0 +w1 x
where w0 (y‐intercept) and w1 (slope) are regression coefficients
Method of least squares: estimates the best‐fitting straight line
• Multiple linear regression: involves more than one predictor variable – Trainingdataisoftheform(X1,y1),(X2,y2),…,(X|D|,y|D|)
– Ex.For2‐Ddata,wemayhave:y=w0 +w1 x1+w2 x2
– Manynonlinearfunctionscanbetransformedintotheabove
February 5, 2020 Data Mining: Concepts and Techniques 12

Other Regression
• Nonlinear regression: some nonlinear models can be modeled by a math function
Example: a polynomial regression model y=w0 +w1 x+w2 x2+w3 x3
• Other Regression‐Based Models – Logistic regression
– Regression trees
February 5, 2020 Data Mining: Concepts and Techniques 13

Classification Example
• Adult
• Car evaluation
• Vote
• Credit screening
• CPU (numeric prediction)
• All from UCI Machine Learning Data Repository

Adult Dataset
• Thisdatasetwasextractedfromcensusbureau database.
• 32561tuples,15attributes
• Classattribute:whetherearns>50Kor<=50K • Attributes:age,workclass,fnlwgt,education, education‐num, marital‐status, occupation, relationship, race, sex, capital‐gain, capital‐loss, hours‐per‐week, native‐country, class • CAR – PRICE car acceptability overall price buying price price of the maintenance technical characteristics comfort • buying • maint – TECH • COMFORT – doors number of doors capacity in terms of persons to carry the size of luggage boot estimated safety of the car Car Evaluation Dataset • The model evaluates cars according to the following concept structure: – persons – lug_boot • safety Car Evaluation Dataset • 1728 tuples, 7 attributes • Class attribute: – car acceptability – values: unacc, acc, good, vgood • Attributes: buying, maint, doors, persons, lug_boot, safety • 435 tuples, 7 attributes Vote Dataset • This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes • Class attribute: Democrat or Republican • Attributes: handicapped‐infants, water‐project‐cost‐ sharing, adoption‐of‐the‐budget‐resolution, physician‐fee‐ freeze, el‐salvador‐aid, religious‐groups‐in‐schools, anti‐ satellite‐test‐ban, aid‐to‐nicaraguan‐contras, mx‐missile, immigration, synfuels‐corporation‐cutback, education‐ spending, superfund‐right‐to‐sue, crime, duty‐free‐exports, export‐administration‐act‐south‐africa Credit Screening Dataset • This file concerns credit card applications. All attribute names and values have been changed to protect confidentiality of the data. • 690 tuples, 16 attributes • Class attribute: approved or denied • Attributes: – MYCT: cycle times CPU Dataset • CPU performance • 209 tuples, 7 attributes • Class attribute: relative performance (numeric) – MMIN: minimum main memory (KB) – MMAX: maximum main memory (KB) – CACH: cache (KB) – CHMIN: minimum channels – CHMAX: maximum channels Hepatitis Dataset • Downloaded from UCI Data Repository • Includes information about 155 acute chronic hepatitis patients, with 19 measured variables. • 33 died and 122 survived. • Goal of original research was to understand the effect of measured variables on the chance of patient survival. • 155 tuples and 19 attributes Hepatitis Dataset • Initial dataset: hepatitis.csv • Preprocessing: missing values, attribute types, attribute selection, etc. • Attribute selection methods: CfsSubset, Correlation, GainRatio, InfoGain, ReliefF, SymmetricalUncert • Classifiers: Naïve Bayes, Logistic regression, SVM, MLP, J48, RandomForest, KNN, Bagging, Boosting • http://www.cs.illinois.edu/~hanj/bk3/ • Ian H. Witten, E. Frank, and M.A. Hall, "Data Mining Practical Machine Learning Tools and Techniques," Third Ed., 2011, Morgan Kaufmann 23 References • Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012