CS计算机代考程序代写 data mining decision tree Excel algorithm CS699 Lecture 7 Other Classifiers

CS699 Lecture 7 Other Classifiers

Ensemble Methods: Increasing the Accuracy
 Ensemble methods
 Use a combination of models to increase accuracy
 Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M*
 Popular ensemble methods
 Bagging: averaging the prediction over a collection of
classifiers
 Boosting: weighted vote with a collection of classifiers
 Ensemble: combining a set of heterogeneous classifiers
2

Bagging: Boostrap Aggregating
 Analogy:Diagnosisbasedonmultipledoctors’majorityvote  Training
 Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap).
 A classifier model Mi is learned for each training set Di (using the same classifier method,. e.g., decision tree).
3

Bagging: Boostrap Aggregating
 Classification: classify an unknown sample X
 Each classifier Mi returns its class prediction
 The bagged classifier M* counts the votes and assigns the class with the most votes to X
 Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple
 Accuracy
 Often significantly better than a single classifier derived
from D
 For noise data: not considerably worse, more robust
 Proved improved accuracy in prediction
4

Bagging on Weka
 Under Meta classifiers
 Select dataset size per each iteration (bagSizePercent)  Select base classifier
5

 Similar to bagging:
 Sampling with replacement is used  The same classifier method is used.  Voting is used.
Boosting
 Builds and combines multiple classifier models.
6

previously built models.
Boosting
 Different from bagging:
 Models are built sequentially and a model is influenced by
 A new model “pays more attention” to tuples that were misclassified by previous models.
 Each training tuple has a weight.
 If a tuple is misclassified, the weight of the tuple is
increased.
 The probability that a tuple is sampled is based on the weight of the tuple (i.e., if a tuple was misclassified in a previous model, it is more likely to be sampled).
 When combining models, a model with a higher performance is given a higher weight.
7

Boosting on Weka
 AdaBoostM1 under Meta classifiers  Select base classifier
8

measure.
Class Imbalance Problem
 The main class of interest (the positive class) is represented by a very small fraction of dataset.
 Example: 10% positive tuples and 90% negative tuples
 As discussed earlier, an accuracy is not a good
 Other measures, such as sensitivity, specificity, ROC curves can be used to better understand the performance of class‐unbalanced data.
 We will discuss two approaches to improve classification accuracy on class‐unbalanced data.
9

Class Imbalance Problem
 Oversampling and undersampling
 Both approaches manipulate the data.
 Example: 100 positives and 1000 negatives
 Oversampling: sample positive tuples, with replacement, until the number of positive tuples becomes 1000.
 Undersampling: randomly remove tuples from negative tuples until the number of negative tuples becomes 100.
10

dataset and classifies the new tuple.
 Also called instance based learner.
 K‐nearest‐neighbor classifier (KNN):
Lazy Learner
 Does not build a model from the training dataset
 When a new tuple comes in, it uses the stored training
 The class label of a new tuple is determined by the class labels of k nearest neighbor tuples, by majority voting.
11

Lazy Learner
 KNN Example, with k = 3, Manhattan distance Training dataset
OID X Y CLASS 122Y 234N 342Y 455Y 563N
Classify an object
Objects 1, 2, and 3 are 3 nearest neighbors.
Y is the majority class among these 3 objects. So, the new object is classified as Y.
12

Classification by Backpropagation
 Backpropagation: A neural network learning algorithm
 Started by psychologists and neurobiologists to develop and
test computational analogues of neurons
 A neural network: A set of connected input/output units
where each connection has a weight associated with it
 During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples
 Also referred to as connectionist learning due to the connections between units
13

Multilayer Feed-forward Neural Network
14

How A Multi-Layer Neural Network Works
 Theinputstothenetworkcorrespondtotheattributesmeasuredforeach training tuple
 Inputs are fed simultaneously into the units making up the input layer
 Theyarethenweightedandfedsimultaneouslytoahiddenlayer
 Thenumberofhiddenlayersisarbitrary,althoughusuallyonlyone
 Theweightedoutputsofthelasthiddenlayerareinputtounitsmakingup the output layer, which emits the network’s prediction
 Thenetworkisfeed‐forward:Noneoftheweightscyclesbacktoaninput unit or to an output unit of a previous layer
15

Neuron: A Hidden/Output Layer Unit
 Ann‐dimensionalinputvectorxismappedintovariableybymeansofthe scalar product and a nonlinear function mapping
 Theinputstounitareoutputsfromthepreviouslayer.Theyaremultipliedby their corresponding weights to form a weighted sum, which is added to the bias associated with unit. Then a nonlinear activation function is applied to it.
16

Neuron: A Hidden/Output Layer Unit
 Netinputtounitjis: Ij wijOi j i
 Output of unit j is:
Here, activation function is logistic, or sigmoid function.
Oj 1 1eIj
17

Defining a Network Topology
 Decide the network topology: Specify # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layer
 Numeric attribute: Normalize the input values for each attribute measured in the training tuples to [0.0—1.0]
 Categorical attribute: One input unit per domain value, each initialized to 0
 Output, if for classification and more than two classes, one output unit per class is used
 Once a network has been trained and its accuracy is unacceptable, repeat the training process with a different network topology or a different set of initial weights
18

Backpropagation
 Iterativelyprocessasetoftrainingtuples&comparethenetwork’sprediction with the actual known target value
 Foreachtrainingtuple,theweightsaremodifiedtominimizethemean squared error between the network’s prediction and the actual target value
 Modificationsaremadeinthe“backwards”direction:fromtheoutputlayer, through each hidden layer down to the first hidden layer, hence “backpropagation”
 Steps
 Initialize weights to small random numbers, associated with biases  Propagate the inputs forward (by applying activation function)
 Backpropagate the error (by updating weights and biases)
 Terminating condition (when error is very small, etc.)
19

Backpropagation
20

 Weakness
Neural Network as a Classifier
 Long training time
 Require a number of parameters typically best determined empirically, e.g., the network topology or “structure.”
 Poor interpretability: Difficult to interpret the symbolic meaning behind the learned weights and of “hidden units” in the network
 Strength
 High tolerance to noisy data
 Ability to classify untrained patterns
 Well‐suited for continuous‐valued inputs and outputs
 Successful on an array of real‐world data, e.g., hand‐written letters
 Algorithms are inherently parallel
 Techniques have recently been developed for the extraction of rules from trained neural networks
21

 Multiple regression
Logistic Regression
 Used when the dependent variable is numeric.
 Dependent variable is expressed as a linear combination of independent variables
 Y: dependent variable (numeric)
X1, X2, . . . Xn: independent variables
Y=b0 +b1X1 +b2X2 +…+bnXn
22

Logistic Regression
 Logistic regression is used when the dependent variable is nominal.
 Let Y be a nominal variable with value 1 or 0.
 Let p be the probability Y = 1
 P, instead of Y, is related to independent variables.
 More specifically, p is expressed as a nonlinear function of independent variables:
This function is called logistic function
􏰀􏰁􏰂􏰃􏰄􏰂􏰅􏰆􏰅􏰄 …􏰂􏰇􏰆􏰇􏰈
23

Logistic Regression
 To have a linear expression on the right‐hand side, we use measure called odds.
 The odds of a tuple belonging to class 1 (Y = 1) is the ratio of the probability of belonging to class 1 to the probability of belonging to class 0:
24

Logistic Regression
 Substitute p with the logistic function (with simplification):
 Take the natural logarithm on both sides:
 Log(odds) is called logit and its range is ‐∞ (very low odds or very low p) and ∞ (very high odds or very high p).
 logit = 0 means odds = 1 means p = 0.5
􏰁􏰂􏰃􏰄􏰂􏰅􏰆􏰅􏰄 …􏰂􏰇􏰆􏰇􏰈
011
25

Logistic Regression
 When a logistic regression model is fitted to a data, all coefficients (b’s) are estimated.
 Then, given independent variable values, we can calculate p.
 Example:
 A heart disease dataset (from UCI Machine
Learning Repository)
 14 attributes and 270 tuples
26

 Part of the dataset
Logistic Regression
 Class attribute: heart_disease 1: has heart disease
2: does not have heart disease
27

Logistic Regression
 Build logistic regression model (with Weka) using only age:
b0 = ‐2.8088, b1 = 0.0474
 Given the age of a person, we can calculate the
probability that person has heart disease:
 P(class=1|age=x)=
1 / (1 + e‐(‐2.8088 + 0.0474x)) = 1 / (1 + e2.8088 ‐ 0.0474x)
28

Logistic Regression
 If the age of a person is 72,
P(class = 1 | age = 72) = 1 / (1 + e2.8088 ‐ 0.0474*72)
= 0.646
So, the probability that the person has heart disease is 64.6%.
If classification threshold is 0.5, then the person is classified as having heart disease
29

Logistic Regression
 If we use two independent variables age and cholesterol, the parameters (by Weka’s logistic regression):
b0 = ‐3.408, b1 = 0.0437, b2 = 0.0032 The fitted model is:
P(class = 1 | age = X1, cholesterol = X2)
= 1 / (1 + e3.408 ‐ 0.0437X1 ‐ 0.0032X2)
30

Support Vector Machine
 SVM for linearly separable data
 Linear decision boundary
 Find a decision boundary, called hyperplane, with the largest margin.
 Tuples that fall on side hyperplanes are called support vectors.
31

Support Vector Machine
 SVM for linearly separable data
32

Support Vector Machine
 SVM for linearly inseparable data
 Some data are not linearly separable
 Original data is transformed to a higher dimensional space.
 Find a linearly separating hyperplane in the new
space.
A2
A1
33

Support Vector Machine
 Properties
 Relatively recent (1992)
 Highly accurate
 Especially for nonlinear boundary decisions  Less prone to overfitting
 Slow
34

Random Forest
 An ensemble method
 Multiple trees are built (so a forest)
 Each tree classifies a given sample
 Final classification is determined by voting
35

 Assume k trees are built.
Random Forest
 Outline of a simple random forest algorithm:
 From dataset D, we generate k bootstrap training set D1, D2, . . ., Dk.
 We build one tree from each training set.
 When we build a tree, the test attribute at each node is determined from a subset of attributes that are randomly selected from all attributes. All k trees are built in this manner.
 When an unseen object is classified, each tree classifies the object and the final class of the object is determined by a voting
36

Random Forest
 Properties
 Usually less susceptible to errors and outliers.
 Usually overfitting is not an issue.
 When determining the test attribute at each node, only a subset of attributes is considered. So, it can be faster, especially for a large dataset.
37

Learning Classifier System (LCS)
 Machine learning technique which combines
 Reinforcement learning
 Evolutionary computing
 Other heuristics to produce adaptive system
38

Learning Classifier System (LCS)
 Reinforcement learning
 Learning through trial and error via the reception of
a numerical reward
 Maximize future reward
39

Learning Classifier System (LCS)
 Evolutionary computing
 Search algorithm based on the mechanisms of
Natural selection and genetics
 Apply Darwin’s principle of the survival of the fittest among the computational structure
 Apply stochastic process of gene mutation, recombination, etc.
40

Genetic Algorithm
 A local search algorithm (searching for a solution to an optimization problem, for example)
 Based on evolutionary biology
 Next generation individuals are generated from a
population by  Selection
 Crossover  Mutation
41

Selection
Crossover
Mutation
Parents selected based on fitness function
Parents are crossed over to generate offsprings
A digit is randomly selected and randomly mutated
Genetic Algorithm
 An example (with digit strings)
Note: Selection, crossover, and mutation can be done in different ways.
42

 A learning classifier system
 A model is represented as a set of rules
 A rule has an antecedent and a consequent
 A rule is usually represented as a bit string
XCS
 Set of rules (classifier model) is generated based on a genetic algorithm
43

 Example of a rule representation (From the Buys_computer dataset)
XCS
 First, attribute values are encoded into binary values  Age has three values so three bits are used:
100: <= 30 010: 31..40 001: > 40
44

XCS
 Income has three values so three bits are used: 100: low
010: medium 001: high
 Student has two values so two bits are used: 10: no
01: yes
 Credit_rating has two values so two bits are used:
10: fair
01: excellent
45

 Then, the following rule
If age >40 AND Income = high AND Student = no
AND Credit_rating = excellent, Then buys_computer = no
Is represented as: 0010011001 10
 NOTE:Thereareotherrepresentations.
XCS
 Buys_computer (class label) has two values so two bits are used:
10: no 01: yes
46

 Simplified Schematic (as a classifier)
XCS
Note: 1. “action” is the class label 2. # matches either 0 or 1
47

 Example Population
Sample: 1101 10
match reward 1000
#10# 10 ##01 10 10## 10 1#0# 01 #01# 10 10#1 01
Match Set #10# 10 ##01 10 1#0# 01
compare action 10 is chosen
Action Set #10# 10 ##01 10
new offsprings inserted into population
Genetic algorithm is performed to generate new offsprings, if needed
XCS
48

 Example Population
Sample: 1011 01
mismatch reward 0
#10# 10 ##01 10 10## 10 1#0# 01 #01# 10 10#1 01
Match Set 10## 10 #01# 10 10#1 01
compare action 10 is chosen
Action Set #10# 10 ##01 10
new offsprings inserted into population
Genetic algorithm is performed to generate new offsprings, if needed
XCS
49

 Experimental result (preliminary)
adult
diabetes
breast cancer
breast cancer wisconsin cmc
ionosphere
iris
liver disorder lymphography
lung cancer
primary tumor
voting
credit approval(crx) balance‐scale heartDisease‐cleveland Average
79.94 87.58 99.57 56.62 92.00 98.00 83.19 52.03 80.48 53.05 93.07 88.26 78.88 66.67 79.41
data Set
Naïve‐Bayse J48 Logistic 83.48 86.22 85.18 76.30 73.83 77.21 71.68 75.53 68.88 95.99 94.56 96.57 49.29 53.22 51.60 82.62 91.45 88.89 94.00 94.00 93.33 55.36 68.70 68.12 37.84 36.49 28.38 78.13 78.13 75.00 54.98 47.91 48.23 75.53 77.34 75.53 77.68 86.09 85.22 23.84 32.80 26.72 55.45 55.45 57.76 67.48 70.11 68.44
ANN SVM 83.27 75.90 75.39 65.10 64.69 73.08 95.28 95.71 54.51 56.21 91.17 93.45 97.33 96.67 71.59 59.42 39.86 43.92 65.63 71.88 45.02 44.05 78.25 81.27 82.75 54.78 21.60 14.08 55.78 49.51 68.14 65.00
XCS 81.78
XCS
50

 LCS applications
 Data mining (besides classification)
 Optimization
 Shape optimization
XCS
 Extract compact descriptions of interesting phenomenon usually described by multidimensional data
 Predict next note in different type of music
 The profit optimization of a batch chemical reaction
51

 LCS applications
 Modeling
 Some categorization process
 Used in Computational Economics
 Control
 Control of traffic signal
 Control of robot arms
 Control of underwater vehicle
 Many others
XCS
52

• •
G.Shmueli,P.C.Bruce,M.L.Stephens,N.R.Patel,‘Data Mining for Business Analytics; Concepts, Techniques, and Applications with JMP PRO,” Wiley, 2017
53
References
• Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012
• http://www.cs.illinois.edu/~hanj/bk3/
Geneticalgorithm:S.RussellandP.Norvig,“Artificial intelligence a modern approach,” 3rd Ed., 2010, Prentice Hall, pp. 126 – 129.


LCS and XCS
54
References
– J.H. Holland, “Adaptation in natural and artificial systems,” 1975, University of Michigan Press.
– S.W. Wilson, “Classifier fitness based on accuracy,” Evolutionary Computation, 3, 1995, pp. 149‐175.
– S.W. Wilson, “Generalization in the XCS Classifier System,” Genetic Programming 1998, Proc. 3rd Annual Conference.
– M Butz and S.W. Wilson, “An Algorithmic Description of XCS,” IlliGAL Report No. 2000017, May 2000, Illinois Genetic Algorithms Laboratory, University of Illinois at Urbana‐Champaign