CS699 Lecture 7 Other Classifiers
Ensemble Methods: Increasing the Accuracy
Ensemble methods
Use a combination of models to increase accuracy
Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M*
Popular ensemble methods
Bagging: averaging the prediction over a collection of
classifiers
Boosting: weighted vote with a collection of classifiers
Ensemble: combining a set of heterogeneous classifiers
2
Bagging: Boostrap Aggregating
Analogy:Diagnosisbasedonmultipledoctors’majorityvote Training
Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap).
A classifier model Mi is learned for each training set Di (using the same classifier method,. e.g., decision tree).
3
Bagging: Boostrap Aggregating
Classification: classify an unknown sample X
Each classifier Mi returns its class prediction
The bagged classifier M* counts the votes and assigns the class with the most votes to X
Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple
Accuracy
Often significantly better than a single classifier derived
from D
For noise data: not considerably worse, more robust
Proved improved accuracy in prediction
4
Bagging on Weka
Under Meta classifiers
Select dataset size per each iteration (bagSizePercent) Select base classifier
5
Similar to bagging:
Sampling with replacement is used The same classifier method is used. Voting is used.
Boosting
Builds and combines multiple classifier models.
6
previously built models.
Boosting
Different from bagging:
Models are built sequentially and a model is influenced by
A new model “pays more attention” to tuples that were misclassified by previous models.
Each training tuple has a weight.
If a tuple is misclassified, the weight of the tuple is
increased.
The probability that a tuple is sampled is based on the weight of the tuple (i.e., if a tuple was misclassified in a previous model, it is more likely to be sampled).
When combining models, a model with a higher performance is given a higher weight.
7
Boosting on Weka
AdaBoostM1 under Meta classifiers Select base classifier
8
measure.
Class Imbalance Problem
The main class of interest (the positive class) is represented by a very small fraction of dataset.
Example: 10% positive tuples and 90% negative tuples
As discussed earlier, an accuracy is not a good
Other measures, such as sensitivity, specificity, ROC curves can be used to better understand the performance of class‐unbalanced data.
We will discuss two approaches to improve classification accuracy on class‐unbalanced data.
9
Class Imbalance Problem
Oversampling and undersampling
Both approaches manipulate the data.
Example: 100 positives and 1000 negatives
Oversampling: sample positive tuples, with replacement, until the number of positive tuples becomes 1000.
Undersampling: randomly remove tuples from negative tuples until the number of negative tuples becomes 100.
10
dataset and classifies the new tuple.
Also called instance based learner.
K‐nearest‐neighbor classifier (KNN):
Lazy Learner
Does not build a model from the training dataset
When a new tuple comes in, it uses the stored training
The class label of a new tuple is determined by the class labels of k nearest neighbor tuples, by majority voting.
11
Lazy Learner
KNN Example, with k = 3, Manhattan distance Training dataset
OID X Y CLASS 122Y 234N 342Y 455Y 563N
Classify an object
Objects 1, 2, and 3 are 3 nearest neighbors.
Y is the majority class among these 3 objects. So, the new object is classified as Y.
12
Classification by Backpropagation
Backpropagation: A neural network learning algorithm
Started by psychologists and neurobiologists to develop and
test computational analogues of neurons
A neural network: A set of connected input/output units
where each connection has a weight associated with it
During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples
Also referred to as connectionist learning due to the connections between units
13
Multilayer Feed-forward Neural Network
14
How A Multi-Layer Neural Network Works
Theinputstothenetworkcorrespondtotheattributesmeasuredforeach training tuple
Inputs are fed simultaneously into the units making up the input layer
Theyarethenweightedandfedsimultaneouslytoahiddenlayer
Thenumberofhiddenlayersisarbitrary,althoughusuallyonlyone
Theweightedoutputsofthelasthiddenlayerareinputtounitsmakingup the output layer, which emits the network’s prediction
Thenetworkisfeed‐forward:Noneoftheweightscyclesbacktoaninput unit or to an output unit of a previous layer
15
Neuron: A Hidden/Output Layer Unit
Ann‐dimensionalinputvectorxismappedintovariableybymeansofthe scalar product and a nonlinear function mapping
Theinputstounitareoutputsfromthepreviouslayer.Theyaremultipliedby their corresponding weights to form a weighted sum, which is added to the bias associated with unit. Then a nonlinear activation function is applied to it.
16
Neuron: A Hidden/Output Layer Unit
Netinputtounitjis: Ij wijOi j i
Output of unit j is:
Here, activation function is logistic, or sigmoid function.
Oj 1 1eIj
17
Defining a Network Topology
Decide the network topology: Specify # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layer
Numeric attribute: Normalize the input values for each attribute measured in the training tuples to [0.0—1.0]
Categorical attribute: One input unit per domain value, each initialized to 0
Output, if for classification and more than two classes, one output unit per class is used
Once a network has been trained and its accuracy is unacceptable, repeat the training process with a different network topology or a different set of initial weights
18
Backpropagation
Iterativelyprocessasetoftrainingtuples&comparethenetwork’sprediction with the actual known target value
Foreachtrainingtuple,theweightsaremodifiedtominimizethemean squared error between the network’s prediction and the actual target value
Modificationsaremadeinthe“backwards”direction:fromtheoutputlayer, through each hidden layer down to the first hidden layer, hence “backpropagation”
Steps
Initialize weights to small random numbers, associated with biases Propagate the inputs forward (by applying activation function)
Backpropagate the error (by updating weights and biases)
Terminating condition (when error is very small, etc.)
19
Backpropagation
20
Weakness
Neural Network as a Classifier
Long training time
Require a number of parameters typically best determined empirically, e.g., the network topology or “structure.”
Poor interpretability: Difficult to interpret the symbolic meaning behind the learned weights and of “hidden units” in the network
Strength
High tolerance to noisy data
Ability to classify untrained patterns
Well‐suited for continuous‐valued inputs and outputs
Successful on an array of real‐world data, e.g., hand‐written letters
Algorithms are inherently parallel
Techniques have recently been developed for the extraction of rules from trained neural networks
21
Multiple regression
Logistic Regression
Used when the dependent variable is numeric.
Dependent variable is expressed as a linear combination of independent variables
Y: dependent variable (numeric)
X1, X2, . . . Xn: independent variables
Y=b0 +b1X1 +b2X2 +…+bnXn
22
Logistic Regression
Logistic regression is used when the dependent variable is nominal.
Let Y be a nominal variable with value 1 or 0.
Let p be the probability Y = 1
P, instead of Y, is related to independent variables.
More specifically, p is expressed as a nonlinear function of independent variables:
This function is called logistic function
…
23
Logistic Regression
To have a linear expression on the right‐hand side, we use measure called odds.
The odds of a tuple belonging to class 1 (Y = 1) is the ratio of the probability of belonging to class 1 to the probability of belonging to class 0:
24
Logistic Regression
Substitute p with the logistic function (with simplification):
Take the natural logarithm on both sides:
Log(odds) is called logit and its range is ‐∞ (very low odds or very low p) and ∞ (very high odds or very high p).
logit = 0 means odds = 1 means p = 0.5
…
011
25
Logistic Regression
When a logistic regression model is fitted to a data, all coefficients (b’s) are estimated.
Then, given independent variable values, we can calculate p.
Example:
A heart disease dataset (from UCI Machine
Learning Repository)
14 attributes and 270 tuples
26
Part of the dataset
Logistic Regression
Class attribute: heart_disease 1: has heart disease
2: does not have heart disease
27
Logistic Regression
Build logistic regression model (with Weka) using only age:
b0 = ‐2.8088, b1 = 0.0474
Given the age of a person, we can calculate the
probability that person has heart disease:
P(class=1|age=x)=
1 / (1 + e‐(‐2.8088 + 0.0474x)) = 1 / (1 + e2.8088 ‐ 0.0474x)
28
Logistic Regression
If the age of a person is 72,
P(class = 1 | age = 72) = 1 / (1 + e2.8088 ‐ 0.0474*72)
= 0.646
So, the probability that the person has heart disease is 64.6%.
If classification threshold is 0.5, then the person is classified as having heart disease
29
Logistic Regression
If we use two independent variables age and cholesterol, the parameters (by Weka’s logistic regression):
b0 = ‐3.408, b1 = 0.0437, b2 = 0.0032 The fitted model is:
P(class = 1 | age = X1, cholesterol = X2)
= 1 / (1 + e3.408 ‐ 0.0437X1 ‐ 0.0032X2)
30
Support Vector Machine
SVM for linearly separable data
Linear decision boundary
Find a decision boundary, called hyperplane, with the largest margin.
Tuples that fall on side hyperplanes are called support vectors.
31
Support Vector Machine
SVM for linearly separable data
32
Support Vector Machine
SVM for linearly inseparable data
Some data are not linearly separable
Original data is transformed to a higher dimensional space.
Find a linearly separating hyperplane in the new
space.
A2
A1
33
Support Vector Machine
Properties
Relatively recent (1992)
Highly accurate
Especially for nonlinear boundary decisions Less prone to overfitting
Slow
34
Random Forest
An ensemble method
Multiple trees are built (so a forest)
Each tree classifies a given sample
Final classification is determined by voting
35
Assume k trees are built.
Random Forest
Outline of a simple random forest algorithm:
From dataset D, we generate k bootstrap training set D1, D2, . . ., Dk.
We build one tree from each training set.
When we build a tree, the test attribute at each node is determined from a subset of attributes that are randomly selected from all attributes. All k trees are built in this manner.
When an unseen object is classified, each tree classifies the object and the final class of the object is determined by a voting
36
Random Forest
Properties
Usually less susceptible to errors and outliers.
Usually overfitting is not an issue.
When determining the test attribute at each node, only a subset of attributes is considered. So, it can be faster, especially for a large dataset.
37
Learning Classifier System (LCS)
Machine learning technique which combines
Reinforcement learning
Evolutionary computing
Other heuristics to produce adaptive system
38
Learning Classifier System (LCS)
Reinforcement learning
Learning through trial and error via the reception of
a numerical reward
Maximize future reward
39
Learning Classifier System (LCS)
Evolutionary computing
Search algorithm based on the mechanisms of
Natural selection and genetics
Apply Darwin’s principle of the survival of the fittest among the computational structure
Apply stochastic process of gene mutation, recombination, etc.
40
Genetic Algorithm
A local search algorithm (searching for a solution to an optimization problem, for example)
Based on evolutionary biology
Next generation individuals are generated from a
population by Selection
Crossover Mutation
41
Selection
Crossover
Mutation
Parents selected based on fitness function
Parents are crossed over to generate offsprings
A digit is randomly selected and randomly mutated
Genetic Algorithm
An example (with digit strings)
Note: Selection, crossover, and mutation can be done in different ways.
42
A learning classifier system
A model is represented as a set of rules
A rule has an antecedent and a consequent
A rule is usually represented as a bit string
XCS
Set of rules (classifier model) is generated based on a genetic algorithm
43
Example of a rule representation (From the Buys_computer dataset)
XCS
First, attribute values are encoded into binary values Age has three values so three bits are used:
100: <= 30 010: 31..40 001: > 40
44
XCS
Income has three values so three bits are used: 100: low
010: medium 001: high
Student has two values so two bits are used: 10: no
01: yes
Credit_rating has two values so two bits are used:
10: fair
01: excellent
45
Then, the following rule
If age >40 AND Income = high AND Student = no
AND Credit_rating = excellent, Then buys_computer = no
Is represented as: 0010011001 10
NOTE:Thereareotherrepresentations.
XCS
Buys_computer (class label) has two values so two bits are used:
10: no 01: yes
46
Simplified Schematic (as a classifier)
XCS
Note: 1. “action” is the class label 2. # matches either 0 or 1
47
Example Population
Sample: 1101 10
match reward 1000
#10# 10 ##01 10 10## 10 1#0# 01 #01# 10 10#1 01
Match Set #10# 10 ##01 10 1#0# 01
compare action 10 is chosen
Action Set #10# 10 ##01 10
new offsprings inserted into population
Genetic algorithm is performed to generate new offsprings, if needed
XCS
48
Example Population
Sample: 1011 01
mismatch reward 0
#10# 10 ##01 10 10## 10 1#0# 01 #01# 10 10#1 01
Match Set 10## 10 #01# 10 10#1 01
compare action 10 is chosen
Action Set #10# 10 ##01 10
new offsprings inserted into population
Genetic algorithm is performed to generate new offsprings, if needed
XCS
49
Experimental result (preliminary)
adult
diabetes
breast cancer
breast cancer wisconsin cmc
ionosphere
iris
liver disorder lymphography
lung cancer
primary tumor
voting
credit approval(crx) balance‐scale heartDisease‐cleveland Average
79.94 87.58 99.57 56.62 92.00 98.00 83.19 52.03 80.48 53.05 93.07 88.26 78.88 66.67 79.41
data Set
Naïve‐Bayse J48 Logistic 83.48 86.22 85.18 76.30 73.83 77.21 71.68 75.53 68.88 95.99 94.56 96.57 49.29 53.22 51.60 82.62 91.45 88.89 94.00 94.00 93.33 55.36 68.70 68.12 37.84 36.49 28.38 78.13 78.13 75.00 54.98 47.91 48.23 75.53 77.34 75.53 77.68 86.09 85.22 23.84 32.80 26.72 55.45 55.45 57.76 67.48 70.11 68.44
ANN SVM 83.27 75.90 75.39 65.10 64.69 73.08 95.28 95.71 54.51 56.21 91.17 93.45 97.33 96.67 71.59 59.42 39.86 43.92 65.63 71.88 45.02 44.05 78.25 81.27 82.75 54.78 21.60 14.08 55.78 49.51 68.14 65.00
XCS 81.78
XCS
50
LCS applications
Data mining (besides classification)
Optimization
Shape optimization
XCS
Extract compact descriptions of interesting phenomenon usually described by multidimensional data
Predict next note in different type of music
The profit optimization of a batch chemical reaction
51
LCS applications
Modeling
Some categorization process
Used in Computational Economics
Control
Control of traffic signal
Control of robot arms
Control of underwater vehicle
Many others
XCS
52
• •
G.Shmueli,P.C.Bruce,M.L.Stephens,N.R.Patel,‘Data Mining for Business Analytics; Concepts, Techniques, and Applications with JMP PRO,” Wiley, 2017
53
References
• Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012
• http://www.cs.illinois.edu/~hanj/bk3/
Geneticalgorithm:S.RussellandP.Norvig,“Artificial intelligence a modern approach,” 3rd Ed., 2010, Prentice Hall, pp. 126 – 129.
•
LCS and XCS
54
References
– J.H. Holland, “Adaptation in natural and artificial systems,” 1975, University of Michigan Press.
– S.W. Wilson, “Classifier fitness based on accuracy,” Evolutionary Computation, 3, 1995, pp. 149‐175.
– S.W. Wilson, “Generalization in the XCS Classifier System,” Genetic Programming 1998, Proc. 3rd Annual Conference.
– M Butz and S.W. Wilson, “An Algorithmic Description of XCS,” IlliGAL Report No. 2000017, May 2000, Illinois Genetic Algorithms Laboratory, University of Illinois at Urbana‐Champaign