RESEARCH METHODS FOR INFORMATION PROFESSIOALS
Predictive modeling
Support Vector Machine
Copyright By PowCoder代写 加微信 powcoder
Random Forest
Support Vector Machine (1)
The main idea is to find a hyperspace surface (hyperplane or multidimensional plane), which separates the categories with the maximum distance
Maximum margin hyperplane
Support Vector Machine (2)
Many possible lines can separate the two groups, but the solid line a separates the two groups with a wider “margin” than line b does. Line a will be retained.
Figure 2-1. 2-dimension illustration of SVM classifier (Joachims, 1998)
Support Vector Machine (3)
Support vectors
The hyperplane is determined using support vectors
Source: Witten, I.H., & Eibe, F. (2000) Data mining. San Francisco:
Support Vector Machine (4)
Linear model
Formula for a maximum margin hyperplane :
X = b + i yi (a(i).a)
X represents category membership
b & i are coefficients to be determined by machine learning
i represents a support vector
yi is the class value of the training case a(i) (i.e. a support vector) : 1=yes, -1=no
a(i) & a are vectors. a(i) is a support vector, a is a new case to be classified
(a(i).a) is the inner product similarity between a(i) & a
Support Vector Machine (5)
Constructing nonlinear classifiers
Transform the input, using higher degree factors
Calculate all possible n-factor (say 3-factor) products of the original attributes. Replace the original set of attributes with the set of n-factor products
E.g. for 2 attributes, height (x) and weight (y), the 3-factor products are: xxx, xxy, xyy, yyy
Hyperplane formula: X = b + i yi (a(i).a)n
(a(i).a)n is called a kernel function
Support Vector Machine (6)
Advantages
Only vectors at the margin of both groups (i.e. support vectors) contribute to the classifier construction.
This makes SVM less computationally expensive compared with some other classifiers
Term selection is not needed.
The overfitting protection of SVM makes it capable of handling high dimensional spaces.
SVM will find the globally optimal solution
like linear regression, & unlike neural network and decision tree induction
SVM does not need manual parameter tuning. The “default” parameter setting is effective
However, we still need to determine a good mapping function (“kernel function”), and term weighting scheme in text categorization
Support vector regression
Source: https://www.saedsayad.com/support_vector_machine_reg.htm
Decision Tree Induction
Example decision tree
to make a prediction
What is decision tree induction?
It is a supervised learning technique for categorization/ classification
Based on the training sample, a decision tree is derived for categorizing new cases
A decision tree is a tree structure that is used like a flow-chart
To categorize a new case, the attribute values of the case are tested against the decision tree. A path is traced from the root to a leaf node that holds the class prediction for that sample.
Each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf notes represent categories.
A decision tree can be converted into a set of categorization rules.
Example decision tree
Induction process
Decision Tree Induction
Attribute selection measure (1)
Also called measure of the goodness of split
One measure used is used to select the test attribute at each node in the tree is measure of information gain or reduction of entropy
The measure of information is based on Shannon’s Entropy theory
Suppose there are m categories in the categorization,
s cases in the sample, and si is the no. of cases in category i.
The expected information needed to classify a set of cases:
I(s1,s2, … , sm) = – Σi=1m pi log2 (pi)
where pi is the probablility that a case belongs to category i (estimated by si/s)
information needed to make correct prediction = measure of entropy (or disorder)
Measure of entropy (disorder)
– 0.9 * log2 0.9 – 0.1 * log2 0.1
0.14 + 0.33
Suppose sample of 100:
0.9 * 0.9 * 0.9 * … [90 times] * 0.1 * 0.1 * … [10 times]
0.9 90 * 0.1 10
Normalize by sample size:
0.9 90/100 * 0.1 10/100 = 0.9 0.9 * 0.1 0.1
Take log: 0.9 log 0.9 + 0.1 log 0.1
Random Forest
An ensemble learning method
Applied to classification and regression
Constructs many decision trees from training set (usually C&R Tree using Gini coefficient)
Prediction is by aggregating the predictions of many trees
classification: majority vote
regression: mean prediction
Random forest corrects for tendency of decision tree to overfit the training set
Reference: https://www.listendata.com/2014/11/random-forest-with-r.html
Random Forest
Combines “bagging” (bootstrap aggregating) + random variable selection
Bagging generates ntree new training sets, each of size N, by sampling with replacement from the original sample
Sampling with replacement: some records may be repeated in each new sample
If N=original sample size, the new sample is expected to have about 63% of the unique records in original sample. This kind of sample is called bootstrap sample.
Random Forest
The records not in the bootstrap sample are called Out-Of-Bag (OOB) sample
For each tree, using the OOB sample (about 37% of the original sample), the misclassification rate is calculated, called OOB error rate
OOB error rates from all the trees are aggregated to form the overall OOB error rate
Random Forest
Random Variable Selection
At each stage in the tree building, a few IVs (mtry) are selected at random out of all the IVs
Default value of mtry:
Sqrt of no. of IVs for classification task
1/3 of IVs for regression task
Random Forest
Disadvantages
Cannot extrapolate to new data (beyond the range of current sample)
Biased towards categorical variable with multiple levels (categories)
Gini impurity reduction is biased to variables with more categories
Random Forest
Fine-tuning
Two parameters are important :
Number of trees (ntree)
Number of random variables used in each tree (mtry).
Random Forest
Fine-tuning
Set mtry to default value, and search for the optimal ntree value.
Try different ntree values (100, 200, 300….,1,000)
Record OOB error rate
Identify the number of trees where the OOB error rate is minimum
Random Forest
Fine-tuning
Find the optimal mtry
Try different values of mtry
Reducing mtry reduces both the correlation and the strength. Increasing it increases both. Somewhere in between is an “optimal” value of mtry
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com