9. General Data mining objectives and algorithms
5b. General Data mining objectives and algorithms
5b.1 Components of Data mining algorithms
Copyright By PowCoder代写 加微信 powcoder
5b.2 Data mining tasks
5b.3 Credit scoring as part of data mining
5b.4 Boosting to improve algorithms
8b.5 Density estimation, Principal component analysis and mixtures
5b.6 Clustering and segmenting
5b.7 Hierarchical clustering
5b.8 Choosing number of clusters
5b.9. Probabilistic methods of clustering
5b.10 Predictive modelling using regression
5b.11 General linear and additive models
5b.12 Finding patterns and rules
5b.13 Association rules
5b.14 Prediction rules
5b.1 Components of Data Mining Algorithms
Data mining application consists of five components
Task: what is the data mining application trying to do
visualization, density estimation, classification; clustering,
Structure: what is the model or pattern we are trying to fit to the data;
linear regression model; classification tree; hierarchical clustering
Measurement function: how to judge the quality of our model or pattern
Can be goodness of fit,: how well does it describe existing data (like R2 ) or
Prediction : how does it perform on data not yet collected/not used in model development
Optimization method: finding best structure/model and parameters in that model.
This is usually what one thinks of as data mining algorithm
Data management technique; how to store/index/retrieve data needed
For small data sets not important but for massive data sets it is essential.
Where the data is stored
how often it needs to be accessed as part of the optimization method
is critical to the application being feasible
5b.2 Data Mining Tasks
Exploratory Data Analysis (EDA)
Explore data with no clear idea of what is being looked for
Visualisation and interactive
p low: use histograms, graphs, pie chars, coxomb plots ( Nightingale)
P high: use projection onto lower dimensions Principal component analysis gives the “most interesting” projections
Descriptive modelling
Describe all the data the process that generates it
Density estimation which describes overall probability distributions
Partitioning p-space into groups – cluster analysis and segmentation
Relationship between variables – dependency modelling
Predictive modelling
Predict value of one variable if one knows the other variables
Classification : if predictive variable is categorical
Regression: if predictive variable is regression
Discovering Patterns and Rules
Detect connections between variables in some parts of the data.
Association rules : what variables tend to occur together ( basket analysis)
Predictive rules: ,to predict the values of an unknown variable
5b.3 Credit Scoring as part of data mining
If one thinks of credit scoring as a data mining algorithm its components are
Task: classify whether customer will default or not
Structure: almost everything- logistic regression, classification tree, linear programming,
Measurement Function: default rate among accept, acceptance rate, classification error, Gini coefficient, ROC curve
Optimization Method: coarse classification splits, scorecard values, choice of segments
Data management techniques; usually flat files with id connections to credit bureau data. Since samples 50 to 100K no major management needed
5b.4 Boosting to improve algorithms
One way of avoiding overfitting is to develop a number of models and to average their predictions. ( in machine learning this is majority voting and some methods might be given more votes than others)
In boosting you start with each data point counting the same and develop the first classifier. Then increase the weight of those misclassified by the first classifier and build a second classifier and repeat.
Although not used much in credit coring it has proved very successful in other classifying and regression applications
5b.5 Density estimation, PCA and mixtures
If you plot all the data of a large data set on a diagram It will just be a black mass.
Overprinting means one cannot see what is going on
Better if you can draw density contours ,i.e. estimate probability density functions
Principal Component Analysis
If p the number of data variables is very large it can be impossible to build visualisations except in a few dimensions
Which dimensions?
Can project onto each ( or a small subset – scatterplots, star icons)
but maybe the interesting thing is not in the original variables
Look for connection in data (x1 x2 ) with (1,2), (2,4), (3,6), (4,8), (5,10).
Most interesting direction is x1 + 2×2 while the most boring direction is the one perpendicular to it 2×1 -x2
i.e interesting is the direction where the sample variance is the biggest
So if we plot onto two dimensions then first is dimension where sample variance is the largest; second is dimension, perpendicular to first with biggest sample variance of such perpendicular dimensions
In same way can define 1st , 2nd, 3rd , etc principal components.
(relatively easy to calculate in that if V is pxp covariance matrix of data these are the eigenvectors, x, ( Vx =x). Corresponding to highest eigenvalues.
The mixtures part of density estimation and mixtures
In some cases one might think the density function is describes/is explained by more than one object type.
Example: supermarket store card usage divides into those who use store regularly and those who use it occasionally.
General form of mixture distribution is
To estimate this, one needs to estimate both p(k) and .
EM (Expectation-Maximisation) algorithm does this in steps
First choose some arbitrary values 0 of and find p(k)1 that best fits data.
Then fixing p(k)1 find best values of , 1 that best fit data.
Repeat until there is “convergence”, i.e. little change in loglikelihood
5b.6 Clustering and Segmentation
The obvious question after using mixture models is given a data set are there different types or “clusters” so that the units in a cluster exhibit similar behaviour which is very different from that of another cluster?
Two types of clustering – segmentation and cluster analysis
Segmentation: partition the data into convenient groups
In behaviour scoring less than 6 months history/ more than 6 months history
Cluster analysis: find the natural partitions in the data
Many different methods for cluster analysis
Partition based clustering
Hierarchical clustering
Probabilistic model based clustering
Partition based clustering
Task; partition data into K ( prespecified ) disjoint sets
Measurement: use score functions which are combinations/variants of
within cluster variation; sum of distance from each point to cluster centre
Between cluster variation; sum of distances between cluster centres
Structure/Optimization:
One example is k-means algorithm
Randomly pick k-cluster centres
Ascribe each pint to nearest cluster which has nearest centre.
Compute new centre of points ascribed to each cluster
Repeat until no or little change .
K-Means clustering
User starts by specifying the number of clusters (K)
K datapoints are randomly selected
Repeat until no change:
Hyperplanes separating K points are generated
K Centroids of each cluster are computed
Age 100
0 Dose (cc’s) 1000
Measuring how good is the clustering
5b.7 Hierarchical Clustering
Clusters are created in levels actually creating sets of clusters at each level. So K is not prespecified
Agglomerative
Initially each item in its own cluster
Iteratively clusters are merge together
Initially all items in one cluster
Large clusters are successively divided
If splitting clusters on one variable then equivalent to classification tree
If splitting on all variables, very complicated searching- so agglomerative methods more common
Measuring distance in agglomerative methods
Take all possible pairs of clusters
Measure how far apart they are
Combine the pair which are closest together and repeat until distance apart above some prespecified level
Distance Measures ( usually use Euclidean distance d(x,y))
Hierarchical versus non hierarchical clustering
Hierarchical clustering
No one distance measure is the best
Ward most common but tends to produce clusters of roughly same size and sensitive to outliers
Hierarchical clustering methods do not scale up well – lots of calculations
Once made a mistake in combination cannot undo it subsequently
have to specify number of clusters ,
need initial seeds
Fast and works well with large data sets
5b.8 Choosing the number of clusters: : a tree data structure which illustrates hierarchical clustering techniques.
Each level shows clusters for that level.
Leaf – individual clusters
Root – one cluster
A cluster at level i is the union of its children clusters at level i+1.
Helps decide how many cluster one wants.
Vertical scale gives “distance between two clusters amalgamated “
Sometimes diagram represented on its side
5b.8 Defining the number of clusters: elbow rule (1)
Elbow rule (2): the scree diagram
Using these distances the
most similar pair of cases is 1 and 2.
Stage 1 ( Euclidean Distance): Stage 2: Join 4 and 5 into B
Stage 3:Join B and 3 distance 2.8
Stage 4; join A and B distance 6.4
3 5.0 4.5 0.0
4 8.5 7.8 3.6 0.0
5 7.2 6.7 2.2 2.0 0.0
4 8.1 3.6 0.0
5 6.9 2.2 2.0 0.0
5b.9 Probabilistic model based clustering using mixture models
Recall form of mixture distribution is
To estimate this use EM, to estimate both pk and .
Obvious extension to clustering in that one prespecifies K and then lets the EM algorithm get to work
At end assign each data point x to the cluster k which
maximises f(x|)
GM example: Initial distribution
After first iteration
After 20th iteration
5b.10 Predictive modelling using regression
In credit scoring we are predicting a binary event – default or not default ( or at most a k-class event)
When we moved to profit scoring we were trying to estimate a continuous variable – profit
This predictions of continuous variables occurs often in data mining
Standard method is linear regression.
y(i) = w1 x 1(i)+w2 x 2(i) + …… wpxp(i) + e(i)
Where e(i) are assumed to be independent with N(0,s2 ) distribution.
This can be generalized to generalized linear and generalized additive models
5b.11 Generalized linear and additive models
Linear regression.
y(i) = w1 x 1(i)+w2 x 2(i) + …… wpxp(i) + e(i)
Where e(i) are assumed to be independent with N(0, 2 ) distribution.
Think of it as saying y(i) is N( m(i), 2 )
S(i)= w1 x 1(i)+w2 x 2(i) + …… wpxp(i) i.e. linear score and
m(i) = s(i).
Generalized linear model says
y(i) can be any sort of distribution with parameter m(i).
m(i) = g(s(i)) so can be some function of the linear score.
Logistic regression can be thought of in this way
Generalize additive models say that as well
g(s(i)) = w1 h1 (x (i))+w2 h2 (x 2(i)) + …… wphp(xp(i))
so the “score” function is now no longer linear – SVM do the same sort of thing.
5b.12 Finding patterns and rules
So far in density estimation, clustering, and prediction we are using or trying to describe the whole data set.
Patterns try to describe particular aspects of the data – they are what is happening “locally”
One common example is “basket analysis” where one tries to see the patterns in what people but together in a shop or supermarket.
If there were n items could be 2n patterns. So impossible to find all patterns
A rule says “ if X ( or maybe X and W or Z) occur then Y will occur”
A probabilistic rule says” if X occurs then with probability p, Y will occur”. This is conditional probability P(Y|X) =p
What is Market Basket Analysis?
Understanding behavior of shoppers
What items are bought together
What’s in each shopping cart/basket?
Basket data consist of collection of transaction date and items bought in a transaction
Retail organizations interested in generating qualified decisions and strategy based on analysis of transaction data
what to put on sale, how to place merchandise on shelves for maximizing profit, customer segmentation based on buying pattern
5b.13 Association Rule
Rule form: LHS ® RHS
IF a customer buys nappies, THEN they also buy beer
nappies ® beer
“Transactions that purchase bread and butter also purchase milk”
bread butter milk
Customers who purchase maintenance agreements are very likely to purchase large appliances ( not as good as rule the other way around)
When a new hardware store opens, one of the most commonly sold items is toilet bowl cleaners
Support, Confidence and Lift
Support : measure of how often the collection of items in an association occur together as a percentage of all the transactions
In 2% of the purchases at hardware store, both pick and shovel were bought
Support of a rule = #tuples(LHS, RHS)/N
Support of a set of objects =#tuples(set appears)/N
Confidence : confidence of rule “B given A” is a measure of how likely it is that B occurs when A has occurred
100% meaning that B always occurs if A has occurred
confidence = #tuples(LHS, RHS) / #(LHS)
Example: bread and butter milk [90%, 1%]
Lift : How much more likely is the RHS to occur if LHS has occurred compared without knowing if LHS has occurred
Lift = (#tuples(LHS, RHS)/ #(LHS))/(# RHS)/N)=
Lift = (#tuples(LHS, RHS)*N / #(LHS))(# RHS)
How Good is an Association Rule?
Is support and confidence enough?
Lift (improvement) tells us how much better a rule is at predicting the result than just assuming the result in the first place
Lift = P(LHS^RHS) / (P(LHS).P(RHS)
When lift > 1 then the rule is better at predicting the result than guessing
When lift < 1, the rule is doing worse than informed guessing and using the Negative Rule produces a better rule than guessing
The association rules mining problem
Generate all association rules from the given dataset that have
support greater than a specified minimum
confidence greater than a specified minimum
Rule form:
LHS ® RHS [confidence, support]
nappies ® beer [60%, 0.5%]
“90% of transactions that purchase bread and butter also purchase milk and 1% of people buy bread and butter”
bread and butter milk [90%, 1%]
Association rule types:
Actionable Rules – contain high-quality, actionable information
Trivial Rules – information already well-known by those familiar with the business
Inexplicable Rules – no explanation and do not suggest action
Trivial and Inexplicable Rules occur most often
Large Itemsets with minsup=30%
Consider the itemset
{Bread, Butter}, and the two possible rules
Bread Butter
Butter Bread
Support({Bread, Butter})/support({Bread} = .75
i.e., Confidence(Bread Butter) = 75%
Support({Bread, Butter})/support({Butter} = 1
i.e. Confidence(Butter Bread) = 100%
Lift(Bread Butter) =.75/.6:Lift( Butter Bread)=1/.8
T1 Beer, Milk
T2 Bread, Butter
T3 Bread, Butter, Jelly
T4 Bread, Butter, Milk
T5 Beer, Bread
Itemset Support
Bread, Butter 60
What is the confidence for this rule:
If a customer purchases soda, then customer also purchases OJ
2 out of 3 soda purchases also include OJ, so 67%
What about the confidence of this rule reversed?
2 out of 4 OJ purchases also include soda, so 50%
Confidence = Ratio of the number of transactions with all the items to the number of transactions with just the “if” items
POS Transactions
Customer Items Purchased
1 OJ, soda
2 Milk, OJ, window cleaner
3 OJ, detergent
4 OJ, detergent, soda
5 Window cleaner, soda
Creating Association Rules
Choosing the right set of items
Generating rules by deciphering the counts in the co-occurrence matrix
Overcoming the practical limits imposed by thousands or tens of thousands of unique items
5b.14 Prediction Rules
Association rules look for patterns between any interesting variables
Prediction rules look for patterns between one variable and the others so as to be able to classify/predict what this variables value might be
This is back to credit scoring and so no surprise main structure is classification tree.
AT&T Classification of Telephone Numbers
Background
AT&T has about 100 million customers
It logs 300 million calls per day, 40 attributes each
350 million unique telephone numbers
Which are business and which are residential?
Proprietary model, using a few attributes, trained on known business customers to adaptively track p(business|data)
Significant systems engineering: data are downloaded nightly, model updated (20 processors, 6Gb RAM, terabyte disk farm)
invaluable evolving “snapshot” of phone usage in US for AT&T
basis for fraud detection, marketing, and other applications
Stage Number of clusters
Number of clusters
0.014823329
0.7078080821
0.9740618424
1.0424687816
1.1000105672
3.6799831574
3.4923992084
6.7436785255
8.2758815454
8.7866973148
11.4033035123
Stage Number of clusters Number of clusters Distance
0 12 11 0.014823329
1 11 10 0.7078080821
2 10 9 0.9740618424
3 9 8 1.0424687816
4 8 7 1.1000105672
5 7 6 3.6799831574
6 6 5 3.4923992084
7 5 4 6.7436785255
8 4 3 8.2758815454
9 3 2 8.7866973148
10 2 1 11.4033035123
Suppose K clusters C,,....
Centroid of cluster k ( n datapoints) is
Within cluster variation w(C)=wc(C)(,)
where d(x,y) is Euclidean distance from
Between cl
uster variation is b(C) =(,)
Measure is some combination of two : b(C
)/w(C) or b(C) -aw(C)
Extend to using variance-covariance
Variants of this can be used
w(C)= trace(W); or trace()
Nearest Neighbour ( Single Link)
D(C,C)= min{d(x,y)|xC,yC}
Average Link
Furthest Neighbour ( Complete Link)
D(C,C)= max{d(x,y)|xC,yC}
Ward's minimum variance
Sum of square distances within each clus
ter calculated. This is ANOVA calculatio
Then increase in sum of square distances
if two clusters put together calculated
for each pair.
Choose pair where this
increase is least.
Agglomeration Schedule
Cluster Combined
Coefficients
Stage Cluster First
Next Stage
StageNumber of clusters
1110987654321
Number of clusters
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com