程序代做CS代考 data structure DNA algorithm ETX2250/ETF5922: Data Visualization and Analytics

ETX2250/ETF5922: Data Visualization and Analytics
Market Basket Analysis
Lecturer:
Department of Econometrics and Business Statistics 
 Week 9

Market basket analysis
Also known as: Association analysis Anity analysis
When would we use market basket analysis? Supermarket would like to increase sales by :
Placing products in such a way that customers see items they are likely to add to their basket (e.g., chips and dip)
Providing voucher to encourage the purchase of likely items (e.g., fancy cat toys to owners who buy super premium cat food)
What data do we have? Items purchased during each transaction:
What items are commonly purchased together
Does the purchase of item A make the purchase of item B more likely?
2/41

Can you think of an example
where you have experienced
the result of this analysis?
Online book stores increase sales by drawing customer’s attention to particular books Netix helping you nd the next movie

Terminology and Notation
A transaction can be thought of as a basket of items. The letters represent the items. This transaction contains items A, B, C, D, E
A particular group of items is called an itemset. The following group contains items A, C and F
The data might suggest that where A and F occur in a basket, C is also likely to occur. This is represented
with an Association Rule
How do we nd these association rules?
4/41
}C{ → }F,A{
}E ,D ,C ,B ,A{ = teSmetI
}E ,D ,C ,B ,A{ = noitcasnar T

A transaction by product matrix
Transaction
A
B
C
D
E
111100 300010 501100 700100
Rows represent transactions (different baskets)
Columns represent different items that might be purchased 1 means the item was in the basket, 0 means it was not
2
1
1
1
1
1
4
1
0
0
0
0
6
1
0
1
1
0
8
0
1
1
0
0
5/41

Consider the itemset
There are several possible associative rules
These are probabilistic. Baskets with A and B don’t necessarily include C, etc.
Two baskets include A and B, both include C, so this rule works all the time
Four baskets include C and B, and of those, two include A, so this rule works half the time Three baskets include A and C, and of those, two include B, so this rule works 2/3 of the time.
6/41
NEHT → FI
}C ,B ,A{
}B{ → }C ,A{ }A{ → }B ,C{ }C{ → }B ,A{

How often does the rule work?
}B{ → }D ,C{

More formal language
In the rule
is called the antecedent is called the consequent
The support of the itemset
The support count of the itemset The support of the association rule
is the proportion of all transaction that include A, B and C is the number of transactions that include A, B and C
is the support of
8/41
}C,B,A{
}C{ → }B,A{ }C ,B ,A{
}C{ → }B ,A{
}C ,B ,A{
}C{ }B ,A{

Condence
How can we describe uncertainty about this rule?
Condence is a conditional probability Pr(Consequent|Antecedent)
9/41
)tnedecetnA(p = ecnedifnoC )tneuqesnoC DNA tnedecetnA(p
}B ,A{tnuoC troppuS = ecnedifnoC }C ,B ,A{tnuoC troppuS

Consider
What is the association rule? What is the antecedent? What is the consequent? What is the itemset? What is the support? What is the support count? What is the condence?
}B{ → }D ,C{

Lift
What if C is just super common in the data?
Are you more likely to nd the consequent(C) in a basket that contains the antecedent (A and B)? How much more likely?
Numerator = likelihood of C in a basket that contains A and B
Denominator = likelihood of C in any basket
Lift measures how many times more likely the consequent is when the antecedent is present For the rule to be useful, lift needs to be more than 1
11/41
tneuqesnoc fo troppuS = elur noitaicossa na fo tfiL elur noitaicossa fo ecnedifnoC

What is a good association rule?
An association rule is ultimately judged on how actionable it is and how well it explains the relationship between item sets.
Ex. Wal-Mart mined its transactional data to uncover strong evidence of the association rule, “If a customer purchases a Barbie doll, then a customer also purchases a candy bar.”
An association rule is useful if
Its antecedents strongly increase the likelihood of the consequent, (measured by the lift) and It explains an important previously unknown relationship
It should also have reasonably large support
12/41

Computational challenges & the apriori algorithm
How do we generate these candidate rules?
We could use all possible rules, but why couldn’t we do this for very big data?
Take a three item set
How many possible rules are there?
If p = number of items in set,
In the case of 3 items, this is 12 , ,
,,,, In the case of 6 items, 602; 7 items 1932
, , , , ,
If we have a large number of items there will be too many rules! Often focus on frequent item-sets (see support)
13/41
}B,A{→C }A,C{→B }C,B{→A A→}C,B{ B→}C,A{ C→}B,A{ B→C C→B A→C A→B C→A B→A
1+ 1+p2− p3
}C ,B ,A{

A priori algorithm
1. Generate frequent itemsets with just one item (count how many transactions contain that item in the data)
2. Drop items that have support below the desired minimum support
3. Generate frequent two item sets using only the frequent one item sets.
4. Generate the frequent three item sets using only the frequent two item sets…
The idea is that if the smaller itemset doesn’t have sucient support, a larger itemset won’t have sucient support.
14/41

An example of the power of associations: The Netix Prize
Netix offered a prize of $1 million to anyone who could improve their recommendations by 10% Online selling provides data
Observe what movies are rented together or by the same individual at different times (this is before the monthly fee version of netix)
Netix movies also have customers’ ratings of movies Develop recommendation systems
Suggest purchases based on what the individual has already purchased and their ratings
15/41

Two approaches:
The direct approach:
Rate movies by factors such as: amount of comedy, complication of plot, amount of blood and gore, how handsome the lead actor was, and a couple more.
Classify the viewer’s taste in the same way
Recommend movies that coincide well with the viewer’s taste Alternatively:
Start with random factors assigned to movies, each a combination of available linear variables Tune the factors to align more and more with the viewer’s tastes
The factors may not be as comprehensible as “how much comedy” but they predicted viewer’s ratings of movies very well
Actually even more complex (http://blog.echen.me/2011/10/24/winning-the-netix-prize-a- summary/)
16/41

HowtoinR

Shopping basket transactions at the Hy- store.
Aims: to suggest:
How to increase sales through product placement
How to increase sales through cross-product promotions First read in the data
transaction_matrix_v1 <- read_csv(here::here("data/HyVeetr.csv"), col_names = F Is this a good format? Rows = baskets, columns = item (no particular order) 18/41 Shopping basket transactions at the Hy- store. Let's use a special function from the arules function to read in as transactions library(arules) transaction_matrix <- read.transactions(here::here("data/HyVeetr.csv"), sep = transaction_matrix ## transactions in sparse format with ## 10 transactions (rows) and ## 15 items (columns) 19/41 " This is a very different structure to what we have used before, what does it look like? str(transaction_matrix) ## Formal class 'transactions' [package "arules"] with 3 slots ## ..@ data ## .. .. ..@ i ## .. .. ..@ p ## .. .. ..@ Dim ## .. .. ..@ Dimnames:List of 2 ## .. .. .. ..$ : NULL ## .. .. .. ..$ : NULL ## .. .. ..@ factors : list() ## ..@ itemInfo :'data.frame': 15 obs. of 1 variable: ## .. ..$ labels: chr [1:15] "Beer" "Bread" "Cheese" "Cream" ... ## ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots : int [1:53] 1 3 6 7 8 10 1 2 3 4 ... : int [1:11] 0 6 15 19 25 31 38 42 46 49 ... : int [1:2] 15 10 Complex! 20/41 Getting back to a binary incidence matrix The storage of the transaction matrix is designed for ecient memory and computation (remember the computation issues!) This is a deviation from our usual tidy data, purely for computational eciency But I want to visually inspect our transaction matrix so I am going to extract it out. You won't be expected to do this in an exam binary_incidence <- "matrix") colnames(binary_incidence)<- 21/41 kableExtra::kable(binary_incidence) Beer Bread FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FA FALSE FA FALSE FA FALSE FA FALSE FA TRUE TRUE FALSE Cheese TRUE FALSE FALSE FALSE TRUE Cream TRUE TRUE FALSE FALSE crisps TRUE FALSE FALSE Crisps FALSE TRUE FALSE FALSE Eggs TRUE FALSE FALSE Fruit TRUE TRUE TRUE FALSE Jam TRUE TRUE FALSE FALSE MapleSyrup FALSE FALSE FALSE FALSE FALSE Milk TRUE TRUE TRUE FALSE Softdrink FALSE TRUE Steak FALSE FALSE T FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FA FALSE TRUE TRUE FALSE FALSE FALSE FA FALSE FALSE FALSE TRUE FA TRUE FALSE T 22/41 Ve R R Consider What is the association rule? What is the antecedent? What is the consequent? What is the itemset? What is the support? What is the support count? What is the condence? }eseehC{ → }kliM ,daerB{ Frequency of items Let's get the most frequent items using the eclat function Specify the support (supp) and the maximum number of items in each item set (maxlen) frequentItems <- eclat (transaction_matrix, parameter = list(supp = 0.5, maxlen = 3)) ## Eclat ## ## parameter specification: ## tidLists support minlen maxlen target ext ## FALSE 0.5 1 ## ## algorithmic control: ## sparse sort verbose 3 frequent itemsets TRUE ## 7 -2 TRUE ## ## Absolute minimum support count: 5 24/41 inspect(frequentItems) ## items support transIdenticalToItemsets count ## [1] {Fruit,Softdrink} 0.5 5 5 ## [2] {Cream,Fruit,Jam} 0.5 5 5 ## [3] {Fruit,Jam} ## [4] {Cream,Jam} ## [5] {Fruit,Milk} ## [6] {Cream,Fruit} ## [7] {Fruit} ## [8] {Cream} ## [9] {Milk} ## [10] {Jam} ## [11] {Softdrink} 0.5 5 5 0.5 5 5 0.6 6 6 0.6 6 6 0.9 9 9 0.6 6 6 0.6 6 6 0.5 5 5 0.5 5 5 25/41 Item Frequency plot Use the command itemFrequencyPlot Specify that we only plot the 10 most frequent items (with the topN argument). Specify we want the absolute frequency (rather than the relative) using the type argument, and add a title using main. itemFrequencyPlot(transaction_matrix, topN=10, type="absolute", main="Item Frequency") 26/41 Mine the rules (nd the most common) Use the apriori function to use the apriori algorithm Need to specify the minimum support (supp), the minimum condence (conf) and the minimum length considered. basket_apriori <-apriori(transaction_matrix, parameter = list(supp = 0.5, conf = 0.8, minlen = 2 )) 27/41 basket_apriori <-apriori(transaction_matrix, parameter = list(supp = 0.5, conf = 0.8, minlen = 2 )) ## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport maxtime support minlen ## 0.8 0.1 1 none FALSE TRUE 5 0.5 2 ## maxlen target ext ## 10 rules TRUE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 5 ## 28/41 Extract the candidate rules inspect(basket_apriori) ## lhs ## [1] {Softdrink} ## [2] {Jam} ## [3] {Cream} ## [4] {Jam} ## [5] {Milk} ## [6] {Cream} ## [7] {Cream,Jam} ## [8] {Fruit,Jam} ## [9] {Cream,Fruit} => {Jam} 0.5
rhs support confidence coverage lift count
=> {Fruit} 0.5
=> {Cream} 0.5
=> {Jam} 0.5
=> {Fruit} 0.5
=> {Fruit} 0.6
=> {Fruit} 0.6
=> {Fruit} 0.5
=> {Cream} 0.5
1.0000000 0.5
1.0000000 0.5
0.8333333 0.6
1.0000000 0.5
1.0000000 0.6
1.0000000 0.6
1.0000000 0.5
1.0000000 0.5
0.8333333 0.6
1.111111 5
1.666667 5
1.666667 5
1.111111 5
1.111111 6
1.111111 6
1.111111 5
1.666667 5
1.666667 5
29/41

What makes a rule good?
When the antecedents hold, the consequent is very likely to hold That is, the condence is high
The antecedents strongly increase the likelihood of the consequent; That is the lift is reasonably large
It explains a previously unknown relationship.
This is beyond what the calculations can tell you.
A rule will not be useful if it applies to hardly any baskets. However, if the dataset (and your business) is suciently large, you may be interested in rules whose itemsets have relatively low support, as we will see with the lastfm example below.
30/41

How often does data come like this?
The form of the grocery data was that each row represented a transaction, and the items in that transaction were listed, seperated by commas.
There are other common forms of data, but they take a little more data munging before we can use them.
One example is the data you will use in your tutorial. You will need extra data munging tools than the ones we covered in week 3 (R has an extensive way of manipulating data, but the basics we covered in week 3 are sucient in many scenarios).
31/41

Last fm artist data
lastfm.csv records 300,000 selections of artist selections by 15,000 users Also have the user’s country and gender
library(tidyverse)
lastfm <- read_csv(here::here("lectures/data/week09/lastfm.csv")) %>%
mutate(user = as.factor(user))
head(lastfm)
## # A tibble: 6 x 4
## user artist
##
sex country

## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
red hot chili peppers
the black dahlia murder f
goldfrapp
dropkick murphys
le tigre
schandmaul
f
Germany
Germany
f Germany
f Germany
f Germany
f Germany
32/41

How is this scenario like the supermarket scenario?
user = shopper
artist = item bought
How is the data structure different from the structure of the supermarket scenario?
33/41

Group split
lastfm_split <- lastfm %>%
group_by(user) %>%
group_split()
Now we have a new data object type! This is something called a list, which let’s us store a large amount of dataframes in one object.
Here we use group_split() to make a dataframe(or a tibble to be more accurate) for every user, which we store in this list for easy access
34/41

Using map from purrr
This list of dataframes(tibbles) means we can apply some tidyverse data munging for each dataframe (user) using our usual dataframe munging.
To see what’s going on, let’s focus in on the rst user. We use the double square brackets to index into a list
user1 <- lastfm_split[[1]] 35/41 Data munging for user1 head(user1) We want to: Select only the artist Make sure we only have the unique artists (so drop any multiple entries) Take it out of a dataframe 36/41 Data munging for user1 user1 %>%
select(artist)%>%
unique() %>%
deframe()
37/41

Map from the purrr package.
The map function in purrr lets us do this over every dataframe (tibble) in our list, with just one function call! Neat!
We do this by creating a function that takes in x (in this case each data.frame/tibble in our list), and then runs our little cleaning routine over it.
playlist<- lastfm_split %>% purrr::map(function(x) x %>%
select(artist) %>%
unique() %>%
deframe())
Let’s check
playlist[[1]]
## [1] “red hot chili peppers” “the black dahlia murder”
## [3] “goldfrapp”
## [5] “le tigre”
## [7] “edguy”
“dropkick murphys”
“schandmaul”
“jack johnson”
38/41

Almost nished with the data munging
Now we have a data.frame every artist the user has listened to, for every user, stored in a list. Fortunately this is sucient to coerce into the transactions matrix from before – phew!
playlist_tr<-as(playlist, "transactions") playlist_tr ## transactions in sparse format with ## 15000 transactions (rows) and ## 1004 items (columns) 39/41 A popular story (from 2012) A father walks into a store of a local Target, furious. Turns out (as the story goes) his teenage daughter has been sent coupons and advertising surrounding maternity and newborn goods The storemanager apologizes profusely and the man leaves A few days later, the father calls the store and explains that his daughter was in fact pregnant. This was a widely publicized example, but I cannot nd actual verication for it. Still it is a good cautionary tale of both the benets and ethical risks of these methods 40/41 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Lecturer: Department of Econometrics and Business Statistics   Week 9