[장바구니 분석] Association Analysis Simplified

2015. 12. 11. 12:49MBA

What is Association Analysis? 

 Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based on the occurrences of other items in the transaction

Association Rule

An implication expression of the form X ® Y, where X and Y are item sets

Example:    {Milk, Diaper} -> {Beer}

Here X is {Milk, Diaper] -> Y which is {Beer}

 

TIDItems
1Chips, Milk
2Chips, Diaper, Beer, Cornflakes
3Milk, Diaper, Beer, Pepsi
4Chips, Milk, Diaper, Beer
5Chips, Milk, Diaper, pepsi

 

Association Rule Evaluation Metrics

Support (s) = Fraction of transactions that contain both X and Y i.e. how often Milk, Diaper and Beer occur together in the transactions. Milk, Diaper and Beer occur in 2 out of total 5 transactions, hence support =2/5=0.4

Confidence (c) = Measures how often each item in Y appears in transactions that contain X

C= Support (X + Y)/Support (X)

That is- How often beer occurs in the transactions which contain milk and diaper. Now milk and diaper are together in 3 transactions (TID=3, 4 and 5), and out of the 3, beer is present in 2 of them, hence confidence = 2/3 (No. of transactions with Milk, Diaper and Beer/No. of transactions with Milk and Beer) =0.67

Lift: The Lift of the rule is X=>Y is the confidence of the rule divided by the expected confidence, assuming that the item sets are independent.

 Interpretation of Lift:

A lift value greater than 1 indicates that X and Y appear more often together than expected; this means that the occurrence of X has a positive effect on the occurrence of Y or that X is positively correlated with Y.

A lift smaller than 1 indicates that X and Y appear less often together than expected, this means that the occurrence of X has a negative effect on the occurrence of Y or that X is negatively correlated with Y

A lift value near 1 indicates that X and Y appear almost as often together as expected; this means that the occurrence of X has almost no effect on the occurrence of Y or that X and Y have Zero Correlation. Thus, lift is a value between 0 and infinity

For all the values of lift which are > 1, actual lift= Lift value-1 and 

% Increase in those cases = (Lift value-1)*100

Coming back to our Example-> Lift (X->Y) = confidence(X->Y) / support(Y)

=Support (X+Y)/Support (X)*Support (Y)

= 0.67 / (3/5)=0.67/0.60 = 1.1167

 Now, Let us do a bit of Math here-> ((0.67-0.60)/0.60)*100=70/6=11.67 i.e. probability of finding beer in the transactions which have Milk and Diaper is greater than the normal probability of finding Beer in the above 5 transactions by 11.67%.

 How? Let’s solve further

 Probability= Favorable Number of Cases/Total Sample Space

Probability of finding beer in the above 5 transactions=3/5=0.60

Probability of finding beer in the transactions which have milk and diaper

Favorable Cases= Beer + Milk + Diaper

Sample Space=Milk + Diaper

=number of transactions which have Beer with Milk and Diaper/number of transactions which have

Milk and Diaper=2/3=0.67. Now 0.67 is 11.67% more than 0.60 i.e. there is a lift or increase of 11.67% of finding beer in the transactions which have Milk and Diaper

To Summarize:

Support: The support of the rule, that is, the relative frequency of transactions that contain X and Y.

                  Support(X->Y) = support(X+Y)

Confidence: The confidence of the rule.  Confidence(X->Y) = support(X+Y)/ support(X)

Lift: The following equation must hold true.  Lift (X->Y) = confidence(X->Y) / support(Y)

          =Support (X+Y)/Support (X)*Support(Y)

 Support of the Rule X=>Y is Symmetric i.e. Support (X->Y) = Support (Y->X)

Lift of the Rule X->Y is Symmetric i.e. Support (X->Y) = Support(Y->X)

 

Drawback of Confidence:

Confidence can sometimes by misleading as is shown in the below example

 Credit Card 
Saving’s Account NoYesTotal
No50350400
Yes100500600

 Rule: S=>C (People with Savings Account are likely to have a credit card)

The interpretation of implication (=>) in association rules can sometimes be misleading

As in Above: Support (S=>C) =500/1000=50%

Confidence (S=>C) = 500/600=83%

Expected Confidence (S=>C (=350+500)/1000) = 85%

Lift (S=>C) = 0.83/0.85 < 1

Based on the Support and Confidence, it might be considered a strong rule. However, people without a savings account are even more likely to have a credit card (=350/400=87.5%).

Savings Account and Credit Card are in fact found to have a negative correlation. Thus, high confidence and support does not imply cause and effect, the two products at times might not even be correlated.

One has to exercise caution in making any recommendations in such cases and look closely at the lift values.

Possible Recommendations for X=>Y Rule (Where X and Y are 2 separate Products and have high support, high confidence and high positive lift > 1)

  1. Put X and Y Closer in the Store
  2. Package X with Y
  3. Package X and Y with a poorly selling item
  4. Give Discount on only one of X and Y
  5. Increase the Price of X and lower the price of Y (or vice versa)
  6. Advertise only one of X and Y i.e. do not advertise X and Y together
  7. Example: If X was a toy and Y a form of sweet, then offering sweets in the form of toy X could also be a good option.

  Example: Interpretation of Rules for a sample product transaction set:

The thresholds used were 1.5 % support and 20% confidence.

 

Product1

==>

Product2

Support (%)

Confidence (%)

Lift

 P

==>

Q

2.18

26.33

1.49

R

==>

Q

1.50

23.82

1.35

 S

==>

Q

2.42

23.45

1.33

T

==>

U

1.79

21.06

1.23

 Interpretation of the first Rule:

Products P and Q together appear in 2.18 % of the transactions as indicated by Support.

If there are 100 transactions that contain Product P, then 26 of those also have Q as indicated by the Confidence.

There is 49% more chance of occurrence of Q, given that P is also there as is indicated by the Lift.

Or The Probability of finding Q in all those transactions which have Product P is 49% more than the Probability of finding Product Q in all the transactions

 Mathematics behind the Rule (Ex B->C):

Lift= Support of (B + C)/ Support (B)*Support (C) = approx 50%

 The Way Lift has been calculated is as below:

Say for Example if total transactions are 100

C is present in 25=> Probability of finding C in transactions=25/100=1/4=0.25

B is present in 50, but C is present with B in 25 of them. So Probability of finding C in all the transactions with B is = B + C together/ B alone = 25/50=0.50

It implies that Probability of Finding C in all the transactions with B is double the probability of finding C alone in all the transactions

Example: Interpretation of Rules for a sample Product by region transaction set

Summary of association rules: Min: support = 2.0%, confidence = 20.0%

Max. Size of an Item Set = 10

Support: Fraction of transactions that contain both X and Y. The threshold has been kept at 2% i.e. atleast 2% of the transactions contain both X and Y.

Confidence (c): Measures how often each item in Y appears in transactions that contain X. The threshold has been kept at 20%.

 

Item Set 1 ( X )

==>

Item Set 2 ( Y )

Support (%)

Confidence (%)

Lift

A1

==>

P1

3.61

88.91

19.41

A2

==>

P2

1.99

65.89

15.11

Consider the top rule:

Let X= A1 (Region)

Let Y= P1 (Product)

Why the values for Support are same? -> It is just a simple mathematical formula

Support = Transactions that contain both X and Y/Total Transactions

Since for both the rules X and Y are same, just that their orientation is different, obviously for both the rules it comes 3.61% i.e. 3.61% of the transactions contain both X & Y

Interpretation of the Confidence Value:

X=>Y, confidence = (X Union Y)/X i.e. Support (X+Y)/Support (X)

88.91% of the times, Product P1 occurs in all those transactions which contain A1 as the region.

Say for example there are 100 transactions which contain region- A1, among them 89 transactions contain the Product P1

Interpretation of the Lift value:

Lift (X->Y) = confidence(X->Y) / support(Y) =Support (X+Y)/Support (X)*support (Y)

For this Rule => probability of finding Product1 increases 18.4 times in all those transactions where region is A1

Or

Probability of P1 in all those transactions which have region A1 is 18.4 times the Probability of Product P1 in all the transactions.

Mathematics behind the Rule (Ex T->S): 

Say for Example if total transactions are 100

S is present in 20=> Probability of finding S in transactions=20/100=0.20

T is present in 50, but S is present with T in 20 of them. So Probability of finding S in all the transactions with T is = T + S together/ T alone = 20/50=0.40

It implies that Probability of finding Product S in all the transactions with region T is double the probability of finding S alone in all the transactions


From : http://analyticstrainings.com/?p=151