Tue 01 November 2016

Frequent Pattern Mining - Apriori Pt. 02

Posted by TRII in frequent-pattern-mining-course

Introduction / Overview¶

This post is the second post related to the Pattern Discovery in Data Mining Course. For the first article in the series, we looked into the Apriori Principle (Algorithm) and how it can be used to find frequent patterns in a dataset.

In this article, the various pieces from the first article were put together to form a complete implementation of the Apriori Algorithm.

After finding all frequent patterns with at least Support 5 in the dataset, we perform some basic analyses. More in-depth analyses are left for future articles.

In [1]:

import os
import sys
import time
import pickle
import itertools
from collections import defaultdict

In [2]:

import numpy
import scipy
import pandas
from nltk import FreqDist
import ipyparallel as ipp

In [3]:

%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style="whitegrid", color_codes=True)

In [4]:

rnd = numpy.random.RandomState(8675309)

Goals for Current Dataset¶

To summarize, the goal of this post is to put all of the pieces constructed in part one together, and then process the dataset using the algorithm in order to extract all Frequent Patterns.

As before, the Talking Data Kaggle dataset will be utilized. The set of active applications associated with a user serves as the itemset (transaction) for that user. The Apriori Algorithm finds all frequent itemsets from these transactions.

Load Data¶

The data is loaded from the previous article's output, as some preprocessing was performed in order to convert the original dataset into the desired transaction format. As before, only a subset of the entire dataset is being analyzed here (7,500 out of about 25,000 records). This is to reduce the overall computational time necessary.

In [5]:

with open('transaction_list_subset.pkl', 'rb') as f:
    transactions = pickle.load(f)

In [6]:

with open('application_dictionary.pkl', 'rb') as f:
    apps_hash = pickle.load(f)

Implementation: Part II¶

Review¶

To review from the previous article, the steps in the iterative process are:

Generate new candidates for Frequent-k patterns from Frequent-(k-1) patterns using the Apriori property
Find the supports for each of the Frequent-k candidate patterns by iterating over the transaction dataset
Determine which of the candidates have supports that meet the min_sup threshold; these are the Frequent-k itemsets

Steps 1-3 are repeated until no Frequent-k itemsets are found.

One other significant change from the previous article is the addition of recording which transactions itemsets were involved in when counting supports. Knowing which transactions each pattern is included in will allow us to look at pattern co-occurrence, something that we are unable to look at by simply observing supports. Most measures and metrics for sorting out interesting patterns from the huge number of total frequent patterns found in the data typically rely on some sort of pattern co-occurrence as well. This is the primary reason for recording the transactions.

Find All Frequent Itemsets¶

Define Primary Work Functions¶

Some minor modifications have been made to the code below in terms of what was being used in the previous article. For starters, the "section_loop_parallel" was modified to collect the set of transaction IDs in which each candidate itemset was found.

The "generateCandidates" function was also updated to generate desired results (ha), something that is always nice. Previously, we were not sorting the items post split. This resulted in missed matches in cases where matches actually occurred. Recall that we are utilizing the idea of consistent last and first items in order to speed up the matching process when looking for itemsets that complete larger itemsets.

Finally, we switched away from using the defaultdict class for storing results. (This is not technically a change from previous code, as we had no need for defaultdict previously, but we thought it was worth the note.) We seem to have encountered memory issues when relying the defaultdict after switching to the updated "section_loop_parallel" code that records transaction IDs, and updating the "generateCandidates" code (which still has the defaultdict code present, but commented out). "Regular" Python dictionaries actually have built-in "defaultdict-like" functionality via the "setdefault" class method (shown below). Our thoughts are that something associated with preallocation of space for defaultdict instances may result in memory issues.

In [7]:

def section_loop_parallel(section, patterns):
        
    def updateCounts(tid, transaction, patterns, curr_counts):
        new_items = checkMatches(transaction, patterns)
        for item in new_items:          
            curr_counts.setdefault(item, []).append(tid)
        return(curr_counts)

    def patternInTransaction(transaction, pattern):
        return(pattern.intersection(transaction) == pattern)

    def checkMatches(transaction, patterns):
        new_items = [pattern for pattern in patterns \
                     if patternInTransaction(transaction,  pattern)]
        new_items = ['-'.join([i for i in sorted(ni)]) for ni in new_items]
        return(new_items)
    
    counts = dict()
    _ = section.apply(lambda x: updateCounts(x.name,
                                             x['apps'],
                                             patterns,
                                             counts),
                        axis=1)
    return(counts)


def section_loop_parallel_v01(section, patterns):
    
    from nltk import FreqDist
    
    def updateCounts(transaction, patterns, curr_counts):
        new_items = checkMatches(transaction, patterns)
        curr_counts.update(new_items)
        return(curr_counts)

    def patternInTransaction(transaction, pattern):
        return(pattern.intersection(transaction) == pattern)

    def checkMatches(transaction, patterns):
        new_items = [pattern for pattern in patterns \
                     if patternInTransaction(transaction,  pattern)]
        new_items = ['-'.join([i for i in sorted(ni)]) for ni in new_items]
        return(new_items)
    
    
    counts = FreqDist()
    _ = section.apply(lambda x: updateCounts(x['apps'],
                                             patterns,
                                             counts),
                        axis=1)
    return(counts)


def generateCandidates(frequent_k1_items, ii):
    
    if ii == 1:
        
        ncs = [set(combo) for combo in itertools.combinations(frequent_k1_items, 2)]
        
    else:
    
        # Map items in "freq" with all but last item matching
        # to the same list in the hash table
        #matchup_dict = defaultdict(list)
        #endwith_dict = defaultdict(set)
        
        matchup_dict = dict()
        endwith_dict = dict()

        for item in frequent_k1_items:
            n = item.split('-')
            key = '-'.join(n[:-1])
            #matchup_dict[key].append(n[-1])
            #endwith_dict[n[-1]].update([item])
            

            if key not in matchup_dict.keys():
                matchup_dict[key] = []
            #matchup_dict.setdefault(key, []).append(n[-1])
            matchup_dict[key].append(n[-1])
                
            if n[-1] not in endwith_dict.keys():
                endwith_dict[n[-1]] = set()
            #endwith_dict.setdefault(n[-1], set()).update([item])
            endwith_dict[n[-1]].update([item])

        # For each key in the hash list, construct new candidates
        # by iterating overall pairwise combination of the corresponding
        # values.  Merge each pairwise combo with the key to create 
        # a new candidate.  Check if the pattern consisting of the
        # last k-1 items in the candidate itemset constitute a
        # frequent pattern in the k-1 Frequent itemsets.  If yes,
        # add as a candidate pattern.
                    
        ncs = []
        for k,v in matchup_dict.items():
            k = k.split('-')
            base = k if type(k)==list else [k]
            for tail in itertools.combinations(v, 2):
                tail = sorted(tail)
                last = tail[-1]
                cand = base.copy()
                cand.extend(list(tail))
                test = '-'.join(cand[1:])
                if test in endwith_dict[last]:
                    ncs.append(set(cand))
                
    return(ncs)

Define Wrapper Functions¶

In [41]:

def parallelLoop(transactions, candidates, loop_func, view, n_sections, version):
    
    async_results = []
    len_section = transactions.shape[0] // n_sections
    start_cuts = [0 + len_section*i for i in range(n_sections)]

    for i, ind in enumerate(start_cuts):
        
        if i==(n_sections-1):
            section = transactions[ind:]
        else:
            section = transactions[ind:start_cuts[i+1]]

        ar = view.apply_async(loop_func, section, candidates)
        async_results.append(ar)

    rc.wait(async_results)  # Wait until all tasks are done.
    
    results = [ar.get() for ar in async_results]
    
    
    if version == 1:
        counts = FreqDist()
        for result in results:
            counts.update(result.copy())
            
        result = []
        del result
        
    elif version == 2:
        counts = dict()
        for result in results:
            for k,v in result.items():
                counts.setdefault(k, []).extend(v.copy())

        result = []
        del result
        
    del results
    results = []
    
    return(counts)


def mainLoop(transactions, items_hash, min_sup=5, n_sections=8, version=2):
    
    rc = ipp.Client()
    view = rc.load_balanced_view()
    
    start = time.time()
    
    if version == 1:
        loop_func = section_loop_parallel_v01
    elif version == 2:
        loop_func = section_loop_parallel
        
    frequent = dict()
    candidates = [set([a]) for a in sorted(items_hash.values())]
    i = 1
    
    while candidates:
        
        # Status
        tic = time.time()
        print('Counting supports for Frequent-{0} items; {1} candidates'.format(i, len(candidates)))
        
        # Count supports for candidates
        counts = parallelLoop(transactions, candidates, view, loop_func, n_sections, version)
        
        # Update
        toc = time.time()
        print('Counted supports for Frequent-{0} items; {1:0.2f} min ellapsed'.format(i, (toc-tic)/60))
            
        # Update frequent items dictionary
        if version == 1:
            frequent[i] = [(k,v) for (k,v) in counts.items() if v >= min_sup]
        elif version == 2:
            frequent[i] = [(k,v) for (k,v) in counts.items() if len(v) >= min_sup]
            
        print('{0} Frequent-{1} itemsets identified.'.format(len(frequent[i]), i))
        
        if i != -1:
        
            print('Generating candidates for Frequent-{0} items'.format(i+1))

            # Generate new candidates for k+1
            f_items = [k for k,v in frequent[i]]
            candidates = generateCandidates(f_items, i)

            # Update 'k'
            i += 1

            print()
            
        else:
            candidates = []
            
        # Clear engine namespaces inbetween iterations
        rc.clear(targets=rc.ids);
        rc.purge_everything();
        
    # Clear engine namespaces inbetween iterations
    rc.clear(targets=rc.ids);
    rc.purge_everything();
        
    stop = time.time()
    total_items = sum(len(f) for f in frequent.values())
    print('Process complete.  {0:0.2f} minutes ellapsed, total. {1} Frequent Patterns identified'\
          .format((stop-start)/60, total_items))
    print()
        
    return(frequent)

Run Process¶

Here the full process is executed. Note that about an hour and a half is required to run the full pattern extraction on the subset of the data we are looking at. The bulk of the processing time is associated with finding Frequent-(4, 5, 6, 7, 8) itemsets: one-third of the steps take up two-thirds of the time.

In [16]:

%%time

frequent_all = mainLoop(transactions, apps_hash, version=2)

Counting supports for Frequent-1 items; 2761 candidates
Counted supports for Frequent-1 items; 0.04 min ellapsed
1471 Frequent-1 itemsets identified.
Generating candidates for Frequent-2 items

Counting supports for Frequent-2 items; 1081185 candidates
Counted supports for Frequent-2 items; 8.97 min ellapsed
31692 Frequent-2 itemsets identified.
Generating candidates for Frequent-3 items

Counting supports for Frequent-3 items; 777612 candidates
Counted supports for Frequent-3 items; 6.87 min ellapsed
168681 Frequent-3 itemsets identified.
Generating candidates for Frequent-4 items

Counting supports for Frequent-4 items; 1108486 candidates
Counted supports for Frequent-4 items; 10.31 min ellapsed
467825 Frequent-4 itemsets identified.
Generating candidates for Frequent-5 items

Counting supports for Frequent-5 items; 1297796 candidates
Counted supports for Frequent-5 items; 12.87 min ellapsed
841249 Frequent-5 itemsets identified.
Generating candidates for Frequent-6 items

Counting supports for Frequent-6 items; 1302412 candidates
Counted supports for Frequent-6 items; 15.05 min ellapsed
1086481 Frequent-6 itemsets identified.
Generating candidates for Frequent-7 items

Counting supports for Frequent-7 items; 1122931 candidates
Counted supports for Frequent-7 items; 13.68 min ellapsed
1053297 Frequent-7 itemsets identified.
Generating candidates for Frequent-8 items

Counting supports for Frequent-8 items; 790993 candidates
Counted supports for Frequent-8 items; 9.90 min ellapsed
773564 Frequent-8 itemsets identified.
Generating candidates for Frequent-9 items

Counting supports for Frequent-9 items; 429534 candidates
Counted supports for Frequent-9 items; 5.69 min ellapsed
425826 Frequent-9 itemsets identified.
Generating candidates for Frequent-10 items

Counting supports for Frequent-10 items; 172595 candidates
Counted supports for Frequent-10 items; 2.38 min ellapsed
172053 Frequent-10 itemsets identified.
Generating candidates for Frequent-11 items

Counting supports for Frequent-11 items; 49691 candidates
Counted supports for Frequent-11 items; 0.75 min ellapsed
49653 Frequent-11 itemsets identified.
Generating candidates for Frequent-12 items

Counting supports for Frequent-12 items; 9875 candidates
Counted supports for Frequent-12 items; 0.15 min ellapsed
9874 Frequent-12 itemsets identified.
Generating candidates for Frequent-13 items

Counting supports for Frequent-13 items; 1287 candidates
Counted supports for Frequent-13 items; 0.02 min ellapsed
1287 Frequent-13 itemsets identified.
Generating candidates for Frequent-14 items

Counting supports for Frequent-14 items; 104 candidates
Counted supports for Frequent-14 items; 0.00 min ellapsed
104 Frequent-14 itemsets identified.
Generating candidates for Frequent-15 items

Counting supports for Frequent-15 items; 5 candidates
Counted supports for Frequent-15 items; 0.00 min ellapsed
5 Frequent-15 itemsets identified.
Generating candidates for Frequent-16 items

Process complete.  88.87 minutes ellapsed, total. 5083062 Frequent Patterns identified

CPU times: user 9min 38s, sys: 16.9 s, total: 9min 55s
Wall time: 1h 28min 52s

In [18]:

with open('frequent_itemsets_dict_fill.pkl', 'wb') as f:
    pickle.dump(frequent_all, f)

Analysis¶

Candidates vs. Frequent¶

We would first like to note an a couple of interesting trends observed in the data, which are visually represented in the plots below.

First we see two fairly analogous "bell shaped" curves (note: the distributions are not / cannot be Normal, as the data are (a) discrete counts with (b) a hard lower bound of 0 and (c) the plots are not PDFs, and interpreting them as PMFs would be dubious.)

The number of Frequent Patterns identified quickly climbs to a peak at around Frequent-6 patterns, and just as quickly returns to a minimum. The number of candidates -- other than candidates for Frequent-2 patterns -- follows a similar pattern, though it significantly leads the found frequent patterns until the peak, after which the two curves merge. With the full dataset, the peaks of these curves will likely shift to the right and higher-level patterns will be identified.

As the length of the patterns increased, the ratio of Frequent patterns found to initial Candidate Patterns increased to unity. Perhaps even more surprising, the trend is almost linear from the Frequent-2 to Frequent-6 pattern steps.

In [19]:

cand_counts = [2761, 1081185, 777612, 1108486, 1297796, 1302412, 1122931,
               790993, 429534, 172595, 49691, 9875, 1287, 104, 5]

In [20]:

for i in frequent_all.keys():
    tot_items = len(frequent_all[i])
    if tot_items > 999999:
        spacer = '\t'
    else:
        spacer = '\t\t'
    print('{0} {2}Frequent-{1} items found'.format(tot_items, i, spacer))

1471 		Frequent-1 items found
31692 		Frequent-2 items found
168681 		Frequent-3 items found
467825 		Frequent-4 items found
841249 		Frequent-5 items found
1086481 	Frequent-6 items found
1053297 	Frequent-7 items found
773564 		Frequent-8 items found
425826 		Frequent-9 items found
172053 		Frequent-10 items found
49653 		Frequent-11 items found
9874 		Frequent-12 items found
1287 		Frequent-13 items found
104 		Frequent-14 items found
5 		Frequent-15 items found

In [21]:

count_meta = pandas.DataFrame({'k' : range(1,16),
                               'Candidates' : cand_counts,
                               'Frequent' : [len(frequent_all[i]) for i in frequent_all.keys()]})
count_meta.index = count_meta.k
count_meta['Frac'] = count_meta.apply(lambda x: x['Frequent'] / x['Candidates'], axis=1)

In [22]:

fig = plt.figure()
ax = fig.add_subplot(111)
ax.figure.set_size_inches(12, 8);

ax.scatter(x='k', y='Candidates', data=count_meta,
           c='r', marker='s', s=100, linewidths=1);
ax.scatter(x='k', y='Frequent', data=count_meta,
           c = 'b', marker='o', s=50, linewidths=1);
plt.legend(loc='best', fontsize='large')
plt.title("Total Counts for Candidates, Frequent Patterns Across k-Values", size=16);
plt.xlabel('Length of Pattern', size=14);
plt.ylabel('Counts', size=14);
plt.xlim([0.5,15.5]);
plt.show()

In [23]:

g = plt.scatter(x='k', y='Frac', data=count_meta, s=40, linewidths=1);
g.figure.set_size_inches(12, 8);
plt.title("Fraction of Candidates Found to be Frequent", size=16);
plt.xlabel('Length of Pattern', size=14);
plt.ylabel('Frequent / Candidates', size=14);
plt.xlim([0.5,15.5]);
plt.ylim([0,1.1]);

In [158]:

r = scipy.stats.linregress(count_meta.ix[2:6]['k'],
                           count_meta.ix[2:6]['Frac'])
r2 = r.rvalue ** 2

An $R^2$ of 0.999 is pretty crazy. This is essentially linear.

Simple Associations with Frequent(14, 15) Patterns¶

In lieu of more advanced analysis (which will come in later articles), we now turn to observing some basic associations between Frequent-(14, 15) patterns.

In [67]:

# Convert patterns to dataframe
freq_patterns = pandas.DataFrame()
for i, section in frequent_all.items():
    if i in [14, 15]:
        freq_patterns = pandas.concat([freq_patterns,
                                       pandas.DataFrame(data = {'Pattern' : [p[0] for p in section],
                                                                'TransIDs' : [p[1] for p in section],
                                                                'k' : [i for _ in range(len(section))]
                                                               }
                                                       )
                                       ]
                                     )
print(freq_patterns.shape)
freq_patterns.head()

(119, 3)

Out[67]:

	Pattern	TransIDs	k
0	0-100-13-17-2-242-26-297-302-306-70-83-9-97	[5532, 18314, 14740, 17232, 540, 14430]	14
1	0-100-1362-17-2-26-293-297-302-306-308-70-9-97	[5310, 3794, 16930, 540, 8890]	14
2	0-100-111-13-17-2-26-297-302-306-453-70-9-97	[11313, 3794, 12716, 14740, 8890, 20324]	14
3	0-100-13-1362-2-26-293-297-302-306-308-70-9-97	[5310, 3794, 16930, 540, 8890]	14
4	0-100-13-2-242-26-297-302-306-620-70-83-9-97	[18314, 14740, 17232, 540, 14430]	14

In [140]:

freq_1 = pandas.DataFrame(data = {'Pattern' : [p[0] for p in frequent_all[1]],
                                  'TransIDs' : [p[1] for p in frequent_all[1]],
                                  'k' : 1})
freq_1['Support'] = freq_1.apply(lambda x: len(x['TransIDs']), axis=1)
freq_1.sort_values(by='Support', ascending=False, inplace=True)
freq_1.index = range(1, freq_1.shape[0]+1)

In [134]:

freq_1.Support.describe([.25, .50, .75, .80, .85, .90, .95])

Out[134]:

count    1471.000000
mean       63.129164
std       218.329258
min         5.000000
25%         7.000000
50%        16.000000
75%        44.000000
80%        59.000000
85%        81.000000
90%       119.000000
95%       238.000000
max      4536.000000
Name: Support, dtype: float64

In [135]:

list(freq_patterns[freq_patterns.k == 15].Pattern)

Out[135]:

['0-100-13-1362-17-2-26-293-297-299-302-306-70-9-97',
 '0-100-1362-17-2-206-26-293-297-299-302-308-70-9-97',
 '0-100-13-17-2-242-26-297-302-306-620-70-83-9-97',
 '0-100-13-1362-17-2-26-293-297-302-306-308-70-9-97',
 '0-100-111-13-17-26-297-302-307-448-70-76-83-9-97']

In [151]:

freq_15_singletons = set()
_ = freq_patterns[freq_patterns.k == 15].apply(lambda x: freq_15_singletons.update(set(x['Pattern'].split('-'))), axis=1)
print('For Frequent-15 itemsets, {0} unique items make up all itemsets'.format(len(freq_15_singletons)))

For Frequent-15 itemsets, 24 unique items make up all itemsets

In [152]:

freq_15_trans = set()
_ = freq_patterns[freq_patterns.k == 15].apply(lambda x: freq_15_trans.update(set(x['TransIDs'])), axis=1)
freq_15_trans_total = []
_ = freq_patterns[freq_patterns.k == 15].apply(lambda x: freq_15_trans_total.extend((x['TransIDs'])), axis=1)
print('A total of {0} transactions are recorded for Frequent-15 patterns'.format(len(freq_15_trans_total)))
print('{0} of these transactions are unique'.format(len(freq_15_trans)))

A total of 25 transactions are recorded for Frequent-15 patterns
15 of these transactions are unique

In [153]:

freq_1[freq_1.apply(lambda x: x['Pattern'] in freq_15_singletons, axis=1)]

Out[153]:

	Pattern	TransIDs	k	Support
1	0	[12582, 4187, 8022, 15968, 13053, 13073, 1213,...	1	4536
2	13	[12582, 4187, 15968, 13053, 13073, 1213, 11018...	1	3744
3	17	[4187, 8022, 15968, 13073, 17633, 2650, 22274,...	1	2985
4	26	[12582, 4187, 8022, 15968, 13073, 1213, 11018,...	1	2332
6	70	[8022, 17353, 20458, 13650, 17633, 2361, 2012,...	1	1508
7	2	[4187, 15968, 13053, 1213, 11018, 15873, 13130...	1	1100
8	9	[8022, 11940, 13650, 17633, 9055, 2650, 11521,...	1	1077
10	100	[8022, 13650, 17633, 2012, 20461, 312, 16725, ...	1	897
13	97	[17633, 20461, 22585, 16725, 3363, 8984, 2002,...	1	723
16	111	[11018, 20461, 13130, 22585, 3754, 8984, 16720...	1	657
21	76	[8022, 13073, 11018, 13650, 9055, 22274, 13130...	1	606
28	242	[20461, 12221, 1006, 22458, 13824, 12725, 8932...	1	527
39	302	[22585, 16725, 3363, 21675, 1006, 267, 12323, ...	1	419
41	83	[8022, 13650, 22585, 312, 16725, 3363, 15076, ...	1	417
42	297	[17633, 22274, 16725, 3363, 21675, 1006, 10782...	1	413
53	620	[15968, 13130, 22585, 8676, 1193, 7285, 2059, ...	1	352
69	306	[17633, 3363, 2002, 10782, 9713, 13131, 5095, ...	1	251
94	448	[8022, 13367, 2059, 2095, 9224, 11697, 22584, ...	1	188
117	206	[21796, 5310, 11697, 10062, 11401, 12857, 2669...	1	158
126	307	[22585, 10782, 2059, 13131, 2095, 5310, 11697,...	1	152
200	299	[2002, 18962, 5310, 2447, 19513, 10091, 3028, ...	1	89
226	293	[5310, 11697, 12102, 3978, 19513, 3028, 5424, ...	1	80
228	1362	[8984, 2059, 5310, 12102, 7699, 4869, 4781, 16...	1	79
280	308	[5310, 12102, 12857, 10091, 3028, 5424, 2891, ...	1	63

The Frequent-15 patterns are highly similar in terms of which items are included; 24 items make up all 5 Frequent-15 patterns. Surprisingly, not all of these items are the most frequent items in the dataset. While 8 of the top 10 items are included (the index indicates the item's rank, starting at 1), the included items range far beyond the top 24 items. All have supports in the top 20th percentile, however, so they are among the most frequent patterns.

Perhaps even more surprisingly, only 49 items make up the 104 Frequent-14 itemsets! All of the Top 10 most frequent items and 16 of the Top 20 items are included in these items. Again, items range far beyond the top 50 items, but all of the items have supports in the top 25th percentile.

In [150]:

freq_14_singletons = set()
_ = freq_patterns[freq_patterns.k == 14].apply(lambda x: freq_14_singletons.update(set(x['Pattern'].split('-'))), axis=1)
print('For Frequent-14 itemsets, {0} unique items make up all itemsets'.format(len(freq_14_singletons)))

For Frequent-14 itemsets, 49 unique items make up all itemsets

In [154]:

freq_14_trans = set()
_ = freq_patterns[freq_patterns.k == 14].apply(lambda x: freq_14_trans.update(set(x['TransIDs'])), axis=1)
freq_14_trans_total = []
_ = freq_patterns[freq_patterns.k == 14].apply(lambda x: freq_14_trans_total.extend((x['TransIDs'])), axis=1)
print('A total of {0} transactions are recorded for Frequent-14 patterns'.format(len(freq_14_trans_total)))
print('{0} of these transactions are unique'.format(len(freq_14_trans)))

A total of 588 transactions are recorded for Frequent-14 patterns
57 of these transactions are unique

In [143]:

freq_1[freq_1.apply(lambda x: x['Pattern'] in freq_14_singletons, axis=1)]

Out[143]:

	Pattern	TransIDs	k	Support
1	0	[12582, 4187, 8022, 15968, 13053, 13073, 1213,...	1	4536
2	13	[12582, 4187, 15968, 13053, 13073, 1213, 11018...	1	3744
3	17	[4187, 8022, 15968, 13073, 17633, 2650, 22274,...	1	2985
4	26	[12582, 4187, 8022, 15968, 13073, 1213, 11018,...	1	2332
5	7	[10163, 12582, 14074, 18831, 11472, 18097, 300...	1	1847
6	70	[8022, 17353, 20458, 13650, 17633, 2361, 2012,...	1	1508
7	2	[4187, 15968, 13053, 1213, 11018, 15873, 13130...	1	1100
8	9	[8022, 11940, 13650, 17633, 9055, 2650, 11521,...	1	1077
9	4	[12582, 13685, 16720, 21791, 23096, 17960, 190...	1	1035
10	100	[8022, 13650, 17633, 2012, 20461, 312, 16725, ...	1	897
11	18	[12582, 13053, 11018, 7114, 13130, 22585, 1846...	1	805
12	114	[10163, 8022, 13053, 13540, 12221, 14772, 1193...	1	791
13	97	[17633, 20461, 22585, 16725, 3363, 8984, 2002,...	1	723
16	111	[11018, 20461, 13130, 22585, 3754, 8984, 16720...	1	657
18	110	[8022, 17353, 6852, 16872, 13540, 9891, 22406,...	1	639
20	38	[7114, 15873, 9891, 18460, 21675, 7816, 3272, ...	1	607
21	76	[8022, 13073, 11018, 13650, 9055, 22274, 13130...	1	606
22	351	[15968, 9055, 22585, 9891, 16725, 19072, 3409,...	1	586
26	6	[13130, 14130, 22406, 17960, 13367, 6595, 9049...	1	533
28	242	[20461, 12221, 1006, 22458, 13824, 12725, 8932...	1	527
29	96	[1621, 11940, 13130, 14130, 22406, 18460, 1336...	1	503
33	106	[16720, 19072, 16621, 1193, 11258, 18348, 1365...	1	479
36	21	[13685, 16720, 19072, 5999, 6595, 16621, 8405,...	1	448
39	302	[22585, 16725, 3363, 21675, 1006, 267, 12323, ...	1	419
41	83	[8022, 13650, 22585, 312, 16725, 3363, 15076, ...	1	417
42	297	[17633, 22274, 16725, 3363, 21675, 1006, 10782...	1	413
48	453	[1958, 20461, 22585, 312, 18460, 16720, 11906,...	1	380
50	255	[17633, 9055, 22585, 8984, 9402, 21240, 217, 2...	1	360
53	620	[15968, 13130, 22585, 8676, 1193, 7285, 2059, ...	1	352
54	44	[18460, 19051, 13131, 14029, 12323, 9267, 2134...	1	350
56	94	[11940, 13130, 14130, 8676, 22406, 18460, 1336...	1	310
69	306	[17633, 3363, 2002, 10782, 9713, 13131, 5095, ...	1	251
72	283	[15968, 22585, 21950, 10520, 13824, 15475, 212...	1	248
78	279	[13130, 14130, 22406, 18460, 13367, 8534, 1969...	1	230
86	286	[13130, 14130, 22406, 18460, 13367, 8534, 1969...	1	205
88	260	[20461, 8984, 5624, 10782, 10419, 2059, 15756,...	1	199
90	92	[11940, 13130, 14130, 8676, 22406, 18460, 2224...	1	196
91	298	[16725, 10782, 13131, 1819, 15475, 11697, 1125...	1	196
94	448	[8022, 13367, 2059, 2095, 9224, 11697, 22584, ...	1	188
98	698	[15968, 14130, 9891, 12221, 3363, 13824, 12725...	1	182
117	206	[21796, 5310, 11697, 10062, 11401, 12857, 2669...	1	158
126	307	[22585, 10782, 2059, 13131, 2095, 5310, 11697,...	1	152
134	291	[13130, 22406, 18460, 13367, 8534, 19690, 2224...	1	136
156	694	[8405, 11258, 13824, 6400, 2013, 10058, 13161,...	1	116
200	299	[2002, 18962, 5310, 2447, 19513, 10091, 3028, ...	1	89
226	293	[5310, 11697, 12102, 3978, 19513, 3028, 5424, ...	1	80
228	1362	[8984, 2059, 5310, 12102, 7699, 4869, 4781, 16...	1	79
280	308	[5310, 12102, 12857, 10091, 3028, 5424, 2891, ...	1	63
364	895	[15475, 11675, 10062, 7272, 12614, 1125, 2891,...	1	44

Both of these observations likely indicate some interesting associations between applications in the dataset. One common aspect of pattern mining that we have stumbled upon is the idea of redundant patterns at different levels. A large number of the Frequent-14 patterns are implied by the Frequent-15 patterns; this is a "backward" application of the Apriori Principle. While the supports (and therefore transaction sets) are not identical, much common information exists. This idea applies all the way down to Frequent-1 items.

Additionally, the similarity between the Frequent-14 and Frequent-15 patterns to each other and to themselves indicates the presence of larger "soft" patterns -- essentially, groups of frequent itemsets that are highly similar to one another, but whose "super-parent" is not frequent in the dataset. Grouping similar patterns and finding and / or creating a representative pattern for the group is known as Pattern Compression.

Methods exist for mining Compressed Patterns. The primary goal is to find a sub-set of the overall patterns that efficiently represents (covers) the entire set of patterns. Such methods are inherently lossy (unless the dataset is very special!), so one must decide on the trade-offs they are willing to accept when finding such meta-patterns.

Not only do items often repeat, but certain transactions also occur often in Frequent-(14, 15) patterns. While Frequent-15 pattern transactions do not show significant overlap, Frequent-14 transactions show a significant amount of overlap. Much of this is likely implied and / or accounted for by the Frequent-15 itemsets and their respective transactions. However, many instances of overlap may be indicative of pattern associations. That is, the presence of one pattern may be highly indicative of the presence of another pattern. Think (bread, milk) implies (eggs).

Again, many of these pattern-pattern associations are captured in the observed patterns; again, consider the Apriori Principle. Oftentimes, though, informative associations are lurking in the dataset, waiting to be discovered. Measures like Confidence, Lift, Cosine similarity, the Kulczynski measure, and many others can aid us in finding these relationships between patterns and lead us to insights about the overall data.

These topics, as well as others, will be covered in future articles.

Summary¶

In this article, we combined the various parts of the Apriori Algorithm developed in Article 1 in order to create a fully-functioning algorithm. We then mined all frequent patterns from the subset of the Talking Data dataset we created in part one. Foregoing in-depth analyses for future articles, we briefly investigated relationships between Frequent-(14, 15) patterns and queued up a variety of topics for future articles.

immersinn-ds