Frequent Pattern Mining - Apriori Pt. 02
Posted by TRII in frequent-pattern-mining-course
Introduction / Overview¶
This post is the second post related to the Pattern Discovery in Data Mining Course. For the first article in the series, we looked into the Apriori Principle (Algorithm) and how it can be used to find frequent patterns in a dataset.
In this article, the various pieces from the first article were put together to form a complete implementation of the Apriori Algorithm.
After finding all frequent patterns with at least Support 5 in the dataset, we perform some basic analyses. More in-depth analyses are left for future articles.
import os
import sys
import time
import pickle
import itertools
from collections import defaultdict
import numpy
import scipy
import pandas
from nltk import FreqDist
import ipyparallel as ipp
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
rnd = numpy.random.RandomState(8675309)
Goals for Current Dataset¶
To summarize, the goal of this post is to put all of the pieces constructed in part one together, and then process the dataset using the algorithm in order to extract all Frequent Patterns.
As before, the Talking Data Kaggle dataset will be utilized. The set of active applications associated with a user serves as the itemset (transaction) for that user. The Apriori Algorithm finds all frequent itemsets from these transactions.
Load Data¶
The data is loaded from the previous article's output, as some preprocessing was performed in order to convert the original dataset into the desired transaction format. As before, only a subset of the entire dataset is being analyzed here (7,500 out of about 25,000 records). This is to reduce the overall computational time necessary.
with open('transaction_list_subset.pkl', 'rb') as f:
transactions = pickle.load(f)
with open('application_dictionary.pkl', 'rb') as f:
apps_hash = pickle.load(f)
Implementation: Part II¶
Review¶
To review from the previous article, the steps in the iterative process are:
- Generate new candidates for Frequent-k patterns from Frequent-(k-1) patterns using the Apriori property
- Find the supports for each of the Frequent-k candidate patterns by iterating over the transaction dataset
- Determine which of the candidates have supports that meet the min_sup threshold; these are the Frequent-k itemsets
Steps 1-3 are repeated until no Frequent-k itemsets are found.
One other significant change from the previous article is the addition of recording which transactions itemsets were involved in when counting supports. Knowing which transactions each pattern is included in will allow us to look at pattern co-occurrence, something that we are unable to look at by simply observing supports. Most measures and metrics for sorting out interesting patterns from the huge number of total frequent patterns found in the data typically rely on some sort of pattern co-occurrence as well. This is the primary reason for recording the transactions.
Find All Frequent Itemsets¶
Define Primary Work Functions¶
Some minor modifications have been made to the code below in terms of what was being used in the previous article. For starters, the "section_loop_parallel" was modified to collect the set of transaction IDs in which each candidate itemset was found.
The "generateCandidates" function was also updated to generate desired results (ha), something that is always nice. Previously, we were not sorting the items post split. This resulted in missed matches in cases where matches actually occurred. Recall that we are utilizing the idea of consistent last and first items in order to speed up the matching process when looking for itemsets that complete larger itemsets.
Finally, we switched away from using the defaultdict class for storing results. (This is not technically a change from previous code, as we had no need for defaultdict previously, but we thought it was worth the note.) We seem to have encountered memory issues when relying the defaultdict after switching to the updated "section_loop_parallel" code that records transaction IDs, and updating the "generateCandidates" code (which still has the defaultdict code present, but commented out). "Regular" Python dictionaries actually have built-in "defaultdict-like" functionality via the "setdefault" class method (shown below). Our thoughts are that something associated with preallocation of space for defaultdict instances may result in memory issues.
def section_loop_parallel(section, patterns):
def updateCounts(tid, transaction, patterns, curr_counts):
new_items = checkMatches(transaction, patterns)
for item in new_items:
curr_counts.setdefault(item, []).append(tid)
return(curr_counts)
def patternInTransaction(transaction, pattern):
return(pattern.intersection(transaction) == pattern)
def checkMatches(transaction, patterns):
new_items = [pattern for pattern in patterns \
if patternInTransaction(transaction, pattern)]
new_items = ['-'.join([i for i in sorted(ni)]) for ni in new_items]
return(new_items)
counts = dict()
_ = section.apply(lambda x: updateCounts(x.name,
x['apps'],
patterns,
counts),
axis=1)
return(counts)
def section_loop_parallel_v01(section, patterns):
from nltk import FreqDist
def updateCounts(transaction, patterns, curr_counts):
new_items = checkMatches(transaction, patterns)
curr_counts.update(new_items)
return(curr_counts)
def patternInTransaction(transaction, pattern):
return(pattern.intersection(transaction) == pattern)
def checkMatches(transaction, patterns):
new_items = [pattern for pattern in patterns \
if patternInTransaction(transaction, pattern)]
new_items = ['-'.join([i for i in sorted(ni)]) for ni in new_items]
return(new_items)
counts = FreqDist()
_ = section.apply(lambda x: updateCounts(x['apps'],
patterns,
counts),
axis=1)
return(counts)
def generateCandidates(frequent_k1_items, ii):
if ii == 1:
ncs = [set(combo) for combo in itertools.combinations(frequent_k1_items, 2)]
else:
# Map items in "freq" with all but last item matching
# to the same list in the hash table
#matchup_dict = defaultdict(list)
#endwith_dict = defaultdict(set)
matchup_dict = dict()
endwith_dict = dict()
for item in frequent_k1_items:
n = item.split('-')
key = '-'.join(n[:-1])
#matchup_dict[key].append(n[-1])
#endwith_dict[n[-1]].update([item])
if key not in matchup_dict.keys():
matchup_dict[key] = []
#matchup_dict.setdefault(key, []).append(n[-1])
matchup_dict[key].append(n[-1])
if n[-1] not in endwith_dict.keys():
endwith_dict[n[-1]] = set()
#endwith_dict.setdefault(n[-1], set()).update([item])
endwith_dict[n[-1]].update([item])
# For each key in the hash list, construct new candidates
# by iterating overall pairwise combination of the corresponding
# values. Merge each pairwise combo with the key to create
# a new candidate. Check if the pattern consisting of the
# last k-1 items in the candidate itemset constitute a
# frequent pattern in the k-1 Frequent itemsets. If yes,
# add as a candidate pattern.
ncs = []
for k,v in matchup_dict.items():
k = k.split('-')
base = k if type(k)==list else [k]
for tail in itertools.combinations(v, 2):
tail = sorted(tail)
last = tail[-1]
cand = base.copy()
cand.extend(list(tail))
test = '-'.join(cand[1:])
if test in endwith_dict[last]:
ncs.append(set(cand))
return(ncs)
Define Wrapper Functions¶
def parallelLoop(transactions, candidates, loop_func, view, n_sections, version):
async_results = []
len_section = transactions.shape[0] // n_sections
start_cuts = [0 + len_section*i for i in range(n_sections)]
for i, ind in enumerate(start_cuts):
if i==(n_sections-1):
section = transactions[ind:]
else:
section = transactions[ind:start_cuts[i+1]]
ar = view.apply_async(loop_func, section, candidates)
async_results.append(ar)
rc.wait(async_results) # Wait until all tasks are done.
results = [ar.get() for ar in async_results]
if version == 1:
counts = FreqDist()
for result in results:
counts.update(result.copy())
result = []
del result
elif version == 2:
counts = dict()
for result in results:
for k,v in result.items():
counts.setdefault(k, []).extend(v.copy())
result = []
del result
del results
results = []
return(counts)
def mainLoop(transactions, items_hash, min_sup=5, n_sections=8, version=2):
rc = ipp.Client()
view = rc.load_balanced_view()
start = time.time()
if version == 1:
loop_func = section_loop_parallel_v01
elif version == 2:
loop_func = section_loop_parallel
frequent = dict()
candidates = [set([a]) for a in sorted(items_hash.values())]
i = 1
while candidates:
# Status
tic = time.time()
print('Counting supports for Frequent-{0} items; {1} candidates'.format(i, len(candidates)))
# Count supports for candidates
counts = parallelLoop(transactions, candidates, view, loop_func, n_sections, version)
# Update
toc = time.time()
print('Counted supports for Frequent-{0} items; {1:0.2f} min ellapsed'.format(i, (toc-tic)/60))
# Update frequent items dictionary
if version == 1:
frequent[i] = [(k,v) for (k,v) in counts.items() if v >= min_sup]
elif version == 2:
frequent[i] = [(k,v) for (k,v) in counts.items() if len(v) >= min_sup]
print('{0} Frequent-{1} itemsets identified.'.format(len(frequent[i]), i))
if i != -1:
print('Generating candidates for Frequent-{0} items'.format(i+1))
# Generate new candidates for k+1
f_items = [k for k,v in frequent[i]]
candidates = generateCandidates(f_items, i)
# Update 'k'
i += 1
print()
else:
candidates = []
# Clear engine namespaces inbetween iterations
rc.clear(targets=rc.ids);
rc.purge_everything();
# Clear engine namespaces inbetween iterations
rc.clear(targets=rc.ids);
rc.purge_everything();
stop = time.time()
total_items = sum(len(f) for f in frequent.values())
print('Process complete. {0:0.2f} minutes ellapsed, total. {1} Frequent Patterns identified'\
.format((stop-start)/60, total_items))
print()
return(frequent)
Run Process¶
Here the full process is executed. Note that about an hour and a half is required to run the full pattern extraction on the subset of the data we are looking at. The bulk of the processing time is associated with finding Frequent-(4, 5, 6, 7, 8) itemsets: one-third of the steps take up two-thirds of the time.
%%time
frequent_all = mainLoop(transactions, apps_hash, version=2)
with open('frequent_itemsets_dict_fill.pkl', 'wb') as f:
pickle.dump(frequent_all, f)
Analysis¶
Candidates vs. Frequent¶
We would first like to note an a couple of interesting trends observed in the data, which are visually represented in the plots below.
First we see two fairly analogous "bell shaped" curves (note: the distributions are not / cannot be Normal, as the data are (a) discrete counts with (b) a hard lower bound of 0 and (c) the plots are not PDFs, and interpreting them as PMFs would be dubious.)
The number of Frequent Patterns identified quickly climbs to a peak at around Frequent-6 patterns, and just as quickly returns to a minimum. The number of candidates -- other than candidates for Frequent-2 patterns -- follows a similar pattern, though it significantly leads the found frequent patterns until the peak, after which the two curves merge. With the full dataset, the peaks of these curves will likely shift to the right and higher-level patterns will be identified.
As the length of the patterns increased, the ratio of Frequent patterns found to initial Candidate Patterns increased to unity. Perhaps even more surprising, the trend is almost linear from the Frequent-2 to Frequent-6 pattern steps.
cand_counts = [2761, 1081185, 777612, 1108486, 1297796, 1302412, 1122931,
790993, 429534, 172595, 49691, 9875, 1287, 104, 5]
for i in frequent_all.keys():
tot_items = len(frequent_all[i])
if tot_items > 999999:
spacer = '\t'
else:
spacer = '\t\t'
print('{0} {2}Frequent-{1} items found'.format(tot_items, i, spacer))
count_meta = pandas.DataFrame({'k' : range(1,16),
'Candidates' : cand_counts,
'Frequent' : [len(frequent_all[i]) for i in frequent_all.keys()]})
count_meta.index = count_meta.k
count_meta['Frac'] = count_meta.apply(lambda x: x['Frequent'] / x['Candidates'], axis=1)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.figure.set_size_inches(12, 8);
ax.scatter(x='k', y='Candidates', data=count_meta,
c='r', marker='s', s=100, linewidths=1);
ax.scatter(x='k', y='Frequent', data=count_meta,
c = 'b', marker='o', s=50, linewidths=1);
plt.legend(loc='best', fontsize='large')
plt.title("Total Counts for Candidates, Frequent Patterns Across k-Values", size=16);
plt.xlabel('Length of Pattern', size=14);
plt.ylabel('Counts', size=14);
plt.xlim([0.5,15.5]);
plt.show()
g = plt.scatter(x='k', y='Frac', data=count_meta, s=40, linewidths=1);
g.figure.set_size_inches(12, 8);
plt.title("Fraction of Candidates Found to be Frequent", size=16);
plt.xlabel('Length of Pattern', size=14);
plt.ylabel('Frequent / Candidates', size=14);
plt.xlim([0.5,15.5]);
plt.ylim([0,1.1]);
r = scipy.stats.linregress(count_meta.ix[2:6]['k'],
count_meta.ix[2:6]['Frac'])
r2 = r.rvalue ** 2
An $R^2$ of 0.999 is pretty crazy. This is essentially linear.
Simple Associations with Frequent(14, 15) Patterns¶
In lieu of more advanced analysis (which will come in later articles), we now turn to observing some basic associations between Frequent-(14, 15) patterns.
# Convert patterns to dataframe
freq_patterns = pandas.DataFrame()
for i, section in frequent_all.items():
if i in [14, 15]:
freq_patterns = pandas.concat([freq_patterns,
pandas.DataFrame(data = {'Pattern' : [p[0] for p in section],
'TransIDs' : [p[1] for p in section],
'k' : [i for _ in range(len(section))]
}
)
]
)
print(freq_patterns.shape)
freq_patterns.head()
freq_1 = pandas.DataFrame(data = {'Pattern' : [p[0] for p in frequent_all[1]],
'TransIDs' : [p[1] for p in frequent_all[1]],
'k' : 1})
freq_1['Support'] = freq_1.apply(lambda x: len(x['TransIDs']), axis=1)
freq_1.sort_values(by='Support', ascending=False, inplace=True)
freq_1.index = range(1, freq_1.shape[0]+1)
freq_1.Support.describe([.25, .50, .75, .80, .85, .90, .95])
list(freq_patterns[freq_patterns.k == 15].Pattern)
freq_15_singletons = set()
_ = freq_patterns[freq_patterns.k == 15].apply(lambda x: freq_15_singletons.update(set(x['Pattern'].split('-'))), axis=1)
print('For Frequent-15 itemsets, {0} unique items make up all itemsets'.format(len(freq_15_singletons)))
freq_15_trans = set()
_ = freq_patterns[freq_patterns.k == 15].apply(lambda x: freq_15_trans.update(set(x['TransIDs'])), axis=1)
freq_15_trans_total = []
_ = freq_patterns[freq_patterns.k == 15].apply(lambda x: freq_15_trans_total.extend((x['TransIDs'])), axis=1)
print('A total of {0} transactions are recorded for Frequent-15 patterns'.format(len(freq_15_trans_total)))
print('{0} of these transactions are unique'.format(len(freq_15_trans)))
freq_1[freq_1.apply(lambda x: x['Pattern'] in freq_15_singletons, axis=1)]
The Frequent-15 patterns are highly similar in terms of which items are included; 24 items make up all 5 Frequent-15 patterns. Surprisingly, not all of these items are the most frequent items in the dataset. While 8 of the top 10 items are included (the index indicates the item's rank, starting at 1), the included items range far beyond the top 24 items. All have supports in the top 20th percentile, however, so they are among the most frequent patterns.
Perhaps even more surprisingly, only 49 items make up the 104 Frequent-14 itemsets! All of the Top 10 most frequent items and 16 of the Top 20 items are included in these items. Again, items range far beyond the top 50 items, but all of the items have supports in the top 25th percentile.
freq_14_singletons = set()
_ = freq_patterns[freq_patterns.k == 14].apply(lambda x: freq_14_singletons.update(set(x['Pattern'].split('-'))), axis=1)
print('For Frequent-14 itemsets, {0} unique items make up all itemsets'.format(len(freq_14_singletons)))
freq_14_trans = set()
_ = freq_patterns[freq_patterns.k == 14].apply(lambda x: freq_14_trans.update(set(x['TransIDs'])), axis=1)
freq_14_trans_total = []
_ = freq_patterns[freq_patterns.k == 14].apply(lambda x: freq_14_trans_total.extend((x['TransIDs'])), axis=1)
print('A total of {0} transactions are recorded for Frequent-14 patterns'.format(len(freq_14_trans_total)))
print('{0} of these transactions are unique'.format(len(freq_14_trans)))
freq_1[freq_1.apply(lambda x: x['Pattern'] in freq_14_singletons, axis=1)]
Both of these observations likely indicate some interesting associations between applications in the dataset. One common aspect of pattern mining that we have stumbled upon is the idea of redundant patterns at different levels. A large number of the Frequent-14 patterns are implied by the Frequent-15 patterns; this is a "backward" application of the Apriori Principle. While the supports (and therefore transaction sets) are not identical, much common information exists. This idea applies all the way down to Frequent-1 items.
Additionally, the similarity between the Frequent-14 and Frequent-15 patterns to each other and to themselves indicates the presence of larger "soft" patterns -- essentially, groups of frequent itemsets that are highly similar to one another, but whose "super-parent" is not frequent in the dataset. Grouping similar patterns and finding and / or creating a representative pattern for the group is known as Pattern Compression.
Methods exist for mining Compressed Patterns. The primary goal is to find a sub-set of the overall patterns that efficiently represents (covers) the entire set of patterns. Such methods are inherently lossy (unless the dataset is very special!), so one must decide on the trade-offs they are willing to accept when finding such meta-patterns.
Not only do items often repeat, but certain transactions also occur often in Frequent-(14, 15) patterns. While Frequent-15 pattern transactions do not show significant overlap, Frequent-14 transactions show a significant amount of overlap. Much of this is likely implied and / or accounted for by the Frequent-15 itemsets and their respective transactions. However, many instances of overlap may be indicative of pattern associations. That is, the presence of one pattern may be highly indicative of the presence of another pattern. Think (bread, milk) implies (eggs).
Again, many of these pattern-pattern associations are captured in the observed patterns; again, consider the Apriori Principle. Oftentimes, though, informative associations are lurking in the dataset, waiting to be discovered. Measures like Confidence, Lift, Cosine similarity, the Kulczynski measure, and many others can aid us in finding these relationships between patterns and lead us to insights about the overall data.
These topics, as well as others, will be covered in future articles.
Summary¶
In this article, we combined the various parts of the Apriori Algorithm developed in Article 1 in order to create a fully-functioning algorithm. We then mined all frequent patterns from the subset of the Talking Data dataset we created in part one. Foregoing in-depth analyses for future articles, we briefly investigated relationships between Frequent-(14, 15) patterns and queued up a variety of topics for future articles.