Mon 10 October 2016

Improved VSM Instantiation - Okapi BM25

Posted by TRII in text-retrieval-and-search-engines

Introduction / Overview¶

This is the second notebook related to work from the Coursera course, Text Retrieval and Search Engines. (See here for the first.) For the various topics covered in the course, the goal is to implement some of the methods and tools in order to gain some hands-on experience.

In this article, we will expand upon the "Simplest VSM Instantiation" and make several improvements that will (hopefully) produce better results than before. Our updates relate to weighting terms used for scoring documents, or, equivalently, modifying how the documents are embedded into the vector space. We utilize the Okapi BM25 Ranking Function methodology for calculating the document embeddings and computing the query - document similarly scores. From our previous model, we will make four changes related to how similarity scores are determined:

Use total word frequency, not just binary BOW
Inverse Document Frequency (IDF) weighting
Term Frequency (TF) transformation
Document Length Normalization

Our document and word preprocessing method will remain the same for consistency. That is:

No stop word removal
Tokens / features are words, not phrases
All words to lowercase
Remove all punctuation
Remove all numbers
No stemming
No mapping multiple related words onto a single feature / word (e.g., no LSI-like dimensionality reduction)

We do not revisit document preprocessing below, but instead load the preprocessed documents from the previous article.

In [1]:

import os
import sys
import pickle

In [2]:

import numpy
import scipy
import pandas
import nltk

In [3]:

%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style="whitegrid", color_codes=True)

In [4]:

# Load previously pre-processed documents
with open('data/preprocessedArticleContentDump.pkl', 'rb') as f:
    articles = pickle.load(f)

In [5]:

articles.head(3)

Out[5]:

	URL	category	content	source	title	tokenized	bow	tot_len	unique_words	title_bow	doc_id
0	http://phys.org/news/2016-04-activists-appeal-...	earth-news	Environmental groups on Tuesday lodged a compl...	PhysOrg	Activists appeal to EU over Polish logging of ...	[environmental, groups, on, tuesday, lodged, a...	{steps, szafraniuk, species, hectares, wwf, br...	385	223	{to, over, forest, of, primeval, activists, eu...	0
1	http://phys.org/news/2016-04-seismic-ecuador.html	earth-news	A doctoral thesis developed at UPM analysed th...	PhysOrg	The seismic risk of Ecuador	[a, doctoral, thesis, developed, at, upm, anal...	{shown, earthquake, titled, results, april, ma...	269	152	{ecuador, of, the, seismic, risk}	1
2	http://phys.org/news/2016-04-japanese-ecuador-...	earth-news	A magnitude 6.5 earthquake struck Japan on Fri...	PhysOrg	Are the Japanese and Ecuador earthquakes related?	[a, magnitude, earthquake, struck, japan, on, ...	{earthquake, building, accumulated, aftershock...	865	382	{ecuador, japanese, the, related, are, and, ea...	2

Okapi BM25 Ranking Function Implimentation Overview¶

Several components constitute the Okapi BM25 (or BM25 for short) Ranking Function. These are show below, along with the Okapi BM25 Ranking Function.

First, there is the IDF function, which weights terms based on their overall prevalence in the collection. Terms that show up in a greater proportion of collection documents are weighted less. The base formula actually allows for negative weightings, specifically for those terms which appear in more than half of the documents. We cap the minimum value at '0', thus creating "silenced" words, or implied stop words.

The Term Frequency transformation reduces the weight of subsequent occurrences of a word in a document after the first occurrence. A tunable parameter, $k_1$, is used to modify the performance. For our purposes, $k_1$ was set to 1.2 in accordance with typical implementations. We can see that this results in the first term contributing a full 1 to the term value, while the second only contributes an additional 0.375 to the overall term value. This prevents the numerous occurrences of a single query term in a document from having a run-away effect on the total query - document score.

Pivot Document Length (PDL) Normalization (slightly) penalizes longer documents, while aiding shorter documents and leaving average length documents unaffected. Another tunable parameter, $b$, is present; here we set the value of $b$ to 0.75. Note that PDL is a linear function of document length, with shorter documents having smaller PDL values.

How does this aid short documents? In Okapi BM25, we apply the PDL Normalization term to the $k_1$ value in the denominator of the TF Transformation. This has the effect of boosting term frequency weights across the board for shorter documents when compared to longer documents by reducing the effective value of the $k_1$ modifier in the denominator for shorter documents.

These three terms are combined to form the Okapi BM25 Ranking Function. For each query term that also appears in the document being scored, a score is computed from the frequency of the given term in the document and the document length. Scores across all query terms are summed together to form an overall score for the document. Documents in a collection are ranked by their BM25 scores. Higher scores indicate a higher degree of similarity with the query.

Inverse Document Frequency:¶

\begin{equation} IDF(q) = \max \left( 0, \log\frac{N - n(q) + 0.5}{n(q) + 0.5} \right) \end{equation}

TF Transformation:¶

\begin{equation} \frac{ f(q,D) \cdot (k_1 + 1) }{ f(q,D) + k_1 } \end{equation}

PDL Normalization:¶

\begin{equation} 1- b + b \cdot \frac{|D|}{avgdl} \end{equation}

Okapi BM25 Score Function:¶

\begin{equation} score(D,Q) = \sum_{q \in Q} IDF(q) \cdot \frac{ f(q,D) \cdot (k_1 + 1) }{ f(q,D) + k_1 \cdot \left( 1- b + b \cdot \frac{|D|}{avgdl} \right)} \end{equation}

Construct Document & Query BOW Vectors¶

Document Vector Space Embeddings¶

This step is analogous to how we created BOW vectors for documents in the first article except that now, instead of recording only whether a word is present or not present in a document, we use the Okapi BM25 transformation to embed the document in the vector space.

Fully transforming the document into the vector space is only feasible because we are working with a fixed collection. The IDF and PDL Normalization terms are variable with the collection. Without these terms, we are unable to get past the term - document frequency matrix.

Calculating the dictionary for all terms is the same as in the previous notebook.

In [6]:

# Create the dictionary from all terms in the contents, titles
dictionary = set()
_ = articles.apply(lambda x: dictionary.update(x['bow']), axis=1)
_ = articles.apply(lambda x: dictionary.update(x['title_bow']), axis=1)
dictionary = {w : v for (v,w) in zip(range(len(dictionary)), dictionary)}

In [7]:

# Document Word Frequencies
articles['word_freqs'] = articles.apply(lambda x: nltk.FreqDist(x['tokenized']), axis=1)

# Collection Word Frequencies
col_word_freqs = nltk.FreqDist()
_ = articles.apply(lambda x: col_word_freqs.update(x['word_freqs'].keys()), axis=1)

In [8]:

# Functions for generating document embeddings

# Constants & params
k1 = 1.2
b = 0.75
N = articles.shape[0]
avgdl = articles.tot_len.mean()

# Calculate IDF values
idf = numpy.zeros((len(dictionary), ))
for w,i in dictionary.items():
    idf[i] = max(0,
                 numpy.log((N - col_word_freqs[w] + 0.5) / \
                       (col_word_freqs[w] + 0.5))
                 )


def pdlCalc(doc_len):
    """PDL Normalization Calculation"""
    return(1 - b + b * doc_len / avgdl)


def generateBM25BOWVec(x, doc_vecs):
    """Full BM25 calculation for a document"""
    doc_id = x['doc_id']
    doc_len = x['tot_len']
    d_pdl = pdlCalc(doc_len)
    for word, count in x['word_freqs'].items():
        w_ind = dictionary[word]
        w_idf = idf[w_ind]
        doc_vecs[doc_id, w_ind] = w_idf * (count * (k1 + 1)) / (count + k1 * d_pdl)

        
def generateBOWFreqVector(x, doc_vecs):
    """Simple frequency vector embedding for a document"""
    doc_id = x['doc_id']
    for word, count in x['word_freqs'].items():
        doc_vecs[doc_id, 
                 dictionary[word]] = count

In [9]:

# Generate Vector-Space Embeddings for Documents
# These embedings will be pre-weighted; that is, 
# we calculate the BM25 value for each Document-Term pair
doc_vecs = numpy.zeros(shape=(articles.shape[0], len(dictionary)))
_ = articles.apply(lambda x: generateBM25BOWVec(x, doc_vecs), axis=1)

Query Vector Space Encodings¶

Queries are constructed from document titles in the same way as in the previous article.

In [10]:

# Document Query (Title) BOW vectors
query_vecs = numpy.zeros(shape=(articles.shape[0], len(dictionary)))
for di,bd in enumerate(articles['title_bow']):
    for w in bd:
        query_vecs[di, dictionary[w]] = 1

Calculating Query - Document Scores¶

As before, we use the dot product to compute the pairwise query - document similarity scores. In this case, though, the document vectors have been computed using the BM25 methodology.

Again, utilizing scipy's "spatial.distance" module makes calculating the pairwise similarities simple.

In [11]:

# Calculate all pairwise dot products between queries and documents
query_doc_sims = scipy.spatial.distance.cdist(query_vecs,
                                              doc_vecs,
                                              numpy.dot)

In [12]:

g = sns.distplot(query_doc_sims.flatten(), kde=False, color='purple');
g.figure.set_size_inches(12,8);
plt.title("Query - Document Similarity Scores Based on Title Query", size=14);
plt.xlabel("Query - Document Similarity Score", size=12);
plt.ylabel("Count", size=12);
plt.xlim([0,20]);

In [13]:

pandas.Series(query_doc_sims.flatten()).describe()

Out[13]:

count    1.205604e+06
mean     8.955459e-01
std      1.920135e+00
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      1.143815e+00
max      7.451137e+01
dtype: float64

"Implicit" Stop Words via IDF¶

Unlike the previous example, the query - document scores are significantly more varied, and contain many, many more zero scores. This is due to the implicit introduction of stop words into the calculations via the IDF function. To see this, lets first look at the distribution of IDF values, and then look at some of the words on the lower end of the IDF value spectrum (i.e., values equal to 0).

In [14]:

g = sns.distplot(idf, kde=False, color='purple');
g.figure.set_size_inches(12,8);
plt.title("Inverse Document Frequency Weighting Distribution", size=14);
plt.xlabel("IDF", size=12);

At first, one may be struck by the proportion of words that do not have small IDF values. But recall that most words appear in almost no documents (extrapolation of Zipf's Law; fun exercise, see if you can find the Zipf's Law article in the PhysOrg collection), and therefore we should expect that most words have relatively large IDF values.

Note the slight "hump" for the 0 bin. Below, we compare these "implicit stop words" to the stop words defined in NLTK. Many of the NLTK stop words have IDFs of 0 in our data, while those that do not are reasonable exclusions given the data set (many partial contractions, pronouns, and prepositions).

Thus, unlike in the previous example where all co-occurring words contributing something to the query - document similarity scores, the most common words n the collection contribute nothing to the query - document scores. This results in the large quantity of zero-valued query - document scores in the dataset.

The last list is the set of stop words unique to the PhysOrg dataset, which is fairly small. Note that "university" and "research" make sense given the collection's context. Additionally, "said" is a common interview-article word as well, and many PhysOrg articles involve some sort of quoted interview.

In [15]:

rev_dict = {v:k for (k,v) in dictionary.items()}
imp_stop_words = set([rev_dict[i] for i,v in enumerate(idf) if v==0.])

In [16]:

stops = set(nltk.corpus.stopwords.words('english'))

In [17]:

print('List of words shared by the NLTK stop words list and the "implicit" stop words:\n')
simi = stops.intersection(imp_stop_words)
print(simi)

List of words shared by the NLTK stop words list and the "implicit" stop words:

{'these', 'about', 'this', 'at', 'how', 'it', 'or', 'and', 'more', 'such', 'they', 'when', 'as', 'a', 'but', 'for', 'their', 'been', 'are', 'not', 'will', 'has', 'to', 'that', 'by', 's', 'with', 'an', 'all', 'in', 'was', 'we', 'its', 'have', 'on', 'the', 'into', 'of', 'be', 'which', 'other', 'is', 'can', 'than', 'from'}

In [18]:

print('List of words unique to the NLTK stop words list:\n')

diff = stops.difference(imp_stop_words)
print(diff)

List of words unique to the NLTK stop words list:

{'doesn', 'don', 've', 'until', 'her', 'few', 'then', 'up', 't', 'hers', 'won', 'again', 'same', 'whom', 'theirs', 'being', 'here', 'do', 'out', 'them', 'didn', 'because', 'some', 'after', 'your', 'me', 'am', 'were', 'mustn', 'what', 'now', 'shouldn', 'should', 'yours', 'having', 'he', 'isn', 'further', 'below', 'off', 'itself', 'against', 'down', 'wasn', 'wouldn', 'where', 'no', 'll', 'yourself', 'who', 'ourselves', 'through', 'you', 'if', 'under', 'his', 'nor', 'himself', 'doing', 'mightn', 'there', 'hadn', 'm', 'themselves', 'o', 'aren', 'did', 'before', 'y', 'hasn', 'ma', 'why', 'both', 'weren', 'very', 'most', 'so', 'during', 'myself', 'while', 'over', 'him', 'my', 'i', 'd', 'our', 'shan', 're', 'yourselves', 'herself', 'she', 'only', 'those', 'once', 'does', 'needn', 'had', 'ain', 'couldn', 'ours', 'haven', 'just', 'each', 'any', 'above', 'between', 'too', 'own'}

In [19]:

print('List of words unique to the "implicit" stop words list:\n')

diff = imp_stop_words.difference(stops)
print(diff)

List of words unique to the "implicit" stop words list:

{'could', 'new', 'also', 'said', 'research', 'university', 'one'}

Evaluation¶

We repeat the evaluation of our method as before. To summarize:

Analyze the Top 10 documents returned for each query by Document Category
Compute the F-Measure for each Category and across all Categories

Document Types Returned by Category¶

First, we determine the Top 10 highest scoring documents for each query. Then, we analyze those documents based on (1) whether or not the Top 10 documents contain the document associated with the query and (2) the number of documents from each category in the Top 10 documents.

The high-level summarization below shows an improvement across the board for (1) percentage of the time a query's respective document is returned first (up from 73.4%), (2) percentage of the time a query's respective document is in the Top 10 list (up from 90.3%), and (3) percentage of the time the highest-scoring document for a query is from the same category as the query's respective document (up from 86.7%).

Additionally, for the category heat-map, we see a stronger showing along the diagonal, indicating that a higher percentage of Top 10 documents for a query are from the same category as the query's associated document.

In [20]:

top10 = query_doc_sims.argsort(axis=1)[:,-10:]
top10 = top10[:, [i for i in reversed(range(10))]]
top10 = pandas.DataFrame(data=top10, columns=[i for i in range(1,11)])
top10['Doc'] = range(top10.shape[0])
top10['Category'] = top10.apply(lambda x: articles.category[x['Doc']], axis=1)

In [21]:

# Doc as first, Doc in top 10
doc_first = sum(top10[1]==top10.index) / top10.shape[0] * 100
doc_top10 = sum(top10.apply(lambda x: x['Doc'] in set(x[0:10]), axis=1)) / top10.shape[0] * 100
print("A query's respective document is returned first {0:0.1f}% of the time".format(doc_first))
print("A query's respective document is returned within the first 10 documents {0:0.1f}% of the time".format(doc_top10))

# Doc Category as first
cat_first = sum(top10.apply(lambda x: articles.category[x[1]]==articles.category[x['Doc']], axis=1)) / top10.shape[0] * 100
print("A query's respective document category is returned first {0:0.1f}% of the time".format(cat_first))

A query's respective document is returned first 88.5% of the time
A query's respective document is returned within the first 10 documents 96.9% of the time
A query's respective document category is returned first 96.1% of the time

In [22]:

def catLookup(x):
    return([articles.category[i] for i in x[1:11]])

def sameCat(x):
    cat = x['Category']
    cats = catLookup(x)
    return([c==cat for c in cats])

def sameCatCount(x):
    same_cats = sameCat(x)
    return(sum(same_cats))

top10['same_cat_count'] = top10.apply(lambda x: sameCatCount(x), axis=1)

In [23]:

# Count occurrences of each Major 8 type in the top 10 list for each Major 8 type
total_counts = []
inds = []
for ind,m8 in enumerate(pandas.unique(articles.category)):
    sub = numpy.array(articles.category==m8, dtype=bool)
    sub_tt = numpy.array(top10.ix[sub][[i for i in range(1,11)]]).flatten()
    sub_tt_categories = articles.category[sub_tt]
    sub_counts = nltk.FreqDist(sub_tt_categories)
    total_counts.append(sub_counts)
    inds.append(m8)
total_counts = pandas.DataFrame(total_counts)
total_counts.index = inds
total_counts = total_counts[inds]

In [24]:

# Frac of total documents in top 10 of each M8, by each M8
total_fracs = numpy.array(total_counts)
total_fracs = numpy.apply_along_axis(lambda x: x / sum(x), 1, total_fracs)
total_fracs = pandas.DataFrame(data = total_fracs, columns=inds, index=inds)

In [25]:

# Plot heatmap of the results
cmap = sns.cubehelix_palette(start=3, rot=0, dark=.12, light=.87, as_cmap=True)

g = sns.heatmap(total_fracs, annot=True, cmap=cmap);
g.figure.set_size_inches(10,8);

plt.xticks(size=12);
plt.yticks(size=12);

Evaluation with the F-Measure¶

We again calculate the F-Measure for a more objective measure of comparison between the two methods. As the fractions along the diagonal in the heat-map above have increased, so have the respective counts in the table below. More importantly, the F-Measures for each category have also increased, all notably except for the "technology-news" category, which is expected given its very slight increase in the heat-map above. Thus, we also observe an overall increase in the F-Measure.

In [26]:

total_counts['total_docs'] = numpy.array([nltk.FreqDist(articles.category)[c] for c in total_counts.index])

In [27]:

total_counts

Out[27]:

	earth-news	science-news	nanotech-news	physics-news	space-news	technology-news	biology-news	chemistry-news	total_docs
earth-news	384	51	24	19	75	115	114	28	81
science-news	93	516	34	43	51	385	177	51	135
nanotech-news	21	11	349	142	17	92	69	119	82
physics-news	32	30	180	565	51	163	52	97	117
space-news	51	30	14	43	460	79	34	19	73
technology-news	110	247	88	152	77	2450	163	123	341
biology-news	116	116	70	73	49	212	871	143	165
chemistry-news	33	38	145	108	29	142	182	363	104

In [28]:

def recall(rel, tot):
    return(rel / tot)

def precision(rel, ret):
    return(rel / ret)

def fMeasure(rel, ret, tot):
    r = recall(rel, tot)
    p = precision(rel, ret)
    fm = 2 * r * p / (r + p)
    return(fm)

def fFromX(x):
    cat = x['Category']
    tot = total_counts['total_docs'][cat]
    return(fMeasure(x['same_cat_count'], 10, tot))

In [29]:

top10['FMeasure'] = top10.apply(lambda x: fFromX(x), axis=1)

In [30]:

top10.groupby(['Category']).agg({'FMeasure' : 'mean'})

Out[30]:

	FMeasure
Category
biology-news	0.060675
chemistry-news	0.062247
earth-news	0.105277
nanotech-news	0.093584
physics-news	0.077260
science-news	0.053231
space-news	0.153161
technology-news	0.041039

In [31]:

print('Overall F-Measure: {0:0.3f}'.format(numpy.mean(top10.FMeasure)))

Overall F-Measure: 0.067

Wrap-Up¶

In this notebook, we improved upon the "Simplest Vector Space Instantiation" implemented in the previous notebook, specifically by updating how documents are embedded into the vector space used for calculating query - document similarity scores (e.g., by using a variant of the Okapi BM25 method).

By implementing the Okapi BM25 Ranking Function, we achieved more promising results than we did for the binary BOW method. These improvements were noted by an increase in the overall fraction of Top 10 documents associated with the same category as the query's respective document. As additional evidence supporting an improvement, an increase in both the Category-level and overall F-Measures were observed.

immersinn-ds