immersinn-ds

Mon 10 October 2016

Improved VSM Instantiation - Okapi BM25

Posted by TRII in text-retrieval-and-search-engines   

Introduction / Overview

This is the second notebook related to work from the Coursera course, Text Retrieval and Search Engines. (See here for the first.) For the various topics covered in the course, the goal is to implement some of the methods and tools in order to gain some hands-on experience.

In this article, we will expand upon the "Simplest VSM Instantiation" and make several improvements that will (hopefully) produce better results than before. Our updates relate to weighting terms used for scoring documents, or, equivalently, modifying how the documents are embedded into the vector space. We utilize the Okapi BM25 Ranking Function methodology for calculating the document embeddings and computing the query - document similarly scores. From our previous model, we will make four changes related to how similarity scores are determined:

  • Use total word frequency, not just binary BOW
  • Inverse Document Frequency (IDF) weighting
  • Term Frequency (TF) transformation
  • Document Length Normalization

Our document and word preprocessing method will remain the same for consistency. That is:

  • No stop word removal
  • Tokens / features are words, not phrases
  • All words to lowercase
  • Remove all punctuation
  • Remove all numbers
  • No stemming
  • No mapping multiple related words onto a single feature / word (e.g., no LSI-like dimensionality reduction)

We do not revisit document preprocessing below, but instead load the preprocessed documents from the previous article.

In [1]:
import os
import sys
import pickle
In [2]:
import numpy
import scipy
import pandas
import nltk
In [3]:
%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
In [4]:
# Load previously pre-processed documents
with open('data/preprocessedArticleContentDump.pkl', 'rb') as f:
    articles = pickle.load(f)
In [5]:
articles.head(3)
Out[5]:
URL category content source title tokenized bow tot_len unique_words title_bow doc_id
0 http://phys.org/news/2016-04-activists-appeal-... earth-news Environmental groups on Tuesday lodged a compl... PhysOrg Activists appeal to EU over Polish logging of ... [environmental, groups, on, tuesday, lodged, a... {steps, szafraniuk, species, hectares, wwf, br... 385 223 {to, over, forest, of, primeval, activists, eu... 0
1 http://phys.org/news/2016-04-seismic-ecuador.html earth-news A doctoral thesis developed at UPM analysed th... PhysOrg The seismic risk of Ecuador [a, doctoral, thesis, developed, at, upm, anal... {shown, earthquake, titled, results, april, ma... 269 152 {ecuador, of, the, seismic, risk} 1
2 http://phys.org/news/2016-04-japanese-ecuador-... earth-news A magnitude 6.5 earthquake struck Japan on Fri... PhysOrg Are the Japanese and Ecuador earthquakes related? [a, magnitude, earthquake, struck, japan, on, ... {earthquake, building, accumulated, aftershock... 865 382 {ecuador, japanese, the, related, are, and, ea... 2

Okapi BM25 Ranking Function Implimentation Overview

Several components constitute the Okapi BM25 (or BM25 for short) Ranking Function. These are show below, along with the Okapi BM25 Ranking Function.

First, there is the IDF function, which weights terms based on their overall prevalence in the collection. Terms that show up in a greater proportion of collection documents are weighted less. The base formula actually allows for negative weightings, specifically for those terms which appear in more than half of the documents. We cap the minimum value at '0', thus creating "silenced" words, or implied stop words.

The Term Frequency transformation reduces the weight of subsequent occurrences of a word in a document after the first occurrence. A tunable parameter, $k_1$, is used to modify the performance. For our purposes, $k_1$ was set to 1.2 in accordance with typical implementations. We can see that this results in the first term contributing a full 1 to the term value, while the second only contributes an additional 0.375 to the overall term value. This prevents the numerous occurrences of a single query term in a document from having a run-away effect on the total query - document score.

Pivot Document Length (PDL) Normalization (slightly) penalizes longer documents, while aiding shorter documents and leaving average length documents unaffected. Another tunable parameter, $b$, is present; here we set the value of $b$ to 0.75. Note that PDL is a linear function of document length, with shorter documents having smaller PDL values.

How does this aid short documents? In Okapi BM25, we apply the PDL Normalization term to the $k_1$ value in the denominator of the TF Transformation. This has the effect of boosting term frequency weights across the board for shorter documents when compared to longer documents by reducing the effective value of the $k_1$ modifier in the denominator for shorter documents.

These three terms are combined to form the Okapi BM25 Ranking Function. For each query term that also appears in the document being scored, a score is computed from the frequency of the given term in the document and the document length. Scores across all query terms are summed together to form an overall score for the document. Documents in a collection are ranked by their BM25 scores. Higher scores indicate a higher degree of similarity with the query.

Inverse Document Frequency:
\begin{equation} IDF(q) = \max \left( 0, \log\frac{N - n(q) + 0.5}{n(q) + 0.5} \right) \end{equation}
TF Transformation:
\begin{equation} \frac{ f(q,D) \cdot (k_1 + 1) }{ f(q,D) + k_1 } \end{equation}
PDL Normalization:
\begin{equation} 1- b + b \cdot \frac{|D|}{avgdl} \end{equation}
Okapi BM25 Score Function:
\begin{equation} score(D,Q) = \sum_{q \in Q} IDF(q) \cdot \frac{ f(q,D) \cdot (k_1 + 1) }{ f(q,D) + k_1 \cdot \left( 1- b + b \cdot \frac{|D|}{avgdl} \right)} \end{equation}

Construct Document & Query BOW Vectors

Document Vector Space Embeddings

This step is analogous to how we created BOW vectors for documents in the first article except that now, instead of recording only whether a word is present or not present in a document, we use the Okapi BM25 transformation to embed the document in the vector space.

Fully transforming the document into the vector space is only feasible because we are working with a fixed collection. The IDF and PDL Normalization terms are variable with the collection. Without these terms, we are unable to get past the term - document frequency matrix.

Calculating the dictionary for all terms is the same as in the previous notebook.

In [6]:
# Create the dictionary from all terms in the contents, titles
dictionary = set()
_ = articles.apply(lambda x: dictionary.update(x['bow']), axis=1)
_ = articles.apply(lambda x: dictionary.update(x['title_bow']), axis=1)
dictionary = {w : v for (v,w) in zip(range(len(dictionary)), dictionary)}
In [7]:
# Document Word Frequencies
articles['word_freqs'] = articles.apply(lambda x: nltk.FreqDist(x['tokenized']), axis=1)

# Collection Word Frequencies
col_word_freqs = nltk.FreqDist()
_ = articles.apply(lambda x: col_word_freqs.update(x['word_freqs'].keys()), axis=1)
In [8]:
# Functions for generating document embeddings

# Constants & params
k1 = 1.2
b = 0.75
N = articles.shape[0]
avgdl = articles.tot_len.mean()

# Calculate IDF values
idf = numpy.zeros((len(dictionary), ))
for w,i in dictionary.items():
    idf[i] = max(0,
                 numpy.log((N - col_word_freqs[w] + 0.5) / \
                       (col_word_freqs[w] + 0.5))
                 )


def pdlCalc(doc_len):
    """PDL Normalization Calculation"""
    return(1 - b + b * doc_len / avgdl)


def generateBM25BOWVec(x, doc_vecs):
    """Full BM25 calculation for a document"""
    doc_id = x['doc_id']
    doc_len = x['tot_len']
    d_pdl = pdlCalc(doc_len)
    for word, count in x['word_freqs'].items():
        w_ind = dictionary[word]
        w_idf = idf[w_ind]
        doc_vecs[doc_id, w_ind] = w_idf * (count * (k1 + 1)) / (count + k1 * d_pdl)

        
def generateBOWFreqVector(x, doc_vecs):
    """Simple frequency vector embedding for a document"""
    doc_id = x['doc_id']
    for word, count in x['word_freqs'].items():
        doc_vecs[doc_id, 
                 dictionary[word]] = count
In [9]:
# Generate Vector-Space Embeddings for Documents
# These embedings will be pre-weighted; that is, 
# we calculate the BM25 value for each Document-Term pair
doc_vecs = numpy.zeros(shape=(articles.shape[0], len(dictionary)))
_ = articles.apply(lambda x: generateBM25BOWVec(x, doc_vecs), axis=1)

Query Vector Space Encodings

Queries are constructed from document titles in the same way as in the previous article.

In [10]:
# Document Query (Title) BOW vectors
query_vecs = numpy.zeros(shape=(articles.shape[0], len(dictionary)))
for di,bd in enumerate(articles['title_bow']):
    for w in bd:
        query_vecs[di, dictionary[w]] = 1

Calculating Query - Document Scores

As before, we use the dot product to compute the pairwise query - document similarity scores. In this case, though, the document vectors have been computed using the BM25 methodology.

Again, utilizing scipy's "spatial.distance" module makes calculating the pairwise similarities simple.

In [11]:
# Calculate all pairwise dot products between queries and documents
query_doc_sims = scipy.spatial.distance.cdist(query_vecs,
                                              doc_vecs,
                                              numpy.dot)
In [12]:
g = sns.distplot(query_doc_sims.flatten(), kde=False, color='purple');
g.figure.set_size_inches(12,8);
plt.title("Query - Document Similarity Scores Based on Title Query", size=14);
plt.xlabel("Query - Document Similarity Score", size=12);
plt.ylabel("Count", size=12);
plt.xlim([0,20]);
In [13]:
pandas.Series(query_doc_sims.flatten()).describe()
Out[13]:
count    1.205604e+06
mean     8.955459e-01
std      1.920135e+00
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      1.143815e+00
max      7.451137e+01
dtype: float64

"Implicit" Stop Words via IDF

Unlike the previous example, the query - document scores are significantly more varied, and contain many, many more zero scores. This is due to the implicit introduction of stop words into the calculations via the IDF function. To see this, lets first look at the distribution of IDF values, and then look at some of the words on the lower end of the IDF value spectrum (i.e., values equal to 0).

In [14]:
g = sns.distplot(idf, kde=False, color='purple');
g.figure.set_size_inches(12,8);
plt.title("Inverse Document Frequency Weighting Distribution", size=14);
plt.xlabel("IDF", size=12);

At first, one may be struck by the proportion of words that do not have small IDF values. But recall that most words appear in almost no documents (extrapolation of Zipf's Law; fun exercise, see if you can find the Zipf's Law article in the PhysOrg collection), and therefore we should expect that most words have relatively large IDF values.

Note the slight "hump" for the 0 bin. Below, we compare these "implicit stop words" to the stop words defined in NLTK. Many of the NLTK stop words have IDFs of 0 in our data, while those that do not are reasonable exclusions given the data set (many partial contractions, pronouns, and prepositions).

Thus, unlike in the previous example where all co-occurring words contributing something to the query - document similarity scores, the most common words n the collection contribute nothing to the query - document scores. This results in the large quantity of zero-valued query - document scores in the dataset.

The last list is the set of stop words unique to the PhysOrg dataset, which is fairly small. Note that "university" and "research" make sense given the collection's context. Additionally, "said" is a common interview-article word as well, and many PhysOrg articles involve some sort of quoted interview.

In [15]:
rev_dict = {v:k for (k,v) in dictionary.items()}
imp_stop_words = set([rev_dict[i] for i,v in enumerate(idf) if v==0.])
In [16]:
stops = set(nltk.corpus.stopwords.words('english'))
In [17]:
print('List of words shared by the NLTK stop words list and the "implicit" stop words:\n')
simi = stops.intersection(imp_stop_words)
print(simi)
List of words shared by the NLTK stop words list and the "implicit" stop words:

{'these', 'about', 'this', 'at', 'how', 'it', 'or', 'and', 'more', 'such', 'they', 'when', 'as', 'a', 'but', 'for', 'their', 'been', 'are', 'not', 'will', 'has', 'to', 'that', 'by', 's', 'with', 'an', 'all', 'in', 'was', 'we', 'its', 'have', 'on', 'the', 'into', 'of', 'be', 'which', 'other', 'is', 'can', 'than', 'from'}
In [18]:
print('List of words unique to the NLTK stop words list:\n')

diff = stops.difference(imp_stop_words)
print(diff)
List of words unique to the NLTK stop words list:

{'doesn', 'don', 've', 'until', 'her', 'few', 'then', 'up', 't', 'hers', 'won', 'again', 'same', 'whom', 'theirs', 'being', 'here', 'do', 'out', 'them', 'didn', 'because', 'some', 'after', 'your', 'me', 'am', 'were', 'mustn', 'what', 'now', 'shouldn', 'should', 'yours', 'having', 'he', 'isn', 'further', 'below', 'off', 'itself', 'against', 'down', 'wasn', 'wouldn', 'where', 'no', 'll', 'yourself', 'who', 'ourselves', 'through', 'you', 'if', 'under', 'his', 'nor', 'himself', 'doing', 'mightn', 'there', 'hadn', 'm', 'themselves', 'o', 'aren', 'did', 'before', 'y', 'hasn', 'ma', 'why', 'both', 'weren', 'very', 'most', 'so', 'during', 'myself', 'while', 'over', 'him', 'my', 'i', 'd', 'our', 'shan', 're', 'yourselves', 'herself', 'she', 'only', 'those', 'once', 'does', 'needn', 'had', 'ain', 'couldn', 'ours', 'haven', 'just', 'each', 'any', 'above', 'between', 'too', 'own'}
In [19]:
print('List of words unique to the "implicit" stop words list:\n')

diff = imp_stop_words.difference(stops)
print(diff)
List of words unique to the "implicit" stop words list:

{'could', 'new', 'also', 'said', 'research', 'university', 'one'}

Evaluation

We repeat the evaluation of our method as before. To summarize:

  • Analyze the Top 10 documents returned for each query by Document Category
  • Compute the F-Measure for each Category and across all Categories

Document Types Returned by Category

First, we determine the Top 10 highest scoring documents for each query. Then, we analyze those documents based on (1) whether or not the Top 10 documents contain the document associated with the query and (2) the number of documents from each category in the Top 10 documents.

The high-level summarization below shows an improvement across the board for (1) percentage of the time a query's respective document is returned first (up from 73.4%), (2) percentage of the time a query's respective document is in the Top 10 list (up from 90.3%), and (3) percentage of the time the highest-scoring document for a query is from the same category as the query's respective document (up from 86.7%).

Additionally, for the category heat-map, we see a stronger showing along the diagonal, indicating that a higher percentage of Top 10 documents for a query are from the same category as the query's associated document.

In [20]:
top10 = query_doc_sims.argsort(axis=1)[:,-10:]
top10 = top10[:, [i for i in reversed(range(10))]]
top10 = pandas.DataFrame(data=top10, columns=[i for i in range(1,11)])
top10['Doc'] = range(top10.shape[0])
top10['Category'] = top10.apply(lambda x: articles.category[x['Doc']], axis=1)
In [21]:
# Doc as first, Doc in top 10
doc_first = sum(top10[1]==top10.index) / top10.shape[0] * 100
doc_top10 = sum(top10.apply(lambda x: x['Doc'] in set(x[0:10]), axis=1)) / top10.shape[0] * 100
print("A query's respective document is returned first {0:0.1f}% of the time".format(doc_first))
print("A query's respective document is returned within the first 10 documents {0:0.1f}% of the time".format(doc_top10))

# Doc Category as first
cat_first = sum(top10.apply(lambda x: articles.category[x[1]]==articles.category[x['Doc']], axis=1)) / top10.shape[0] * 100
print("A query's respective document category is returned first {0:0.1f}% of the time".format(cat_first))
A query's respective document is returned first 88.5% of the time
A query's respective document is returned within the first 10 documents 96.9% of the time
A query's respective document category is returned first 96.1% of the time
In [22]:
def catLookup(x):
    return([articles.category[i] for i in x[1:11]])

def sameCat(x):
    cat = x['Category']
    cats = catLookup(x)
    return([c==cat for c in cats])

def sameCatCount(x):
    same_cats = sameCat(x)
    return(sum(same_cats))

top10['same_cat_count'] = top10.apply(lambda x: sameCatCount(x), axis=1)
In [23]:
# Count occurrences of each Major 8 type in the top 10 list for each Major 8 type
total_counts = []
inds = []
for ind,m8 in enumerate(pandas.unique(articles.category)):
    sub = numpy.array(articles.category==m8, dtype=bool)
    sub_tt = numpy.array(top10.ix[sub][[i for i in range(1,11)]]).flatten()
    sub_tt_categories = articles.category[sub_tt]
    sub_counts = nltk.FreqDist(sub_tt_categories)
    total_counts.append(sub_counts)
    inds.append(m8)
total_counts = pandas.DataFrame(total_counts)
total_counts.index = inds
total_counts = total_counts[inds]
In [24]:
# Frac of total documents in top 10 of each M8, by each M8
total_fracs = numpy.array(total_counts)
total_fracs = numpy.apply_along_axis(lambda x: x / sum(x), 1, total_fracs)
total_fracs = pandas.DataFrame(data = total_fracs, columns=inds, index=inds)
In [25]:
# Plot heatmap of the results
cmap = sns.cubehelix_palette(start=3, rot=0, dark=.12, light=.87, as_cmap=True)

g = sns.heatmap(total_fracs, annot=True, cmap=cmap);
g.figure.set_size_inches(10,8);

plt.xticks(size=12);
plt.yticks(size=12);

Evaluation with the F-Measure

We again calculate the F-Measure for a more objective measure of comparison between the two methods. As the fractions along the diagonal in the heat-map above have increased, so have the respective counts in the table below. More importantly, the F-Measures for each category have also increased, all notably except for the "technology-news" category, which is expected given its very slight increase in the heat-map above. Thus, we also observe an overall increase in the F-Measure.

In [26]:
total_counts['total_docs'] = numpy.array([nltk.FreqDist(articles.category)[c] for c in total_counts.index])
In [27]:
total_counts
Out[27]:
earth-news science-news nanotech-news physics-news space-news technology-news biology-news chemistry-news total_docs
earth-news 384 51 24 19 75 115 114 28 81
science-news 93 516 34 43 51 385 177 51 135
nanotech-news 21 11 349 142 17 92 69 119 82
physics-news 32 30 180 565 51 163 52 97 117
space-news 51 30 14 43 460 79 34 19 73
technology-news 110 247 88 152 77 2450 163 123 341
biology-news 116 116 70 73 49 212 871 143 165
chemistry-news 33 38 145 108 29 142 182 363 104
In [28]:
def recall(rel, tot):
    return(rel / tot)

def precision(rel, ret):
    return(rel / ret)

def fMeasure(rel, ret, tot):
    r = recall(rel, tot)
    p = precision(rel, ret)
    fm = 2 * r * p / (r + p)
    return(fm)

def fFromX(x):
    cat = x['Category']
    tot = total_counts['total_docs'][cat]
    return(fMeasure(x['same_cat_count'], 10, tot))
In [29]:
top10['FMeasure'] = top10.apply(lambda x: fFromX(x), axis=1)
In [30]:
top10.groupby(['Category']).agg({'FMeasure' : 'mean'})
Out[30]:
FMeasure
Category
biology-news 0.060675
chemistry-news 0.062247
earth-news 0.105277
nanotech-news 0.093584
physics-news 0.077260
science-news 0.053231
space-news 0.153161
technology-news 0.041039
In [31]:
print('Overall F-Measure: {0:0.3f}'.format(numpy.mean(top10.FMeasure)))
Overall F-Measure: 0.067

Wrap-Up

In this notebook, we improved upon the "Simplest Vector Space Instantiation" implemented in the previous notebook, specifically by updating how documents are embedded into the vector space used for calculating query - document similarity scores (e.g., by using a variant of the Okapi BM25 method).

By implementing the Okapi BM25 Ranking Function, we achieved more promising results than we did for the binary BOW method. These improvements were noted by an increase in the overall fraction of Top 10 documents associated with the same category as the query's respective document. As additional evidence supporting an improvement, an increase in both the Category-level and overall F-Measures were observed.