immersinn-ds

Thu 10 November 2016

JM Smoothing Language Model For Ranking

Posted by TRII in text-retrieval-and-search-engines   

Introduction / Overview

We continue our work on the Text Retrieval and Search Engines course (see here for the last article). For the various topics covered in the course, the goal is to implement some of the methods and tools in order to gain some hands-on experience.

The previous articles looked at embedding documents and queries into an $n$-dimensional space, calculating the distances between query-document embeddings, and utilizing these distances as a measure of similarity between documents and queries.

In this article, we look at Statistical Language Models for developing a Ranking Function for Documents given a Query. While much of the model formulation shown below appears very similar to what was performed previously, the underlying concept is drastically different.

Specifically, our driving principle is developing a model that provides an estimate for the probability that a "relevance variable" is 1, or "positive", given a query and a document. That is, the goal is to construct a model that estimates $\mathsf{p}(R=1 \vert q, d)$, where $R$ is the relevance parameter.

In [1]:
import os
import sys
import pickle
In [2]:
import numpy
import scipy
import pandas
import nltk
In [3]:
%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
In [4]:
# Load previously pre-processed documents
with open('data/preprocessedArticleContentDump.pkl', 'rb') as f:
    articles = pickle.load(f)
In [5]:
articles.head(3)
Out[5]:
URL category content source title tokenized bow tot_len unique_words title_bow doc_id
0 http://phys.org/news/2016-04-activists-appeal-... earth-news Environmental groups on Tuesday lodged a compl... PhysOrg Activists appeal to EU over Polish logging of ... [environmental, groups, on, tuesday, lodged, a... {as, previous, branch, of, pose, greenpeace, t... 385 223 {to, of, logging, eu, polish, appeal, forest, ... 0
1 http://phys.org/news/2016-04-seismic-ecuador.html earth-news A doctoral thesis developed at UPM analysed th... PhysOrg The seismic risk of Ecuador [a, doctoral, thesis, developed, at, upm, anal... {of, heavily, short-term, that, last, from, pr... 269 152 {risk, ecuador, the, seismic, of} 1
2 http://phys.org/news/2016-04-japanese-ecuador-... earth-news A magnitude 6.5 earthquake struck Japan on Fri... PhysOrg Are the Japanese and Ecuador earthquakes related? [a, magnitude, earthquake, struck, japan, on, ... {were, but, comparison, thrust, locked, evenin... 865 382 {ecuador, are, japanese, related, and, earthqu... 2

Language Models

Review of Similarity Methods

In the previous articles, we computed document rankings given a query using a Similarity based approach. That is, we made the assumption that a document's relevance with regards to a query should be proportional to a Similarity Score Function.

In the first case, the Score Function was a simple dot product between a simple vector representation for documents and queries, which was a vector representing the absence or presence of terms in the documents and queries.

The second case utilized the Okapi BM25 Ranking Function. This function provided various weightings for terms given individual document prevalence and overall collection prevalence. Word counts instead of word presence / absence were also utilized.

Both cases were conceptually focused around embedding documents and queries into the same $n$-dimensional space and calculating the distances between them. The distances between embeddings were then used as surrogates for calculating query - document similarities. In semi-mathematical terms, we can represent these methods as:

$$f(query, document) = \mathsf{similarity}(query, document)$$

Overview of Language Models

A Statistical Language Model, on the other hand, is a Probabilistic Model in which each document and query is treated as a Random Variable being sampled from some distribution(s).

Put another way, the string constituting a document or query has a probability given a particular vocabulary. For a simple Unigram Model, we treat each word in this string as an individual entity whose probability is independent of the other words present given the vocabulary. For the simplest case, the Maximum Likelihood Estimator (MLE) for the probability of observing a word is:

$$\mathsf{p_{ML}}(w_i|C) = \frac{\mathsf{count}(w_i,C)}{\sum_{w \in C} \mathsf{count}(w,C)}$$

where $\mathsf{p_{ML}}(w_i|C)$ is the total number of times word $w_i$ appears in the collection $C$.

Given a collection and each word's MLE Probability, we know the probability for each word that appears in a document (query). The total probability of observing a document (query) is the product of the probabilities corresponding to the word sequence, or:

$$\mathsf{p_{ML}}(d) = \prod^{n}_{w \in d} \mathsf{p_{ML}}(w \vert d)$$

Note that if we establish a set of word probabilities based on a collection, and we attempt to estimate the probability of a document $d$ that contains a word $w_j$ foreign to the original collection, then the calculated probability of observing $d$ is 0 given the model described above, as $\mathsf{p}(w_j \vert C) = 0$.

We do not address (nor suffer from) this particular problem is this article, though we do encounter a similar issue. But 0's in models are bad news.

Overview of Query Likelihood Ranking Function

As an imagined formulation for using probabilities as the driver for determining the relevance of a document for a particular query, we can picture a massive array that contains (query, document, $R$) triplets, where $R$ is a binary indicator describing whether the document was viewed as relevant for the given query.

For our table, the same (query, document) pair may appear many times, each originating from some different imaginary user, implying that no row is necessarily unique, and multiple instances of a particular (query, document, $R$) triplet can appear as well.

With this table, we are able to generate an estimator for $\mathsf{p}(R=1 \vert q, d)$, which is the probability of observing $R = 1$ given a (query, document) pair. The value is the average $R$ value observed across all transactions containing the (query, document) pair. Thus:

$$f(query, document) = \mathsf{p}(R=1 \vert query, document) = \frac{\mathsf{count}(R=1, query, document)}{\mathsf{count}(query, document)}$$

By assigning a cutoff value for $\mathsf{p}(R=1 \vert q, d)$, we would know which documents to return to a user who entered a query $q$.

This table, however, clearly does not exist in any form nearly comprehensible enough for use, nor would it be reasonable to attempt to construct it. Instead, we use the Language Model ideas described above, and make the following assumption:

$$f(query, document) = p(R = 1 | query, document) \approx p(query | R = 1, document) \approx p(query | document)$$

which is the probability of observing a query given that a particular document that is relevant. Phrasing this a bit differently, we can treat this as the probability of a user entering a particular query $q$ when said user is thinking about or attempting to retrieve a particular document $d$ (the "imaginary relevant document").

The Language Model is utilized in order to determine $\mathsf{p}(q)$ and $\mathsf{p}(d)$ so that $\mathsf{p}(q \vert d)$ can be obtained.

General Ranking Function & Smoothing

Below we show various equations related to calculating the probability of a particular query given a document. The first two relate to general versions of Ranking Functions relying on Language Model concepts, while the latter two show details related to the particular implementation utilized in this article.

Non-Smoothing

In the first equation, we take the basic expression of this ranking function, which is the product over individual word probabilities in the given query based on the probability of observing the word within a given document.

Note that we have taken the $\log(\cdot)$ of this function, as is typical, in order to avoid computational underflow when multiplying together many small values, which is what is typically encountered when using the probabilities of words occurring in documents.

For the last form in the first equation, note that the summation is over the unique words in the query, and each word probability is weighted by the total number of times the word occurs within the query.

A major issue with using the raw word-document probabilities is that words appearing in a query but unobserved in a particular document have $\mathsf{p}(w \vert d) = 0$. This results in $\mathsf{p}(q \vert d) = 0$ as well, which is highly problematic for calculating ranks, unless we can make the very strong assumption that every document we would want to deem as relevant will contain every word in the query. Typically this is neither a reasonable nor useful assumption, and so a work-around is needed to avoid "zeroing out" the conditional probability.

Smoothing

This can be accomplished by smoothing the word-document probability function to account for words in the vocabulary that are not present in the document. Various smoothing methods exist in the wild. All methods find work-arounds to the "data sparseness" issue of documents, and often leverage the collection as a whole, or subsets of the collection larger than a document. Some also make use of "preset" adjustments to guarantee non-zero probabilities for words.

Typically, smoothing methods utilize two distributions, one for "seen" words observed in the document, and an "unseen" term for words not observed in the document. The "seen" term sums over only those words that appear in both the document and the query, and the "unseen" term summing over only words appearing in the query but not in the document. This split is shown in the first line of the second equation.

In the second line, the "unseen" component is extended to sum over all words appearing in the query, not just those terms that are not observed in the document. This addition is balanced by subtracting out $\mathsf{p_{Unseen}}(\cdot)$ from the "seen" component of the equation, which manifests itself as the denominator within the $\log(\cdot)$. This results in a term that is similar to a TF-IDF weighting.

The unseen word probability is typically assumed to be proportional to the overall collection frequency of the word. In the third line of the equation, $\mathsf{p_{Unseen}}(\cdot)$ is replaced with the term $\alpha_d \mathsf{p}(w \vert C)$, which is a composite of the collection language model, $\mathsf{p}(w \vert C)$, and a term that potentially depends on the document, $\alpha_d$.

The $\alpha_d$ term is extracted from the "unseen" component as a separate element of the equation in the third line, resulting in a third term that depends only on the query and the overall collection and not the document. Thus, this term can be discarded for ranking purposes.

As a note, $\mathsf{p}(w \vert C)$ is $\mathsf{p_{ML}}(w \vert C)$ defined above.

Generalized Language Model Ranking Function:

$$\mathsf{f}(q,d) = \log \mathsf{p}(q \vert d) = \log \left(\prod^{n}_{i = 1} \mathsf{p}(w_i \vert d) \right) = \sum^{n}_{i = 1} \log ( \mathsf{p}(w_i \vert d)) = \sum_{w \in q} \mathsf{c}(w, q) \cdot \log ( \mathsf{p}(w \vert d)) $$

Generalized Language Model Ranking Function with Smoothing:

$$\mathsf{f}(q,d) = \log \mathsf{p}(q \vert d) = \sum_{\substack{w \in q\\ c(w,d)>0}} \mathsf{c}(w, q) \cdot \log \mathsf{p_{Seen}}(w \vert d) + \sum_{\substack{w \in q\\ c(w,d)=0}} \mathsf{c}(w, q) \cdot \log \mathsf{p_{Unseen}}(w \vert d)$$$$= \sum_{\substack{w \in q\\ c(w,d)>0}} \mathsf{c}(w, q) \cdot \log \frac{\mathsf{p_{Seen}}(w \vert d)}{\mathsf{p_{Unseen}}(w \vert d)} + \sum_{w \in q} \mathsf{c}(w, q) \cdot \log \mathsf{p_{Unseen}}(w \vert d)$$$$= \sum_{\substack{w \in q\\ c(w,d)>0}} \mathsf{c}(w, q) \cdot \log \frac{\mathsf{p_{Seen}}(w \vert d)}{\alpha_d \mathsf{p}(w \vert C)} + \vert q \vert\log \alpha_d + \sum_{w \in q} \log \mathsf{p}(w \vert C)$$

We note that this representation of the Smoothed Language Model permits analogies with the vector space models shown previously. Specifically, the ratio $\frac{\mathsf{p_{Seen}}(w \vert d)}{\alpha_d \mathsf{p}(w \vert C)}$ is proportional to the term frequency within the given document and inversely proportional to the term frequency in the overall collection. This latter factor is a bit different than previously, however, as the BM25 model was concerned with only the proportion of the documents in which a particular term appeared, not the overall number of appearances. Thus, we again see a TF-IDF-like structure in the model enclosed within a sub-linear function, $\log(\cdot)$.

Additionally, the second term -- $\vert q \vert \log \alpha_d$ -- can be made to account for document length, as $\alpha_d$ is assumed to be dependent on the document in question. The formulation we use here does not take advantage of that option.

Jelinek-Mercer Smoothing

The two equations below relate the specific approach we use for this article. Jelinek-Mercer Smoothing is a linear interpolation of the document and collection word probabilities, where the coefficient $\lambda$ determines the weighting balance between the two terms. We use a value of 0.1 in this article.

To generate our Ranking Function, we need to substitute the JM Smoothing function into the general ranking function from above. The JM word probability function replaces the $\mathsf{p_{Seen}}(\cdot)$ term. In this case, then, $\mathsf{p_{Seen}}(\cdot)$ is a function of both the document and the collection.

The analogous term for $\alpha_d$ in the general equation should perform the same function. Note that both $\alpha_d$ and the $\lambda$ term in JM are responsible for adjusting the background collection word probability. Thus, we replace $\alpha_d$ with the constant coefficient $\lambda$, and do not let $\alpha_d$ vary for each document.

As $\alpha_d$ is now a constant, the second term in the original equation, $\vert q \vert\log\alpha_d$ is independent of any given document, and so can be ignored for ranking purposes. Note that the term is still dependent on the query.

Substituting in the JM word probability function, we arrive at the JM Smoothing Ranking Function (after re-arranging and re-writing some terms). The TDM (term-document matrix) can be pre-weighted with the $\log(\cdot)$ term, as this is constant for a given document in a given collection.

Calculating the ranking for documents given a query becomes a dot product of the query's TDV with the weighted TDM. We observed this to be the case with the VSM models as well.

Linear Interpolation Word Probability Smoothing (Jelinek-Mercer):

$$\mathsf{p_\lambda}(w \vert d) = (1 - \lambda) \mathsf{p_{ML}}(w \vert d) + \lambda \mathsf{p}(w \vert C)$$

JM Smoothing Language Model Ranking Function:

$$\mathsf{f_{JM}}(q,d) = \sum_{\substack{w \in q\\ c(w,d)>0}} \mathsf{c}(w, q) \cdot \log \left( 1 + \frac{1-\lambda}{\lambda} \frac{\mathsf{c}(w, d)}{\vert d \vert \cdot \mathsf{p}(w \vert C)} \right)$$

Analysis

The steps below are identical to those shown previously, only we replace the VSM Similarity Measure formulation with the JM Smoothing Language Model Ranking Function.

In [6]:
# Create the dictionary from all terms in the contents, titles
dictionary = set()
_ = articles.apply(lambda x: dictionary.update(x['bow']), axis=1)
_ = articles.apply(lambda x: dictionary.update(x['title_bow']), axis=1)
dictionary = {w : v for (v,w) in zip(range(len(dictionary)), dictionary)}
In [7]:
# Document Word Frequencies
articles['word_freqs'] = articles.apply(lambda x: nltk.FreqDist(x['tokenized']), axis=1)

# Document Total Word Frequencies
col_word_freqs = nltk.FreqDist()
_ = articles.apply(lambda x: col_word_freqs.update(x['word_freqs']), axis=1)
total_words = sum(col_word_freqs.values())

JM Smoothing Pre-Processing for Documents

In [8]:
lam = 0.1

# Calculate IDF values
cwp = numpy.zeros((len(dictionary), ))
for w,i in dictionary.items():
   cwp[i] = col_word_freqs[w] / total_words

def generateJMSmoothProbs(x, doc_vecs, lambda_param):
    """JM-Smoothed Calculation for Document"""
    doc_id = x['doc_id']
    doc_len = x['tot_len']
    for word, count in x['word_freqs'].items():
        w_ind = dictionary[word]
        w_prob = cwp[w_ind]
        doc_vecs[doc_id, w_ind] = numpy.log(1 + (1-lambda_param)/lambda_param * count / (doc_len * w_prob))
In [9]:
# Generate JM Smoothing Terms for all words in a document
# These vectors will then be used to calculate the query-doc probabilities
doc_vecs = numpy.zeros(shape=(articles.shape[0], len(dictionary)))
_ = articles.apply(lambda x: generateJMSmoothProbs(x, doc_vecs, lam), axis=1)

Query Word Count Vectors

In [10]:
# Document Query (Title) BOW vectors
query_vecs = numpy.zeros(shape=(articles.shape[0], len(dictionary)))
for di, bd in enumerate(articles['title_bow']):
    for w in bd:
        query_vecs[di, dictionary[w]] = query_vecs[di, dictionary[w]] + 1

Calculating Query-Document Rankings

As mentioned above, due to the fact that the operations related to the document side of the JM Smoothing Ranking Function were applied already, the remaining calculation is simply the dot product of the document representations and the query representations -- which is now total counts of all words in a query.

Below are the results for all pairwise dot products between documents and queries. Note the lack of a prominent peak for the "Score = 0" bin that was observed in the previous model. The current model does not contain the same "implicit stop word" feature that was present in BM25, so any shared word between document and query makes a non-zero -- though heavily "discounted" for more frequent words -- contribution to the score.

Overall, the distribution of scores for this model appears much less concentrated than the similarities observed with BM25.

In [11]:
# Calculate all pairwise dot products between queries and documents
query_doc_ranks = scipy.spatial.distance.cdist(query_vecs,
                                               doc_vecs,
                                               numpy.dot)
In [12]:
g = sns.distplot(query_doc_ranks.flatten(), kde=False, color='purple');
g.figure.set_size_inches(12,8);
plt.title("Query - Document Rank Scores", size=14);
plt.xlabel("Query - Document Rank Score", size=12);
plt.ylabel("Count", size=12);
plt.xlim([-1,40]);
In [13]:
pandas.Series(query_doc_ranks.flatten()).describe()
Out[13]:
count    1.205604e+06
mean     6.237211e+00
std      4.571517e+00
min      0.000000e+00
25%      2.598635e+00
50%      5.290014e+00
75%      8.610751e+00
max      1.006037e+02
dtype: float64

Evaluation

We repeat the evaluation of our method as before. To summarize:

  • Analyze the Top 10 documents returned for each query by Document Category
  • Compute the F-Measure for each Category and across all Categories

As a broad summary, results observed for the JM Smoothing Language Model Ranking Function are highly similar to results obtained from the Okapi BM25 model. It should be noted that BM25 (slightly) outperforms the current method across most of the article categories, as well as the overall F-Measure for the collection. Using an alternative smoothing method that provides an $\alpha_d$ that varies by document, such as Dirichlet Prior Smoothing, could provide for a more analogous performance. Such an exercise is left for the interested reader :-)

Document Types Returned by Category

In [14]:
top10 = query_doc_ranks.argsort(axis=1)[:,-10:]
top10 = top10[:, [i for i in reversed(range(10))]]
top10 = pandas.DataFrame(data=top10, columns=[i for i in range(1,11)])
top10['Doc'] = range(top10.shape[0])
top10['Category'] = top10.apply(lambda x: articles.category[x['Doc']], axis=1)
In [15]:
# Doc as first, Doc in top 10
doc_first = sum(top10[1]==top10.index) / top10.shape[0] * 100
doc_top10 = sum(top10.apply(lambda x: x['Doc'] in set(x[0:10]), axis=1)) / top10.shape[0] * 100
print("A query's respective document is returned first {0:0.1f}% of the time".format(doc_first))
print("A query's respective document is returned within the first 10 documents {0:0.1f}% of the time".format(doc_top10))

# Doc Category as first
cat_first = sum(top10.apply(lambda x: articles.category[x[1]]==articles.category[x['Doc']], axis=1)) / top10.shape[0] * 100
print("A query's respective document category is returned first {0:0.1f}% of the time".format(cat_first))
A query's respective document is returned first 87.2% of the time
A query's respective document is returned within the first 10 documents 96.1% of the time
A query's respective document category is returned first 95.0% of the time
In [16]:
def catLookup(x):
    return([articles.category[i] for i in x[1:11]])

def sameCat(x):
    cat = x['Category']
    cats = catLookup(x)
    return([c==cat for c in cats])

def sameCatCount(x):
    same_cats = sameCat(x)
    return(sum(same_cats))

top10['same_cat_count'] = top10.apply(lambda x: sameCatCount(x), axis=1)
In [17]:
# Count occurrences of each Major 8 type in the top 10 list for each Major 8 type
total_counts = []
inds = []
for ind,m8 in enumerate(pandas.unique(articles.category)):
    sub = numpy.array(articles.category==m8, dtype=bool)
    sub_tt = numpy.array(top10.ix[sub][[i for i in range(1,11)]]).flatten()
    sub_tt_categories = articles.category[sub_tt]
    sub_counts = nltk.FreqDist(sub_tt_categories)
    total_counts.append(sub_counts)
    inds.append(m8)
total_counts = pandas.DataFrame(total_counts)
total_counts.index = inds
total_counts = total_counts[inds]
In [18]:
# Frac of total documents in top 10 of each M8, by each M8
total_fracs = numpy.array(total_counts)
total_fracs = numpy.apply_along_axis(lambda x: x / sum(x), 1, total_fracs)
total_fracs = pandas.DataFrame(data = total_fracs, columns=inds, index=inds)
In [19]:
# Plot heatmap of the results
cmap = sns.cubehelix_palette(start=3, rot=0, dark=.12, light=.87, as_cmap=True)

g = sns.heatmap(total_fracs, annot=True, cmap=cmap);
g.figure.set_size_inches(10,8);

plt.xticks(size=12);
plt.yticks(size=12);

Evaluation with the F-Measure

In [20]:
total_counts['total_docs'] = numpy.array([nltk.FreqDist(articles.category)[c] for c in total_counts.index])
In [21]:
total_counts
Out[21]:
earth-news science-news nanotech-news physics-news space-news technology-news biology-news chemistry-news total_docs
earth-news 345 61 27 32 72 131 109 33 81
science-news 99 460 36 68 44 395 186 62 135
nanotech-news 24 21 305 151 23 107 71 118 82
physics-news 35 30 189 528 43 155 70 120 117
space-news 55 30 16 45 413 104 41 26 73
technology-news 107 229 119 152 98 2381 182 142 341
biology-news 104 128 86 97 65 271 761 138 165
chemistry-news 32 43 156 126 29 163 166 325 104
In [22]:
def recall(rel, tot):
    return(rel / tot)

def precision(rel, ret):
    return(rel / ret)

def fMeasure(rel, ret, tot):
    r = recall(rel, tot)
    p = precision(rel, ret)
    fm = 2 * r * p / (r + p)
    return(fm)

def fFromX(x):
    cat = x['Category']
    tot = total_counts['total_docs'][cat]
    return(fMeasure(x['same_cat_count'], 10, tot))
In [23]:
top10['FMeasure'] = top10.apply(lambda x: fFromX(x), axis=1)
In [24]:
top10.groupby(['Category']).agg({'FMeasure' : 'mean'})
Out[24]:
FMeasure
Category
biology-news 0.053126
chemistry-news 0.056511
earth-news 0.094695
nanotech-news 0.081919
physics-news 0.072145
science-news 0.047714
space-news 0.136986
technology-news 0.040020
In [25]:
print('Overall F-Measure: {0:0.3f}'.format(numpy.mean(top10.FMeasure)))
Overall F-Measure: 0.062

Wrap-Up

In this notebook, we took a different approach to evaluating a Document's relevance to a Query. Instead of embedding Documents and Queries in the same $n$-dimensional space as in the previous notebook, we instead used a Language Model approach and asked the question, "What is the probability that a user trying to find a particular document D will enter the query Q in an attempt to find that document?"

While the Language Model presented here performs very similarly to the Okapi BM25 model presented previously, a major benefit of the Probabilistic Model is that it was essentially derived from "first principles" related to our assumptions and not pieced together from components that either "made sense" or were added to offset model shortcomings. This is something rather typical of probabilistic models: they have intuitive components that are a consequence of the probability model itself and are not added heuristically.