Mon 10 October 2016

Simplest VSM Instantiation

Posted by TRII in text-retrieval-and-search-engines

Introduction¶

This notebook will (hopefully) be the first in a series of notebooks related to work from the Coursera course, Text Retrieval and Search Engines. For the various topics covered in the course, the goal is to impliment some of the methods and tools in order to gain some hands-on experience.

The current plan is to use the same set of data for all of the notebooks. The dataset consists of articles scraped from PhysOrg, which posts news stories related to new developments in various science / technology ares. The articles are associated with a variety of PhysOrg topic areas, though each article is associated with only a simple topic. Method / algorithm evaluation will utilize these categories.

For the first article, as one can observe from the title, we will go over the process for constructing the "Simplest Vector Space Model Instantiation". Key features of this model are:

Bag of Word feature encoding using binary hits for terms
No normalization or weighting applied to terms or documents
Dot product similarity for calculating similarity scores between documents / queries

Term and document preprocessing will be minimal, partially for simplicity, partially to exemplify various shortcomings / strengths of various models. Below is a list of things preprocessing will (not) include:

No stop word removal
Tokens / features are words, not phrases
All words to lowercase
Remove all punctuation
Remove all numbers
No stemming
No mapping multiple related words onto a single feature / word (e.g., no LSI-like dimensionality reduction)

Given the context of this mini-project, queries (or statements that look like queries) are needed to evaluate performance. As all articles have associated titles, we use the titles to generate queries for each document in the collection.

In [1]:

import os
import sys
import pickle

In [2]:

import numpy
import scipy
import pandas
import nltk

In [3]:

%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style="whitegrid", color_codes=True)

Load and Preprocess¶

Having already scraped a set of PhysOrg articles and extracted the relevant textual content (see a previous post of for details), all we need to do is load the files from the stored pickle. We can see, below, that there are about 1,200 documents (articles) in this collection.

Each entry has the article URL, article category, the content (i.e., article text), the title, and the source (all are fro PhysOrg in this dataset). The article content will be used to generate the BOW vector, while the title entry will be used to generate the query. For grouping articles, the "category" feature will be used.

In [4]:

with open('data/tempArticleContentDump.pkl', 'rb') as f:
    articles = pickle.load(f)

In [5]:

articles = pandas.DataFrame(articles)
print(len(articles))
articles.head()

Out[5]:

	URL	category	content	source	title
0	http://phys.org/news/2016-04-post-wildfire-ero...	Earth Sciences	Erosion after severe wildfires can be the domi...	PhysOrg	Post-wildfire erosion can be major sculptor of...
1	http://phys.org/news/2016-04-copper-oxygen.html	Earth Sciences	A new study presents evidence that the rise of...	PhysOrg	Copper gives an answer to the rise of oxygen
2	http://phys.org/news/2016-04-nasa-category-tro...	Earth Sciences	Tropical Cyclone Fantala has become a major tr...	PhysOrg	NASA examines Category 5 Tropical Cyclone Fant...
3	http://phys.org/news/2016-04-international-tea...	Earth Sciences	(Phys.org)—A team with members from the U.K., ...	PhysOrg	International team gets a seismic look at Nort...
4	http://phys.org/news/2016-04-belly-glacier.html	Earth Sciences	Dropping into an ice cave is like entering ano...	PhysOrg	Into the belly of a glacier

Restrict by Category¶

In [6]:

g = sns.countplot(articles.category);
g.figure.set_size_inches(15,10);
plt.xticks(rotation='vertical', size=12);
plt.title('Total Document Counts per PhysOrg Category', size=14);
plt.ylabel("");

The vast majority of documents are associated with only eight topics, all with the pattern "NEWS_TYPE-news" as the name due to how the articles were originally obtained from PhysOrd. For our analysis, we will restrict the data to articles from the eight major categories ("Major 8"). Note that articles from the "technology-news" category far outnumber content from any other category, while counts from the other seven are fairly close.

In [7]:

articles = articles[articles.apply(lambda x: x['category'].endswith('-news'),
                                   axis=1)]
articles.index = range(articles.shape[0])

In [8]:

g = sns.countplot(articles.category);
g.figure.set_size_inches(8,6);
plt.xticks(rotation='vertical', size=12);
plt.title('Total Document Counts per PhysOrg Category', size=14);
plt.ylabel("");

Tokenize Content & Titles¶

In order to generate a (simpliest) Vector Space Representation for the content of each article, we must first preprocess the documents. For our simple case, this includes:

All words to lowercase
Remove punctuation
Remove numbers
No stop word removal
No stemming
Word tokenize each document (no "higher-order" features)
Obtain the set of words contained in each document (ignore word counts)

Removing punctuation and numbers reduces the number of features from an area not likley to provide much benefit. Some methods convert all numbers to a generic placeholder (e.g., "NUMBER", not to be conflated with the word token "number"), but again, our focus is on simplicity.

NLTK is used for tokenization and whitespace / punctuation / number removal all in one step by utilizing the RegexpTokenizer. For large datasets, using regular expressions could be disastrous given the potential for regex to be quite slow (though this may be outdated). In our case, the small dataset being used does not run into any issues.

Why no stop word removal? Again, partially for simplicity and partially to help highlight shortcomings and strengths of the various models. Due to the sheer simplicity of the model in this notebook, much of the "noise" generate by stop words is mitigated, as each word can only contribute a total count of '1' to the Query - Document similarity score. In future models, we will see that stop words are midigated in other ways.

Similarly, the lack of stemming and extracting higher-order phrase features for use in the document vector representation is primarily for simplicity.

Later in the article, we will also be using the binary BOW generated from the title to create queries. Therefore, the title BOW representation is generated now as well. Titles are preprocessed and tokenized using the same methodology as the document content.

In [9]:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r"""[\s,\.\!\?0-9'"-\(\)]+""", gaps=True)
tk_func = lambda content: tokenizer.tokenize(content.lower())

In [10]:

# Content tokenization and Binary BOW
articles['tokenized'] = articles.apply(lambda x: tk_func(x['content']), axis=1)
articles['bow'] = articles.apply(lambda x: set(x['tokenized']), axis=1)

In [11]:

# Content total length and number of unique words
articles['tot_len'] = articles.apply(lambda x: len(x['tokenized']), axis=1)
articles['unique_words'] = articles.apply(lambda x:len(x['bow']), axis=1)

In [12]:

# Title Binary BOW
articles['title_bow'] = articles.apply(lambda x: set(tk_func(x['title'])), axis=1)

In [13]:

g = sns.distplot(articles.tot_len, kde=False);
g.figure.set_size_inches(12,8);
plt.title("Distribution of Total Word Count", size=14);
plt.xlabel('Total Words in Document', size=12);
plt.ylabel('Count', size=12);

In [14]:

articles.tot_len.describe()

Out[14]:

count    1098.000000
mean      584.921676
std       288.545918
min        38.000000
25%       389.000000
50%       532.500000
75%       722.000000
max      2324.000000
Name: tot_len, dtype: float64

In [15]:

g = sns.distplot(articles.unique_words, kde=False);
g.figure.set_size_inches(12,8);
plt.title("Distribution of Unique Word Count", size=14);
plt.xlabel('Unique Words in Document', size=12);
plt.ylabel('Count', size=12);

In [16]:

articles.unique_words.describe()

Out[16]:

count    1098.000000
mean      279.198543
std       108.684471
min        35.000000
25%       209.000000
50%       267.000000
75%       334.750000
max       952.000000
Name: unique_words, dtype: float64

The above plots display the distibutions for total document length and unique words per document, respectively. Both appear to be Negative Bionomial distributions, which would make sense (note that neither has been normalized at this point). Simplistically, we can imagine that a document is a sequence of words. With some probabilitiy $p$, we add an additional random term to the document (a success), thereby increasing its length. With probability $1-p$ we decide not to add a new word (failure) to the document, and thus the document ends. In terms of the Negative Binomial, our specified number of failures before halting the process in this case is 1.

At this point, we also pickle the "articles" data frame for use in later articles:

In [17]:

articles['doc_id'] = range(articles.shape[0])

In [18]:

with open("data/preprocessedArticleContentDump.pkl", "wb") as f:
    pickle.dump(articles, f)

Construct Document & Query BOW Vectors¶

Steps required for this process:

Generate global vocabulary dictionary
Convert each Binary BOW Document word into its respective index value generated from the dictionary
Generate vector for each Document representing the words it contains

To construct the dictionary, document contents and titles are used. Various methods exist for handling unobserved words -- such as those that may arrise from our queries if we did not construct the dictionary using the titles -- but for our method, the outcome is the same regardless. That is, words in queries (titles) that were not observed in any documents do not contribute to a document's score. For making the processing easier, however, we include all terms in the dictionary.

The dictionary (hash table) is then utilized to create a word to integer mapping. From the mapping, each document's binary BOW representation can be converted into an N-dimensional vector, where N is the size of the dictionary. Each word is mapped to a particular dimension in the space. A document containing a particular word will have a value of '1' entered for the word's corresponding dimension in the document's vector representation. If a document does not contain a particular word, a '0' will be entered for that word's respective dimension.

Our particular method takes advantage of the fact that most documents do not contain most of the words in the dictionary by initializing a matrix of 0's with shape (Number of documents, Size of Dictionary). Then, we iterate over each document and replace 0's with 1's in the matrix where appropriate. The result is a (very) sparse matrix of word occurences in documents. We can see that less than 0.8% of the matrix is populated with 1's.

Titles (queries) are converted into the same vector space using the same method. The resulting document and query vectors are then used to calculate query - document similarity scores.

In [19]:

dictionary = set()
_ = articles.apply(lambda x: dictionary.update(x['bow']), axis=1)
_ = articles.apply(lambda x: dictionary.update(x['title_bow']), axis=1)
dictionary = {w : v for (v,w) in zip(range(len(dictionary)), dictionary)}

In [20]:

# Document content BOW vectors
doc_vecs = numpy.zeros(shape=(articles.shape[0], len(dictionary)))
for di,bd in enumerate(articles['bow']):
    for w in bd:
        doc_vecs[di, dictionary[w]] = 1

In [21]:

# Document Query (Title) BOW vectors
query_vecs = numpy.zeros(shape=(articles.shape[0], len(dictionary)))
for di,bd in enumerate(articles['title_bow']):
    for w in bd:
        query_vecs[di, dictionary[w]] = 1

In [22]:

print("Total entries: {0}".format(str(doc_vecs.size)))
print("Total non-zero entries: {0:0.0f}".format(doc_vecs.sum()))
print("Percentage populated: {0:0.2f}%".format(100*(doc_vecs.sum() / doc_vecs.size)))

Total entries: 38263104
Total non-zero entries: 306560
Percentage populated: 0.80%

Calculating Query - Document Scores¶

In this section, we use the dot product to compute the pairwise query - document similarity scores. We use document titles to generate a query for each document. Thus, our query - document scores are proxies for document - document similarty scores mapped into a Text Retrieval paradigm.

This makes evaluating the results particularly simple and interesting, as we can make use of each document's associated category to generate the set of "expected" returned document. While this may greatly over-simplify the Text Retrieval problem, it makes for a particularly straight-forward and intuitive analysis.

Python's scipy library is used for easily calculating the pairwise similarities. Specifically, we make use of the "spatial.distance" module, which can be used for a wide variety of distance / similarity computations.

In [23]:

# Calculate all pairwise dot products between queries and documents
query_doc_sims = scipy.spatial.distance.cdist(query_vecs,
                                              doc_vecs,
                                              numpy.dot)
qds_counts = pandas.DataFrame([{'score':s, 'count':c} \
                               for (s,c) in nltk.FreqDist(query_doc_sims.flatten()).items()])

In [24]:

g = plt.scatter(x='score', y='count', s=40, c='purple', data=qds_counts)
g.figure.set_size_inches(12,8);
plt.title("Query - Document Similarity Scores Based on Title Query", size=14);
plt.xlabel("Query - Document Similarity Score", size=12);
plt.ylabel("Count", size=12)
plt.xlim([-0.5,17.5]);
plt.ylim([-10000, 350000]);

In [25]:

qds_counts.transpose()

Out[25]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17
count	86140.0	300030.0	330439.0	240908.0	134142.0	67403.0	28324.0	10753.0	4262.0	1907.0	787.0	331.0	106.0	44.0	18.0	8.0	1.0	1.0
score	0.0	1.0	2.0	3.0	4.0	5.0	6.0	7.0	8.0	9.0	10.0	11.0	12.0	13.0	14.0	15.0	16.0	17.0

Not so surprisingly given the data, similarity scores top out at 17. This is reasonable given that the maximum score has an upper-bound equal to the maximum title length due to the use of the dot product in calculating the similarity score and the maximum number of unique words in a title (shown below) is 19 words. Most scores are below 10, with only about 1000 hitting 10 or above. This is also in line with the data observed below. Overall, the distribution of scores seems reasonable given the title lengths and the basic idea that most words in the dataset are sparse and therefore not shared between documents (queries / titles).

In [26]:

title_lengths = articles.apply(lambda x: len(x['title_bow']), axis=1)
title_lengths.describe()

Out[26]:

count    1098.000000
mean        9.023679
std         2.562507
min         2.000000
25%         7.000000
50%         9.000000
75%        11.000000
max        19.000000
dtype: float64

Evaluation¶

Utilizing the pre-defined groups from PhysOrg, we can compare Query - Document scores for (Query, Document) pairs in which both the Query and the Document are associated with the same category to (Query, Document) pairs in which the Query is associated one category and the Document is associated with a different category.

Since we are performing these analyses within the context of Information Retrieval and Search Engines, we will evaluate the Top 10 highest scoring Documents for each Query in terms of whether the Query and Documents share the same PhysOrg category.

We will first review the fraction of document types returned for all queries from a given category, and then look to evalute the method explored in this article using the F-Measure.

In [27]:

top10 = query_doc_sims.argsort(axis=1)[:,-10:]
top10 = top10[:, [i for i in reversed(range(10))]]
top10 = pandas.DataFrame(data=top10, columns=[i for i in range(1,11)])
top10['Doc'] = range(top10.shape[0])
top10['Category'] = top10.apply(lambda x: articles.category[x['Doc']], axis=1)

Document Types Returned by Category¶

For this portion of the analysis, we will look into 4 questions:

What percentage of the time is the highest-scored document the document itself?
What percentage of the time is the document itself within the top 10 highest scored documents?
What percentage of the time is a document sharing the document in question's cetegory the top ranked document?
For each category, what is the categorical break-down of documents in the top 10 list across all documents in that category?

The first three questions are answered directly below. Intuitively, we should expect a "good" method to return the respective document for a query most of the time. 75% of the time seems somewhat reasonable, though it would be reasonable to assume that we could do better. In the same vein, a 90% hit rate for returning the associated document in the Top 10 List is respectable, but also leaves much room for improvement. We will primarily use these values to compare to the next method we impliment.

In [28]:

# Doc as first, Doc in top 10
doc_first = sum(top10[1]==top10.index) / top10.shape[0] * 100
doc_top10 = sum(top10.apply(lambda x: x['Doc'] in set(x[0:10]), axis=1)) / top10.shape[0] * 100
print("A query's respective document is returned first {0:0.1f}% of the time".format(doc_first))
print("A query's respective document is returned within the first 10 documents {0:0.1f}% of the time".format(doc_top10))

# Doc Category as first
cat_first = sum(top10.apply(lambda x: articles.category[x[1]]==articles.category[x['Doc']], axis=1)) / top10.shape[0] * 100
print("A query's respective document category is returned first {0:0.1f}% of the time".format(cat_first))

A query's respective document is returned first 73.4% of the time
A query's respective document is returned within the first 10 documents 90.3% of the time
A query's respective document category is returned first 86.7% of the time

In [29]:

def catLookup(x):
    return([articles.category[i] for i in x[1:11]])

def sameCat(x):
    cat = x['Category']
    cats = catLookup(x)
    return([c==cat for c in cats])

def sameCatCount(x):
    same_cats = sameCat(x)
    return(sum(same_cats))

top10['same_cat_count'] = top10.apply(lambda x: sameCatCount(x), axis=1)

In [30]:

# Count occurrences of each Major 8 type in the top 10 list for each Major 8 type
total_counts = []
inds = []
for ind,m8 in enumerate(pandas.unique(articles.category)):
    sub = numpy.array(articles.category==m8, dtype=bool)
    sub_tt = numpy.array(top10.ix[sub][[i for i in range(1,11)]]).flatten()
    sub_tt_categories = articles.category[sub_tt]
    sub_counts = nltk.FreqDist(sub_tt_categories)
    total_counts.append(sub_counts)
    inds.append(m8)
total_counts = pandas.DataFrame(total_counts)
total_counts.index = inds
total_counts = total_counts[inds]

In [31]:

# Frac of total documents in top 10 of each M8, by each M8
total_fracs = numpy.array(total_counts)
total_fracs = numpy.apply_along_axis(lambda x: x / sum(x), 1, total_fracs)
total_fracs = pandas.DataFrame(data = total_fracs, columns=inds, index=inds)

In [32]:

# Plot heatmap of the results
cmap = sns.cubehelix_palette(start=3, rot=0, dark=.12, light=.87, as_cmap=True)

g = sns.heatmap(total_fracs, annot=True, cmap=cmap);
g.figure.set_size_inches(10,8);

plt.xticks(size=12);
plt.yticks(size=12);

For a more comprehensive picture of the method's performance, we analyze the fraction of each document type returned in the Top 10 List for each document type. Each square in the heatmap above represents the fraction of the column category's documents returned in the Top 10 List for a query associated with the row category. For example, for queries associated with the "science-news" category (row 2), about 29% of the total documents returned in the Top 10 Lists were from the "technology-news" category (column 6).

While we see a fairly strong diagonal (i.e. queries returning documents from the same category), the overall performance is not particularly encouraging outside of "technology-news", which also happens to have the largest number of documents in the collection by far.

Evaluation with the F-Measure¶

Eyeballing the data is not particularly conducive for comparing results. Luckily, the IR field has provided some useful measures for evaluating an arbitrary method's performance, the most notable of which are Precision and Recall. Precision evaluates the fraction of returned documents that are from the target class or category, while recall looks at the fraction of target documents in the entire collection returned. More precisly:

precision = (total relevant documents returned) / (total documents returned)

recall = (total relevant documents returned) / (total relevant documents in the collection)

While both of these metrics help evaluate a method, having a single value for comparing methods would be even more useful. The F-Measure combines Precision and Recall into a single metric that can be used to compare methods. The traditional -- or balanced -- F-Measure is calculated as follows:

$$F = 2 \cdot \frac{precision \cdot recall}{precision + recall}$$

Note that this metric is meant to evaluate a single query. We will calculate the average F-Measure across all queries (1) in each category and (2) in our dataset to obtain average F-Measures.

To do this, we first add the total document counts for each category to the "total_counts" data frame. Then, for each query entry in the "top10" data frame, we calculate the F-Measure. After calculating F-Measures for individual queries, we find the mean F-Measure for all queries in a category, and then calculate the overall mean F-Measure.

In [33]:

total_counts['total_docs'] = numpy.array([nltk.FreqDist(articles.category)[c] for c in total_counts.index])

In [34]:

total_counts

Out[34]:

	earth-news	science-news	nanotech-news	physics-news	space-news	technology-news	biology-news	chemistry-news	total_docs
earth-news	284	64	40	40	65	155	124	38	81
science-news	96	395	45	84	66	394	200	70	135
nanotech-news	33	35	266	143	25	129	71	118	82
physics-news	49	50	178	489	47	163	71	123	117
space-news	63	38	22	59	356	109	50	33	73
technology-news	128	213	143	163	112	2275	187	189	341
biology-news	102	121	120	112	75	296	680	144	165
chemistry-news	40	58	146	121	32	200	161	282	104

In [35]:

def recall(rel, tot):
    return(rel / tot)

def precision(rel, ret):
    return(rel / ret)

def fMeasure(rel, ret, tot):
    r = recall(rel, tot)
    p = precision(rel, ret)
    fm = 2 * r * p / (r + p)
    return(fm)

def fFromX(x):
    cat = x['Category']
    tot = total_counts['total_docs'][cat]
    return(fMeasure(x['same_cat_count'], 10, tot))

In [36]:

top10['FMeasure'] = top10.apply(lambda x: fFromX(x), axis=1)

In [37]:

top10.groupby(['Category']).agg({'FMeasure' : 'mean'})

Out[37]:

	FMeasure
Category
biology-news	0.048485
chemistry-news	0.052126
earth-news	0.079772
nanotech-news	0.073966
physics-news	0.067703
science-news	0.043014
space-news	0.119162
technology-news	0.038533

In [38]:

print('Overall F-Measure: {0:0.3f}'.format(numpy.mean(top10.FMeasure)))

Overall F-Measure: 0.056

At this point, we can compare category performances to one another and see that queries from the "space-news" category obtained better F-Measures and any other category. Overall though, the F-Measures are rather disappointing, as the best possible score is 1.

It should be noted that obtaining a recall of '1' for any of the categories with our current dataset and methodology is impossible. For evaluating queries, the method only retrieves 10 documents. Each category contains at least 73 documents. For the dataset, the maximum possible score obtainable is about 0.24, which could potentially be obtained by the "space-news" category. Other categories have lower maximums because they have a larger number of associated documents in the collection.

Wrap-Up and Next Steps¶

In this notebook, we looked at one method for implimenting the "Simplest Vector Space Model Instantiation" using Python. This simple model has three main features:

Bag of Word feature encoding using binary hits for terms
No normalization or weighting applied to terms or documents
Dot product used for calculating similarity scores between documents and queries

A selection of PhysOrg articles was used to demonstrate the model. In our analysis, the dataset was restricted to articles that were associated with the eight most prevelant categories. Each document was associated with a query that was generated from its title.

In evaluating the method, we looked at the Top 10 highest-scoring documents for each query. Using this subset, we looked at the category breakdowns for the returned documents, as well as the average F-Measure for each category and for the entire dataset. Overall, the "Simplest Vector Space Model Instantiation" looked to be lacking in its ability to return documents associated with the same category as the query (i.e., relevant documents).

In a subsequent notebook, we will look to improve the VSM by implimenting various term and document weighting schemes discussed in lectures from the Text Retrieval and Search Engines Coursera course. We will compare that model's performance to the performance of the model from this notebook.

immersinn-ds