immersinn-ds

Mon 12 December 2016

NLP Pipelines with spaCy: Filter & Replace & Map

Posted by TRII in text-analytics   

Introduction / Overview

A semi-hasty post :-)

As a semi-follow up to the previous article, we expand upon the pipeline to build out some custom steps that we use for generating word contexts. spaCy documents lock token attributes, so it is a bit difficult to filter and replace tokens directly. Here, we build some custom "dummy" Token and Document classes that mimic basic functionality of spaCy Tokens and Documents while still being able to link back to the vocabulary (or generate a custom one).

Additionally, we make use of textacy for some document pre-pre-processing, like replacing numbers / digits with text (i.e., "*number*"). To steps of the pipeline used in this article are shown below. The code referenced by several of the steps below, along with some other not so useful stuff, is here in the (not so well named) "vocab_customizer" file.

Pipeline:

  1. Pre-process raw text with textacy
    • Replace contractions with "full" format
    • Replace numbers with common representation
    • Replace emails with common representation
    • Replace currency symbols with acronyms
  2. Convert raw text to spaCy docs
    • Only use "tagger" and "parser"
  3. Filter, Replace, Map:
    1. Filter out unwanted tokens / lexeme
      • Remove non-alpha-like characters
      • Remove stopwords
    2. Replace infrequent tokens
      • Replace infrequent words with POS tags or some other representative symbols
    3. Map
      • Convert the token keys to a shortened list; i.e., the size of the new vocab will be the number of unique token keys observed, not the total number in the spaCy nlp pipeline vocabulary
  4. Get Word Contexts from Documents' Sentences

Our implementation is currently a bit of a "loose" pipeline in that the textacy preprocessing and spaCy transformation steps are "hand coded" to an extent, and there is no code that tracks the overall process of the complete pipe. Such a wrapper may be constructed at a later date if it seems beneficial for future endeavors.

In [1]:
import os
import sys
import pickle
import itertools
In [2]:
from collections import defaultdict, Counter
In [3]:
from importlib import reload
In [4]:
import numpy
import scipy
import pandas
import spacy
from spacy import attrs
import textacy
In [5]:
%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
In [6]:
from importlib import reload

Load Articles

Load data stuffs

cd ..

In [8]:
with open('Text Retrieval and Search Engines/data/preprocessedArticleContentDump.pkl', 'rb') as f:
    articles = pickle.load(f)
In [9]:
cd Text\ Mining\ and\ Analytics
/home/immersinn/Dropbox/Analytics/Text Mining and Analytics
In [10]:
articles.head(2)
Out[10]:
URL category content source title tokenized bow tot_len unique_words title_bow doc_id
0 http://phys.org/news/2016-04-activists-appeal-... earth-news Environmental groups on Tuesday lodged a compl... PhysOrg Activists appeal to EU over Polish logging of ... [environmental, groups, on, tuesday, lodged, a... {firs, protests, blocks, in, received, what, a... 385 223 {logging, activists, primeval, polish, to, of,... 0
1 http://phys.org/news/2016-04-seismic-ecuador.html earth-news A doctoral thesis developed at UPM analysed th... PhysOrg The seismic risk of Ecuador [a, doctoral, thesis, developed, at, upm, anal... {in, february, cárdenas, director, benito, wit... 269 152 {the, risk, ecuador, seismic, of} 1

Text to Word Contexts

Here we take each of the steps in the pipeline and carry out that task. The first two steps -- preprocess with textacy and create spaCy docs -- are fairly straightforward.

The components of the third step (Filter, Replace, Map) are covered separately first and then all together. The code shown directly in this file only covers some of the process. For the full code, see the GitHub repository referenced in the Intro.

A basic version of the last step -- create word contexts-- was seen previously.

Preprocess Texts with Textacy

Does what it says. There are several more options for things that can be removed / replaced (e.g. punctuation), but the set of items shown below were sufficient for our purposes.

In [11]:
textacy_preprocessor = lambda text: textacy.preprocess.preprocess_text(text, 
                                                                       no_contractions=True,
                                                                       no_numbers=True,
                                                                       no_emails=True,
                                                                       no_currency_symbols=True)
In [12]:
docs = [textacy_preprocessor(doc) for doc in list(articles.content)[:500]]

spaCy Pipeline

Essentially as shown in the previous article. We do not load the word vectors to save some space / time because we do not need them. Additionally, we truncate the pipeline to only Tagging and Parsing, as we do not need anything else for this particular exercise. These two steps also would allow for using Noun Phrases as Tokens in our Vocabulary / Word Documents below if we were so inclined.

In [13]:
nlp = spacy.load("en", add_vectors=False)
In [14]:
nlp.pipeline = [nlp.tagger, nlp.parser]
In [15]:
spacy_docs = [doc for doc in nlp.pipe(docs, batch_size=250, n_threads=4)]

Filter, Replace, Map

In [16]:
from text_analytics import vocab_customizer

In the previous post dealing with spaCy, an example of a vocabulary customizer class was shown. We now use an updated and modified version of that example to generate a restricted vocabulary for our data. As mentioned above, the code used can be found here. Each of the classes follows a scikit-learn style of "fit" and "transform", though this model is not entirely appropriate. Typically we will not be looking at documents outside of the original document set for this particular pipeline. Additionally functionality would be needed to account for this. For instance, we would need to cover the case of encountering unseen tokens.

First, we Filter out words we do not want to have in our vocabulary. For this example, we use a subset of the NLTK English Stopwords list. Notice that we add back in pronouns, prepositions, and verbs, as well as the possessive "ending", "'s". For finding Paradigmatic and Syntagmatic similarities, we feel that these tokens may be important.

Our second step is to Replace a subset of the vocabulary. Specifically, we replace words that appear in a small number of documents (original documents, not word documents) with the respective Part of Speech (POS) Tag for the Token. Note that we do not necessarily replace all occurrences of an infrequent word with the same POS Tag. Instead, we look at each occurrence of that Token and replace each instance of the Token with the respective POS Tag assigned to it by spaCy. Also note that we do not look at the total frequency of words, but the Document Frequency. This step removes a number of peoples' names, as well as infrequent words in general.

The third step takes our now Filtered and Replaced documents and creates a mapping between consecutive integers and the spaCy lemmas associated with the Tokens (POS tags are assigned to lemmas in the spaCy vocabulary as well). This allows for a compact representation of the remaining tokens in the documents, which enables us to easily map the words, documents, and word-documents to a set of indices for referencing a matrix (array).

Additionally, we only construct contexts and Word-Documents for words that were neither filtered nor replaced, nor were the resulting mapping for a replaced word (i.e. no Word Documents are constructed for POS Tags). This means we need an additional mapping from our truncated vocabulary to the Word-Document Identifiers. We cover this step below as well.

Filter

In [18]:
import nltk
In [19]:
custom_stop_words = nltk.corpus.stopwords.words('english')
In [20]:
stops_to_remove = ['i', 'me', 'we', 'you', 'he', 'him', 'himself', 'she', 'her', 'herself',
                   'it', 'itself', 'they', 'them', 'themselves',
                   'ourselves', 'yourselves'
                   'about', 'against', 'between', 'into', 'through',
                   'during', 'before', 'after', 'above', 'below',
                   'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off',
                   'over', 'under', 'again', 'further',
                   'has', 'had', 'having', 
                   'am', 'is', 'are', 'was', 'were', 'will', 'can',
                   'be', 'been', 'being',
                   'have', 'has', 'had', 'having',
                   'do', 'does', 'did', 'doing',
                   
                   "'s"]
In [21]:
custom_stop_words = [w for w in custom_stop_words if w not in stops_to_remove]

Here we set the list of custom stopwords we want to use. First, we need to un-flag all default stopwords in the spaCy vocabulary. Then, all words in the custom list are flagged as stops in the spaCy vocabulary.

In [22]:
# Remove existing stopwords...
for word in nlp.Defaults.stop_words:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = False
    word = word[0].upper() + word[1:]
    lexeme = nlp.vocab[word]
    lexeme.is_stop = False

# and set custom stopwords...
for word in custom_stop_words:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True
    word = word[0].upper() + word[1:]
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True
In [23]:
lex_filter = lambda l: vocab_customizer.lexeme_filter(l, 
                                                      filter_stop=True,
                                                      filter_digit=False,
                                                      filter_like_num=False,
                                                      filter_short=False)
In [24]:
filter01 = vocab_customizer.TokenFilter()
In [25]:
documents = [vocab_customizer.DummyDoc([[vocab_customizer.DummyToken(w) for w in sent] \
                                        for sent in doc.sents]) \
             for doc in spacy_docs]
In [26]:
[len(sent) for sent in documents[0].sents]
Out[26]:
[32, 49, 31, 49, 52, 26, 37, 21, 34, 20, 14, 4, 32, 46, 45, 1, 4]
In [27]:
filter01.fit(documents)
filter01.transform(documents)
In [28]:
[len(sent) for sent in documents[0].sents]
Out[28]:
[22, 32, 20, 34, 31, 14, 24, 12, 22, 13, 11, 2, 18, 26, 21, 0, 2]

Replace

In [29]:
min_freq = 10
replace_attr = 'pos'
In [30]:
key_2_attr_map = {'lower' : attrs.LOWER,
                  'orth' : attrs.ORTH,
                  'pos' : attrs.POS,}

kem = lambda x: vocab_customizer.count_by_transform(x, attr=key_2_attr_map['lower'])
In [31]:
docs_keys = [kem(document) for document in spacy_docs]
          
# Count total documents each key occurs in
doc_frequency = defaultdict(int)
for doc_keys in docs_keys:
    for key in doc_keys:
        doc_frequency[key] += 1
ift = [k for k,c in doc_frequency.items() if c < min_freq]
In [32]:
def replace_func(token, ift, replace_attr):
    if token.key_attr in ift:
        token.key_attr = token.__getattribute__(replace_attr)
        
token_replace=lambda x: replace_func(x, ift, replace_attr)
In [33]:
replace01 = vocab_customizer.TokenReplaceInfrequent(min_freq=min_freq, token_replace_attr=replace_attr)
In [34]:
documents = [vocab_customizer.DummyDoc([[vocab_customizer.DummyToken(w) for w in sent] \
                                        for sent in doc.sents]) \
             for doc in spacy_docs]
In [35]:
for sent in documents[0].sents[:3]:
    prt_wrds = [(w.lower_, w.key_attr) for w in sent[:7]]
    print(prt_wrds)
[('environmental', 7091), ('groups', 1924), ('on', 485), ('tuesday', 20152), ('lodged', 32735), ('a', 469), ('complaint', 5183)]
[('"', 481), ('we', 535), ('risk', 1568), ('turning', 2539), ('this', 496), ('forest', 5699), ('into', 586)]
[('poland', 46560), ("'s", 478), ('environment', 2376), ('minister', 11907), ('jan', 51381), ('szyszko', 1510247), ('last', 691)]
In [36]:
replace01.fit(documents)
replace01.transform(documents)
In [37]:
for sent in documents[0].sents[:3]:
    prt_wrds = [(w.lower_, w.key_attr) for w in sent[:7]]
    print(prt_wrds)
[('environmental', 7091), ('groups', 1924), ('on', 485), ('tuesday', 20152), ('lodged', 97), ('a', 469), ('complaint', 89)]
[('"', 481), ('we', 535), ('risk', 1568), ('turning', 2539), ('this', 496), ('forest', 89), ('into', 586)]
[('poland', 93), ("'s", 478), ('environment', 2376), ('minister', 93), ('jan', 93), ('szyszko', 93), ('last', 691)]
In [38]:
for id_ in [89, 93, 97]:
    print("ID: {}; POS: {}".format(id_, nlp.vocab[id_].lower_))
ID: 89; POS: noun
ID: 93; POS: propn
ID: 97; POS: verb

Map

In [39]:
documents = [vocab_customizer.DummyDoc([[vocab_customizer.DummyToken(w) for w in sent] \
                                        for sent in doc.sents]) \
             for doc in spacy_docs]
In [40]:
for sent in documents[0].sents[:3]:
    prt_wrds = [(w.lower_, w.key_attr) for w in sent[:7]]
    print(prt_wrds)
[('environmental', 7091), ('groups', 1924), ('on', 485), ('tuesday', 20152), ('lodged', 32735), ('a', 469), ('complaint', 5183)]
[('"', 481), ('we', 535), ('risk', 1568), ('turning', 2539), ('this', 496), ('forest', 5699), ('into', 586)]
[('poland', 46560), ("'s", 478), ('environment', 2376), ('minister', 11907), ('jan', 51381), ('szyszko', 1510247), ('last', 691)]
In [41]:
vocab_keys = set()
unit = "sents"
for doc in documents:
    if unit=="sents":
        keys = [w.key_attr for sent in doc.sents for w in sent]
    if unit=="words":
        keys = [w.key_attr for w in doc]
    vocab_keys.update(set(keys))
In [42]:
vocab_encoding = {key:i for i,key in enumerate(vocab_keys)}
rev_vocab_encoding = {i:k for k,i in vocab_encoding.items()}

vocab = {i : nlp.vocab[rev_vocab_encoding[i]] \
         for i in vocab_encoding.values()}

word_lookup = {i : nlp.vocab[rev_vocab_encoding[i]].lower_ \
               for (key,i) in vocab_encoding.items()}
In [43]:
def encode_word(token):
    token.key_attr = vocab_encoding[token.key_attr]

def encode_doc(doc):
    if unit=="sents":
        for sent in doc.sents:
            for w in sent:
                encode_word(w)
    elif unit=="words":
        for w in doc.tokens:
            encode_word(w)

def encode_docs(docs):
    for doc in documents:
        encode_doc(doc)
In [44]:
encode_docs(documents)
In [45]:
for sent in documents[0].sents[:3]:
    prt_wrds = [(w.lower_, w.key_attr) for w in sent[:7]]
    print(prt_wrds)
[('environmental', 6686), ('groups', 1581), ('on', 175), ('tuesday', 15118), ('lodged', 19967), ('a', 159), ('complaint', 4789)]
[('"', 171), ('we', 225), ('risk', 1238), ('turning', 2169), ('this', 186), ('forest', 5305), ('into', 276)]
[('poland', 11467), ("'s", 168), ('environment', 2011), ('minister', 10251), ('jan', 14327), ('szyszko', 2526), ('last', 379)]

Filter & Replace & Map

Here we throw the previous three steps together into as single transformation object. This object tracks the entire process and allows us to trace encodings back to words for those words that were not replaced. Replaced words are inherently "lost" in the replacement process because it is possible for multiple tokens to be encoded to the same token during the replace process.

For comparison, we show the original spaCy version and the processed version for the first document in the collection.

In [46]:
vm = vocab_customizer.VocabPrep(nlp, stopwords=custom_stop_words)
In [47]:
vm.fit(spacy_docs)
In [48]:
for sent in spacy_docs[0].sents:
    print(len(sent))
    prt_snt = [w.lex_id for w in sent][:10]
    print(prt_snt)
    print([w.lower_ for w in sent][:10])
32
[28656, 1467, 22, 5629, 32314, 6, 4734, 26, 3, 2753]
['environmental', 'groups', 'on', 'tuesday', 'lodged', 'a', 'complaint', 'with', 'the', 'european']
49
[18, 193, 1109, 2082, 33, 5252, 124, 6, 1950, 35261]
['"', 'we', 'risk', 'turning', 'this', 'forest', 'into', 'a', 'tree', 'plantation']
31
[8013, 15, 36042, 12393, 12690, 0, 230, 618, 682, 3]
['poland', "'s", 'environment', 'minister', 'jan', 'szyszko', 'last', 'month', 'gave', 'the']
49
[54, 18909, 11, 1941, 50, 3, 1503, 1191, 8, 3]
['the', 'commission', 'is', 'concerned', 'about', 'the', 'recent', 'decision', 'of', 'the']
52
[5804, 3, 190, 783, 2, 66193, 71, 10777, 66, 93]
['under', 'the', 'new', 'plan', ',', 'loggers', 'will', 'harvest', 'more', 'than']
26
[54, 41488, 166, 447, 26, 3, 489, 15, 14136, 2]
['the', 'environmentalists', 'take', 'issue', 'with', 'the', 'government', "'s", 'rationale', ',']
37
[18, 19677, 5, 61, 3, 11466, 8, 1919, 420, 2]
['"', 'contrary', 'to', 'what', 'the', 'minister', 'of', 'environment', 'says', ',']
21
[54, 1467, 171, 3, 9197, 783, 2, 125, 100, 1727]
['the', 'groups', 'said', 'the', 'logging', 'plan', ',', 'which', 'could', 'begin']
34
[18, 193, 48, 30, 2164, 33, 1191, 415, 10182, 675]
['"', 'we', 'can', 'not', 'challenge', 'this', 'decision', 'under', 'polish', 'law']
20
[18, 222, 3, 524, 2, 40192, 8, 2522, 1376, 675]
['"', 'in', 'the', 'past', ',', 'breaches', 'of', 'eu', 'nature', 'law']
14
[193, 363, 3, 11466, 71, 12346, 170, 33, 39343, 5252]
['we', 'hope', 'the', 'minister', 'will', 'reconsider', 'before', 'this', 'irreplaceable', 'forest']
4
[18, 326403, 863, 23]
['"', 'sprawling', 'across', '*']
32
[31202, 23, 99278, 43, 173, 23, 31202, 23, 19108, 24]
['number', '*', 'hectares', '(', 'around', '*', 'number', '*', 'acres', ')']
46
[78855, 6, 84861, 1755, 25621, 770, 14, 23, 31202, 23]
['designated', 'a', 'unesco', 'world', 'heritage', 'site', 'in', '*', 'number', '*']
45
[1749, 15, 23942, 2648, 2, 78085, 46382, 23, 31202, 23]
['europe', "'s", 'tallest', 'trees', ',', 'firs', 'towering', '*', 'number', '*']
1
[69433]
['©']
4
[23, 31202, 23, 64878]
['*', 'number', '*', 'afp']
In [49]:
new_docs = vm.transform(spacy_docs)
In [50]:
for sent in new_docs[0].sents:
    print(len(sent))
    print(sent[:10])
    print([vm.vocab2word[w] for w in sent][:10])
22
[2310, 1023, 63, 1727, 16, 11, 1724, 14, 138, 14]
['environmental', 'groups', 'on', 'tuesday', 'verb', 'noun', 'european', 'propn', 'over', 'propn']
32
[95, 814, 1304, 11, 127, 11, 11, 2277, 902, 1094]
['we', 'risk', 'turning', 'noun', 'into', 'noun', 'noun', 'reducing', 'natural', 'heritage']
20
[14, 1234, 14, 14, 14, 208, 501, 545, 120, 936]
['propn', 'environment', 'propn', 'propn', 'propn', 'last', 'month', 'gave', 'go', 'ahead']
34
[14, 60, 16, 1045, 863, 4, 19, 1423, 1234, 11]
['propn', 'is', 'verb', 'recent', 'decision', 'adj', 'authorities', 'eu', 'environment', 'noun']
31
[352, 177, 620, 11, 93, 16, 47, 4, 11, 47]
['under', 'new', 'plan', 'noun', 'will', 'verb', 'number', 'adj', 'noun', 'number']
14
[11, 159, 380, 416, 11, 238, 4, 11, 11, 60]
['noun', 'take', 'issue', 'government', 'noun', 'saying', 'adj', 'noun', 'noun', 'is']
24
[4, 56, 11, 1234, 355, 11, 11, 11, 106, 16]
['adj', 'to', 'noun', 'environment', 'says', 'noun', 'noun', 'noun', 'does', 'verb']
12
[1023, 163, 11, 620, 109, 1144, 529, 11, 16, 1423]
['groups', 'said', 'noun', 'plan', 'could', 'begin', 'early', 'noun', 'verb', 'eu']
22
[95, 75, 1333, 863, 352, 4, 539, 16, 56, 14]
['we', 'can', 'challenge', 'decision', 'under', 'adj', 'law', 'verb', 'to', 'propn']
13
[61, 439, 11, 1423, 979, 539, 62, 1539, 56, 4]
['in', 'past', 'noun', 'eu', 'nature', 'law', 'have', 'led', 'to', 'adj']
11
[95, 307, 11, 93, 16, 162, 4, 11, 60, 471]
['we', 'hope', 'noun', 'will', 'verb', 'before', 'adj', 'noun', 'is', 'lost']
2
[16, 668]
['verb', 'across']
18
[47, 11, 165, 47, 11, 14, 11, 16, 138, 4]
['number', 'noun', 'around', 'number', 'noun', 'propn', 'noun', 'verb', 'over', 'adj']
26
[16, 14, 244, 1094, 609, 61, 47, 11, 60, 327]
['verb', 'propn', 'world', 'heritage', 'site', 'in', 'number', 'noun', 'is', 'home']
21
[1122, 4, 1500, 6, 16, 47, 11, 259, 47, 975]
['europe', 'adj', 'trees', 'adv', 'verb', 'number', 'noun', 'high', 'number', 'feet']
2
[47, 1723]
['number', 'afp']
In [51]:
vm.n_words
Out[51]:
2428
In [52]:
vm.n_total
Out[52]:
2441

Generate Word Documents From Word Contexts

For determining Paradigmatic and Syntagmatic relationships, a document is generated for each word of interest in the vocabulary. The "word document" is constructed from all word contexts within the corpus of interest. From each sentence a particular word occurs in, a context is extracted. This is typically some window around the word, consisting of other words that are observed both before and after the target word. All contexts for the word are combined to construct a single BOW document for the word.

As a note, we only find contexts and construct Word Documents for words that were neither filtered out nor replaced. Thus, at this step, we need an additional mapping between the vocab and the documents. Word Documents are simply numbered from 1 to the total number of Word Documents.

Generate Word Docs TDM

In [187]:
def symWindowContext(target, sent, window_size=3):
    context = list(sent[max(0, target-window_size):target]) + \
    list(sent[min(len(sent),target+1):min(len(sent), (target+window_size+1))])
    return(context)


def contextGen(doc, vocab, wcc, ws, cutoff_len=2):
    for sent in doc.sents:
        if len(sent) > cutoff_len:
            for i,word in enumerate(sent):
                if word in vocab:
                    context = symWindowContext(i, sent, window_size=ws)
                    for c in context:
                        wcc[word][c] += 1
    return(wcc)


def wordBOWGenerator(docs, vocab, window_size=3):
    
    # Each word is a "document"
    word_bows = {k : defaultdict(int) for k in vocab}
    for doc in docs:
        word_bows = contextGen(doc, vocab,
                               word_bows,
                               ws=window_size)
    # "Lock" the defaultdicts
    word_bows = {w : dict(c) for w,c in word_bows.items()}
    
    return(word_bows)


def contextMatrixGenerator(word_docs, n_docs, n_vocab):
    
    data, iii, jjj = \
    zip(*[(c,doc['doc_id'],w2) for doc in word_docs for w2,c in doc['bow'].items()])
    
    mat = scipy.sparse.coo_matrix((data, (iii, jjj)),
                                  shape=((n_docs, n_vocab)),
                                  dtype=numpy.int32)
    mat = scipy.sparse.csr_matrix(mat)
    
    return(mat)
In [174]:
word_bows = wordBOWGenerator(new_docs, vm.vocab, window_size=1)
In [179]:
word_docs = [{'name':w, 'bow':word_bows[w], 'doc_id':i} for i,w in enumerate(word_bows.keys())]
In [189]:
word_contexts_matrix = contextMatrixGenerator(word_docs, len(word_docs), vm.n_total)
In [193]:
word_contexts_matrix.shape
Out[193]:
(2428, 2441)
In [127]:
sums = pandas.Series([word_contexts_matrix.getrow(i).toarray().sum() \
                      for i in range(word_contexts_matrix.shape[0])])
sums.index = range(sums.shape[0])
In [128]:
sums.describe()
Out[128]:
count     2428.000000
mean       124.437397
std        485.552693
min          0.000000
25%         34.000000
50%         54.000000
75%        104.000000
max      15271.000000
dtype: float64

Pre-process T(W)DM / Word Contexts Matrix with BM25

For calculating Paradigmatic Similarity Scores we want to use the BM25 method. To make pairwise similarity calculations more efficient, we can pre-compute the appropriate BM25 word weightings for each document in the collection. This is similar to the previous BM25 article. Weighted term vectors for each Word Document are stacked into a matrix.

In [72]:
def idfCalc(df, n_vocab, n_docs, method="smooth"):
    """Calculate the idf vector from the vocab"""
    idf = numpy.zeros((n_vocab
                       ,))
    if method == "smooth":
        for w,c in df.items():
            idf[w] = max(0,
                         numpy.log((n_docs - c + 0.5) / (c + 0.5))
                        )
    elif method=="basic":
        for w,c in df.items():
            idf[w] = c
        idf /= n_docs
    return(idf)
In [61]:
def transformBM25(vec, idf, avgdl, k1=1.2, b=0.75):
    """Weight a document BOW vetor based on BM25 methodology"""
    
    def pdlCalc(doc_len):
        """PDL Normalization Calculation"""
        return(1 - b + b * doc_len / avgdl)
    
    doc_len = vec.sum()
    d_pdl = pdlCalc(doc_len)
    vec = idf * (vec * (k1 + 1)) / (vec + k1 * d_pdl)
    vec = vec / vec.sum()
    return(vec)
In [129]:
doc_freqs = Counter()
for bow in word_bows.values():
    doc_freqs.update(bow.keys())
doc_freqs = dict(doc_freqs)
In [130]:
idf = scipy.sparse.lil_matrix((vm.n_total, vm.n_total))
In [131]:
idf.setdiag(idfCalc(doc_freqs, vm.n_total, vm.n_words))
In [132]:
avgdl = sums.sum() / len(sums)
In [133]:
bm25_trans_func = lambda x: transformBM25(x, idf.diagonal(), avgdl)
In [134]:
# Basic form, with removed '0-context' words
# May be a better way to loop this?
bm25_vecs = scipy.sparse.vstack(\
                                scipy.sparse.coo_matrix(bm25_trans_func(word_contexts_matrix.getrow(i).toarray())) \
                                for i in range(word_contexts_matrix.shape[0]) \
                                if word_contexts_matrix.getrow(i).sum() > 0
                               )
In [135]:
bm25_vecs = scipy.sparse.csr_matrix(bm25_vecs)
In [136]:
bm25_vecs.prune()
In [137]:
bm25_vecs.shape
Out[137]:
(2427, 2441)

Paradigmatic Similarity

Below, we explore some preliminary Paradigmatic similarity measures using the above data. By utilizing the BM25 precomputed matrix, we can calculate similarity scores for Word Documents. We also plot the distribution of log base 10 similarity scores below.

Generate Word Document Similarity Scores

In [138]:
%%time

# Sim(d1,d2) = sum_over_i
word_sims = numpy.dot(numpy.dot(bm25_vecs, idf), bm25_vecs.transpose())
CPU times: user 160 ms, sys: 4 ms, total: 164 ms
Wall time: 164 ms
In [139]:
word_sims.prune()
In [140]:
word_sims.shape
Out[140]:
(2427, 2427)

Get Unique Non-zero entries

The similarity matrix is symmetric and contains two entries for each pair or word-documents. Here we locate all non-zero similarity measures and remove duplicates. The strict "less than" requirement also removes Word-Document self-similarity scores as well.

In [141]:
nz = word_sims.nonzero()
In [142]:
len(nz[0])
Out[142]:
4404499
In [143]:
matches = nz[0] < nz[1]
In [144]:
nz = (nz[0][matches], nz[1][matches])
In [145]:
len(nz[0])
Out[145]:
2201036
In [146]:
dist = word_sims[nz]
In [147]:
dist = numpy.array(dist)[0,:]
In [148]:
g = sns.distplot(numpy.log10(dist), kde=False, color='purple');
g.figure.set_size_inches(12,8);
plt.title("Log10 of Word Similarity Scores, BM25", size=14);
plt.xlabel("Log10 BM25 Similarity Score", size=12);
plt.ylabel("Count", size=12);
In [149]:
pandas.Series(dist).describe([.25, .50, .75, .9, .95, 0.995])
Out[149]:
count    2.201036e+06
mean     2.768429e-03
std      4.462091e-03
min      7.221888e-07
25%      4.303217e-04
50%      1.494851e-03
75%      3.309476e-03
90%      6.512896e-03
95%      9.765315e-03
99.5%    2.674414e-02
max      4.357798e-01
dtype: float64