NLP Pipelines with spaCy: Filter & Replace & Map
Posted by TRII in text-analytics
Introduction / Overview¶
A semi-hasty post :-)
As a semi-follow up to the previous article, we expand upon the pipeline to build out some custom steps that we use for generating word contexts. spaCy documents lock token attributes, so it is a bit difficult to filter and replace tokens directly. Here, we build some custom "dummy" Token and Document classes that mimic basic functionality of spaCy Tokens and Documents while still being able to link back to the vocabulary (or generate a custom one).
Additionally, we make use of textacy for some document pre-pre-processing, like replacing numbers / digits with text (i.e., "*number*"). To steps of the pipeline used in this article are shown below. The code referenced by several of the steps below, along with some other not so useful stuff, is here in the (not so well named) "vocab_customizer" file.
Pipeline:
- Pre-process raw text with textacy
- Replace contractions with "full" format
- Replace numbers with common representation
- Replace emails with common representation
- Replace currency symbols with acronyms
- Convert raw text to spaCy docs
- Only use "tagger" and "parser"
- Filter, Replace, Map:
- Filter out unwanted tokens / lexeme
- Remove non-alpha-like characters
- Remove stopwords
- Replace infrequent tokens
- Replace infrequent words with POS tags or some other representative symbols
- Map
- Convert the token keys to a shortened list; i.e., the size of the new vocab will be the number of unique token keys observed, not the total number in the spaCy nlp pipeline vocabulary
- Filter out unwanted tokens / lexeme
- Get Word Contexts from Documents' Sentences
Our implementation is currently a bit of a "loose" pipeline in that the textacy preprocessing and spaCy transformation steps are "hand coded" to an extent, and there is no code that tracks the overall process of the complete pipe. Such a wrapper may be constructed at a later date if it seems beneficial for future endeavors.
import os
import sys
import pickle
import itertools
from collections import defaultdict, Counter
from importlib import reload
import numpy
import scipy
import pandas
import spacy
from spacy import attrs
import textacy
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
from importlib import reload
Load Articles¶
Load data stuffs¶
cd ..
with open('Text Retrieval and Search Engines/data/preprocessedArticleContentDump.pkl', 'rb') as f:
articles = pickle.load(f)
cd Text\ Mining\ and\ Analytics
articles.head(2)
Text to Word Contexts¶
Here we take each of the steps in the pipeline and carry out that task. The first two steps -- preprocess with textacy and create spaCy docs -- are fairly straightforward.
The components of the third step (Filter, Replace, Map) are covered separately first and then all together. The code shown directly in this file only covers some of the process. For the full code, see the GitHub repository referenced in the Intro.
A basic version of the last step -- create word contexts-- was seen previously.
Preprocess Texts with Textacy¶
Does what it says. There are several more options for things that can be removed / replaced (e.g. punctuation), but the set of items shown below were sufficient for our purposes.
textacy_preprocessor = lambda text: textacy.preprocess.preprocess_text(text,
no_contractions=True,
no_numbers=True,
no_emails=True,
no_currency_symbols=True)
docs = [textacy_preprocessor(doc) for doc in list(articles.content)[:500]]
spaCy Pipeline¶
Essentially as shown in the previous article. We do not load the word vectors to save some space / time because we do not need them. Additionally, we truncate the pipeline to only Tagging and Parsing, as we do not need anything else for this particular exercise. These two steps also would allow for using Noun Phrases as Tokens in our Vocabulary / Word Documents below if we were so inclined.
nlp = spacy.load("en", add_vectors=False)
nlp.pipeline = [nlp.tagger, nlp.parser]
spacy_docs = [doc for doc in nlp.pipe(docs, batch_size=250, n_threads=4)]
Filter, Replace, Map¶
from text_analytics import vocab_customizer
In the previous post dealing with spaCy, an example of a vocabulary customizer class was shown. We now use an updated and modified version of that example to generate a restricted vocabulary for our data. As mentioned above, the code used can be found here. Each of the classes follows a scikit-learn style of "fit" and "transform", though this model is not entirely appropriate. Typically we will not be looking at documents outside of the original document set for this particular pipeline. Additionally functionality would be needed to account for this. For instance, we would need to cover the case of encountering unseen tokens.
First, we Filter out words we do not want to have in our vocabulary. For this example, we use a subset of the NLTK English Stopwords list. Notice that we add back in pronouns, prepositions, and verbs, as well as the possessive "ending", "'s". For finding Paradigmatic and Syntagmatic similarities, we feel that these tokens may be important.
Our second step is to Replace a subset of the vocabulary. Specifically, we replace words that appear in a small number of documents (original documents, not word documents) with the respective Part of Speech (POS) Tag for the Token. Note that we do not necessarily replace all occurrences of an infrequent word with the same POS Tag. Instead, we look at each occurrence of that Token and replace each instance of the Token with the respective POS Tag assigned to it by spaCy. Also note that we do not look at the total frequency of words, but the Document Frequency. This step removes a number of peoples' names, as well as infrequent words in general.
The third step takes our now Filtered and Replaced documents and creates a mapping between consecutive integers and the spaCy lemmas associated with the Tokens (POS tags are assigned to lemmas in the spaCy vocabulary as well). This allows for a compact representation of the remaining tokens in the documents, which enables us to easily map the words, documents, and word-documents to a set of indices for referencing a matrix (array).
Additionally, we only construct contexts and Word-Documents for words that were neither filtered nor replaced, nor were the resulting mapping for a replaced word (i.e. no Word Documents are constructed for POS Tags). This means we need an additional mapping from our truncated vocabulary to the Word-Document Identifiers. We cover this step below as well.
Filter¶
import nltk
custom_stop_words = nltk.corpus.stopwords.words('english')
stops_to_remove = ['i', 'me', 'we', 'you', 'he', 'him', 'himself', 'she', 'her', 'herself',
'it', 'itself', 'they', 'them', 'themselves',
'ourselves', 'yourselves'
'about', 'against', 'between', 'into', 'through',
'during', 'before', 'after', 'above', 'below',
'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off',
'over', 'under', 'again', 'further',
'has', 'had', 'having',
'am', 'is', 'are', 'was', 'were', 'will', 'can',
'be', 'been', 'being',
'have', 'has', 'had', 'having',
'do', 'does', 'did', 'doing',
"'s"]
custom_stop_words = [w for w in custom_stop_words if w not in stops_to_remove]
Here we set the list of custom stopwords we want to use. First, we need to un-flag all default stopwords in the spaCy vocabulary. Then, all words in the custom list are flagged as stops in the spaCy vocabulary.
# Remove existing stopwords...
for word in nlp.Defaults.stop_words:
lexeme = nlp.vocab[word]
lexeme.is_stop = False
word = word[0].upper() + word[1:]
lexeme = nlp.vocab[word]
lexeme.is_stop = False
# and set custom stopwords...
for word in custom_stop_words:
lexeme = nlp.vocab[word]
lexeme.is_stop = True
word = word[0].upper() + word[1:]
lexeme = nlp.vocab[word]
lexeme.is_stop = True
lex_filter = lambda l: vocab_customizer.lexeme_filter(l,
filter_stop=True,
filter_digit=False,
filter_like_num=False,
filter_short=False)
filter01 = vocab_customizer.TokenFilter()
documents = [vocab_customizer.DummyDoc([[vocab_customizer.DummyToken(w) for w in sent] \
for sent in doc.sents]) \
for doc in spacy_docs]
[len(sent) for sent in documents[0].sents]
filter01.fit(documents)
filter01.transform(documents)
[len(sent) for sent in documents[0].sents]
Replace¶
min_freq = 10
replace_attr = 'pos'
key_2_attr_map = {'lower' : attrs.LOWER,
'orth' : attrs.ORTH,
'pos' : attrs.POS,}
kem = lambda x: vocab_customizer.count_by_transform(x, attr=key_2_attr_map['lower'])
docs_keys = [kem(document) for document in spacy_docs]
# Count total documents each key occurs in
doc_frequency = defaultdict(int)
for doc_keys in docs_keys:
for key in doc_keys:
doc_frequency[key] += 1
ift = [k for k,c in doc_frequency.items() if c < min_freq]
def replace_func(token, ift, replace_attr):
if token.key_attr in ift:
token.key_attr = token.__getattribute__(replace_attr)
token_replace=lambda x: replace_func(x, ift, replace_attr)
replace01 = vocab_customizer.TokenReplaceInfrequent(min_freq=min_freq, token_replace_attr=replace_attr)
documents = [vocab_customizer.DummyDoc([[vocab_customizer.DummyToken(w) for w in sent] \
for sent in doc.sents]) \
for doc in spacy_docs]
for sent in documents[0].sents[:3]:
prt_wrds = [(w.lower_, w.key_attr) for w in sent[:7]]
print(prt_wrds)
replace01.fit(documents)
replace01.transform(documents)
for sent in documents[0].sents[:3]:
prt_wrds = [(w.lower_, w.key_attr) for w in sent[:7]]
print(prt_wrds)
for id_ in [89, 93, 97]:
print("ID: {}; POS: {}".format(id_, nlp.vocab[id_].lower_))
Map¶
documents = [vocab_customizer.DummyDoc([[vocab_customizer.DummyToken(w) for w in sent] \
for sent in doc.sents]) \
for doc in spacy_docs]
for sent in documents[0].sents[:3]:
prt_wrds = [(w.lower_, w.key_attr) for w in sent[:7]]
print(prt_wrds)
vocab_keys = set()
unit = "sents"
for doc in documents:
if unit=="sents":
keys = [w.key_attr for sent in doc.sents for w in sent]
if unit=="words":
keys = [w.key_attr for w in doc]
vocab_keys.update(set(keys))
vocab_encoding = {key:i for i,key in enumerate(vocab_keys)}
rev_vocab_encoding = {i:k for k,i in vocab_encoding.items()}
vocab = {i : nlp.vocab[rev_vocab_encoding[i]] \
for i in vocab_encoding.values()}
word_lookup = {i : nlp.vocab[rev_vocab_encoding[i]].lower_ \
for (key,i) in vocab_encoding.items()}
def encode_word(token):
token.key_attr = vocab_encoding[token.key_attr]
def encode_doc(doc):
if unit=="sents":
for sent in doc.sents:
for w in sent:
encode_word(w)
elif unit=="words":
for w in doc.tokens:
encode_word(w)
def encode_docs(docs):
for doc in documents:
encode_doc(doc)
encode_docs(documents)
for sent in documents[0].sents[:3]:
prt_wrds = [(w.lower_, w.key_attr) for w in sent[:7]]
print(prt_wrds)
Filter & Replace & Map¶
Here we throw the previous three steps together into as single transformation object. This object tracks the entire process and allows us to trace encodings back to words for those words that were not replaced. Replaced words are inherently "lost" in the replacement process because it is possible for multiple tokens to be encoded to the same token during the replace process.
For comparison, we show the original spaCy version and the processed version for the first document in the collection.
vm = vocab_customizer.VocabPrep(nlp, stopwords=custom_stop_words)
vm.fit(spacy_docs)
for sent in spacy_docs[0].sents:
print(len(sent))
prt_snt = [w.lex_id for w in sent][:10]
print(prt_snt)
print([w.lower_ for w in sent][:10])
new_docs = vm.transform(spacy_docs)
for sent in new_docs[0].sents:
print(len(sent))
print(sent[:10])
print([vm.vocab2word[w] for w in sent][:10])
vm.n_words
vm.n_total
Generate Word Documents From Word Contexts¶
For determining Paradigmatic and Syntagmatic relationships, a document is generated for each word of interest in the vocabulary. The "word document" is constructed from all word contexts within the corpus of interest. From each sentence a particular word occurs in, a context is extracted. This is typically some window around the word, consisting of other words that are observed both before and after the target word. All contexts for the word are combined to construct a single BOW document for the word.
As a note, we only find contexts and construct Word Documents for words that were neither filtered out nor replaced. Thus, at this step, we need an additional mapping between the vocab and the documents. Word Documents are simply numbered from 1 to the total number of Word Documents.
Generate Word Docs TDM¶
def symWindowContext(target, sent, window_size=3):
context = list(sent[max(0, target-window_size):target]) + \
list(sent[min(len(sent),target+1):min(len(sent), (target+window_size+1))])
return(context)
def contextGen(doc, vocab, wcc, ws, cutoff_len=2):
for sent in doc.sents:
if len(sent) > cutoff_len:
for i,word in enumerate(sent):
if word in vocab:
context = symWindowContext(i, sent, window_size=ws)
for c in context:
wcc[word][c] += 1
return(wcc)
def wordBOWGenerator(docs, vocab, window_size=3):
# Each word is a "document"
word_bows = {k : defaultdict(int) for k in vocab}
for doc in docs:
word_bows = contextGen(doc, vocab,
word_bows,
ws=window_size)
# "Lock" the defaultdicts
word_bows = {w : dict(c) for w,c in word_bows.items()}
return(word_bows)
def contextMatrixGenerator(word_docs, n_docs, n_vocab):
data, iii, jjj = \
zip(*[(c,doc['doc_id'],w2) for doc in word_docs for w2,c in doc['bow'].items()])
mat = scipy.sparse.coo_matrix((data, (iii, jjj)),
shape=((n_docs, n_vocab)),
dtype=numpy.int32)
mat = scipy.sparse.csr_matrix(mat)
return(mat)
word_bows = wordBOWGenerator(new_docs, vm.vocab, window_size=1)
word_docs = [{'name':w, 'bow':word_bows[w], 'doc_id':i} for i,w in enumerate(word_bows.keys())]
word_contexts_matrix = contextMatrixGenerator(word_docs, len(word_docs), vm.n_total)
word_contexts_matrix.shape
sums = pandas.Series([word_contexts_matrix.getrow(i).toarray().sum() \
for i in range(word_contexts_matrix.shape[0])])
sums.index = range(sums.shape[0])
sums.describe()
Pre-process T(W)DM / Word Contexts Matrix with BM25¶
For calculating Paradigmatic Similarity Scores we want to use the BM25 method. To make pairwise similarity calculations more efficient, we can pre-compute the appropriate BM25 word weightings for each document in the collection. This is similar to the previous BM25 article. Weighted term vectors for each Word Document are stacked into a matrix.
def idfCalc(df, n_vocab, n_docs, method="smooth"):
"""Calculate the idf vector from the vocab"""
idf = numpy.zeros((n_vocab
,))
if method == "smooth":
for w,c in df.items():
idf[w] = max(0,
numpy.log((n_docs - c + 0.5) / (c + 0.5))
)
elif method=="basic":
for w,c in df.items():
idf[w] = c
idf /= n_docs
return(idf)
def transformBM25(vec, idf, avgdl, k1=1.2, b=0.75):
"""Weight a document BOW vetor based on BM25 methodology"""
def pdlCalc(doc_len):
"""PDL Normalization Calculation"""
return(1 - b + b * doc_len / avgdl)
doc_len = vec.sum()
d_pdl = pdlCalc(doc_len)
vec = idf * (vec * (k1 + 1)) / (vec + k1 * d_pdl)
vec = vec / vec.sum()
return(vec)
doc_freqs = Counter()
for bow in word_bows.values():
doc_freqs.update(bow.keys())
doc_freqs = dict(doc_freqs)
idf = scipy.sparse.lil_matrix((vm.n_total, vm.n_total))
idf.setdiag(idfCalc(doc_freqs, vm.n_total, vm.n_words))
avgdl = sums.sum() / len(sums)
bm25_trans_func = lambda x: transformBM25(x, idf.diagonal(), avgdl)
# Basic form, with removed '0-context' words
# May be a better way to loop this?
bm25_vecs = scipy.sparse.vstack(\
scipy.sparse.coo_matrix(bm25_trans_func(word_contexts_matrix.getrow(i).toarray())) \
for i in range(word_contexts_matrix.shape[0]) \
if word_contexts_matrix.getrow(i).sum() > 0
)
bm25_vecs = scipy.sparse.csr_matrix(bm25_vecs)
bm25_vecs.prune()
bm25_vecs.shape
Paradigmatic Similarity¶
Below, we explore some preliminary Paradigmatic similarity measures using the above data. By utilizing the BM25 precomputed matrix, we can calculate similarity scores for Word Documents. We also plot the distribution of log base 10 similarity scores below.
Generate Word Document Similarity Scores¶
%%time
# Sim(d1,d2) = sum_over_i
word_sims = numpy.dot(numpy.dot(bm25_vecs, idf), bm25_vecs.transpose())
word_sims.prune()
word_sims.shape
Get Unique Non-zero entries¶
The similarity matrix is symmetric and contains two entries for each pair or word-documents. Here we locate all non-zero similarity measures and remove duplicates. The strict "less than" requirement also removes Word-Document self-similarity scores as well.
nz = word_sims.nonzero()
len(nz[0])
matches = nz[0] < nz[1]
nz = (nz[0][matches], nz[1][matches])
len(nz[0])
dist = word_sims[nz]
dist = numpy.array(dist)[0,:]
g = sns.distplot(numpy.log10(dist), kde=False, color='purple');
g.figure.set_size_inches(12,8);
plt.title("Log10 of Word Similarity Scores, BM25", size=14);
plt.xlabel("Log10 BM25 Similarity Score", size=12);
plt.ylabel("Count", size=12);
pandas.Series(dist).describe([.25, .50, .75, .9, .95, 0.995])