spaCy First Steps: BOW Vectors, Word Contexts
Posted by TRII in text-analytics
Introduction / Overview¶
spaCy is an "Industrial-Strength Natural Language Processing" library built in python. One may consider it a competitor to NLTK, though spaCy's creator would argue that they occupy fairly different spaces in the NLP world.
In any case, we make use of spaCy in order to create a pipeline and extract information from a set of documents we want to analyze. First we construct a custom vocabulary from the documents. Then, we obtain BOW Vector representations for each of the docs using our vocabulary. Additionally, we extract word contexts.
This is meant to be a simple primer / introduction to using spaCy. We do not cover anything "deep", or anything particularly analytical for that matter.
import pickle
import numpy
import pandas
import spacy
import textacy
from spacy import attrs
Load Documents¶
with open('/home/immersinn/Dropbox/Analytics/Text Retrieval and Search Engines/data/preprocessedArticleContentDump.pkl', 'rb') as f:
articles = pickle.load(f)
articles.shape
articles.head(2)
Run spaCy Pipeline on Documents¶
With spaCy, it is straight-forward to load a pipeline for processing documents. Below, we load the standard English-language pipeline with "spacy.load("en")". This is equivalent to "spacy.en.English()" The initial load of the pipeline takes a bit of time because it needs to load all of the "behind the scenes" workhorse stuff. After loading, we can view the default steps associated with the pipeline which are Tagging, Entity Recognition, Matching, and Parser.
The first step -- which is left out -- is the general Tokenizer, which is responsible for initializing the spaCy document in the first place, as well as deciding what individual tokens in the document will be. More on "tokens" below.
To be honest, the full default pipe is a bit overkill for our purposes. Really, a solid word and sentence parser do fine for our purposes of obtaining BOW Vectors for documents and word contexts (which rely on sentences). So for now, we are actually going to reduce the pipe to a subset of the default pipe.
Ideally, we would like to just use the Parser. However, to correctly identify sentence boundaries, it seems that the Parser needs POS tags (and / or whatever other types of tags) that are provided by the Tagger. Along with this, we also get Noun Chunks (but no "named entities", as would be expected from deciding to not run the Entity Recognizer).
Conveniently, spaCy also has built in GIL-free parallelization by default (yay!!!) via usage of C. This means we can easily make use of multiple cores on our machine to do the initial "heavy lifting" of initializing each of the documents in the corpus.
# This part takes a little bit...
nlp = spacy.load("en")
nlp.pipeline
nlp.pipeline = [nlp.tagger, nlp.parser]
nlp.pipeline
docs = list(articles.content)
#... and this part takes about the same amount of time
docs = [doc for doc in nlp.pipe(docs, batch_size=250, n_threads=4)]
doc = docs[0]
count = 0
max_lines = 6
for sent in doc.sents:
if count < max_lines:
print(sent)
print('\n')
elif count == max_lines:
print('......')
break
count += 1
# We have Noun Chunks...
[nc for nc in doc.noun_chunks][:10]
#...but no Named Entities
doc.ents
Initialization of a spaCy document first involves a tokenization step performed by the Tokenizer. Tokens are the fundamental components / building-blocks of a spaCy document. When iterating over a spaCy document, spaCy Tokens are returned for each token object in the document.
Each Token is associated with various characteristics. One such attribute is the Token's POS tag (if we have used the Tagger). Another is an ID that links the token to the associated Lexeme object, which is an entry in the underlying vocabulary utilized by the particular spaCy pipe used to parse the document.
So, each entity that is returned from the spaCy tokenization process is a Token. And each Token is associated with various attributes, one of which is a vocabulary item, a Lexeme. Lexeme's contain information that allow users to index the particular vocabulary item referred.
One feature of this is the ability to link a token to a word-vector representation (word embedding). By default, spaCy uses the GloVe set of word embeddings, though users are able to set custom embeddings if they choose.
Additionally, Lexemes are identified as being punctuation, spaces, number-esq things, email-like strings, and so on. These attributes can be used to filter out unwanted Tokens / Lexeme types for creation of a custom vocabulary for a corpus that is a sub-set of the default spaCy vocabulary. Below, we utilize Lexeme attributes to do just that in order to create a "condensed" BOW vocabulary.
token = docs[0][55]
type(token)
token.lower_
token.pos_
token.lex_id
Create BOW Vector Encodings¶
Creation of BOW Vector representations for documents is fairly straight forward. spaCy documents have a built-in "count_by" method that allows Tokens to be tallied based on various attributes. For instance, we could tally Tokens based on POS tags. A potentially full but un-annotated list of features can be found here.
For our purposes, we utilize the "LOWER" attribute. For our purposes, this is analogous to converting the document to lowercase characters, parsing the document by word boundaries, and tallying up tokens.'
After counting Tokens for each document, the next step is creating a custom vocabulary. Specifically, numbers, white-space, punctuation -- in short, anything but content words -- is not of interest at the moment. These are filtered out by utilizing the Lexeme attributes. Once the filtering is complete, a mapping between the old (spaCy) and new vocabulary IDs is created.
Using the mapping, BOW vectors are created for each document. These vectors can be used to create a TDM (term-document / document-term matrix). For now, this is the end-goal.
Get Initial Representation¶
Using spaCy documents' built-in method "count_by" and the "attrs.LOWER" property, we generate the initial BOW representations for the documents in the collection. Since "attrs.LOWER" was the chosen property, the keys in the BOW hash tables are the Lexeme IDs corresponding to the lowercase representation for each token.
In all likelihood, there is no non-lowercase "behind the scenes" representation for tokens. Any special attributes that would be associated with capitalized words would instead be encapsulated in things like POS tagging or Entity Recognition (probably).
Interestingly, for spaCy, the keys returned by "count_by", regardless of attribute, all seem to be in the same "space" that is the spaCy vocabulary. For example, if the "POS" attribute is used instead of "LOWER", appropriate key labels can still be obtained via the "nlp.vocab" hash table. This is convenient, though potentially confusing.
bow_reprs = [doc.count_by(attrs.LOWER) for doc in docs]
vocab_keys = set()
for bow_rep in bow_reprs:
vocab_keys.update(set(bow_rep.keys()))
count = 0
start = 1000
stop = 1006
for k in vocab_keys:
if count <= start:
count += 1
elif count > start:
print("spaCy ID: {};\t word: {}".format(str(k), nlp.vocab[k].lower_))
count +=1
if count >= stop:
break
pos_reprs = [doc.count_by(attrs.POS) for doc in docs]
pos_keys = set()
for pos_rep in pos_reprs:
pos_keys.update(set(pos_rep.keys()))
count = 0
start = 0
stop = 6
for k in pos_keys:
if count <= start:
count += 1
elif count > start:
print("spaCy ID: {};\t tag: {}".format(str(k), nlp.vocab[k].lower_))
count +=1
if count >= stop:
break
Create Desired Vocabulary¶
Once all the Lexemes encountered in the collection of documents have been identified, undesirable ones can be filtered out. To do this, we create a function, "lexeme_filter" which checks each item for various attributes. Items possessing any of these attributes are flagged as "bad" and excluded from the new vocabulary.
Recall from above that the keys in the BOW representations were only the Lexeme object identifiers, not the Lexemes themselves. Thus, each of the keys need to call its respective Lexeme object in order to access the various attributes requested in the filter.
def lexeme_filter(lexeme):
if lexeme.is_digit:
return(False)
if lexeme.is_punct:
return(False)
if lexeme.is_space:
return(False)
if lexeme.like_num:
return(False)
if lexeme.like_email:
return(False)
return(True)
vocab_lexemes = [nlp.vocab[vk] for vk in vocab_keys]
vocab_lexemes_filtered = [vl for vl in vocab_lexemes if lexeme_filter(vl)]
print("Maximum vocabulary key found in documents: {}".format(max(vocab_keys)))
print("Original vocabulary size: {}".format(len(vocab_lexemes)))
print("Filtered vocabulary size: {}".format(len(vocab_lexemes_filtered)))
lexeme_encoding = {lexeme.lower : i for i,lexeme in enumerate(vocab_lexemes_filtered)}
rev_lexeme_encoding = {i:k for k,i in lexeme_encoding.items()}
lexeme_word_lookup = {lexeme.lower : lexeme.lower_ for lexeme in vocab_lexemes_filtered}
n_words = len(lexeme_encoding)
lexeme_word_lookup[rev_lexeme_encoding[555]]
Create BOW Vectors¶
Once the new vocabulary and key-to-key lookups have been created, the BOW representations output by the "count_by" method can be converted into the new vocabulary ids, and then into vectors. Note that this conversion process greatly reduces the size of the resulting BOW Vectors / TDM.
The original spaCy vocabulary contained over 1 million entries, while the reduced vocabulary encountered in the document set contains only around 30 thousand unique items. Most smaller document sets (i.e. under 1 Million documents) likely benefit from a reduced vocabulary, especially if the documents are all from a fairly specific domain (e.g., "science" or "history" or "news").
After converting the original BOW Representations, we can look at the size of each document in terms of our reduced vocabulary. The average size is around 600 tokens, with the smallest consisting of only 38 tokens, and the largest having about 2300.
From here, the TDM could be used for document similarity measures, classification, clustering, or other analyses.
def lexeme_lower_bow_to_vec(lexeme_lower_bow, lexeme_encoding):
bow_vec = numpy.zeros((len(lexeme_encoding,)), dtype=numpy.int64)
for k,v in lexeme_lower_bow.items():
try:
bow_vec[lexeme_encoding[k]] += v
except KeyError:
pass
return(bow_vec)
tdm = numpy.vstack(lexeme_lower_bow_to_vec(bow_rep, lexeme_encoding)
for bow_rep in bow_reprs)
print(tdm.shape)
tdm[:15,:15]
pandas.Series(tdm.sum(axis=1)).describe()
Create Word Contexts Matrix¶
Word Contexts can be utilized for investigating Paradigmatic and Syntagmatic Similarity. In our example, we define a word's context to be the three words on either side of it (for up to six total) within a single sentence. So, words close to sentence boundaries have truncated contexts for that particular sentence. For example, in the context of "fox" in the following sentence is (The, quick, red, jumped, over, the), while the context of "brown" is (over, the, lazy, dog).
"The quick red fox jumped over the lazy brown dog"
Doing this for all sentences in all documents, we are able to generate what can be interpreted as a pseudo-document for each word. This is simply the total number of times each word in the vocabulary occurs within a given word's contexts across all sentences and documents.
Note that the identifiers given to words are from our vocabulary defined above.
Extract Word Contexts¶
def symWindowContext(target, sent, window_size=3):
context = list(sent[max(0, target-window_size):target]) + \
list(sent[(target+1):min(len(sent), (target+window_size+1))])
return(context)
def contextGen(sents, ws=3, sent_preprocess=lambda x : x):
contexts = []
for sent in sents:
sent = sent_preprocess(sent)
for i,w in enumerate(sent):
contexts.append((w, symWindowContext(i, sent, window_size=ws)))
return(contexts)
def sentPPv01(sent, dictionary):
encoded_sent = []
for word in sent:
try:
encoded_sent.append(dictionary[word.lower])
except KeyError:
pass
return(encoded_sent)
context_gen = lambda x: contextGen(x, ws=3, sent_preprocess=lambda y: sentPPv01(y, dictionary=lexeme_encoding))
word_contexts = [context_gen(doc.sents) for doc in docs]
word_contexts[0][:10]
# Flatten the list of word contexts, which are currently nested in documents
word_contexts = [item for sublist in word_contexts for item in sublist]
word_contexts[:10]
Compile Word Contexts¶
All that is left to do now is collect all of the contexts for each word and count the number of times each word in the vocabulary occurred in those contexts. We represent these counts in matrix form, where each row is a word-document BOW representation.
word_contexts_matrix = numpy.zeros((n_words, n_words), dtype=numpy.int64)
for word,context in word_contexts:
for c in context:
word_contexts_matrix[word,c] += 1
word_contexts_matrix[:15,:15]
pandas.Series(word_contexts_matrix.sum(axis=1)).describe()
Summary & Follow-up¶
In this article, we have used (in a "tip of the iceberg" sense) the spaCy library for performing some basic NLP steps. First, we let spaCy do the heavy-lifting of converting raw text to a tokenized, tagged, and parsed document. The pipeline we utilized was a subset of the default pipeline.
Processing with the pipeline allowed us to utilize individual tokens and sentences in a set of documents for extracting a term document matrix and word contexts based on word co-occurring in the same sentences. A precursor to these activities was creating a custom vocabulary by filtering out unwanted token types from the documents.
Note that during the course of these activities that took place after the initial pipeline, the spaCy documents were not modified or updated. One may consider this a good or bad thing, depending on perspective.
Having a more direct association between the underlying documents and any derived representations may be desired in many cases. Any steps carried out by the default or a customized initial pipeline would have these associations by default, while steps after this do not. This prevents us from associating document representations based on a custom vocabulary derived from the documents with each of the documents without an initial pass over the documents or forcing the addition of attributes to the documents.
Additionally, there is no explicit overarching Corpus entity / object to track the process (pipeline) that takes us from the original documents to the TDM or the Word Contexts or to store the vocabulary.
One potential option is using textacy, a library builds on spaCy in a functional and organizations sense, and provides some additional functionality, like various term filtering and generating a TDM from a set of documents.
Another option is creating some basic classes that perform various tasks. Utilizing sklearn-like model, one can stay with the pipeline concept as well. An example if this is shown below. We see that this is potentially a bit "awkward" (clunky) in terms of how the vocabulary needs to be initialized and then applied to documents, though we can offer a bit of flexibility. The assumption here is that the documents being passed are spaCy-like.
class CustomVocab:
def __init__(self, nlp, method='bow', attr=attrs.LOWER, token_filter=lambda x: True):
self.nlp = nlp
self._filter = token_filter
self.method = method
self._attr = attr
def _call_count_by(self, texts):
if self.method == 'bow':
bow_reprs = [doc.count_by(self._attr) for doc in docs]
doc_reprs = [[k for k in bow_rep.keys()] \
for bow_rep in bow_reprs]
else:
pass
return(doc_reprs)
def fit(self, texts):
doc_reprs = self._call_count_by(texts)
vocab_keys = set()
for doc_rep in doc_reprs:
vocab_keys.update(set(doc_rep))
vocab_keys = [vk for vk in vocab_keys \
if self._filter(self.nlp.vocab[vk])]
self.vocab_encoding = {key:i for i,key in enumerate(vocab_keys)}
self.rev_vocab_encoding = {i:k for k,i in self.vocab_encoding.items()}
self.word_lookup = {i : nlp.vocab[key].lower_ \
for (key,i) in self.vocab_encoding.items()}
self.n_vocab = len(self.vocab_encoding)
def transform(self, texts, unit="sents"):
def word_transform(word):
try:
return(self.vocab_encoding[word.lemma])
except KeyError:
pass
new_texts = []
for text in texts:
if unit=="sents":
new_doc = []
for sent in text.sents:
new_doc.append([w for w in [word_transform(w) for w in sent] if w])
elif unit=="words":
new_doc = [word_transform(w) for w in text]
new_doc = [w for w in new_doc if w]
new_texts.append(new_doc)
return(new_texts)
cv = CustomVocab(nlp, token_filter=lexeme_filter)
cv.fit(docs)
cv.n_vocab
new_docs = cv.transform(docs)
new_docs[0][0][:12]