Wed 30 November 2016

spaCy First Steps: BOW Vectors, Word Contexts

Introduction / Overview¶

spaCy is an "Industrial-Strength Natural Language Processing" library built in python. One may consider it a competitor to NLTK, though spaCy's creator would argue that they occupy fairly different spaces in the NLP world.

In any case, we make use of spaCy in order to create a pipeline and extract information from a set of documents we want to analyze. First we construct a custom vocabulary from the documents. Then, we obtain BOW Vector representations for each of the docs using our vocabulary. Additionally, we extract word contexts.

This is meant to be a simple primer / introduction to using spaCy. We do not cover anything "deep", or anything particularly analytical for that matter.

In [1]:

import pickle
import numpy
import pandas
import spacy
import textacy
from spacy import attrs

Load Documents¶

In [2]:

with open('/home/immersinn/Dropbox/Analytics/Text Retrieval and Search Engines/data/preprocessedArticleContentDump.pkl', 'rb') as f:
    articles = pickle.load(f)

In [3]:

articles.shape

Out[3]:

(1098, 11)

In [4]:

articles.head(2)

Out[4]:

	URL	category	content	source	title	tokenized	bow	tot_len	unique_words	title_bow	doc_id
0	http://phys.org/news/2016-04-activists-appeal-...	earth-news	Environmental groups on Tuesday lodged a compl...	PhysOrg	Activists appeal to EU over Polish logging of ...	[environmental, groups, on, tuesday, lodged, a...	{before, told, over, animal, last, millennia, ...	385	223	{logging, activists, of, over, polish, appeal,...	0
1	http://phys.org/news/2016-04-seismic-ecuador.html	earth-news	A doctoral thesis developed at UPM analysed th...	PhysOrg	The seismic risk of Ecuador	[a, doctoral, thesis, developed, at, upm, anal...	{situations, analysed, provided, ocean, cum, l...	269	152	{seismic, ecuador, the, of, risk}	1

Run spaCy Pipeline on Documents¶

With spaCy, it is straight-forward to load a pipeline for processing documents. Below, we load the standard English-language pipeline with "spacy.load("en")". This is equivalent to "spacy.en.English()" The initial load of the pipeline takes a bit of time because it needs to load all of the "behind the scenes" workhorse stuff. After loading, we can view the default steps associated with the pipeline which are Tagging, Entity Recognition, Matching, and Parser.

The first step -- which is left out -- is the general Tokenizer, which is responsible for initializing the spaCy document in the first place, as well as deciding what individual tokens in the document will be. More on "tokens" below.

To be honest, the full default pipe is a bit overkill for our purposes. Really, a solid word and sentence parser do fine for our purposes of obtaining BOW Vectors for documents and word contexts (which rely on sentences). So for now, we are actually going to reduce the pipe to a subset of the default pipe.

Ideally, we would like to just use the Parser. However, to correctly identify sentence boundaries, it seems that the Parser needs POS tags (and / or whatever other types of tags) that are provided by the Tagger. Along with this, we also get Noun Chunks (but no "named entities", as would be expected from deciding to not run the Entity Recognizer).

Conveniently, spaCy also has built in GIL-free parallelization by default (yay!!!) via usage of C. This means we can easily make use of multiple cores on our machine to do the initial "heavy lifting" of initializing each of the documents in the corpus.

In [5]:

# This part takes a little bit...
nlp = spacy.load("en")

In [6]:

nlp.pipeline

Out[6]:

[,
 ,
 ,
 ]

In [7]:

nlp.pipeline = [nlp.tagger, nlp.parser]
nlp.pipeline

Out[7]:

[,
 ]

In [8]:

docs = list(articles.content)

In [9]:

#... and this part takes about the same amount of time
docs = [doc for doc in nlp.pipe(docs, batch_size=250, n_threads=4)]

In [10]:

doc = docs[0]

In [11]:

count = 0
max_lines = 6
for sent in doc.sents:
    if count < max_lines:
        print(sent)
        print('\n')
    elif count == max_lines:
        print('......')
        break
    count += 1

Environmental groups on Tuesday lodged a complaint with the European Commission over Poland's large-scale logging plans in the Bialowieza forest, which includes Europe's last primeval woodland.


"We risk turning this forest into a tree plantation and reducing our natural heritage into blocks of wood," Greenpeace Poland head Robert Cyglicki told reporters, alongside representatives from six other groups including the Polish branch of the Worldwide Fund for Nature (WWF).


Poland's Environment Minister Jan Szyszko last month gave the go ahead for the large-scale logging—despite protests from scientists and ecologists—to combat a spruce bark beetle infestation. "


The Commission is concerned about the recent decision of the Polish authorities," EU environment spokeswoman Iris Petsa told AFP on Tuesday, adding that the institution had reached out to Warsaw and "will decide on any further steps" based on replies it received Monday.


Under the new plan, loggers will harvest more than 180,000 cubic metres (6.4 million cubic feet) of wood from non-protected areas of the forest over a decade, dwarfing previous plans to harvest 40,000 cubic metres over the same period.


The environmentalists take issue with the government's rationale, saying "the intensive wood extraction is a threat for priority habitats and species".


......

In [12]:

# We have Noun Chunks...
[nc for nc in doc.noun_chunks][:10]

Out[12]:

[Environmental groups,
 Tuesday,
 a complaint,
 the European Commission,
 Poland's large-scale logging plans,
 the Bialowieza forest,
 Europe's last primeval woodland,
 We,
 this forest,
 a tree plantation]

In [13]:

#...but no Named Entities
doc.ents

Out[13]:

()

Initialization of a spaCy document first involves a tokenization step performed by the Tokenizer. Tokens are the fundamental components / building-blocks of a spaCy document. When iterating over a spaCy document, spaCy Tokens are returned for each token object in the document.

Each Token is associated with various characteristics. One such attribute is the Token's POS tag (if we have used the Tagger). Another is an ID that links the token to the associated Lexeme object, which is an entry in the underlying vocabulary utilized by the particular spaCy pipe used to parse the document.

So, each entity that is returned from the spaCy tokenization process is a Token. And each Token is associated with various attributes, one of which is a vocabulary item, a Lexeme. Lexeme's contain information that allow users to index the particular vocabulary item referred.

One feature of this is the ability to link a token to a word-vector representation (word embedding). By default, spaCy uses the GloVe set of word embeddings, though users are able to set custom embeddings if they choose.

Additionally, Lexemes are identified as being punctuation, spaces, number-esq things, email-like strings, and so on. These attributes can be used to filter out unwanted Tokens / Lexeme types for creation of a custom vocabulary for a corpus that is a sub-set of the default spaCy vocabulary. Below, we utilize Lexeme attributes to do just that in order to create a "condensed" BOW vocabulary.

In [14]:

token = docs[0][55]
type(token)

Out[14]:

spacy.tokens.token.Token

In [15]:

token.lower_

Out[15]:

'head'

In [16]:

token.pos_

Out[16]:

'NOUN'

In [17]:

token.lex_id

Out[17]:

Create BOW Vector Encodings¶

Creation of BOW Vector representations for documents is fairly straight forward. spaCy documents have a built-in "count_by" method that allows Tokens to be tallied based on various attributes. For instance, we could tally Tokens based on POS tags. A potentially full but un-annotated list of features can be found here.

For our purposes, we utilize the "LOWER" attribute. For our purposes, this is analogous to converting the document to lowercase characters, parsing the document by word boundaries, and tallying up tokens.'

After counting Tokens for each document, the next step is creating a custom vocabulary. Specifically, numbers, white-space, punctuation -- in short, anything but content words -- is not of interest at the moment. These are filtered out by utilizing the Lexeme attributes. Once the filtering is complete, a mapping between the old (spaCy) and new vocabulary IDs is created.

Using the mapping, BOW vectors are created for each document. These vectors can be used to create a TDM (term-document / document-term matrix). For now, this is the end-goal.

Get Initial Representation¶

Using spaCy documents' built-in method "count_by" and the "attrs.LOWER" property, we generate the initial BOW representations for the documents in the collection. Since "attrs.LOWER" was the chosen property, the keys in the BOW hash tables are the Lexeme IDs corresponding to the lowercase representation for each token.

In all likelihood, there is no non-lowercase "behind the scenes" representation for tokens. Any special attributes that would be associated with capitalized words would instead be encapsulated in things like POS tagging or Entity Recognition (probably).

Interestingly, for spaCy, the keys returned by "count_by", regardless of attribute, all seem to be in the same "space" that is the spaCy vocabulary. For example, if the "POS" attribute is used instead of "LOWER", appropriate key labels can still be obtained via the "nlp.vocab" hash table. This is convenient, though potentially confusing.

In [18]:

bow_reprs = [doc.count_by(attrs.LOWER) for doc in docs]

In [19]:

vocab_keys = set()
for bow_rep in bow_reprs:
    vocab_keys.update(set(bow_rep.keys()))

In [20]:

count = 0
start = 1000
stop = 1006
for k in vocab_keys:
    if count <= start:
        count += 1
    elif count > start:
        print("spaCy ID: {};\t word: {}".format(str(k), nlp.vocab[k].lower_))
        count +=1
        if count >= stop:
            break

spaCy ID: 1361;	 word: lack
spaCy ID: 1362;	 word: telling
spaCy ID: 1363;	 word: fat
spaCy ID: 1364;	 word: himself
spaCy ID: 1365;	 word: edit

In [21]:

pos_reprs = [doc.count_by(attrs.POS) for doc in docs]

In [22]:

pos_keys = set()
for pos_rep in pos_reprs:
    pos_keys.update(set(pos_rep.keys()))

In [23]:

count = 0
start = 0
stop = 6
for k in pos_keys:
    if count <= start:
        count += 1
    elif count > start:
        print("spaCy ID: {};\t tag: {}".format(str(k), nlp.vocab[k].lower_))
        count +=1
        if count >= stop:
            break

spaCy ID: 83;	 tag: adp
spaCy ID: 84;	 tag: adv
spaCy ID: 86;	 tag: conj
spaCy ID: 87;	 tag: det
spaCy ID: 88;	 tag: intj

Create Desired Vocabulary¶

Once all the Lexemes encountered in the collection of documents have been identified, undesirable ones can be filtered out. To do this, we create a function, "lexeme_filter" which checks each item for various attributes. Items possessing any of these attributes are flagged as "bad" and excluded from the new vocabulary.

Recall from above that the keys in the BOW representations were only the Lexeme object identifiers, not the Lexemes themselves. Thus, each of the keys need to call its respective Lexeme object in order to access the various attributes requested in the filter.

In [24]:

def lexeme_filter(lexeme):
    if lexeme.is_digit:
        return(False)
    if lexeme.is_punct:
        return(False)
    if lexeme.is_space:
        return(False)
    if lexeme.like_num:
        return(False)
    if lexeme.like_email:
        return(False)
    return(True)

In [25]:

vocab_lexemes = [nlp.vocab[vk] for vk in vocab_keys]

In [26]:

vocab_lexemes_filtered = [vl for vl in vocab_lexemes if lexeme_filter(vl)]

In [27]:

print("Maximum vocabulary key found in documents: {}".format(max(vocab_keys)))
print("Original vocabulary size: {}".format(len(vocab_lexemes)))
print("Filtered vocabulary size: {}".format(len(vocab_lexemes_filtered)))

Maximum vocabulary key found in documents: 1517992
Original vocabulary size: 32304
Filtered vocabulary size: 31339

In [28]:

lexeme_encoding = {lexeme.lower : i for i,lexeme in enumerate(vocab_lexemes_filtered)}
rev_lexeme_encoding = {i:k for k,i in lexeme_encoding.items()}
lexeme_word_lookup = {lexeme.lower : lexeme.lower_ for lexeme in vocab_lexemes_filtered}
n_words = len(lexeme_encoding)

In [29]:

lexeme_word_lookup[rev_lexeme_encoding[555]]

Out[29]:

'single'

Create BOW Vectors¶

Once the new vocabulary and key-to-key lookups have been created, the BOW representations output by the "count_by" method can be converted into the new vocabulary ids, and then into vectors. Note that this conversion process greatly reduces the size of the resulting BOW Vectors / TDM.

The original spaCy vocabulary contained over 1 million entries, while the reduced vocabulary encountered in the document set contains only around 30 thousand unique items. Most smaller document sets (i.e. under 1 Million documents) likely benefit from a reduced vocabulary, especially if the documents are all from a fairly specific domain (e.g., "science" or "history" or "news").

After converting the original BOW Representations, we can look at the size of each document in terms of our reduced vocabulary. The average size is around 600 tokens, with the smallest consisting of only 38 tokens, and the largest having about 2300.

From here, the TDM could be used for document similarity measures, classification, clustering, or other analyses.

In [30]:

def lexeme_lower_bow_to_vec(lexeme_lower_bow, lexeme_encoding):
    bow_vec = numpy.zeros((len(lexeme_encoding,)), dtype=numpy.int64)
    for k,v in lexeme_lower_bow.items():
        try:
            bow_vec[lexeme_encoding[k]] += v
        except KeyError:
            pass
    return(bow_vec)

In [31]:

tdm = numpy.vstack(lexeme_lower_bow_to_vec(bow_rep, lexeme_encoding)
                   for bow_rep in bow_reprs)

In [32]:

print(tdm.shape)
tdm[:15,:15]

(1098, 31339)

Out[32]:

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [33]:

pandas.Series(tdm.sum(axis=1)).describe()

Out[33]:

count    1098.000000
mean      586.349727
std       288.792052
min        38.000000
25%       389.250000
50%       536.500000
75%       721.750000
max      2348.000000
dtype: float64

Create Word Contexts Matrix¶

Word Contexts can be utilized for investigating Paradigmatic and Syntagmatic Similarity. In our example, we define a word's context to be the three words on either side of it (for up to six total) within a single sentence. So, words close to sentence boundaries have truncated contexts for that particular sentence. For example, in the context of "fox" in the following sentence is (The, quick, red, jumped, over, the), while the context of "brown" is (over, the, lazy, dog).

"The quick red fox jumped over the lazy brown dog"

Doing this for all sentences in all documents, we are able to generate what can be interpreted as a pseudo-document for each word. This is simply the total number of times each word in the vocabulary occurs within a given word's contexts across all sentences and documents.

Note that the identifiers given to words are from our vocabulary defined above.

Extract Word Contexts¶

In [34]:

def symWindowContext(target, sent, window_size=3):
    context = list(sent[max(0, target-window_size):target]) + \
              list(sent[(target+1):min(len(sent), (target+window_size+1))])
    return(context)

def contextGen(sents, ws=3, sent_preprocess=lambda x : x):
    contexts = []
    for sent in sents:
        sent = sent_preprocess(sent)
        for i,w in enumerate(sent):
            contexts.append((w, symWindowContext(i, sent, window_size=ws)))
    return(contexts)

def sentPPv01(sent, dictionary):
    encoded_sent = []
    for word in sent:
        try:
            encoded_sent.append(dictionary[word.lower])
        except KeyError:
            pass
    return(encoded_sent)

In [35]:

context_gen = lambda x: contextGen(x, ws=3, sent_preprocess=lambda y: sentPPv01(y, dictionary=lexeme_encoding))

In [36]:

word_contexts = [context_gen(doc.sents) for doc in docs]

In [37]:

word_contexts[0][:10]

Out[37]:

[(6487, [1488, 132, 16009]),
 (1488, [6487, 132, 16009, 22356]),
 (132, [6487, 1488, 16009, 22356, 120]),
 (16009, [6487, 1488, 132, 22356, 120, 4625]),
 (22356, [1488, 132, 16009, 120, 4625, 135]),
 (120, [132, 16009, 22356, 4625, 135, 117]),
 (4625, [16009, 22356, 120, 135, 117, 10862]),
 (135, [22356, 120, 4625, 117, 10862, 8820]),
 (117, [120, 4625, 135, 10862, 8820, 235]),
 (10862, [4625, 135, 117, 8820, 235, 26911])]

In [38]:

# Flatten the list of word contexts, which are currently nested in documents
word_contexts = [item for sublist in word_contexts for item in sublist]

In [39]:

word_contexts[:10]

Out[39]:

[(6487, [1488, 132, 16009]),
 (1488, [6487, 132, 16009, 22356]),
 (132, [6487, 1488, 16009, 22356, 120]),
 (16009, [6487, 1488, 132, 22356, 120, 4625]),
 (22356, [1488, 132, 16009, 120, 4625, 135]),
 (120, [132, 16009, 22356, 4625, 135, 117]),
 (4625, [16009, 22356, 120, 135, 117, 10862]),
 (135, [22356, 120, 4625, 117, 10862, 8820]),
 (117, [120, 4625, 135, 10862, 8820, 235]),
 (10862, [4625, 135, 117, 8820, 235, 26911])]

Compile Word Contexts¶

All that is left to do now is collect all of the contexts for each word and count the number of times each word in the vocabulary occurred in those contexts. We represent these counts in matrix form, where each row is a word-document BOW representation.

In [40]:

word_contexts_matrix = numpy.zeros((n_words, n_words), dtype=numpy.int64)

In [41]:

for word,context in word_contexts:
    for c in context:
        word_contexts_matrix[word,c] += 1

In [42]:

word_contexts_matrix[:15,:15]

Out[42]:

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [43]:

pandas.Series(word_contexts_matrix.sum(axis=1)).describe()

Out[43]:

count     31339.00000
mean        112.44309
std        1868.95456
min           0.00000
25%           6.00000
50%          12.00000
75%          36.00000
max      216108.00000
dtype: float64

Summary & Follow-up¶

In this article, we have used (in a "tip of the iceberg" sense) the spaCy library for performing some basic NLP steps. First, we let spaCy do the heavy-lifting of converting raw text to a tokenized, tagged, and parsed document. The pipeline we utilized was a subset of the default pipeline.

Processing with the pipeline allowed us to utilize individual tokens and sentences in a set of documents for extracting a term document matrix and word contexts based on word co-occurring in the same sentences. A precursor to these activities was creating a custom vocabulary by filtering out unwanted token types from the documents.

Note that during the course of these activities that took place after the initial pipeline, the spaCy documents were not modified or updated. One may consider this a good or bad thing, depending on perspective.

Having a more direct association between the underlying documents and any derived representations may be desired in many cases. Any steps carried out by the default or a customized initial pipeline would have these associations by default, while steps after this do not. This prevents us from associating document representations based on a custom vocabulary derived from the documents with each of the documents without an initial pass over the documents or forcing the addition of attributes to the documents.

Additionally, there is no explicit overarching Corpus entity / object to track the process (pipeline) that takes us from the original documents to the TDM or the Word Contexts or to store the vocabulary.

One potential option is using textacy, a library builds on spaCy in a functional and organizations sense, and provides some additional functionality, like various term filtering and generating a TDM from a set of documents.

Another option is creating some basic classes that perform various tasks. Utilizing sklearn-like model, one can stay with the pipeline concept as well. An example if this is shown below. We see that this is potentially a bit "awkward" (clunky) in terms of how the vocabulary needs to be initialized and then applied to documents, though we can offer a bit of flexibility. The assumption here is that the documents being passed are spaCy-like.

In [44]:

class CustomVocab:
    
    def __init__(self, nlp, method='bow', attr=attrs.LOWER, token_filter=lambda x: True):
        self.nlp = nlp
        self._filter = token_filter
        self.method = method
        self._attr = attr
        
    def _call_count_by(self, texts):
        if self.method == 'bow':
            bow_reprs = [doc.count_by(self._attr) for doc in docs]
            doc_reprs = [[k for k in bow_rep.keys()] \
                         for bow_rep in bow_reprs]
        else:
            pass
        return(doc_reprs)
        
    def fit(self, texts):
        doc_reprs = self._call_count_by(texts)
        
        vocab_keys = set()
        for doc_rep in doc_reprs:
            vocab_keys.update(set(doc_rep))
        vocab_keys = [vk for vk in vocab_keys \
                      if self._filter(self.nlp.vocab[vk])]
        
        self.vocab_encoding = {key:i for i,key in enumerate(vocab_keys)}
        self.rev_vocab_encoding = {i:k for k,i in self.vocab_encoding.items()}
        self.word_lookup = {i : nlp.vocab[key].lower_ \
                            for (key,i) in self.vocab_encoding.items()}
        self.n_vocab = len(self.vocab_encoding)
    
    def transform(self, texts, unit="sents"):
        
        def word_transform(word):
            try:
                return(self.vocab_encoding[word.lemma])
            except KeyError:
                pass
            
        new_texts = []
        for text in texts:
            if unit=="sents":
                new_doc = []
                for sent in text.sents:
                    new_doc.append([w for w in [word_transform(w) for w in sent] if w])
            elif unit=="words":
                new_doc = [word_transform(w) for w in text]
                new_doc = [w for w in new_doc if w]
            new_texts.append(new_doc)
        return(new_texts)

In [45]:

cv = CustomVocab(nlp, token_filter=lexeme_filter)

In [46]:

cv.fit(docs)

In [47]:

cv.n_vocab

Out[47]:

In [48]:

new_docs = cv.transform(docs)

In [50]:

new_docs[0][0][:12]

Out[50]:

[6487, 634, 132, 16009, 17039, 120, 4625, 135, 117, 10862, 8820, 235]

immersinn-ds