Mixture Models Part 01: Two Unigram Language Models
Posted by TRII in text-analytics
Introduction / Overview¶
Again continuing with the Coursera Courses Theme, this post will kick off a series of posts related to the Text Mining and Analytics Course, which is fairly related to the Text Retrieval and Search Engines Course.
In this post, we investigate a basic Topic Mining tool, the Mixture Model. More specifically, we look at a Two-Topic Unigram Mixture Model. This is similar to other language models we have covered
NLP Pipelines with spaCy: Filter & Replace & Map
Posted by TRII in text-analytics
Introduction / Overview¶
A semi-hasty post :-)
As a semi-follow up to the previous article, we expand upon the pipeline to build out some custom steps that we use for generating word contexts. spaCy documents lock token attributes, so it is a bit difficult to filter and replace tokens directly. Here, we build some custom "dummy" Token and Document classes that mimic basic functionality of spaCy Tokens and Documents while still being able to link back to the vocabulary (or generate a custom one).
spaCy First Steps: BOW Vectors, Word Contexts
Posted by TRII in text-analytics
Introduction / Overview¶
spaCy is an "Industrial-Strength Natural Language Processing" library built in python. One may consider it a competitor to NLTK, though spaCy's creator would argue that they occupy fairly different spaces in the NLP world.
In any case, we make use of spaCy in order to create a pipeline and extract information from a set of documents we want to analyze. First we construct a custom vocabulary from the documents. Then, we obtain BOW Vector representations for each of the docs using our vocabulary. Additionally, we extract word contexts.