immersinn-ds

Mon 19 December 2016

Mixture Models Part 01: Two Unigram Language Models

Posted by TRII in text-analytics   

Introduction / Overview

Again continuing with the Coursera Courses Theme, this post will kick off a series of posts related to the Text Mining and Analytics Course, which is fairly related to the Text Retrieval and Search Engines Course.

In this post, we investigate a basic Topic Mining tool, the Mixture Model. More specifically, we look at a Two-Topic Unigram Mixture Model. This is similar to other language models we have covered

Read more...


Mon 12 December 2016

NLP Pipelines with spaCy: Filter & Replace & Map

Posted by TRII in text-analytics   

Introduction / Overview

A semi-hasty post :-)

As a semi-follow up to the previous article, we expand upon the pipeline to build out some custom steps that we use for generating word contexts. spaCy documents lock token attributes, so it is a bit difficult to filter and replace tokens directly. Here, we build some custom "dummy" Token and Document classes that mimic basic functionality of spaCy Tokens and Documents while still being able to link back to the vocabulary (or generate a custom one).

Read more...


Wed 30 November 2016

spaCy First Steps: BOW Vectors, Word Contexts

Posted by TRII in text-analytics   

Introduction / Overview

spaCy is an "Industrial-Strength Natural Language Processing" library built in python. One may consider it a competitor to NLTK, though spaCy's creator would argue that they occupy fairly different spaces in the NLP world.

In any case, we make use of spaCy in order to create a pipeline and extract information from a set of documents we want to analyze. First we construct a custom vocabulary from the documents. Then, we obtain BOW Vector representations for each of the docs using our vocabulary. Additionally, we extract word contexts.

Read more...