immersinn-ds


Mon 19 December 2016

Mixture Models Part 01: Two Unigram Language Models

Posted by TRII in text-analytics   

Introduction / Overview

Again continuing with the Coursera Courses Theme, this post will kick off a series of posts related to the Text Mining and Analytics Course, which is fairly related to the Text Retrieval and Search Engines Course.

In this post, we investigate a basic Topic Mining tool, the Mixture Model. More specifically, we look at a Two-Topic Unigram Mixture Model. This is similar to other language models we have covered

Read more...


Mon 12 December 2016

NLP Pipelines with spaCy: Filter & Replace & Map

Posted by TRII in text-analytics   

Introduction / Overview

A semi-hasty post :-)

As a semi-follow up to the previous article, we expand upon the pipeline to build out some custom steps that we use for generating word contexts. spaCy documents lock token attributes, so it is a bit difficult to filter and replace tokens directly. Here, we build some custom "dummy" Token and Document classes that mimic basic functionality of spaCy Tokens and Documents while still being able to link back to the vocabulary (or generate a custom one).

Read more...


Wed 30 November 2016

spaCy First Steps: BOW Vectors, Word Contexts

Posted by TRII in text-analytics   

Introduction / Overview

spaCy is an "Industrial-Strength Natural Language Processing" library built in python. One may consider it a competitor to NLTK, though spaCy's creator would argue that they occupy fairly different spaces in the NLP world.

In any case, we make use of spaCy in order to create a pipeline and extract information from a set of documents we want to analyze. First we construct a custom vocabulary from the documents. Then, we obtain BOW Vector representations for each of the docs using our vocabulary. Additionally, we extract word contexts.

Read more...


Thu 10 November 2016

JM Smoothing Language Model For Ranking

Posted by TRII in text-retrieval-and-search-engines   

Introduction / Overview

We continue our work on the Text Retrieval and Search Engines course (see here for the last article). For the various topics covered in the course, the goal is to implement some of the methods and tools in order to gain some hands-on experience.

The previous articles looked at embedding documents and queries into an $n$-dimensional space, calculating the distances between query-document embeddings, and utilizing these distances as a measure of similarity between documents and queries.

Read more...


Tue 01 November 2016

Frequent Pattern Mining - Apriori Pt. 02

Posted by TRII in frequent-pattern-mining-course   

Introduction / Overview

This post is the second post related to the Pattern Discovery in Data Mining Course. For the first article in the series, we looked into the Apriori Principle (Algorithm) and how it can be used to find frequent patterns in a dataset.

In this article, the various pieces from the first article were put together to form a complete implementation of the Apriori Algorithm.

Read more...


Mon 24 October 2016

Virginia Disc One Exploration

Posted by TRII in text-retrieval-and-search-engines   

Introduction / Overview

Virginia Disc One was "the first large-scale distribution of test collections" used in Information Retrieval. The goal was to create a large test collection that hundreds of researchers could contribute to and utilize for work in the IR field. While many larger, more comprehensive collections have been created and distributed since VD1 was first distributed in 1990, we thought it would be interesting (and fun!) to take a look at some of the contents and use them for future notebooks / articles.

Read more...


Fri 21 October 2016

Frequent Pattern Mining - Apriori Pt. 01

Posted by TRII in frequent-pattern-mining-course   

Introduction / Overview

Continuing with the Coursera Courses Theme, this post will kick off a series of posts related to the Pattern Discovery in Data Mining Course. As one might expect, the course covers introductory topics related to finding interesting pattern in data. "Pattern" is a fairly loaded term, so for now we'll leave the definition and exploration of the general field to the course and other resources.

Read more...