Fri 30 December 2016

Bioinformatics Algorithms Chapter 1 Workthrough

Introduction & Overview¶

In the continued spirit of learning and courses, this article is the first in a series related to Bioinformatics Algorithms. In particular, our goal here is to follow along with the text Bioinformatics Algorithms: An Active Learning Approach -- which is also associated with a MOOC on Coursera

Mon 19 December 2016

Mixture Models Part 01: Two Unigram Language Models

Posted by TRII in text-analytics

Introduction / Overview¶

Again continuing with the Coursera Courses Theme, this post will kick off a series of posts related to the Text Mining and Analytics Course, which is fairly related to the Text Retrieval and Search Engines Course.

In this post, we investigate a basic Topic Mining tool, the Mixture Model. More specifically, we look at a Two-Topic Unigram Mixture Model. This is similar to other language models we have covered

Mon 12 December 2016

NLP Pipelines with spaCy: Filter & Replace & Map

Posted by TRII in text-analytics

Introduction / Overview¶

A semi-hasty post :-)

As a semi-follow up to the previous article, we expand upon the pipeline to build out some custom steps that we use for generating word contexts. spaCy documents lock token attributes, so it is a bit difficult to filter and replace tokens directly. Here, we build some custom "dummy" Token and Document classes that mimic basic functionality of spaCy Tokens and Documents while still being able to link back to the vocabulary (or generate a custom one).

Wed 30 November 2016

spaCy First Steps: BOW Vectors, Word Contexts

Posted by TRII in text-analytics

Introduction / Overview¶

spaCy is an "Industrial-Strength Natural Language Processing" library built in python. One may consider it a competitor to NLTK, though spaCy's creator would argue that they occupy fairly different spaces in the NLP world.

In any case, we make use of spaCy in order to create a pipeline and extract information from a set of documents we want to analyze. First we construct a custom vocabulary from the documents. Then, we obtain BOW Vector representations for each of the docs using our vocabulary. Additionally, we extract word contexts.

Thu 10 November 2016

JM Smoothing Language Model For Ranking

Posted by TRII in text-retrieval-and-search-engines

Introduction / Overview¶

We continue our work on the Text Retrieval and Search Engines course (see here for the last article). For the various topics covered in the course, the goal is to implement some of the methods and tools in order to gain some hands-on experience.

The previous articles looked at embedding documents and queries into an $n$-dimensional space, calculating the distances between query-document embeddings, and utilizing these distances as a measure of similarity between documents and queries.

Tue 01 November 2016

Frequent Pattern Mining - Apriori Pt. 02

Posted by TRII in frequent-pattern-mining-course

Introduction / Overview¶

This post is the second post related to the Pattern Discovery in Data Mining Course. For the first article in the series, we looked into the Apriori Principle (Algorithm) and how it can be used to find frequent patterns in a dataset.

In this article, the various pieces from the first article were put together to form a complete implementation of the Apriori Algorithm.

Mon 24 October 2016

Virginia Disc One Exploration

Posted by TRII in text-retrieval-and-search-engines

Introduction / Overview¶

Virginia Disc One was "the first large-scale distribution of test collections" used in Information Retrieval. The goal was to create a large test collection that hundreds of researchers could contribute to and utilize for work in the IR field. While many larger, more comprehensive collections have been created and distributed since VD1 was first distributed in 1990, we thought it would be interesting (and fun!) to take a look at some of the contents and use them for future notebooks / articles.

Fri 21 October 2016

Frequent Pattern Mining - Apriori Pt. 01

Posted by TRII in frequent-pattern-mining-course

Introduction / Overview¶

Continuing with the Coursera Courses Theme, this post will kick off a series of posts related to the Pattern Discovery in Data Mining Course. As one might expect, the course covers introductory topics related to finding interesting pattern in data. "Pattern" is a fairly loaded term, so for now we'll leave the definition and exploration of the general field to the course and other resources.

Mon 10 October 2016

Improved VSM Instantiation - Okapi BM25

Posted by TRII in text-retrieval-and-search-engines

Introduction / Overview¶

This is the second notebook related to work from the Coursera course, Text Retrieval and Search Engines. (See here for the first.) For the various topics covered in the course, the goal is to implement some of the methods and tools in order to gain some hands-on experience.

Mon 10 October 2016

Simplest VSM Instantiation

Posted by TRII in text-retrieval-and-search-engines

Introduction¶

This notebook will (hopefully) be the first in a series of notebooks related to work from the Coursera course, Text Retrieval and Search Engines. For the various topics covered in the course, the goal is to impliment some of the methods and tools in order to gain some hands-on experience.