Bioinformatics Algorithms Chapter 1 Workthrough
Posted by TRII in bioinformatics
Introduction & Overview¶
In the continued spirit of learning and courses, this article is the first in a series related to Bioinformatics Algorithms. In particular, our goal here is to follow along with the text Bioinformatics Algorithms: An Active Learning Approach -- which is also associated with a MOOC on Coursera
Mixture Models Part 01: Two Unigram Language Models
Posted by TRII in text-analytics
Introduction / Overview¶
Again continuing with the Coursera Courses Theme, this post will kick off a series of posts related to the Text Mining and Analytics Course, which is fairly related to the Text Retrieval and Search Engines Course.
In this post, we investigate a basic Topic Mining tool, the Mixture Model. More specifically, we look at a Two-Topic Unigram Mixture Model. This is similar to other language models we have covered
NLP Pipelines with spaCy: Filter & Replace & Map
Posted by TRII in text-analytics
Introduction / Overview¶
A semi-hasty post :-)
As a semi-follow up to the previous article, we expand upon the pipeline to build out some custom steps that we use for generating word contexts. spaCy documents lock token attributes, so it is a bit difficult to filter and replace tokens directly. Here, we build some custom "dummy" Token and Document classes that mimic basic functionality of spaCy Tokens and Documents while still being able to link back to the vocabulary (or generate a custom one).
spaCy First Steps: BOW Vectors, Word Contexts
Posted by TRII in text-analytics
Introduction / Overview¶
spaCy is an "Industrial-Strength Natural Language Processing" library built in python. One may consider it a competitor to NLTK, though spaCy's creator would argue that they occupy fairly different spaces in the NLP world.
In any case, we make use of spaCy in order to create a pipeline and extract information from a set of documents we want to analyze. First we construct a custom vocabulary from the documents. Then, we obtain BOW Vector representations for each of the docs using our vocabulary. Additionally, we extract word contexts.
JM Smoothing Language Model For Ranking
Posted by TRII in text-retrieval-and-search-engines
Introduction / Overview¶
We continue our work on the Text Retrieval and Search Engines course (see here for the last article). For the various topics covered in the course, the goal is to implement some of the methods and tools in order to gain some hands-on experience.
The previous articles looked at embedding documents and queries into an $n$-dimensional space, calculating the distances between query-document embeddings, and utilizing these distances as a measure of similarity between documents and queries.
Frequent Pattern Mining - Apriori Pt. 02
Posted by TRII in frequent-pattern-mining-course
Introduction / Overview¶
This post is the second post related to the Pattern Discovery in Data Mining Course. For the first article in the series, we looked into the Apriori Principle (Algorithm) and how it can be used to find frequent patterns in a dataset.
In this article, the various pieces from the first article were put together to form a complete implementation of the Apriori Algorithm.
Virginia Disc One Exploration
Posted by TRII in text-retrieval-and-search-engines
Introduction / Overview¶
Virginia Disc One was "the first large-scale distribution of test collections" used in Information Retrieval. The goal was to create a large test collection that hundreds of researchers could contribute to and utilize for work in the IR field. While many larger, more comprehensive collections have been created and distributed since VD1 was first distributed in 1990, we thought it would be interesting (and fun!) to take a look at some of the contents and use them for future notebooks / articles.
Frequent Pattern Mining - Apriori Pt. 01
Posted by TRII in frequent-pattern-mining-course
Introduction / Overview¶
Continuing with the Coursera Courses Theme, this post will kick off a series of posts related to the Pattern Discovery in Data Mining Course. As one might expect, the course covers introductory topics related to finding interesting pattern in data. "Pattern" is a fairly loaded term, so for now we'll leave the definition and exploration of the general field to the course and other resources.
Improved VSM Instantiation - Okapi BM25
Posted by TRII in text-retrieval-and-search-engines
Introduction / Overview¶
This is the second notebook related to work from the Coursera course, Text Retrieval and Search Engines. (See here for the first.) For the various topics covered in the course, the goal is to implement some of the methods and tools in order to gain some hands-on experience.
Simplest VSM Instantiation
Posted by TRII in text-retrieval-and-search-engines
Introduction¶
This notebook will (hopefully) be the first in a series of notebooks related to work from the Coursera course, Text Retrieval and Search Engines. For the various topics covered in the course, the goal is to impliment some of the methods and tools in order to gain some hands-on experience.