immersinn-ds

Mon 19 December 2016

Mixture Models Part 01: Two Unigram Language Models

Posted by TRII in text-analytics   

Introduction / Overview

Again continuing with the Coursera Courses Theme, this post will kick off a series of posts related to the Text Mining and Analytics Course, which is fairly related to the Text Retrieval and Search Engines Course.

In this post, we investigate a basic Topic Mining tool, the Mixture Model. More specifically, we look at a Two-Topic Unigram Mixture Model. This is similar to other language models we have covered in that the probability of observing a document is calculated as the product of the probabilities of the words in the document.

The general idea of a Topic Model is to assign topic labels / tags / classes to documents or portions of documents. Identifying Topics can serve a variety of goals. In this particular case, we are looking to identify words that are highly relevant to a particular topic and separate out "common" words that are not very descriptive from a content perspective.

For the Two-Topic Unigram Mixture Model, words can be associated with one of two language models, a Background model and a model related to the target topic ("Topic of Interest" or "Topic"). The intended function of the background model is to account for common language terms ("background" terms) that are not unique or specific to the topic being modeled.

For our example, we use the entire set of PhysOrg articles as the background model, and each of the eight main topics will be used as the target topic (a "Topic of Interest") individually. By utilizing a "science related" background corpus, we will hopefully be able to extract words that are highly relevant to each of the eight topics and not simply phrases that are more prevalent in science- and tech-like articles in general.

In [1]:
import pickle
In [2]:
import numpy
import scipy
import pandas
from nltk import FreqDist
import textacy
import spacy
In [3]:
from text_analytics import vocab_customizer
In [4]:
%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
In [5]:
import plotly
from plotly.tools import FigureFactory as FF

Mixture of Two Unigram Language Models

The particular Mixture Model instance utilized here is a Mixture of Two Unigrams model. This model assumes that two underlying topics are used to generate documents. Each topic type has some probability of being observed, and the probabilities of both topic types must sum to 1.

Additionally, two word distributions exist, one for the Background and one for the Topic. For example, in the Background, "the" may make up 0.0005% of the observed words, and "token" makes up 0.00000013% of observed words, while in the Topic, "the" and "token" make up 0.000001% and 0.0003%, respectively.

Using this model, documents are generated by randomly sampling topics and word. To generate a word in a document, a topic is first selected at random (Background or Topic), and then a word is randomly selected based on the selected topic's word distribution. From this, we can see that the total probability of observing a word in a document is the sum of the topic-probability weighted topic-conditional word probabilities (the first equation below).

Thus, there are four components that specify a particular model:

  • The probability of observing / choosing the Topic ($p(\theta_d)$)
  • The probability of observing / choosing the Background ($p(\theta_{B})$)
  • For each word, the probability of observing the word in the Topic (conditional word probability given the Topic) ($p(\theta_{d})p(w_i | \theta_{d})$)
  • For each word, the probability of observing the word in the Background (conditional word probability given the background) ($p(\theta_{B})p(w_i | \theta_{B})$)

The probability of observing a document within this model framework is the product of individual observed word probabilities. The "best" model is the one that provides the highest probability for the observed documents.

Total Word Probability / Word Generative Model:

$p(w) = p(\theta_d)p(w | \theta_d) + p(\theta_{B})p(w | \theta_{B})$

Model Parameters:

$\Lambda = (\{p(w | \theta_d)\}, \{p(w | \theta_{B})\}, \{p(\theta_{B})\}, \{p(\theta_d)\})$

Likelihood function:

$p(d | \Lambda) = \prod^{|V|}_{i=1} \left [ p(\theta_d)p(w_i | \theta_d) + p(\theta_{B})p(w_i | \theta_{B}) \right ] ^{c(w_i, d)}$

$\log(p(d | \Lambda)) = \sum^{|V|}_{i=1} c(w_i, d) \log \left [ p(\theta_d)p(w_i | \theta_d) + p(\theta_{B})p(w_i | \theta_{B}) \right ] $

ML Estimate:

$\Lambda^* = \arg\max_{\Lambda}p(d|\Lambda)$

where:

$\sum^{|V|}_{i=1} p(w_i | \theta_d) = \sum^{|V|}_{i=1} p(w_i | \theta_B) = 1$

$p(\theta_d) + p(\theta_B) = 1$

All four of the quantities are technically unobserved quantities. However, for this particular model, the probability of observing the Topic and the Background are set by the user. The pair is referred to at the Mixing Weight.

Additionally, Conditional Word Probabilities given the Background topic are typical calculating from a reference corpus in the given language. In our case, we utilize all articles in the PhysOrg dataset to generate the Background word distributions.

This leaves only the Conditional Word Probabilities given the Topic to be determined. Our goal, then, is to find the Word Probabilities conditioned on the Topic that maximize the probability of the observed data. For each of the PhysOrg categories, the observed data is the overall word counts from all documents associated with the given category.

Because the Background word probabilities are fixed, trivially we see that the globally optimal model is arrived at if we set $p(\theta_d) = 1$ and allow $p(w | \theta_d)$ to be the observed word probabilities in the document(s) of interest. However, this undermines our goal of determining which words are indicative of a particular topic and which words are simply general components of the language.

Thinking about this problem another way, if we somehow knew which words were from the Background and which were from the Topic, it would be trivial to calculate the Topic Word Distribution from the data. While this information is not available, one way of getting around this issue is pretending that this information is known and working from there.

To do this, a latent variable $z$ is introduced that indicates whether the Background ($z = 1$) or Topic ($z = 0$) is currently being observed. For each word, the probability of $z = 0$ can be calculated (e.g., $p(z=0 | w)$), which indicates the likelihood of being "in the Topic" given a particular word.

Given probabilistic membership to the Background and Topic for each word (the $p(z=0 | w)$ values), the number of times each word appears in the Background and Topic can be calculated. In order to to this, the assumption is made that total observed word counts can be split across the two topics by utilizing $p(z=0 | w)$.

Specifically, if a word $w$ appears $c(w,d)$ times in the observed data, then $c(w,d)p(z=0 | w)$ of those occurrences are associated with the Topic. Now that there is a formal method for counting the occurrence of words in the Topic and the Topic Word Distribution can be calculated as before.

Utilizing the EM Algorithm, given a randomly initialized Topic Word Distribution, the hidden variables $z$ and Topic Word Distribution are iteratively updated (see below).

EM Algorithm

For the language model being used, the word probabilities conditional on the topic and the hidden variables conditional on the topic need to be updated at each step. The E Step updates the Hidden Variables ($p(z=0 | w)$ values), while the M Step updates the conditional word probabilities ($p(w | \theta_d)$ values).

The E Step calculates the probability that we have selected the "Topic" portion of the model given a particular word. Note that the denominator is the probability of observing a particular word in the model above. The numerator is the portion of this total probability that is associated with the "Topic".

For the M Step, the probability of a word given that we are in "Topic" is the normalized weighted count of the particular word. The word counts are all weighted by the probability that we have selected "Topic" given the word; that is, we weight by the results from the E Step.

In both cases, the process can be generalized as "counting and normalizing". First, an attributed is counted. Then, then attributed is normalized by dividing by the sum across all values of the attribute.

E Step:

$$p^{(n)}(z=0 | w) = \frac {p(\theta_d)p^{(n)}(w | \theta_d)} {p(\theta_d)p^{(n)}(w | \theta_d) + p(\theta_B)p^{(n)}(w | \theta_B)}$$

M Step:

$$p^{(n+1)}(w | \theta_d) = \frac {c(w, d)p^{(n)}(z=0 | w)} {\sum_{w' \in V} c(w', d)p^{(n)}(z=0 | w')}$$

where $n$ indicates the iteration cardinality and the parameter $z$ is a hidden variable whose assignment indicates whether the word is associated with the Background or the Topic. For a word in the Topic, $z = 0$.

For the first calculation of $p(z=1 | w)$ (i.e. $n = 1$), it is typically to use a randomly generated set of probabilities for $p(w | \theta_d)$. It is important to note that this probability is not the fractional composition of the Topic Corpus associated with word $w$.

To calculate an approximation for the Topic-Word Probabilities, the EM algorithm must be initialized with a random Topic-Word Probabilities vector. The starting point for this vector contributes to the eventual output of the algorithm. In some cases, starting conditions may lead to local, non-optimal maximums (minimums). Multiple runs are often conducted from different random starting points in order to mitigate the chances of this occurring.

The process is run until some user-defined convergence on the ML Estimate function is reached.

Note that the Mixing Weight Probabilities ($p(\theta_d)$ and $p(\theta_B)$, where $p(\theta_d) + p(\theta_B) = 1$) are not free parameters in the model. If these parameters were allowed to update, then $p(\theta_d) \Rightarrow 1$, as this would mean that the Topic-Word Probabilities exactly match the observed word probabilities and the Background Model would play no part.

Below we show the functions used for the E and M Steps, as well as the Maximum Likelihood Estimate function.

In [6]:
def update_hidden_probs(topic_prob, topic_word_probs,
                        background_prob, background_word_probs):
    """
    topic_prob
    topic_word_probs
    background_prob
    background_word-probs
    
    hidden_probs --> |V| by 1
    """
    topic_weights = topic_prob * topic_word_probs
    background_weights = background_prob * background_word_probs
    hidden_probs = topic_weights / (topic_weights + background_weights)
    return(hidden_probs)

def update_topicword_probs(word_counts, hidden_probs):
    """
    Should only update probabilities for words in non-background model
    
    word_counts --> |V| by 1
    hidden_probs --> |V| by 1
    
    wordcat_probs --> |V| by 1
    """
    weighted_probs = word_counts * hidden_probs
    topic_word_probs = weighted_probs / numpy.sum(weighted_probs, axis=0)
    return(topic_word_probs)

def calc_loglikelihood(word_counts,
                       topic_prob, topic_word_probs,
                       background_prob, background_word_probs):
    
    topic_weights = topic_prob * topic_word_probs
    background_weights = background_prob * background_word_probs
    inter = numpy.log10(topic_weights + background_weights)
    inter = numpy.multiply(word_counts, inter)
    loglike = inter.sum()
    return(loglike)

Article Preprocessing & Data Preperation

In [7]:
with open('data/major8Articles_short.pkl', 'rb') as f:
    articles = pandas.DataFrame(pickle.load(f))
In [8]:
articles = articles.sample(250)
In [9]:
articles.index = [i for i in range(articles.shape[0])]
In [10]:
articles.head()
Out[10]:
URL category content source title
0 http://phys.org/news/2016-05-chance-production... biology-news An almost entirely accidental discovery by Uni... PhysOrg Chance finding could transform plant production
1 http://phys.org/news/2016-05-physicists.html physics-news Physicists from Trinity College Dublin's Schoo... PhysOrg Physicists discover a new form of light
2 http://phys.org/news/2016-05-tinder-users-vigi... technology-news Tinder says "people with bad intentions exist ... PhysOrg Tinder reminds users to be 'vigilant' amid kid...
3 http://phys.org/news/2016-05-neutrons-probe-en... chemistry-news A team led by the Department of Energy's Oak R... PhysOrg Neutrons probe structure of enzyme critical to...
4 http://phys.org/news/2016-05-nakamoto-fiasco-r... technology-news The world last week was treated to another epi... PhysOrg In the Nakamoto fiasco, Reddit proves a more r...

Pre-Process Steps:

  1. Preprocess Raw Text with Textacy
  2. Covert to spaCy docs and tag & parse
  3. Filter-Replace-Map (FRM)
  4. Construct BOWs

Textacy Preproc

In [11]:
textacy_preprocessor = lambda text: textacy.preprocess.preprocess_text(text, 
                                                                       no_contractions=True,
                                                                       no_numbers=True,
                                                                       no_emails=True,
                                                                       no_currency_symbols=True)
In [12]:
articles['content'] = articles.apply(lambda x: textacy_preprocessor(x['content']), axis=1)

spaCy Doc Gen

In [13]:
nlp = spacy.load("en", add_vectors=False)
In [14]:
nlp.pipeline = [nlp.tagger, nlp.parser]
In [15]:
articles['spacy_docs'] = [doc for doc in nlp.pipe(articles.content, batch_size=50, n_threads=5)]

FRM

In [16]:
vm = vocab_customizer.VocabPrep(nlp, stopwords=[], min_freq=3)
In [17]:
vm.fit(articles.spacy_docs)
In [18]:
articles['new_content'] = vm.transform(articles.spacy_docs)
In [19]:
count = 0
for sent in articles['spacy_docs'][0].sents:
    print([w.lower_ for w in sent][:7])
    count += 1
    if count > 5: break
['an', 'almost', 'entirely', 'accidental', 'discovery', 'by', 'university']
['by', 'tweaking', 'a', 'plant', "'s", 'genetic', 'profile']
['the', 'findings', 'were', 'published', 'in', 'the', 'march']
['the', 'team', 'studied', 'arabidopsis', ',', 'a', 'small']
['they', 'found', 'that', 'inserting', 'a', 'particular', 'corn']
['"', 'even', 'if', 'the', 'effects', 'in', 'a']
In [20]:
count = 0
for sent in articles['new_content'][0].sents:
    print([vm.vocab2word[w] for w in sent][:7])
    count += 1
    if count > 5: break
['an', 'almost', 'entirely', 'adj', 'discovery', 'by', 'university']
['by', 'verb', 'a', 'plant', 'genetic', 'profile', 'the']
['the', 'findings', 'were', 'published', 'in', 'the', 'march']
['the', 'team', 'studied', 'arabidopsis', 'a', 'small', 'noun']
['they', 'found', 'that', 'verb', 'a', 'particular', 'corn']
['even', 'if', 'the', 'effects', 'in', 'a', 'field']

BOW

In [21]:
def bow_from_doc(doc):
    bow = FreqDist([w for w in doc])
    bow = dict(bow)
    return(bow)
In [22]:
articles['bow'] = articles.apply(lambda x: bow_from_doc(x['new_content']), axis=1)
In [23]:
count = 1
for key in articles['bow'][0]:
    print('Count for token id {0}\t ({2}):\t {1}'.format(key, articles['bow'][0][key], vm.vocab2word[key]))
    count += 1
    if count > 7: break
Count for token id 0	 (similarity):	 1
Count for token id 1032	 (terms):	 1
Count for token id 2061	 (university):	 1
Count for token id 4110	 (engineered):	 1
Count for token id 1041	 (science):	 2
Count for token id 18	 (adj):	 2
Count for token id 21	 (increasingly):	 1

Topic Word Counts (BOW) & Background Word Fractions

  • Word Counts are cummulative Frequence Distributions across all items in a Topic
  • Background is the cummulative Word Counts / Frequencies across all Topics
  • Background Word Fractions are Word Probabilities conditioned on the Background Topic (no smoothing in this case)
In [35]:
topics_word_counts = {topic : FreqDist() for topic in set(articles.category)}
background_word_counts = FreqDist()
In [36]:
def topicWordcountUpdateFunc(doc, topic_fd):
    topic_fd.update(doc['bow'])
In [37]:
for topic in topics_word_counts.keys():
    sub = articles[articles.category==topic]
    _ = sub.apply(lambda x: topicWordcountUpdateFunc(x, 
                                                    topics_word_counts[topic]),
                 axis=1)
    background_word_counts.update(topics_word_counts[topic])
In [38]:
for topic in topics_word_counts.keys():
    print('Topic: {};\t Vocab Size: {}'.format(topic, len(topics_word_counts[topic])))
Topic: physics-news;	 Vocab Size: 2172
Topic: nanotech-news;	 Vocab Size: 1485
Topic: earth-news;	 Vocab Size: 1769
Topic: space-news;	 Vocab Size: 1859
Topic: technology-news;	 Vocab Size: 3562
Topic: chemistry-news;	 Vocab Size: 2299
Topic: science-news;	 Vocab Size: 2167
Topic: biology-news;	 Vocab Size: 2710
In [39]:
vm.n_total
Out[39]:
4459
Background Word Probabilities
In [40]:
background_word_probs = numpy.array([background_word_counts[w] for w in background_word_counts.keys()])
background_word_probs = background_word_probs / background_word_probs.sum()
background_word_probs = background_word_probs.reshape((background_word_probs.shape[0],1))
background_word_probs.shape
Out[40]:
(4459, 1)

Run EM Algorithm on each Topic

At this point, we are ready to run the EM algorithm on each of the 8 topics in order to find the optimal word probabilities for the topic. We fix the Topic Probability to be the same across all topics, and we set a fairly low value of $0.05$. That is, there is only a 5% chance of a word being from a given topic. This is fairly low, but gives us nice results.

In [42]:
#### 00: Some General Variables
tol = 0.1
topic_prob = 0.05; background_prob = 1 - topic_prob;
topics_word_probs = {}
nV = background_word_probs.shape[0]
n_starting_conds = 5


for i,topic in enumerate(topics_word_counts.keys()):
    
    print("Learning word probs for topic {}".format(topic))
    
    # Generate the array corresponding to the Topic word counts
    word_counts = numpy.zeros_like(background_word_probs)
    for k,c in topics_word_counts[topic].items():
        word_counts[k] = c    
    wps = []
    
    for n in range(n_starting_conds):
        
        print("  Running Trial {}".format(n+1))

        #### 01: Initialize Starting Conditions
        topic_word_probs = scipy.rand(nV, 1)
        topic_word_probs = topic_word_probs / topic_word_probs.sum(axis=0)


        #### 02: Run EM Algorithm
        last_score = 9999
        delta = 999
        count = -1

        while delta > tol:
            count += 1
            
            # E-Step
            hidden_probs = update_hidden_probs(topic_prob, topic_word_probs,
                                               background_prob, background_word_probs)

            # M-Step
            topic_word_probs = update_topicword_probs(word_counts, hidden_probs)

            if count % 10 == 0:

                # Evaluate
                score = calc_loglikelihood(word_counts, 
                                           topic_prob, topic_word_probs,
                                           background_prob, background_word_probs)
                delta = numpy.abs(score - last_score)
                last_score = score

        score = calc_loglikelihood(word_counts, 
                                   topic_prob, topic_word_probs,
                                   background_prob, background_word_probs)
   
        wps.append(topic_word_probs.copy())
        
        
    # Evaluate is various starting points generate results that "look similar"
    wps_distances = scipy.spatial.distance.pdist(scipy.hstack(wps).transpose())
    if numpy.where(wps_distances >= (tol / nV)**0.5)[0].shape[0] > 0:
        pass
    else:
        topic_word_probs = wps[0].copy()
        del(wps)
        score = calc_loglikelihood(word_counts, 
                                   topic_prob, topic_word_probs,
                                   background_prob, background_word_probs)
        print("Final score: {}\n".format(score))
        
    
    topics_word_probs[topic] = topic_word_probs
Learning word probs for topic physics-news
  Running Trial 1
  Running Trial 2
  Running Trial 3
  Running Trial 4
  Running Trial 5
Final score: -44824.753389980804

Learning word probs for topic nanotech-news
  Running Trial 1
  Running Trial 2
  Running Trial 3
  Running Trial 4
  Running Trial 5
Final score: -20788.829119645314

Learning word probs for topic earth-news
  Running Trial 1
  Running Trial 2
  Running Trial 3
  Running Trial 4
  Running Trial 5
Final score: -28489.653552809963

Learning word probs for topic space-news
  Running Trial 1
  Running Trial 2
  Running Trial 3
  Running Trial 4
  Running Trial 5
Final score: -31863.837286180755

Learning word probs for topic technology-news
  Running Trial 1
  Running Trial 2
  Running Trial 3
  Running Trial 4
  Running Trial 5
Final score: -127349.9195143051

Learning word probs for topic chemistry-news
  Running Trial 1
  Running Trial 2
  Running Trial 3
  Running Trial 4
  Running Trial 5
Final score: -43988.47969186744

Learning word probs for topic science-news
  Running Trial 1
  Running Trial 2
  Running Trial 3
  Running Trial 4
  Running Trial 5
Final score: -38578.70141305159

Learning word probs for topic biology-news
  Running Trial 1
  Running Trial 2
  Running Trial 3
  Running Trial 4
  Running Trial 5
Final score: -64736.48096823133

Results

Below we show the Top 30 highest probability words from each of the topics and the Background dataset. Note that the Background "picks up" a number of common stopwords from the dataset, as well as a few POS Tags. Recall that the Background data is not "learned" in the same sense as each of the Topic Models. The Background Model is simply the observed word probabilities in the total dataset.

We also notice that "number", which is actually overloaded to be both the word "number" and any set of characters that looked like a number in the original data, is also captured in the Background dataset. It is reasonable to assume that science and tech articles typically contain numbers, and thus "number" in this sense functions as a sort of topical stopword. Such contextualized stopword identification is one of the primary uses of the Background model.

Looking over the words for each topic, we observe that words seem to be very representative of the topic they are associated with. Some words appear in multiple topics, such as "cells" in biology-news and nanotech-news.

Additionally, components of some phrases -- such as "el nino" and "carbon dioxide" -- appear as two separate entries due to the fact that we did not include (noun) phrases in the vocabulary. Similarly, because we did not stem words, or at the very least map plural words onto their singular counterpart, we observe both forms in some cases ("cell" and "cells", "earthquake" and "earthquakes").

Interestingly, the words "you" and "your" are highly associated with the technology topic, perhaps indicating that articles in this vein tend to relate their topics to the reader ("you can do X with Y" or "your phone may now be able to X", for instance). Such "reader relations" are likely not present in the other topics.

In [43]:
def getTopN(topic_word_probs, n=10):
    s = list(numpy.argsort(topic_word_probs[:,0]))
    s = [ind for ind in reversed(s)]
    return(s[:n])
In [44]:
n = 30
table = {}

for topic, word_probs in topics_word_probs.items():
    top10 = getTopN(word_probs, n)
    table[topic] = [vm.vocab2word[i] for i in top10]
    
top10 = getTopN(background_word_probs, n)
table["Background"] = [vm.vocab2word[i]  for i in top10]
table = pandas.DataFrame(table)

table = FF.create_table(table)
plotly.plotly.iplot(table, filename="top{}Words".format(n))
Out[44]:

One interesting side effect of finding the Topic Word Distributions for each of the PhysOrg categories is that we could potentially classify a new observed document by associating it with the PhysOrg category whose Word Distribution assigns it the highest probability.

In other words, making use of the document probability function above, $\log(p(d | \Lambda)) = \sum^{|V|}_{i=1} c(w_i, d) \log \left [ p(\theta_d)p(w_i | \theta_d) + p(\theta_{B})p(w_i | \theta_{B}) \right ] $, given the learned word distributions, we could perform this calculation using each Topic in turn. Simplistically, we would assign the document to the Topic that provided the largest probability score. While there are some shortcomings associated with this approach, it would suffice as a "first pass" method.

Summary & Next Steps

One thing we may notice from these word lists is that the topics are not all the "pure" in the sense that words bleed across topics in multiple cases and some words that are highly associated with a topic do not seem to really belong to that topic. This is likely because that while the articles themselves are given a "hard" PhysOrg topic association, their content is not strictly limited to the PhysOrg class because science fields often bleed into one another and there are soft, sometimes non-existent boundaries.

Perhaps we should re-examine what our goal was, though. If our goal was simply to characterize the PhysOrg topics as they are and try to understand what each topic covered, then this method succeeds in doing that. If, however, we wanted to discover the underlying themes and topics across all articles, then this method falls short. In order to do this, we would need to use something like Probabilistic Latent Semantic Analysis (PLSA). The next article in this series will analyze the PhysOrg data using PLSA.