immersinn-ds

Fri 30 December 2016

Bioinformatics Algorithms Chapter 1 Workthrough

Posted by TRII in bioinformatics   

Introduction & Overview

In the continued spirit of learning and courses, this article is the first in a series related to Bioinformatics Algorithms. In particular, our goal here is to follow along with the text Bioinformatics Algorithms: An Active Learning Approach -- which is also associated with a MOOC on Coursera -- and generate articles pertaining to each chapter and the topics covered in them. This first article relates to the chapter entitled, "Where in the Genome Does DNA Replication Begin?". Below we list the topics covered in the chapter.

Ch01: Location of oriC

Topics covered:

  • Frequent "Words" / Patterns / Motifs
    • Reverse Compliments of sub-sequences
  • $(L, t)$ - clumps
  • GC - skew
  • Approximate Pattern Matching
    • Hamming Distance

As a first pass, we use a randomly generated 'ACGT' sequence to develope code for attacking each of these problems. We then apply the tools in turn to re-find the oriC region and DnaA box for Vibro cholerae, which was covered in the text, to confirm that our methods can reproduce the results.

Then we turn our methods on two other circular bacterial genomes, those of Echera coli and Salmonella enterica, the former of which is also addressed in the text to a lesser extent, and the latter which is given as a challenge problem in the text.

We make use of the biopython library for handling the genome sequence data and also fetching the respective genomes from the Entrez site, which is a subset of the NCBI site

Additionally, we use ipyparallel for parallel processing. A nice overview of the toolkit can be found here

In [1]:
from collections import defaultdict
import itertools
In [2]:
import numpy
import scipy
import pandas
import ipyparallel as ipp
In [3]:
%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
In [4]:
from Bio import Entrez
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
In [5]:
#Setup the parallel view
rc = ipp.Client()
print(rc[:])
lv = rc[:]

Generate Random "ACGT" Sequence

Random "ACTG" sequences are used for building and evaluating the methods addressed here. Each letter is given an equal likelihood of occurring as the next letter in the sequence. Sequences produced via this method are expected to have drastically different properties than naturally occuring sequences (i.e., genomes).

Thus, while from a "high-level" the random sequence looks like a real sequence and can be used to test code, the outputs and results observed will most likely not resemble each other at all.

In [6]:
def random_genome_generator(n, seed=8675309):
    numpy.random.seed(seed)
    for i in range(n):
        num = numpy.random.rand()
        if num < 0.25:
            yield 'A'
        elif num < 0.5:
            yield 'C'
        elif num < 0.75:
            yield 'G'
        else:
            yield 'T'
            
def generateSeq(length):
    seq = Seq(''.join([nuc for nuc in random_genome_generator(length)]))
    return(seq)
In [7]:
seq = generateSeq(10000)
In [8]:
seq[:50]
Out[8]:
Seq('TTCACAACAGGTCGGATTAGAATTAGATATTCCGAATGATTGCAGAGTTG', Alphabet())
In [9]:
record = SeqRecord(seq=seq)

Frequent Patterns

Perhaps one of the most straight-forward approaches to looking for important or meaningful patterns is to find all common clumps / phrases and go from there. In other words, look for all sequences in the data that occur "frequently", which is a bit of a loaded term.

Simplistically, we can think of "frequent" sequences as those that occur "a lot". In other words, we specify some minimum count threshold and look for all sequences that occur more frequently than the threshold.

Another way to think about "a lot" is "more than expected". The expected number of occurrences can be calculated by assuming that sequences are generated by randomly drawing letters from an alphabet with probabilities proportional to each letter's overall fraction in the dataset. Then, actual sequence occurrences can be compared to observed sequence occurrences.

Both methodologies have benefits and drawbacks. For simplicity, the "constant count criteria" method is used below.

In [10]:
# Simple method:
def find_motif_location(sequence, motif):
    location = -1
    len_s = len(sequence)
    len_m = len(motif)
    for i in range(len_s - len_m):
        if sequence[i:(i+len_m)] == motif:
            location = i+1
            break
    return(location)
            
            
def find_all_motif_locations(sequence, motif):
    len_s = len(sequence)
    len_m = len(motif)

    matches = []
    for i in range(len_s - len_m):
        if sequence[i:(i+len_m)] == motif:
            matches.append(i+1)  # Correct for index at 0/1
    return(matches)


# Simple Method:
def find_frequent_motifs(record, min_sup=4, limit=1000):
    seq = record.seq
    if limit == 0:
        seq = seq
    elif limit < len(seq):
        seq = seq[:limit]
    
    # We assume that all sequences contain all bp types
    alpha_beta = ["A", "C", "G", "T"]

    motifs = []
    candidates = alpha_beta
    while candidates:
        found = []
        for cand in candidates:
            count = len(find_all_motif_locations(seq, cand))
            if count >= min_sup:
                found.append(cand)
                motifs.append((cand, count))
        candidates = [f+a for a in alpha_beta \
                      for f in found \
                      if (f+a)[1:] in found]
    return(motifs)
In [11]:
# Push these functions to the view
lv.push(dict(find_all_motif_locations=find_all_motif_locations,
             find_frequent_motifs=find_frequent_motifs));

To find the total number of times a particular sub-sequence or motif occurs within a larger sequence, the entire sequence needs to be analyzed. This may be obvious, but it is an important point to make in terms of thinking about how to go about counting motifs.

Perhaps the most obvious way to do this is to move a window that is the same length as the motif being counted down the sequence and check if the sub-sequence in the window is the same as the motif being counted. This is precisly what is being done in "find_all_motif_locations", which also has an additional step of recording all locations the motif was found. This method is wrapped within the function "find_frequent_motifs" to count the occurrences of incrementially longer motifs in the master sequence.

Several approaches exist for counting motifs in a sequence in order to determine which are frequent. The brute-force, naive approach is to use the alphabet to generate all possible motifs and count each of them in turn. For a small sequence, this is doable. But this approach takes exponentially longer as the size of the master sequence increases.

One way to make the searching more efficient is to only count candidates that are possibly frequent given information about smaller candidates. In other words, do not check for longer motifs if some sub-part of that motif is not itself frequent. This idea is known as the Apriori Principle (we have previously looked at this idea. Note that this approach does not necessairly hold for methods that do not use constant cutoff criteria as infrequent motifs at one step may be sub-sets of frequent motifs at a future step.

To impliment a method utilizing the Apriori Principle, new candidates for the current step are constructed from the set of patterns deemed frequent in the previous step. The initial set of candidates is the alphabet "ACGT" in this case.

Typically, because sequence order is not necessairly important, building new candidates requires confirming that an ever-increasing collection of sub-parts of a larger pattern are frequent. Because sequence order is critical for this case, only two sub-parts need to be confirmed at each step, and candidates in step $k+1$ can be generated by appending each of "ACTG" to a frequent pattern of length $k$ and confirming that the last $k$ letters of each new sequence are also frequent.

In [12]:
start_inds = range(0,10000,1000)
inds = zip(start_inds, [s+1010 for s in start_inds])
sub_records = [record[x[0]:x[1]] for x in inds]
In [13]:
results = lv.map_sync(lambda x: find_frequent_motifs(x, min_sup=4, limit=0), sub_records)
In [14]:
motifs = defaultdict(int)
for result in results:
    for k,v in result:
        motifs[k] += v
motifs = dict(motifs)

Looking at the counts for motifs in our random sequence, we see that shorter motifs have counts expected from a random sequence. Each of the base letters make up approximatly $\frac {1} {4}$ of the total letters in the sequence, each of the letter pairs approximatly $\frac {1} {16}$, and so on.

We do note some anomolies for the longer motifs; for instance, "GGGGGGG" occurs 5 times in the data, while it is only expected to occur $0.25^7*10000 \approx 0.6$ times (see table below). Similarly, all of the length 6 frequent sequences also seem to apper more often than they should due to random chance (expected to occur about $2.44$ times). Interestingly, most of the possible length 5 sequences occur less than expected in the data. These are simply some anomolies associated with any (pseudo-) randomly generated sequences.

Also note that splitting the data into sections would normally undercount longer strands if they were split between sections. In this case, however, strands overlap by 10 nucleotides (the first 10 nucleotides of each segment are grafted onto the end of the previous segment), so short strands have the potential to be overcounted. One way around this is to instead look for sub-sequence start locations, and collect all of the start locations for each sub-sequence at the end. This is done below when we are investigating $(L,t)$-clumps

In [15]:
expected_seq_freq = pandas.DataFrame(data={'Seq Len':range(1,8),
                                           'Exp Count':['{:0.2f}'.format(0.25**i * len(record.seq)) \
                                                        for i in range(1,8)]})
expected_seq_freq
Out[15]:
Exp Count Seq Len
0 2500.00 1
1 625.00 2
2 156.25 3
3 39.06 4
4 9.77 5
5 2.44 6
6 0.61 7
In [16]:
def format_print(item, motifs, pre='', post=''):
    try:
        print(pre + 'Nuc seq {} was found {} times.'.format(item, motifs[item]) + post)
    except KeyError:
        print(pre + 'Nuc seq {} was not found in the sequence.'.format(item) + post)
In [17]:
for letter in 'ACGT':
    format_print(letter, motifs)
Nuc seq A was found 2498 times.
Nuc seq C was found 2468 times.
Nuc seq G was found 2563 times.
Nuc seq T was found 2551 times.
In [18]:
for pair in itertools.combinations('ACGT', 2):
    pair = ''.join(pair)
    riap = pair[1] + pair[0]
    format_print(pair, motifs)
    format_print(riap, motifs)
Nuc seq AC was found 603 times.
Nuc seq CA was found 606 times.
Nuc seq AG was found 630 times.
Nuc seq GA was found 640 times.
Nuc seq AT was found 665 times.
Nuc seq TA was found 653 times.
Nuc seq CG was found 641 times.
Nuc seq GC was found 630 times.
Nuc seq CT was found 594 times.
Nuc seq TC was found 607 times.
Nuc seq GT was found 631 times.
Nuc seq TG was found 633 times.
In [19]:
long_freq = defaultdict(list)
for k,v in motifs.items():
    if len(k) == 5 and v > 9:
        long_freq[len(k)].append(k)
    elif len(k) > 5 and v >=4:
        long_freq[len(k)].append(k)
for length in long_freq.keys():
    print('Sequences of len {}:'.format(length))
    for item in sorted(long_freq[length]):
        format_print(item, motifs, pre='  ')
    print
Sequences of len 5:
  Nuc seq TTTAG was found 10 times.
  Nuc seq TTTTA was found 10 times.

Sequences of len 6:
  Nuc seq CCATGC was found 4 times.
  Nuc seq CCCGTA was found 4 times.
  Nuc seq CTCCTG was found 4 times.
  Nuc seq GGGGGG was found 5 times.
  Nuc seq TAAAGC was found 4 times.
  Nuc seq TATTTA was found 4 times.

Sequences of len 7:
  Nuc seq GGGGGGG was found 4 times.

While the Apriori Principle significantly reduces the number of candidates investigated, and thus the time it takes to process a sequence, significanlty longer sequences -- such a genomes -- can still take a significant amount of time to process. Thus, the method of finding all frequent patterns is best reserved for shorter sequences.

This also requires that we think about patterns and the "frequent" idea in a slightly different manner. For instance, finding a 6-mer 4 times in a 500 nucleotide stratch is more significant than finding the motif the same number of times in an entire genome. One way of contextualizing fequency in terms of the size of the search space is the concept of an $(L, t)$-clump, which we look at next.

Additionally, we need methods that can reduce the search space so that our frequent pattern finding methods can run in a reasonable amount of time. Below we look at one such method for narrowing down the search space in circular bacterial genomes, GC Skew.

$(L, t)$ - clumps

$(L, t)$-clumps help scale the idea of Frequent Patterns to various sized search spaces. If within a strand of $L$ nucleotides, a $k$-mer is observed $t$ (or more) times, then it is an $(L, t)$. This provides both a frequency benchmark and a search space reference for evaluating and comparing patterns.

This also introduces the idea of a number of occurrances of a particular $k$-mer within the same region of a genome. For instance, a $k$-mer may appear too infrequently to make it an "interesting" pattern with regards to the the entire genome, but if all or most of those occurrences are in a very small region of the genome, then it becomes interesting.

As before for Frequent Patterns, searching the entire genome for $(L, t)$-clumps is problimatic. At first glance, one may consider simply splitting the entire sequence into clumps and finding the motifs that occur in each clump at least $t$ times. However, what about the case where a $k$-mer occurs at least $t$ times in a span of size $L$, but that span is split between two clumps? A "split-and-stich" approach could be suggested, but the problem is always present. So, we cannot get $(L, t)$-clumps from simply slitting the genome into parts and counting "intelligently".

Splitting up the data and counting $k$-mer occurrences in each sequence is only possible if, instead of simply getting the raw count for each $k$-mer, the start location of each $k$-mer instance is recorded. Then, the start locations for each $k$-mer can be analyzed to see if $t$ of them occur within a span of length $L$.

There is still an issue though. What if a $k$-mer starts in one clump and ends in the next clump? Even though that $k$-mer exists in the data, it is never recorded as it does not appear completly in any data subset analyzed. One way to get around this issue is to patch on the first $(k-1)$ nucleotides from the $(n+1)$-th clump onto the end of the $n$-th clump.

While this prevents missing out on $k$-mers, it adds in a slight complication for how $(L, t)$-clumps are determined. That is, a $k$-mer with at least $t$ start points in a span $L$ may not be completly contained within said span. This is mostly irrelevant for instances where $k \ll L$, though could cause issues when $k = O(L)$.

Because a method exists for locating $(L, t)$-clumps by splitting the overall sequence into clumps and processing them individually, parallel processing can be utilized. A similar notion applies to the post-counting processing of each $k$-mer found in the data in order to determine which have $(L, t)$-clumps.

In [20]:
def ltclump_check(clump, kmer, t):
    count = find_all_motif_locations(clump, kmer)
    if count >= t:
        return(True)
    else:
        return(False)
    
    
def count_kmers(sequence, k):
    kmer_counts = defaultdict(int)
    for start in range(len(sequence)-k):
        kmer_counts[str(sequence[start:(start+k)])] += 1
    kmer_counts = dict(kmer_counts)
    return(kmer_counts)


def find_kmer_starts(sequence, k, offset=0):
    from collections import defaultdict
    
    kmer_starts = defaultdict(list)
    for start in range(len(sequence)-k):
        kmer_starts[str(sequence[start:(start+k)])].append(start+offset)
    kmer_starts = dict(kmer_starts)
    
    return(kmer_starts)


def processClump(clump, clump_ind, clump_offset, k):
    clump = str(clump)
    L = len(clump)
    kmer_starts = find_kmer_starts(clump, k, offset=clump_offset)  
    return((clump_ind, kmer_starts))


def mergeClumps(clumps):
    # clumps = [(clump_id, kmer_starts)]
    kmer_starts = defaultdict(set)
    for clump in clumps:
        for kmer,starts in clump[1].items():
            kmer_starts[kmer].update(set(starts))
    kmer_starts = dict(kmer_starts)
    return(kmer_starts)


def detectLTClump(kmer, starts, L, t):
    lt_clumps = []
    
    if len(starts) >= t:
        starts = [s for s in sorted(list(starts))]
        ind = 0
        
        while ind <= (len(starts)-t):
            # Do at least t starts occur within L?
            if starts[ind+t-1] - starts[ind] <= L:
                span = t
                # How many starts are within L?
                while True:
                    if ind+span == len(starts):
                        break
                    elif starts[ind+span-1] - starts[ind] > L:
                        break
                    else:
                        span += 1
                lt_clumps.append(starts[ind:(ind+span)])
                
            else:
                span = 1
                
            ind += span
    return({kmer : lt_clumps})


def detectLTClumps(results, L, t):
    lt_clumps = {}
    for kmer,starts in results.items():
        lt_clumps.update(detectLTClump(kmer, starts, L, t))
    lt_clumps = {k : v for k,v in lt_clumps.items() if v}
    return(lt_clumps)
In [21]:
def findLTClumps(record, k, L, t, view):
    
    # Push these functions to the view
    view.push(dict(find_kmer_starts=find_kmer_starts,
                   processClump=processClump));
    
    # Define start, end locations for slices
    spacing = min(len(record)-k, 1000)
    start_inds = range(0,len(record),spacing)
    inds = zip(start_inds, [s+spacing+k for s in start_inds])
    sub_seqs = [record.seq[x[0]:x[1]] for x in inds]
    
    # Do stuff, merge results
    results = view.map_sync(lambda clump, clump_ind, clump_offset, k: processClump(clump, 
                                                                                   clump_ind,
                                                                                   clump_offset,
                                                                                   k),
                            sub_seqs, range(len(start_inds)), start_inds, [k for _ in start_inds])
    results = [(a,b) for a,b in results if b]
    if len(results) > 1:
        results = mergeClumps(results)
    else:
        results = results[0][1]
    
    # Get the lt_clumps
    lt_clumps = detectLTClumps(results, L, t)
    
    return(lt_clumps)
In [22]:
ltc = findLTClumps(record, k=8,  L=2000, t=3, view=lv)
In [23]:
ltc
Out[23]:
{'ATGGATCC': [[5259, 5514, 6460]],
 'GGGGCACG': [[2686, 3493, 4383]],
 'GGGGGGGG': [[1368, 1369, 1370]],
 'GTGGATCG': [[1926, 2851, 3544]],
 'GTGGGGCA': [[2684, 3491, 4381]],
 'TACACAGG': [[4569, 4980, 6231]],
 'TGGATCCA': [[4988, 5260, 6461]],
 'TGGGGCAC': [[2685, 3492, 4165, 4382]]}

GC Skew

In [24]:
def calcGCSkew(seq):
    current_skew = 0
    gc_skew = numpy.zeros((len(seq)))
    for i,nuc in enumerate(seq):
        if nuc == "G":
            current_skew += 1
        elif nuc == "C":
            current_skew -= 1
        gc_skew[i] = current_skew
    gc_skew = pandas.Series(gc_skew)
    return(gc_skew)

def find_minmax_gcskew(gc_skew):
    pos2neg = numpy.argmax(gc_skew)
    neg2pos = numpy.argmin(gc_skew)
    return({'oriC_pos' : neg2pos,
            'terC_pos' : pos2neg})
In [25]:
sequence_gcskew = calcGCSkew(record.seq)
In [26]:
# Init Plot
fig, ax = plt.subplots(figsize = (10, 6));
ax.set_title('GC Skew for Random Sequence');

# Axis Labels
ax.set_xlabel("Position");
ax.set_ylabel("GC Skew");


g,  = ax.plot(sequence_gcskew,
              c = "green",
              markersize=8);

# Set Axis limits
ax.set_xlim(0, sequence_gcskew.shape[0]+1000);

No real descernable pattern in the GC Skew plot for the random nucleotide sequence.

Approximate Matches

Because mutations of nucleotides are common in genomes, we may consider occurrences of highly-similar sequences to be one in the same for the purposes of detecting patterns in the genome.

By using approximate matching for motifs, we may be able to uncover trends and patterns that were not exposed by simply using exact sequence matches. Various methods exist for assessing partial matches. For our current purposes, the Hamming Distance is the metric we will utilize.

Hamming Distance

One method for evaluating the distance between sequences is the Hamming Distance. The Hamming Distance is the total number of mis-matched nucleotides for two sequences of the same length. This can be calculated by simply subtracting the total number of matched positions two sequences have from the length of the sequences. Motifs with small Hamming Distances can then be grouped together and occurrences of all motifs in a group can be used to determine if a Frequent Pattern or $(L,t)$-clump has been found.

In [27]:
seq1 = 'atcgatcg'
seq2 = 'acgtgtag'
In [28]:
def hammingDist(seq1, seq2):
    hd = len(seq1) - sum([a==b for a,b in zip(seq1, seq2)])
    return(hd)
In [29]:
hammingDist(seq1, seq2)
Out[29]:
5

Integrating Hamming Distance with Other Methods

Calculating the Hamming Distance for pairs of sequences is straight-forward. In order to utilize it in practice, though, it needs to be integrated with the methods discussed above. The first step in doing so is calculating the pairwise Hamming Distances for all $k$-mers of interest.

Once pairwise Hamming Distances are calculated, $k$-mers need to be grouped. Typically this is done by starting with a "base" $k$-mer and associating all other $k$-mers within a distance $d$ with that $k$-mer. Then, for each group of $k$-mers, the metric in question is calculated.

Note that to do this, the initial counting steps for each unique $k$-mer can proceede as follows. Only the count filtering and final summary steps need to be modified to account for all $k$-mers in each group.

For example, is the case of $(L,t$)-clumps, start positions are located as before. Then, for each $k$-mer group, the start locations from all distinct $k$-mers in the group are merged and ordered. The $(L,t)$-clump detection algorithm is then carried out on the new list of start positions. This process is carried out below. Note that the number of "hits" increases significantly from the previous case.

Some additional notes about the implimentation. Because such a large number of $k$-mers can be found in the data, calculating the pairwise Hamming Distances can get expensive if "for" loops are used. Thus, the "ACGT" strings are converted to numeric vectors, and scipy.spatial is utilized for calculating the distances between $k$-mers.

In [30]:
def hammingDistsLetters(k_mers):
    # Works for a small number of candidate k-mers
    dists = numpy.zeros((len(k_mers), len(k_mers)), dtype='int8')

    for i in range(len(k_mers)-1):
        seq1 = k_mers[i]
        for j in range(i+1, len(k_mers)):
            seq2 = k_mers[j]
            hd = hammingDist(seq1=seq1, seq2=seq2)
            dists[i,j] = hd
            
    dists += dists.transpose()
    
    return(dists)


def hammingDistsDigits(k_mers):
    k = len(k_mers[0])
    
    # Encode k-mers
    lookup = {'A':0, 'C':1, 'G':2, 'T':3}
    k_mers = numpy.array([[lookup[l] for l in k_mer] for k_mer in k_mers],
                         dtype=int)

    # Calculate distances
    dists = scipy.spatial.distance.squareform(\
                                              scipy.spatial.distance.pdist(k_mers, metric='hamming'))
    dists = dists * k
    dists = dists.astype('int16')
    return(dists)



def groupSimilarKMers(k_mers, dmax=1, dmetric='hamming'):
    
    # Calculate pairwise distances
    if dmetric == 'hamming':
        dists = hammingDistsDigits(k_mers)
    
    # Find all close k_mers to each k_mer (including self)
    kmer_groups = {}
    for i,kmer in enumerate(k_mers):
        close = [k_mers[k] for k,val in enumerate(list(dists[i,:] <= dmax)) if val]
        kmer_groups[kmer] = close

    return(kmer_groups)
In [31]:
def mergeSimilarKMerAttr(kmer_groups, attr_values, attr='starts', dmax=1):
        
    new_attr_values = {}
    for k_mer, sim_kmers in kmer_groups.items():
        
        if attr == 'counts':
            # count is an int
            new_value = 0
            for km in sim_kmers:
                new_value += attr_values[km]
        elif attr == 'starts':
            # Starts are a set
            new_value = set()
            for km in sim_kmers:
                new_value.update(attr_values[km])
                
        new_attr_values[k_mer] = new_value
            
    return(new_attr_values)
In [32]:
def findApproxLTClumps(record, k, L, t, view, dmax=1):
    
    # Push these functions to the view
    view.push(dict(find_kmer_starts=find_kmer_starts,
                 processClump=processClump));
    
    
    # Define start, end locations for slices
    spacing = min(len(record)-k, 1000)
    start_inds = range(0,len(record),spacing)
    inds = zip(start_inds, [s+spacing+k for s in start_inds])
    sub_seqs = [record.seq[x[0]:x[1]] for x in inds]
    
    # Do stuff, merge results
    results = view.map_sync(lambda clump, clump_ind, clump_offset, k: processClump(clump, 
                                                                                 clump_ind, 
                                                                                 clump_offset,
                                                                                 k),
                          sub_seqs, range(len(start_inds)), start_inds, [k for _ in start_inds])    
    results = [(a,b) for a,b in results if b]
    if len(results) > 1:
        results = mergeClumps(results)
    else:
        results = results[0][1]
    
    # Get groups and merge starts for approx k-mers
    k_mers = [km for km in results.keys()]
    kmer_groups = groupSimilarKMers(k_mers, dmax=dmax)
    results = mergeSimilarKMerAttr(kmer_groups, results, attr='starts', dmax=dmax)
    
    # Get the lt_clumps
    lt_clumps = detectLTClumps(results, L, t)
    
    # Add info about similar sequences
    lt_clumps = {k : {'clumps' : c,
                      'sim_kmers' : akm} \
                 for k,c,akm in zip(lt_clumps.keys(), 
                                    lt_clumps.values(),
                                    [kmer_groups[km] for km in lt_clumps.keys()])
                }
    
    return(lt_clumps)
In [33]:
ltc_approx = findApproxLTClumps(record=record, k=8, L=1000, t=6, view=lv)
In [34]:
for k in ltc_approx.keys():
    print('k-mer {} has clumps {}'.format(k, ltc_approx[k]['clumps']))
k-mer GGATAGCA has clumps [[1096, 1111, 1291, 1436, 1545, 1721, 7203]]
k-mer ATGCAGTC has clumps [[8447, 8574, 8871, 9049, 9120, 9239, 9313]]
k-mer TCAGCCCG has clumps [[6885, 7321, 7651, 7829, 7843, 7861]]
k-mer CGGCGGTA has clumps [[7623, 7633, 7929, 8057, 8157, 8227]]

Finding oriC in V. cholerae

Now that we have our tools in place, we will use them to attempt to locate the oriC and DnaA boxes in a strand of V. cholerae that is similar to but not the same as the stand used in the text as an example.

Retrieve Genome

The first step is to obtain the genome for the strand of V. cholera of interest. The NCBI page for the strand being used can be found here. Note that the strand is itentified by a unique ID -- CP013302.1 -- that can be used by the Biopython package to retrieve the genome from Entrez. This is done below.

In [35]:
#vibcho_id = "CP007653.1"
#vibcho_id = "CP007654.1"
vibcho_id = "CP013302.1"
In [36]:
db = "nucleotide"
rettype = "gb"
retmode = "text"
In [37]:
Entrez.email = "immersinn@gmail.com"
In [38]:
handle = Entrez.efetch(db=db, id=vibcho_id, rettype=rettype, retmode=retmode)
vibcho_rec = SeqIO.read(handle, rettype)
handle.close()
In [39]:
def print_record_descrip(record):
    print("{} ({}...) with {} features".format(record.id,
                                               record.description[:50],
                                               len(record.features)))
In [40]:
print_record_descrip(vibcho_rec)
CP013302.1 (Vibrio cholerae strain C5 chromosome 2, complete s...) with 2369 features

Step Through the Process

Once the genome has been obtained, the search for oriC and DnaA boxes can begin.

The first step here is to search for the oriC region by using GC Skew. After a reasonable region for oriC has been obtained via this method, the next step is to look for $(L,t)$-clumps in the region. Two different $L$ values are tested, 300 and 500.

Next, the nucleotide sequences that have $(L,t)$-clumps are checked to see if their revere complement also has a $(L,t)$-clump in the same region. This is a halmark feature of a DnaA box.

By using $L = 300$, the sequences (the main sequence and its reverse complement) corresponding to the DnaA box described in the text are recovered.

Calculate & Plot GC Skew

By using the GC Skew tools from above, the start of the oriC region seems to be around nucleotide 80.

In [41]:
vibcho_gcskew = calcGCSkew(vibcho_rec.seq)
In [42]:
# Init Plot
fig, ax = plt.subplots(figsize = (10, 6));
ax.set_title('GC Skew for V. cholerae');

# Axis Labels
ax.set_xlabel("Position");
ax.set_ylabel("GC Skew");


g,  = ax.plot(vibcho_gcskew,
              c = "green",
              markersize=8);

# Set Axis limits
ax.set_xlim(0, vibcho_gcskew.shape[0]+1000);

Approx oriC

In [43]:
vibcho_out = find_minmax_gcskew(vibcho_gcskew)
vibcho_out['oriC_pos']
Out[43]:
78

Look for (Approximate) $(L, t)$ - clumps in the proposed oriC region

In [44]:
t = 2
dmax = 2

$L = 500$

In [45]:
L = 500
In [46]:
start = vibcho_out['oriC_pos'] - 0
stop = vibcho_out['oriC_pos'] + L
In [47]:
ltcs_500 = {}
for k in range(8,11):
    ltcs_500[k] = findApproxLTClumps(vibcho_rec[start:stop], k=k,  L=L, t=t, view=lv, dmax=dmax)

$L = 300$

In [48]:
L = 300
In [49]:
start = vibcho_out['oriC_pos'] - 0
stop = vibcho_out['oriC_pos'] + L
In [50]:
ltcs_300 = {}
for k in range(8,11):
    ltcs_300[k] = findApproxLTClumps(vibcho_rec[start:stop], k=k,  L=L, t=t, view=lv, dmax=dmax)

Look for Rev Comp Pairs

In [51]:
for L,ltcs in [(500, ltcs_500), (300, ltcs_300)]:
    print('L={} results:'.format(L))
    
    for k in ltcs.keys():
        print('  results for k={}:'.format(k))
    
        for seq in ltcs[k]:
            try:
                rc = str(Seq(seq).reverse_complement())
                count = len(ltcs[k][rc])
                print("\tSeq {} ({}) has rev comp {} ({}) in data".format(str(seq), len(ltcs[k][seq]['clumps'][0]),
                                                                          str(rc), len(ltcs[k][rc]['clumps'][0])))
            except KeyError:
                pass
    print
L=500 results:
  results for k=8:
	Seq ATCTTCAA (7) has rev comp TTGAAGAT (6) in data
	Seq TTGAAGAT (6) has rev comp ATCTTCAA (7) in data
	Seq ATGATCAA (14) has rev comp TTGATCAT (9) in data
	Seq TTGATCAT (9) has rev comp ATGATCAA (14) in data
	Seq TGATCAAG (9) has rev comp CTTGATCA (5) in data
	Seq GATCTTCA (4) has rev comp TGAAGATC (7) in data
	Seq ATTGAAGA (7) has rev comp TCTTCAAT (5) in data
	Seq CTTGATCA (5) has rev comp TGATCAAG (9) in data
	Seq AAGATCTT (3) has rev comp AAGATCTT (3) in data
	Seq TGAAGATC (7) has rev comp GATCTTCA (4) in data
	Seq TCTTCAAT (5) has rev comp ATTGAAGA (7) in data
	Seq GAAGATCT (4) has rev comp AGATCTTC (3) in data
	Seq AGATCTTC (3) has rev comp GAAGATCT (4) in data
  results for k=9:
	Seq ATCTTCAAT (4) has rev comp ATTGAAGAT (3) in data
	Seq AGATCTTCA (2) has rev comp TGAAGATCT (2) in data
	Seq TTGAAGATC (2) has rev comp GATCTTCAA (3) in data
	Seq ATTGAAGAT (3) has rev comp ATCTTCAAT (4) in data
	Seq TGAAGATCT (2) has rev comp AGATCTTCA (2) in data
	Seq ATGATCAAG (5) has rev comp CTTGATCAT (2) in data
	Seq CTTGATCAT (2) has rev comp ATGATCAAG (5) in data
	Seq GATCTTCAA (3) has rev comp TTGAAGATC (2) in data
  results for k=10:

L=300 results:
  results for k=8:
	Seq CTTGATCA (3) has rev comp TGATCAAG (6) in data
	Seq ATGATCAA (7) has rev comp TTGATCAT (5) in data
	Seq TTGATCAT (5) has rev comp ATGATCAA (7) in data
	Seq TGATCAAG (6) has rev comp CTTGATCA (3) in data
  results for k=9:
	Seq ATGATCAAG (3) has rev comp CTTGATCAT (2) in data
	Seq CTTGATCAT (2) has rev comp ATGATCAAG (3) in data
  results for k=10:

For both $L = 300$ and $L = 500$, the sequence presented in the book -- "ATGATCAAG" -- is found. When the range is set to 500 away from the proposed oriC, the target DnaA box sequence is not as distinct, as 3 other sequences also appear with reverse compliments, though the target sequence is the most frequent 9-mer observed. Reducing the range to only 300 away from the proposed oriC removes all other candidates and the target sequence is a clear winner

An algorithm may be proposed from thses findings. First, $L$ can be increased within reason until a positive hit is found, perhaps using the values $(200, 300, 400, 500)$. Additionally, for each $L$ value choses, the $dmax$ value -- the value specifying the maximum Hamming Distance between grouped $k$-mers -- can be increased from 0 to 2. By varying these parameters from most to least restrictive, we should be able to find a reasonable candidate for the DnaA box in other circular bacterial genomes.

In [52]:
def locateDnaaBox(record, k=9, t=2):
    
    gcs = calcGCSkew(record.seq)
    start = find_minmax_gcskew(gcs)['oriC_pos']
    
    hits = []
    
    for L in [200, 300, 400, 500]:
        stop = start + L
        for dmax in [0, 1, 2]:
            candidates = findApproxLTClumps(record[start:stop], k=k,  L=L, t=t, view=lv, dmax=dmax)
            
            for seq in candidates:
                try:
                    seq_count = len(candidates[seq]['clumps'][0])
                    rc = str(Seq(seq).reverse_complement())
                    rc_count = len(candidates[rc]['clumps'][0])
                    
                    hits.append((L, dmax, seq, seq_count, rc, rc_count))
                except KeyError:
                    pass
    for hit in hits:
        print(hit)
In [53]:
locateDnaaBox(vibcho_rec)
(300, 2, 'ATGATCAAG', 3, 'CTTGATCAT', 2)
(300, 2, 'CTTGATCAT', 2, 'ATGATCAAG', 3)
(400, 2, 'ATCTTCAAT', 4, 'ATTGAAGAT', 3)
(400, 2, 'TTGAAGATC', 2, 'GATCTTCAA', 3)
(400, 2, 'AGATCTTCA', 2, 'TGAAGATCT', 2)
(400, 2, 'ATTGAAGAT', 3, 'ATCTTCAAT', 4)
(400, 2, 'TGAAGATCT', 2, 'AGATCTTCA', 2)
(400, 2, 'ATGATCAAG', 5, 'CTTGATCAT', 2)
(400, 2, 'CTTGATCAT', 2, 'ATGATCAAG', 5)
(400, 2, 'GATCTTCAA', 3, 'TTGAAGATC', 2)
(500, 2, 'ATCTTCAAT', 4, 'ATTGAAGAT', 3)
(500, 2, 'AGATCTTCA', 2, 'TGAAGATCT', 2)
(500, 2, 'TTGAAGATC', 2, 'GATCTTCAA', 3)
(500, 2, 'ATTGAAGAT', 3, 'ATCTTCAAT', 4)
(500, 2, 'TGAAGATCT', 2, 'AGATCTTCA', 2)
(500, 2, 'ATGATCAAG', 5, 'CTTGATCAT', 2)
(500, 2, 'CTTGATCAT', 2, 'ATGATCAAG', 5)
(500, 2, 'GATCTTCAA', 3, 'TTGAAGATC', 2)

Repeat for E. coli & S. enterica

E. Coli

Retrieve Genome

In [54]:
ecoli_id = "BA000007.2"
In [55]:
handle = Entrez.efetch(db=db, id=ecoli_id, rettype=rettype, retmode=retmode)
ecoli_rec = SeqIO.read(handle, rettype)
handle.close()
In [56]:
print_record_descrip(ecoli_rec)
BA000007.2 (Escherichia coli O157:H7 str. Sakai DNA, complete ...) with 11109 features

GC Skew Plot

In [57]:
ecoli_gcskew = calcGCSkew(ecoli_rec.seq)
In [58]:
# Init Plot
fig, ax = plt.subplots(figsize = (10, 6));
ax.set_title('GC Skew for E. coli');

# Axis Labels
ax.set_xlabel("Position");
ax.set_ylabel("GC Skew");


g,  = ax.plot(ecoli_gcskew,
              c = "purple",
              markersize=8);

# Set Axis limits
ax.set_xlim(0, ecoli_gcskew.shape[0]+1000);
In [59]:
locateDnaaBox(ecoli_rec)
(400, 1, 'GTGGATAAC', 2, 'GTTATCCAC', 2)
(400, 1, 'GTTATCCAC', 2, 'GTGGATAAC', 2)
(400, 1, 'TGTGGATAA', 2, 'TTATCCACA', 2)
(400, 1, 'TTATCCACA', 2, 'TGTGGATAA', 2)
(400, 2, 'TGGATAACT', 2, 'AGTTATCCA', 2)
(400, 2, 'AGTTATCCA', 2, 'TGGATAACT', 2)
(400, 2, 'TGTGGATAA', 4, 'TTATCCACA', 2)
(400, 2, 'TTATCCACA', 2, 'TGTGGATAA', 4)
(400, 2, 'GTGGATAAC', 2, 'GTTATCCAC', 2)
(400, 2, 'GTTATCCAC', 2, 'GTGGATAAC', 2)
(400, 2, 'CTGTGGATA', 4, 'TATCCACAG', 2)
(400, 2, 'TATCCACAG', 2, 'CTGTGGATA', 4)
(500, 1, 'GTGGATAAC', 2, 'GTTATCCAC', 2)
(500, 1, 'GCTGGGATC', 2, 'GATCCCAGC', 2)
(500, 1, 'GATCCCAGC', 2, 'GCTGGGATC', 2)
(500, 1, 'GTTATCCAC', 2, 'GTGGATAAC', 2)
(500, 1, 'TGTGGATAA', 2, 'TTATCCACA', 2)
(500, 1, 'TTATCCACA', 2, 'TGTGGATAA', 2)
(500, 2, 'TGTTGATCT', 3, 'AGATCAACA', 2)
(500, 2, 'GCTGGGATC', 2, 'GATCCCAGC', 2)
(500, 2, 'CTGTGGATA', 4, 'TATCCACAG', 2)
(500, 2, 'AAGATCAAC', 2, 'GTTGATCTT', 4)
(500, 2, 'AGATCAACA', 2, 'TGTTGATCT', 3)
(500, 2, 'GTTGATCTT', 4, 'AAGATCAAC', 2)
(500, 2, 'GATCCCAGC', 2, 'GCTGGGATC', 2)
(500, 2, 'TGGATAACT', 2, 'AGTTATCCA', 3)
(500, 2, 'AGTTATCCA', 3, 'TGGATAACT', 2)
(500, 2, 'TTATCCACA', 3, 'TGTGGATAA', 4)
(500, 2, 'GTGGATAAC', 2, 'GTTATCCAC', 3)
(500, 2, 'TGTGGATAA', 4, 'TTATCCACA', 3)
(500, 2, 'GTTATCCAC', 3, 'GTGGATAAC', 2)
(500, 2, 'TATCCACAG', 2, 'CTGTGGATA', 4)

Using the methodology proposed above from experince with the V. cholerae example, a number of candidates for the E. coli DnaA box are found. No clear "winner" emerges from the search, though the experimentally verified DnaA box -- "TGTGGATAA" (or "TTATCCACA") -- is contained within the highly likely candidates, as two copies exist within 400 nucleotides of the proposed oriC location.

From these observations, we can conclude that GC Skew paired with approximate $(L,t)$-clump finding is not sufficient to definatively locate DnaA boxes, though it does seem to greatly reduce the number of candidates for consideration.

S. enterica

Retrieve Genome

In [60]:
salent_id = "CP003278.1"
In [61]:
handle = Entrez.efetch(db=db, id=salent_id, rettype=rettype, retmode=retmode)
salent_rec = SeqIO.read(handle, rettype)
handle.close()
In [62]:
print_record_descrip(salent_rec)
CP003278.1 (Salmonella enterica subsp. enterica serovar Typhi ...) with 9674 features

GC Skew Plot

In [63]:
salent_gcskew = calcGCSkew(salent_rec.seq)
In [64]:
# Init Plot
fig, ax = plt.subplots(figsize = (10, 6));
ax.set_title('GC Skew for S. enterica');

# Axis Labels
ax.set_xlabel("Position");
ax.set_ylabel("GC Skew");


g,  = ax.plot(salent_gcskew,
              c = "red",
              markersize=8);

# Set Axis limits
ax.set_xlim(0, salent_gcskew.shape[0]+1000);
In [65]:
locateDnaaBox(salent_rec)
(500, 1, 'GGATCCGGG', 3, 'CCCGGATCC', 2)
(500, 1, 'CCCGGATCC', 2, 'GGATCCGGG', 3)
(500, 1, 'TGTGGATAA', 2, 'TTATCCACA', 2)
(500, 1, 'TTATCCACA', 2, 'TGTGGATAA', 2)
(500, 2, 'CGGATCCTG', 3, 'CAGGATCCG', 2)
(500, 2, 'GGATCCGGG', 3, 'CCCGGATCC', 3)
(500, 2, 'CCCGGATCC', 3, 'GGATCCGGG', 3)
(500, 2, 'TTATCCACA', 3, 'TGTGGATAA', 2)
(500, 2, 'TGTGGATAA', 2, 'TTATCCACA', 3)
(500, 2, 'CAGGATCCG', 2, 'CGGATCCTG', 3)

As with E. coli, the results for DnaA candidates is a bit mixed. No positive matches occur within 400 nucleotides of the proposed oriC location, though two candidates are present within 500 nucleotides for single-mismatch patterns. Again we find that one of the two matches -- "TTATCCACA" -- matches the proposed DnaA box for S. enterica based on where the chromosomal replication initiator protein DnaA binds to the genome.