The Source Code of Life: Using Python to Explore Our DNA¶

xkcd_dna Researchers just found the gene responsible for mistakenly thinking we've found the gene for specific things. It's the region between the start and the end of every chromosome, plus a few segments in our mitochondria.

Bioinformatics¶

HINT: If you're viewing this notebook as slides, press the "s" key to see a bunch of extra notes. dna_code Image: http://www.sciencemag.org/sites/default/files/styles/article_main_large/public/images/13%20June%202014.jpg?itok=DPBy5nLZ

What is DNA?¶

A long string of A, C, G, and T (bases)

ACGTTCATGG           <- Ten bases

ACGTTCATGGATGTGACCAG <- Twenty bases

etc...

Well, not really a string... Double helix Two strings stuck together? But if you know one you know the other...

The Human Genome¶

It's big. How big?

About 3,200,000,000bp (bp is base pairs). Actual file format commonly used in bioinformatics (FASTA):

>Sequence0
TTTCTGACTAACACTACAATTACCACTTGATGTTACCGACTAAGTGGTACGACTTGCTAGAACCGACTCTCGTACGTAT
CGCAGACTAGTGCGCGCGCTTAGTGACTATACTAGAATATACCTGGGGCCCAAGGAGTGTCGGGCGATCGTCCTTGAAA
TAAATATCTCAACCATCGTCATCTAGGGGGAACAGAGCGGTGGGCAGGTCCCAACCTGTTTATTTGTGTTGCTAACACT
ACGGCGCAGCTGCTCAAGTAGGTGCGATTATCGAGTAGAGGCTCCACCGGCTCTATGTGCCACGCATCTACTGAACCGA
ATTCTATCCCTGATACTCCAGAAGGTCGCAGGTTTACAGACACGTTTCAGCTCGAGAGGCCATCGATTATCTTAATATA
CCACACTGCCGAATAGCATGCCCGTAGAATCCAAGCCACGAGATAGCGTTACTTAATGAGTACCCAACGCAAATGAGGT
TGATTATCCCTAACCTGCAATCTAGGCCTTGTTCTGGAGGGGGTTATCCTTTATAGTTGATTACTTACACTCACCATGT
TCGTAGTCGGAACTCACCGATTAAGACCGATTTTACTATGGGAAGGCCAGGTTACACCTGTTTCGGGGGGGCCGCGGCG
GGTTACTTTAACCTGTCCATCCATCAGTCACTGGGCGCCAAGATTCTCCTATAGTTATATCCGCCCTTTGATTTAAACC
TAGGCCTACCTCAACGAACTGGGCCATGGGGTTCACACAGAAACAAGGGGGATAGACAGTCTTATTGAGCGCTTCTGAA
CAGCGTGTGTTCACGGTACGGCAATACCACCAGTAAACCGAGAACAGTGTTGAAGGTGATCGAACACGTGTTTTCTTCA
CCGTAGGGCTTCTAGGGAGTATCGCCCCCATATAGGCAGACGAGAAGGACTGTCACGCGCGGAGATCGATAATACGTAT
AACACAAGCACAGTAACTGCCCCGACCGGCTAAAGGACGTGGCCCAGTGTACCCAACGTACGTAATTGCAAGAGGTCTG
TCTGTCATCCCGAGGACTGCTTCTATAACTCGTTGAGGGCACTAGGCTTGAGACAATCAGCTTCGCTCGTCACGATTTT
ACTTTTTTCCTGGAAAAGCCCCCCCACAGACTATCAGGTCGCGCTTACCATACCAGTCCTTCTTGATAAGCCAATCCGT
ATTAGGTAGATTAAGCTGACAGTCGGGGCGACTCTTTGGAAACAGTATTCCCGTTTCGGGCACCTAGGATTCAGGCTTG
TACAACGATCATAGACGTCGCGGAAAGAAATAGCACAGTGTAGGAGCTGGTCGTGACCCGTGCTGTCAAGTTTATTGCA
CGGCTTGCTAAAAGGTACAGTGTAACGTTTCACAAACAAGCGAGACCCATTGTTGGTCTAACGCTATCGTACTTGATAC
CAGCCTGTGACGTCACGCGAAATCGTCTGTATAACTAGTTCTTCCCCGACTGCCACGGTATCCCAAAATTACATACTGA
CAGGACCTCTTCCATATTCATCAGGACTCGACGAAGCGCGCCCCGTGTAGTACGCGAAAATTATACCGTCCGTAGGTAC

Now picture this 2 million more times.

https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology

It's time to learn something called "The Central Dogma of Molecular Biology," since that's a lot cooler-sounding than "what DNA does." Don't worry, we'll keep to the very basics. There are two steps, "transcription" and "translation." Transcription basically involves making a copy of a chunk of DNA, except that you only copy one side of it (it looks like half a strand of DNA). We call this copy RNA. Translation involves reading the pattern of the DNA into protein. What is protein? Basically most of what we're made of is either protein or is made by proteins (excluding water, of course). So that's basically how your source code works: your DNA tells your body how to make proteins, which are like parts of a machine. These little parts make little machines, which are part of bigger machines, etc. etc. until all the parts fit together to make you! (But how do the parts know how to fit together you ask? Umm...great question...that we're not going to cover today. Next slide!)

Sequencing DNA¶

In [4]:

YouTubeVideo('fCd6B5HRaZ8', start=135)

Out[4]:

Ok, now you know something about DNA, so we can start getting into some of the fun puzzles that this leads to. So how do we find out what someone's DNA sequence is? There are several methods, including some newer ones that I'm really excited about, but we'll stick with the most popular here: Illuimina's sequencing by synthesis. It's probably so popular because it's fast and keeps getting cheaper. The reason it's so fast is because it's parallelized. It works by breaking the DNA up into little chunks and then looking at all the chunks at the same time. Basically you have a whole bunch of these fragments on a slide and then flood the slide with a whole bunch of one base (say A). The base is modified slightly so that it fluoresces a certain color when illuminated by a certain wavelength and also so that no other bases can attach after it. Then the excess bases are washed off, the slide is imaged to see which fragments got a base added, and then the fluorescent part of the new addition as well as the part that blocks the next base are removed and the process is repeated with another base (say C). Repeat until you've got most of the fragments.

Sequence Alignment¶

Now things start to get fun

That's a nice-looking lake... lake

Until now! lake_puzzle

First step: Assign scores¶

If a base matches with itself, it gets a score of one. Otherwise it gets a score of zero.

	A	C	G	T
A	1	0	0	0
C	0	1	0	0
G	0	0	1	0
T	0	0	0	1

So these two sequences:

AACTGTGGTAC
ACTTGTGGAAC
10011111011

have an alignment score of eight.

What about gaps?¶

Yes, unfortunately there are going to be gaps.

We'll add a "gap penalty" of -3.

So the score for this alignment:

ACTACA-ACGTTGAC
A-TAGAAACGCT-AC
1 1101 11101 11
-3   -3    -3

is just one

Scoring matrix¶

	-	A	C	G	T	T	T	G	T	C	G	C
-	0	-3	-6	-9	-12	-15	-18	-21	-24	-27	-30	-33
A	-3
C	-6
T	-9
T	-12
T	-15
C	-18
T	-21
G	-24
C	-27

$$ S_{m,n}=\left\{ \begin{array}{ll} S_{m-1,n} + gap\\ S_{m,n-1} + gap\\ S_{m-1,n-1} + B(a,b)\\ \end{array} \right. $$

	-	A	C	G	T	T	T	G	T	C	G	C
-	0	-3	-6	-9	-12	-15	-18	-21	-24	-27	-30	-33
A	-3	1	-2	-5	-8	-11	-14	-17	-20	-23	-26	-29
C	-6
T	-9
T	-12
T	-15
C	-18
T	-21
G	-24
C	-27

We need to keep track of where that score came from.

	-	A	C	G	T	T	T	G	T	C	G	C
-	0	←-3	←-6	←-9	←-12	←-15	←-18	←-21	←-24	←-27	←-30	←-33
A	↑-3	↖1	←-2	←-5	←-8	←-11	←-14	←-17	←-20	←-23	←-26	←-29
C	↑-6
T	↑-9
T	↑-12
T	↑-15
C	↑-18
T	↑-21
G	↑-24
C	↑-27

	-	A	C	G	T	T	T	G	T	C	G	C
-	0	←-3	←-6	←-9	←-12	←-15	←-18	←-21	←-24	←-27	←-30	←-33
A	↑-3	↖1	←-2	←-5	←-8	←-11	←-14	←-17	←-20	←-23	←-26	←-29
C	↑-6	↑-2	↖2	←1	←-4	←-7	←-10	←-13	←-16	↖-19	←-22	↖-25
T	↑-9	↑-5	↑-1	↖2	↖0	←-3	←-6	←-9	←-12	←-15	←-18	←-21
T	↑-12	↑-8	↑-4	↑-1	↖3	↖1	←-2	←-5	↖-8	←-11	←-14	←-17
T	↑-15	↑-11	↑-7	↖-4	↑0	↖4	↖2	←-1	↖-4	←-7	←-10	←-13
C	↑-18	↑-14	↖-10	↖-7	↑-3	↑1	↖4	↖2	↖-1	↖-3	←-6	←-9
T	↑-21	↑-17	↑-13	↑-10	↑-6	↑-2	↖2	↖4	↖3	←0	↖-3	←-6
G	↑-24	↑-20	↑-16	↖-12	↑-9	↑-5	↑-1	↖3	↖4	↖3	↖1	←-2
C	↑-27	↑-23	↖-19	↑-15	↖-12	↑-8	↑-4	↑0	↖3	↖5	↖3	↖2

Now just follow the arrows from that bottom-right corner back to the top-left zero.

ACGTTTGTCGC
|| ||| | ||
AC-TTTCT-GC

Let's code this!¶

See Python for Bioinformatics for the inspiration for this demo.

Here is the "substitution matrix" and its corresponding "alphabet":

In [5]:

dna_sub_mat = np.array(
    [[ 1, 0, 0, 0],
     [ 0, 1, 0, 0],
     [ 0, 0, 1, 0],
     [ 0, 0, 0, 1]])

dbet = 'ACGT'

And here are we calculate the scores and arrows as separate matrices

In [6]:

def nw_alignment(sub_mat, abet, seq1, seq2, gap=-8):
    # Get the lengths of the sequences
    seq1_len, seq2_len = len(seq1), len(seq2)
    # Create the scoring and arrow matrices
    score_mat = np.zeros((seq1_len+1, seq2_len+1), int)
    arrow_mat = np.zeros((seq1_len+1, seq2_len+1), int)
    # Fill first column and row of score matrix with scores based on gap penalty
    score_mat[0] = np.arange(seq2_len+1) * gap
    score_mat[:,0] = np.arange(seq1_len+1) * gap
    # Fill top row of arrow matrix with ones (left arrow)
    arrow_mat[0] = np.ones(seq2_len+1)
    for seq1_pos, seq1_letter in enumerate(seq1):
        for seq2_pos, seq2_letter in enumerate(seq2):
            f = np.zeros(3)
            # Cell above + gap penalty
            f[0] = score_mat[seq1_pos,seq2_pos+1] + gap
            # Cell to left + gap penalty
            f[1] = score_mat[seq1_pos+1,seq2_pos] + gap
            n1 = abet.index(seq1_letter)
            n2 = abet.index(seq2_letter)
            # Cell to upper-left + alignment score
            f[2] = score_mat[seq1_pos,seq2_pos] + sub_mat[n1,n2]
            score_mat[seq1_pos+1, seq2_pos+1] = f.max()
            arrow_mat[seq1_pos+1, seq2_pos+1] = f.argmax()
    return score_mat, arrow_mat

What does our result look like?

In [7]:

s1 = 'ACTTCTGC'

s2 = 'ACGTTTGTCGC'

score_mat, arrow_mat = nw_alignment(dna_sub_mat, dbet, s1, s2, gap=-3)
print(score_mat)
print(arrow_mat)

[[  0  -3  -6  -9 -12 -15 -18 -21 -24 -27 -30 -33]
 [ -3   1  -2  -5  -8 -11 -14 -17 -20 -23 -26 -29]
 [ -6  -2   2  -1  -4  -7 -10 -13 -16 -19 -22 -25]
 [ -9  -5  -1   2   0  -3  -6  -9 -12 -15 -18 -21]
 [-12  -8  -4  -1   3   1  -2  -5  -8 -11 -14 -17]
 [-15 -11  -7  -4   0   3   1  -2  -5  -7 -10 -13]
 [-18 -14 -10  -7  -3   1   4   1  -1  -4  -7 -10]
 [-21 -17 -13  -9  -6  -2   1   5   2  -1  -3  -6]
 [-24 -20 -16 -12  -9  -5  -2   2   5   3   0  -2]]
[[1 1 1 1 1 1 1 1 1 1 1 1]
 [0 2 1 1 1 1 1 1 1 1 1 1]
 [0 0 2 1 1 1 1 1 1 1 1 1]
 [0 0 0 2 2 1 1 1 1 1 1 1]
 [0 0 0 0 2 2 1 1 1 1 1 1]
 [0 0 0 0 0 2 2 1 1 2 1 1]
 [0 0 0 0 0 2 2 1 2 1 1 1]
 [0 0 0 2 0 0 0 2 1 1 2 1]
 [0 0 0 0 0 0 0 0 2 2 1 2]]

Now we need a way to get the sequences back out of our scoring matrix:

In [8]:

def backtrace(arrow_mat, seq1, seq2):
    align1, align2 = '', ''
    align1_pos, align2_pos = arrow_mat.shape
    align1_pos -= 1
    align2_pos -= 1
    selected = []
    while True:
        selected.append((align1_pos, align2_pos))
        if arrow_mat[align1_pos, align2_pos] == 0:
            # Up arrow, add gap to align2
            align1 += seq1[align1_pos-1]
            align2 += '-'
            align1_pos -= 1
        elif arrow_mat[align1_pos, align2_pos] == 1:
            # Left arrow, add gap to align2
            align1 += '-'
            align2 += seq2[align2_pos-1]
            align2_pos -= 1
        elif arrow_mat[align1_pos, align2_pos] == 2:
            # Up-arrow arrow, no gap
            align1 += seq1[align1_pos-1]
            align2 += seq2[align2_pos-1]
            align1_pos -= 1
            align2_pos -= 1
        if align1_pos==0 and align2_pos==0:
            break
    # reverse the strings
    return align1[::-1], align2[::-1], selected

In [9]:

a1, a2, selected = backtrace(arrow_mat, s1, s2)
print(a1)
print(a2)

AC-TTC-T-GC
ACGTTTGTCGC

Sometimes it's nice to see the scoring matrix, though, so here's a function to visualize it

In [10]:

def visual_scoring_matrix(seq1, seq2, score_mat, arrow_mat):
    visual_mat = []
    for i, row in enumerate(arrow_mat):
        visual_mat_row = []
        for j, col in enumerate(row):
            if col == 0:
                arrow = '↑'
            elif col == 1:
                arrow = '←'
            else:
                arrow = '↖'
            visual_mat_row.append(arrow + ' ' + str(score_mat[i,j]))
        visual_mat.append(visual_mat_row)
    visual_mat = np.array(visual_mat, object)

    tab = plt.table(
        cellText=visual_mat,
        rowLabels=['-'] + list(s1),
        colLabels=['-'] + list(s2),
        loc='center')
    tab.scale(2, 3)
    tab.set_fontsize(30)
    plt.axis('tight')
    plt.axis('off')
    
    align1, align2, selected = backtrace(arrow_mat, seq1, seq2)
    for pos in selected:
        y, x = pos
        tab._cells[(y+1, x)]._text.set_color('green')
        tab._cells[(y+1, x)]._text.set_weight(1000)
    plt.show()

In [11]:

visual_scoring_matrix(s1, s2, score_mat, arrow_mat)

Let's generate some sequences and see how fast this is

In [17]:

def random_dna_seq(length=100):
    seq = [random.choice(dbet) for x in range(length)]
    return ''.join(seq)

def mutate_dna_seq(seq, chance=1/5):
    mut_seq_base = [random.choice(dbet) if random.random() < chance else x for x in seq]
    mut_seq_indel = [random.choice(('', x + random.choice(dbet))) if random.random() < chance else x for x in mut_seq_base]
    return ''.join(mut_seq_indel)

In [19]:

s1 = random_dna_seq()
s2 = mutate_dna_seq(s1)
print(s1[:25], len(s1))
print(s2[:25], len(s2))
a = %timeit -o nw_alignment(dna_sub_mat, dbet, s1, s2, gap=-3)
print('{:.1f} years for the whole genome'.format(a.best * 2300000000 / 60 / 60 / 24 / 365.25))

ATGAACCTGAGGGATTGCCATGCCG 100
AATGACACTGAGGGCATTGCCCATA 104
10 loops, best of 3: 90 ms per loop
6.6 years for the whole genome

Let's make it faster! (Just because)

	-	A	C	G	T	T	T	G	T	C	G	C
-	0	←-3	←-6	←-9	←-12	←-15	←-18	←-21	←-24	←-27	←-30	←-33
A	↑-3	↖1	←-2
C	↑-6	↑-2
T	↑-9
T	↑-12
T	↑-15
C	↑-18
T	↑-21
G	↑-24
C	↑-27

We can't calculate a whole row or column at a time because the values depend on those in the same row/column. But what about diagonals?

We'll "get rid of" our nested loop (really just abstract it into a faster numpy "loop")

This is going to take a couple more steps, but it will be worth it in the end.

First we pre-calculate the "upper-left score" for each location.

In [21]:

def sub_values(sub_mat, abet, seq1, seq2):
    # convert the sequences to numbers
    seq1_ind = [abet.index(i) for i in seq1]
    seq2_ind = [abet.index(i) for i in seq2]
    sub_vals = np.array([[0] * (len(seq2)+1)] + [[0] + [sub_mat[y, x] for x in seq2_ind] for y in seq1_ind], int)
    return sub_vals

sub_values(dna_sub_mat, dbet, 'AACGTTA', 'AAGCTTAAAAAAAA')

Out[21]:

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]])

Then we get a list of all the diagonals in the matrix.

In [22]:

def diags(l1, l2):
    ys = np.array([np.arange(l1) + 1 for i in np.arange(l2)])
    xs = np.array([np.arange(l2) + 1 for i in np.arange(l1)])
    diag_ys = [np.flip(ys.diagonal(i), 0) for i in range(1-l2, l1)]
    diag_xs = [xs.diagonal(i) for i in range(1-l1, l2)]
    index_list = []
    for y, x in zip(diag_ys, diag_xs):
        index_list.append([y, x])
    return index_list

diags(6, 3)

Out[22]:

[[array([1]), array([1])],
 [array([2, 1]), array([1, 2])],
 [array([3, 2, 1]), array([1, 2, 3])],
 [array([4, 3, 2]), array([1, 2, 3])],
 [array([5, 4, 3]), array([1, 2, 3])],
 [array([6, 5, 4]), array([1, 2, 3])],
 [array([6, 5]), array([2, 3])],
 [array([6]), array([3])]]

And here's the actual function. It takes the same arguments and produces the same matrices.

In [25]:

def FastNW(sub_mat, abet, seq1, seq2, gap=-8):
    sub_vals = sub_values(sub_mat, abet, seq1, seq2)
    # Get the lengths of the sequences
    seq1_len, seq2_len = len(seq1), len(seq2)
    # Create the scoring and arrow matrices
    score_mat = np.zeros((seq1_len+1, seq2_len+1), int)
    arrow_mat = np.zeros((seq1_len+1, seq2_len+1), int)
    # Fill first column and row of score matrix with scores based on gap penalty
    score_mat[0] = np.arange(seq2_len+1) * gap
    score_mat[:,0] = np.arange(seq1_len+1) * gap
    # Fill top row of arrow matrix with ones (left arrow)
    arrow_mat[0] = np.ones(seq2_len+1)
    # Get the list of diagonals
    diag_list = diags(seq1_len, seq2_len)
    # fill in the matrix
    for diag in diag_list:
        # Matrix to hold all three possible scores for every element in the diagonal
        f = np.zeros((3, len(diag[0])), float)
        # Cell above + gap penalty for every cell in the diagonal
        x, y = diag[0]-1, diag[1]
        f[0] = score_mat[x, y] + gap
        # Cell to the left + gap penalty for every cell in the diagonal
        x, y = diag[0], diag[1]-1
        f[1] = score_mat[x, y] + gap
        # Cell to the upper left + alignment score for every cell in the diagonal
        x, y = diag[0]-1, diag[1]-1
        f[2] = score_mat[x,y] + sub_vals[diag]
        max_score = (f.max(0))
        max_score_pos = f.argmax(0)
        score_mat[diag] = max_score
        arrow_mat[diag] = max_score_pos
    return score_mat, arrow_mat

So how much faster is it?

In [24]:

s1 = random_dna_seq()
s2 = mutate_dna_seq(s1)
a = %timeit -o nw_alignment(dna_sub_mat, dbet, s1, s2)
print('{:.1f} years for the whole genome'.format(a.best * 2300000000 / 60 / 60 / 24 / 365.25))
a = %timeit -o FastNW(dna_sub_mat, dbet, s1, s2)
print('{:.1f} years for the whole genome'.format(a.best * 2300000000 / 60 / 60 / 24 / 365.25))

10 loops, best of 3: 64.2 ms per loop
4.7 years for the whole genome
100 loops, best of 3: 8.23 ms per loop
0.6 years for the whole genome

Now why did we use the substitution matrix?¶

Here's how DNA translates to protein:

Wikipedia

So we can align proteins, too!

In [16]:

blosum50 = np.array(
    [[ 5,-2,-1,-2,-1,-1,-1, 0,-2,-1,-2,-1,-1,-3,-1, 1, 0,-3,-2, 0],
     [-2, 7,-1,-2,-1, 1, 0,-3, 0,-4,-3, 3,-2,-3,-3,-1,-1,-3,-1,-3],
     [-1,-1, 7, 2,-2, 0, 0, 0, 1,-3,-4,-0,-2,-4,-2,-1, 0,-4,-2,-3],
     [-2,-2, 2, 8,-4, 0, 2,-1,-1,-4,-4,-1,-4,-5,-1, 0,-1,-5,-3,-4],
     [-1,-4,-2,-4,13,-3,-3,-3,-3,-2,-2,-3,-2,-2,-4,-1,-1,-5,-3,-1],
     [-1,-1, 0, 0,-3, 7, 2,-2, 1,-3,-2, 2, 0,-4,-1,-0,-1,-1,-1,-3],
     [-1, 0, 0, 2,-3, 2, 6,-3, 0,-4,-3, 1,-2,-3,-1,-1,-1,-3,-2,-3],
     [ 0,-3, 0,-1,-3,-2,-3, 8,-2,-4,-4,-2,-3,-4,-2, 0,-2,-3,-3,-4],
     [-2, 0, 1,-1,-3, 1, 0,-2,10,-4,-3, 0,-1,-1,-2,-1,-2,-3,-1, 4],
     [-1,-4,-3,-4,-2,-3,-4,-4,-4, 5, 2,-3, 2, 0,-3,-3,-1,-3,-1, 4],
     [-2,-3,-4,-4,-2,-2,-3,-4,-3, 2, 5,-3, 3, 1,-4,-3,-1,-2,-1, 1],
     [-1, 3, 0,-1,-3, 2, 1,-2, 0,-3,-3, 6,-2,-4,-1, 0,-1,-3,-2,-3],
     [-1,-2,-2,-4,-2, 0,-2,-3,-1, 2, 3,-2, 7, 0,-3,-2,-1,-1, 0, 1],
     [-3,-3,-4,-5,-2,-4,-3,-4,-1, 0, 1,-4, 0, 8,-4,-3,-2, 1, 4,-1],
     [-1,-3,-2,-1,-4,-1,-1,-2,-2,-3,-4,-1,-3,-4,10,-1,-1,-4,-3,-3],
     [ 1,-1, 1, 0,-1, 0,-1, 0,-1,-3,-3, 0,-2,-3,-1, 5, 2,-4,-2,-2],
     [ 0,-1, 0,-1,-1,-1,-1,-2,-2,-1,-1,-1,-1,-2,-1, 2, 5,-3,-2, 0],
     [-3,-3,-4,-5,-5,-1,-3,-3,-3,-3,-2,-3,-1, 1,-4,-4,-3,15, 2,-3],
     [-2,-1,-2,-3,-3,-1,-2,-3, 2,-1,-1,-2, 0, 4,-3,-2,-2, 2, 8,-1],
     [ 0,-3,-3,-4,-1,-3,-3,-4,-4, 4, 1,-3, 1,-1,-3,-2, 0,-3,-1, 5]])
pbet = 'ARNDCQEGHILKMFPSTWYV'

In [17]:

s1 = [random.choice(pbet) for _ in range(10)]
s2 = [random.choice(pbet) if random.random() < .25 else x for x in s1] + [random.choice(pbet) for _ in range(10)]
score_mat, arrow_mat = FastNW(blosum50, pbet, s1, s2)
visual_scoring_matrix(s1, s2, score_mat, arrow_mat)

Other things we can do with this algorithm:¶

Local alignment
Affine gap penalties

Other useful bioinformatics Python packages:¶

In [18]:

from pysam import FastaFile
from os.path import getsize

print(getsize('Homo_sapiens.GRCh38.dna.primary_assembly.fasta.gz')/1024**2, 'MiB')


with FastaFile('Homo_sapiens.GRCh38.dna.primary_assembly.fasta.gz') as myfasta:
    chr17len = myfasta.get_reference_length('17')
    print(chr17len, 'bp')
    seq = myfasta.fetch('17', int(chr17len/2), int(chr17len/2)+500)
print(seq)

853.3953523635864 MiB
83257441 bp
TGTTATTCTTTTGCGCCTGTTGCTGTCCCTAGGATACCGGAGATACTCAGCCTGTGCTTGCCAAGGGTATACATATGTATTCAAATACGCACATCCTGTAAACAGCTGTGGTGGAAACCCAAGACCAGACTTGTAGAGACCTGGATCTATTTTGCAGGCCACAGTGAAGGGCTGTGTGAGCTTGAGTCCTCTCTGAGCCTCACGTTTTACATCTAAAACATGAGGGGCTGGTCCAACTGGTTTCCAAAAGTCCTTCCAACTGGGACAGCCTGTGGATCTGCCAGTGCTCCAGCTGCACCAGCATGCAGCACATGTGCCCACACGTGTACACACCTGCTTACGCACGGCCACCCACCATGCTGAAGAACCAGCCCTCGGCATCCTTGCAGCTCTTCTCCACCAATGTCTTGTCCTGGTCACGCATCTCATTCAGGATGCAGCTCAGGTTCACTCCAGGCACAGTGTCCATCTTCACACTGACATCCTCATCCACCTGACCT