Hash tables are a very good general solution for providing quick access to values (e.g. k-mer positions) based on keys (e.g. k-mers). There is an even more elegant solution to the specific problem of finding every position of a k-mer in a sequence database; the FM-index (Ferragina and Manzini, 2000). Unlike a hash table of k-mer positions, the k-mer length “k” does not need to chosen in advance, so the FM-index can be reused again and again for different values of k.

The FM-index is built on the combination of the Burrows–Wheeler transformation of the original sequence string, plus the suffix array of the original sequence string. With the BW transformed string and suffix array we can in O(k) time find the positions of all exact matches to a k-mer in the database sequence.

Burrows–Wheeler transformation was originally developed for compression, as although the transformed string is the same length as the original string, it has proporties that tend to make it more easily compressible.

Before generating the suffix array and BW transformed string, we need to append a symbol to the end of the original sequence string that has an encoded value outside the range of the normal sequence alphabet, for example “$”. This extra symbol is used to identify the correct rotation of the original sequence when reversing the transformation.

Next fill in an n×n matrix where each row is a rotation of the sequence string i steps to the left. n is the length of the sequence string (including the special character), i is the zero based index of the row. Let’s see an example where our sequence is “banana”:

index 0 1 2 3 4 5 6
0 b a n a n a $
1 a n a n a $ b
2 n a n a $ b a
3 a n a $ b a n
4 n a $ b a n a
5 a $ b a n a n
6 $ b a n a n a

Notice that the rows and columns are interchangable at this point. Reorder the rows of this matrix lexigraphically, so that the rotations are in alphabetical order. “ana$ban” comes before “anana$b” because the dollar sign has a lower numeric code than any of the letters in the text encoding used for this example, which is ASCII:

index 0 1 2 3 4 5 6
6 $ b a n a n a
5 a $ b a n a n
3 a n a $ b a n
1 a n a n a $ b
0 b a n a n a $
4 n a $ b a n a
2 n a n a $ b a

The suffix array is simply the indices after the rows have been sorted, which in this example is “6, 5, 3, 1, 0, 4, 2”. We call this a suffix array because each row up to the dollar sign represents one of the suffixes of the sequence string, and the index is the original position of that suffix. For example, the suffix “nana” begins at index 2 of the original sequence string “banana”, and lexigraphically it is the last suffix.

Very importantly for finding k-mers, all identical substrings in the original sequence will be located in a contiguous block. For example the substring “ana” is found at positions 1 and 3 of the original string, and those indices are stored in the third and fourth positions of the suffix array.

The BW transformed string is simply the last column of this matrix, which in this example is “annb$aa”. Without any information other than the transformed string itself, we can reverse the transformation and get back the original string!

To reverse the transformation, initialize a new n×n matrix, and fill in the last column with the BW transformed string:

0 1 2 3 4 5 6
            a
            n
            n
            b
            $
            a
            a

Reversing the transformation involves alternating between filling in columns of the matrix and reordering the rows of the matrix lexigraphically. We have filled in the last column, so the next step is to sort the matrix:

0 1 2 3 4 5 6
            $
            a
            a
            a
            b
            n
            n

Now fill in the second last column. Columns are always filled with the BW transformed string:

0 1 2 3 4 5 6
          a $
          n a
          n a
          b a
          $ b
          a n
          a n

And reorder the rows lexigraphically:

0 1 2 3 4 5 6
          $ b
          a $
          a n
          a n
          b a
          n a
          n a

Fill in the third column:

0 1 2 3 4 5 6
        a $ b
        n a $
        n a n
        b a n
        $ b a
        a n a
        a n a

And reorder the rows:

0 1 2 3 4 5 6
        $ b a
        a $ b
        a n a
        a n a
        b a n
        n a $
        n a n

Continue until the entire matrix is filled and sorted lexigraphically:

0 1 2 3 4 5 6
$ b a n a n a
a $ b a n a n
a n a $ b a n
a n a n a $ b
b a n a n a $
n a $ b a n a
n a n a $ b a

This is the same matrix we generated the suffix array and the BW transformed string from! This matrix includes every suffix of the original sequence string. The suffix beginning at index zero includes the entire sequence string, and will be terminated with the special character. So simply find the row that ends with the special character, and that will contain the original sequence string: “banana”!

If instead of reversing the transformation, we want to find all occurences of a k-mer within the original string, we can use an algorithm that uses the BW transformed string and the suffix array.

This can be made more efficient, but even the basic implementation I detail below the index requires just 5 bytes per base pair for a 3Gb genome - 1 byte to store each character in the sequence string, and 4 bytes to store each string index in the suffix array (with 4 bytes we can store numbers from 0 to 232 ≈ 4 billion, 1 billion more than we need haha). The advantages of this index over BLAST or BLAT indices are:

  • It doesn’t require the k-mer hash table, which saves space
  • Without a hash table lookups are always O(k) time, as there are no hash collisions (multiple unique k-mer sequences in the same hash bucket)
  • The same index can be used to lookup k-mers of any length, and so only needs to be generated once for each sequence database (e.g. once per genome)

The algorithm for exact matches works as follows:

First, define a function C(c) to get the number of characters in the sequence string which are lexigraphically (alphabetically) “lower” than c. For example given the string “banana$” or its BW transform “annb$aa”, the possible results are

  • C("$") = 0 (there are no characters lower than “$”)
  • C("a") = 1 (the “$” character is lower and occurs once)
  • C("b") = 4 (the “$” and “a” characters are lower and occur once and thrice respectively)
  • C("n") = 5 (the “$”, “a”, and “b” characters are lower and occur once, thrice, and once respectively)

Second, define a function rank(c, j) to get the number of c characters in the BW transformed string truncated to be j characters in length. For example given the transformed string “annb$aa”, to get the result of rank("n", 2), truncate the string to get the substring “an”. There is one occurance of “n” in this substring, so rank(c, j) = 1.

Third, using the suffix array and the helper functions, we will iteratively refine an interval of the suffix array to arrive at the contiguous block containing all exact matches. Initialize the start and end of the interval as 0 and the length of the suffix array respectively. For each character c in the k-mer from the last to the first, update start to equal C(c) + rank(c, start), and update end to equal C(c) + rank(c, end).

For example if our k-mer of interest is “an”, there will only be two iterations:

  1. start = C("n") + rank("n", 0) = 5 + 0 = 5;
    end = C("n") + rank("n", 7) = 5 + 2 = 7
    
  2. start = C("a") + rank("a", 5) = 1 + 1 = 2;
    end = C("a") + rank("a", 7) = 1 + 3 = 4
    

The final interval is 2 to 4, which will contain all exact matches to our k-mer “an”! This is a zero-based half-open interval, therefore the third and fourth positions of the suffix array contain the indices to both occurances of “an” in “banana”. Let’s look back at the matrix we used to generate the suffix array:

index 0 1 2 3 4 5 6
6 $ b a n a n a
5 a $ b a n a n
3 a n a $ b a n
1 a n a n a $ b
0 b a n a n a $
4 n a $ b a n a
2 n a n a $ b a

You can see that the third and fourth rows specify the suffixes “ana$” and “anana$”, which both begin with “an”, exact matches to our k-mer of interest. The third and fourth values of the suffix array are 3 and 1 respectively, which are the zero based indices to “ana$” and “anana$”, or equivalently the zero based indices to the exact matches.

The FM-index introduces further time and space optimizations by compressing the BW transformed string, precalculating the character ranks at each position, and deriving suffix array values on the fly. But essentially it’s a fancier version of the same process as described above.

I have written an implementation of the Burrows–Wheeler transform in Python as three functions. The first function bwt transforms the original sequence string and returns the suffix array, the second function inverse_bwt reverses the transformation, and the third function exact_match finds exact matches in the original sequence string using the suffix array with the BW transformed string. You can try running it on your own computer with different sequences and kmers:

import numpy

# transform a sequence string and generate the suffix array
def bwt(original):
	# convert the sequence string into an array of 8-bit numbers
	string_array = numpy.fromstring(original, dtype = "uint8")

	# get the length "n" of the sequence string (now an array)
	n = len(string_array)

	# initialize the table for the BW transformation of length n x n
	offset_table = numpy.zeros((n, n), dtype = "uint8")

	# For every row in the table "i".
	for i in range(n):
		# For every column "j", set the value (character) to the value in the
		# string array, offset by i. The modulo "%" is used to wrap around back
		# to the beginning.
		for j in range(n):
			offset_table[i][j] = string_array[(j + i) % n]

	# Generate the suffix array. This is simply a series of numbers giving the
	# lexigraphic (alphabetical) order of the rows or columns in the offset
	# table (the rows read left-to-right are identical to the columns read
	# top-to-bottom). Before getting these numbers using numpy's "lexsort"
	# function, we have to flip the values in each column (so the first value
	# becomes the last, second becomes second last, etc.), because numpy
	# treats the last value as most significant, second last second most
	# significant, and so on.
	suffix_array = numpy.lexsort(numpy.flip(offset_table, axis = 0))

	# Sort the rows of the offset array lexigraphically according to the order
	# given by the suffix array, and then take the last column. This bwt_array
	# encodes the BW transformed string.
	bwt_array = offset_table[suffix_array, -1]

	return suffix_array, bwt_array

# reverse the transformation
def inverse_bwt(bwt_array):
	# get the length "n" of the bwt string (encoded as an array)
	n = len(bwt_array)

	# initialize the table to reverse the transformation (of length n x n)
	# we will treat this table as organized by columns and then by rows
	inverse_table = numpy.zeros((n, n), dtype = "uint8")

	# For every column "i", copy the bwt string into that column. We are
	# filling the table in reverse (left-to-right), because numpy's lexsort
	# function treats the last (rightmost) value as most significant, the
	# second last as second most significant and so on. We will flip the table
	# at the end to be the correct way around.
	for i in range(n):
		inverse_table[i] = bwt_array

		# Reorder the rows in the table so they are in lexigraphic
		# (alphabetical) order.
		sort_indices = numpy.lexsort(inverse_table)
		inverse_table = inverse_table[:, sort_indices]

	# Identify the row where the first value is the special symbol. This row
	# will encode the original string in reverse, so flip it and return it
	# (after conversion back to a string). If we had filled in the table in
	# the usual direction instead of in reverse, the characters would have
	# been in the right order, the special symbol would have been the last
	# value, and we would not have to flip it.
	for row in inverse_table.transpose():
		if row[0] == min(bwt_array):
			return numpy.flip(row).tostring().decode("utf-8")

# find the positions of all exact matches to a (k-mer) query
def exact_match(query, suffix_array, bwt_array):
	# Generate the C-map. This is the sum total number of appearances of any
	# character in the BW transformed string (or original string, it doesn't
	# make a difference) that come lexigraphically (alphabetically) earlier
	# than a given character "c".
	C = {}
	# For every character "c" and its index "i" in the alphabetically sorted
	# BW transformed string (stored as an array). This would be identical to
	# an alphabetically sorted original string, as all of the characters in
	# the original string are present in the BW transformed string.
	for i, c in enumerate(sorted(bwt_array)):
		# If this is the first appearance of c in the sorted string, its index
		# will correspond to the sum total number of characters which come
		# lexigraphically before c in the original or BW transformed string.
		if c not in C:
			C[c] = i

	# convert the query (k-mer) to an array of numbers
	query_array = numpy.fromstring(query, dtype = "uint8")

	# get the length of the query k, and the length of the BW transformed or
	# original string n
	k = len(query)
	n = len(bwt_array)

	# Define the start and end of our suffix array range, which will be
	# updated until the range corresponds to the query (k-mer) positions.
	start = 0
	end = n

	# For every character, in reverse, from the query (k-mer). So if the query
	# is "GSA", this will be "A" then "S" then "G".
	for c in numpy.flip(query_array):
		# Get the numbers of appearances of c in the bwt string that come
		# before the start or end of the suffix array range respectively. We
		# will call these the "ranks" of c.
		rank_start = numpy.sum(bwt_array[:start] == c)
		rank_end = numpy.sum(bwt_array[:end] == c)
		# update our range to be the sum of the corresponding C-map value and
		# corresponding ranks
		start = C[c] + rank_start
		end = C[c] + rank_end

	# Return the values in the suffix array within the final range, sorted
	# from the smallest value (first match) to largest (last match).
	return numpy.sort(suffix_array[start:end])

# The original string has to include a special symbol at the end which has a
# lower lexigraphic value than any of the characters in the "regular"
# alphabet. For example in the ASCII alphabet we are using in this program,
# the dollar sign has a decimal value of 36, the upper case alphabet occupies
# the range 65 to 90 inclusive, and the lower case alphabet occupies the range
# 97 to 122 inclusive.
original_string = "mississippi$"

# Get the suffix array and BW transformed string (as an array)
suffix_array, bwt_array = bwt(original_string)
bwt_string = bwt_array.tostring().decode("utf-8")

# Reconstruct the original string from the BW transformed string
reconstructed_string = inverse_bwt(bwt_array)

# Find the positions of all exact matches to a query (k-mer) substring
kmer = "ssi"
exact_match_positions = exact_match(kmer, suffix_array, bwt_array)

# Print all of the above
print("Original string = " + original_string)
print("BWT string = " + bwt_string)
print("Reconstructed string = " + reconstructed_string)
print("Substring = " + kmer)
print('Positions of exact matches = ' + repr(exact_match_positions))

Implementations of the Burrows–Wheeler transform and FM-index for short sequence alignment, such as the popular aligner BWA, also implement a modified (and more complicated) algorithm for finding inexact matches.