Indexed MegaBLAST, BLAT, and SSAHA

3 minute read

BLAST and FASTA are based on building an index from the query sequences, and scanning the database sequences for k-mers which are present in the query index. For both methods the top level of the index is a hash table.

Every k-mer and every position it occupies in the query must be linked to from the hash table. In addition, to keep the load factor of the hash table constant, the length of the hash must be increased as the size of the query is increased. For both reasons, the size of the hash table itself will be proportional to the size of the query.

Given a total database sequence length n, in the best case scenario where each hash bucket contains zero or one k-mer entries, the time complexity for the BLAST or FASTA k-mer search will be O(n). This is because the time taken for a single hash table lookup is O(1) in that best case scenario, meaning a constant time unrelated to the query or database size. This constant is multiplied by the length of the database being searched, as a hash table lookup is performed for almost every sequence position in the database, which is why the time complexity is O(n).

Of course if the query is much smaller than the database, it might be much faster to scan the query for k-mers present in an index of the database. The time complexity would then be O(l) where l is the length of the query.

In the opposite case where the query is very large, the index might not fit in available memory. Any index must fit inside memory so that the constant time of a hash table lookup is fast. Because of that requirement, indexing the database will also be faster if the query is much LARGER than the database and its index will not fit in memory (but the database index will).

Consider the case of genome resequencing. Typically researchers will resequence a genome to at least 30× coverage, that is the total number of characters across all sequence reads will equal the length of the (haploid) genome multiplied by thirty. In the case of the human genome which is about 3 billion base pairs in length, the total length of our query will be 90 billion nucleotides!

To store every position of every k-mer requires 12 billion bytes (3 billion base pairs × 4 bytes per position), and the hash table structure for quick access to those positions will use space as well. So unless the index is compressed, a human genome hash table index will be greater than 12GB in size.

This sounds a lot, but many computers today have 32GB or more memory, so it’s not insurmountable. But consider if we generated an index from the query sequences… in that case it will be greater than 360GB in size!

We could of course run BLAST or FASTA separately for every read, but in that case the time complexity of our search will be O(nm), where m is the number of reads. Since for genome resequencing there will be hundreds of millions of reads, the performance will be very poor.

Not only that, but building an index from all the queries will take more time than it will from the database, and then only be used a handful of times. The index for something like a human genome can be generated once and used in thousands of studies, as long as the k-mer length isn’t changed.

Building a hash table index of the database is the approach taken by the popular aligners BLAT and SSAHA. Both those aligners reduce the size of the index by only including non-overlapping k-mers, which helps if you are running it on a less powerful computer like a laptop. As well as those specialized aligners, when NCBI’s MegaBLAST is used to search human or mouse genomes, it will use a precompiled index of those genomes. This has been described as “Indexed MegaBLAST”.