Genome indexing - lesson plan

Learning outcomes

  • Understand why data-structures storing suffixes of a string are useful in answering string matching questions
  • Understand why the suffix trees require just linear space
  • Able to come up with algorithms based on suffix trees to answer basic string matching questions (repeats, matching, etc.)
  • Able to convert a suffix tree into a suffix array
  • Understand the additional information needed to make suffix arrays work (LCP array)
  • Able to use suffix arrays to match strings
  • Able to construct the Burrows-Wheeler transform of a string and to reconstruct a string from its BWT
  • Able to use the BWT to perform matching
  • Understand the need for and use of additional information to allow string matching with the BWT
  • Understand the tradeoff between speed and space inherent in  BWT-based indices

Class activities

Introduce the longest common substring problem and why it may be hard to do in linear time

Introduce the concept of sequence trie as a possible solution to matching many strings to the reference genome.

Discuss the use of a trie with a KMP-like algorithm to speed up alignment (Aho-Corasick)

Introduce the changes from trie to suffix trie to suffix tree and discuss memory usage

Discuss the use of suffix arrays to solve string matching. Also identify additional information that may be needed to do it efficiently.

Describe Burrows Wheeler Transform and why it is invertible.

Describe its use in matching and the need for additional information to speed up the process.