Genome indexing - lesson plan

Understand why data-structures storing suffixes of a string are useful in answering string matching questions
Understand why the suffix trees require just linear space
Able to come up with algorithms based on suffix trees to answer basic string matching questions (repeats, matching, etc.)
Able to convert a suffix tree into a suffix array
Understand the additional information needed to make suffix arrays work (LCP array)
Able to use suffix arrays to match strings
Able to construct the Burrows-Wheeler transform of a string and to reconstruct a string from its BWT
Able to use the BWT to perform matching
Understand the need for and use of additional information to allow string matching with the BWT
Understand the tradeoff between speed and space inherent in BWT-based indices

Introduce the longest common substring problem and why it may be hard to do in linear time

Introduce the concept of sequence trie as a possible solution to matching many strings to the reference genome.

Discuss the use of a trie with a KMP-like algorithm to speed up alignment (Aho-Corasick)

Introduce the changes from trie to suffix trie to suffix tree and discuss memory usage

Discuss the use of suffix arrays to solve string matching. Also identify additional information that may be needed to do it efficiently.

Describe Burrows Wheeler Transform and why it is invertible.

Describe its use in matching and the need for additional information to speed up the process.

....or call 1-833-566-3347 (staff/faculty)
1-877-399-4090 (students)