My expectations

You will be able to define/describe concepts you have learned in class (e.g., what is a codon)
You will be able to directly apply concepts learned in class (e.g., reverse complement a string, or translate RNA into protein)
You will be able to reproduce/apply algorithms learned in class (e.g., compute Z values for a given string)
You will be able to apply concepts/algorithms learned in class to a new type of problem (e.g., use Z values to compute a new type of preprocessing function, or devise a brute force algorithm for something else than protein sequencing).

Sample problems

These are some problems that have appeared on prior exams or homeworks.

Short/simple

Identify the longest open reading frame (ORF) in the following DNA sequence (note: start codon ATG, stop codons TAG, TGA, TAA)
ATGATATGATGTCACAGGATGTAGATGCTCCGACATAAGCTTAG
Reverse complement the following DNA string
ATGCGGGCAC GGAAGCCGTG TACGCGAGCG CGCTTGAGGG
Identify the longest open reading frame in the following DNA sequence and translate it into an amino-acid sequence TGCGTATGTATGTCAGACGGTGAGACGCTTGCGGGCTAAGCGACG
Please link the following terms in the correct order
RNA, translation, DNA, protein, transcription

Long/more complex

The genomes of many bacteria are circular. In order to represent the data in a linear form, genome assembly programs pick a random location in the genome to break the circle. Thus, it is possible that running the same program multiple times we would get different answers, corresponding to different circular rotations of the same string. Using the Z values described in class, describe a simple algorithm that identifies whether two DNA strings are circular rotations of each other. For example, string ACATTCAT is a circular rotation of string TTCATACA.
a) The steps outlined below represent an execution of the KMP algorithm (text on top, pattern on bottom)

(i) ABCABCDABABCDABDABDE (ii) ABCABCDABABCDABDABDE
|||X ||||||X
ABCDABD ABCDABD

(iii) ABCABCDABABCDABDABDE (iv) ABCABCDABABCDABDABDE
X |||||||
ABCDABD ABCDABD
In step (i), the pattern is matched until a mismatch occurs at the 4th character. Since no suffix of the matched portion matches a prefix of the whole pattern (no prefix and suffix of ABC match each other), the pattern is shifted beyond the aligned region and the matching continues from the beginning of the pattern. In (ii) a mismatch is found at position 7 in the pattern and the pattern is shifted (iii) so that the common prefix and suffix of the matched regions are aligned (the first AB in the pattern is placed where the second AB matched at step (ii)). The third position in the pattern is compared to the text and results in a mismatch. The pattern is then shifted past this position and a match is found.

Using this execution as an example, provide an argument why the overall running time of the algorithm proportional to the sum of the lengths of pattern and text.

b) During the execution of the algorithm described in class for computing Z values, when computing the value Z[i] we relied on the Z-value for a position j < i such that
(i) Z[j] extends the farthest in the string (j + Z[j] is maximum over all choices of j < i) and
(ii) j + Z[j] > i

Would the algorithm still work efficiently (linear time) if only the second condition were satisfied? Which part of the reasoning would fall apart?
What are the sp(i) values for the string below (think pattern in the KMP algorithm):
P = A C T T A C T A A C T C T A A T A G G A C T T A C T C T A
sp =
b) Assume you are matching the string above against a long text, e.g. the human genome. Assume you've matched all characters up to position 10 (starting the numbering from 0, string ending at AACT), but mismatched on the 4th C of the pattern.
How far will you have to shift the pattern before attempting to match the string again?
Which is the next character in the pattern that will be compared to the text?

(simply draw the next position of the pattern below and circle or underline the character that will be compared to the text)
P = A C T T A C T A A C T C T A A T A G G A C T T A C T C T A

c) With the help of the example from part 1, can you figure out some of the following relationships? Please explain.

Define one possible relationship between sp(i) and sp(i – 1).
Can you define a similar relationship between Z[i] and Z[i – 1]?
a) Write out the Z-values for the following string:

A G A G A G A C C A G C A G A G A G A G A C G T

b) Assume you are trying to compute Z[14] - please explain how you compute this value using the efficient algorithm described in class. Specifically:

a) what are the values for j and Z[j]?

b) which earlier value of Z do you need to make this computation?
c) do you need to perform any additional character comparisons?