Fall, 2009

COMP/MATH 452

Bioinformatics

Nick Peters

This assignment asks us to write a program that finds and prints out the longest common subsequence for two given strings. In addition, we are asked to asked to experiment with different values for mu and sigma where mu is the penality for mismatched characters and sigma is the penalty for indel characters.

For this program I used five test cases. Each case was ran four times: when mu and sigma are both zero, when mu and sigma are both equal to one, when mu is greater than sigma, and when sigma is greater than mu. The case when both mu and sigma are equal to zero is equivalent to the longest common subsequence algorithm given in section 6.5 of the book.

For each case, the program prints out five pieces of information: The case number, the values for mu and sigma, the similarity score, the longest common subsequence, and the alignment of the two given strings. The similarity score is defined as # Matches - MU * # Mismatches - SIGMA * # Indels (higher is better). When there are more mismatches/indels than matches, we get a negative score. Note that when we count the number of indels, we are counting only counting the number of indels in the first string. According to our book, this is because this score "corresponds to the minimum number of insertions and deletions needed to transform v into w."

The first observation I want to make is an obvious one. When sigma is greater than mu, there are more mismatches than indels and when mu is greater than sigma, there are more indels than mismatches. I also want to note that the similarity score is higher when we penalize mismatches rather than indels. This makes sense since when we think about what the score is telling us. When we have mismatches lining up, we can't exactly transform the first string into the other. On the other hand, when we have indels, it tells us to insert a character here to transform the first string into the second. In all cases, penalizing mismatches gives the same score when we have no penalization. In fact, in all cases, penalization in general does not seem to improve the score for any of the sets of the strings. This program runs adequately when optimized for when both mu and sigma both equal zero.

Cases

Other

I modified wrote my own PrintLCS function. Instead of using a recursive algorithm (like in the book), I wrote an iterative algorithm. Also, in addition to finding the LCS, the function also finds the alignment for the two given strings and prints that out.

Also the posmax(seq,key) function might be a bit mysterious. Essentially it gets passed a list of 2-3 values to choose from and chooses the largest one. In addition to returning the largest value of the list (called sequence), it also returns the index. The index corresponds to the direction arrow used in the book. If you would like to know more about how this works, look at python's enumerate function and lambda functions.