Fall, 2009
COMP/MATH 452
Bioinformatics
nick-peterss-imac:Project 4 Nick$ python GreedyMotifSearch.py
ATGCAACT
ATGCAACT
ATGCAACT
ATGCAACT
ATGCAACT
ATGCAACT
ATGCAACT
gCTGGGTC
GATGgAtC
GCTGGTTC
TGTCGGTC
GCTGCATG
ACATGATC
This shows the run of two examples given in the book on page 94. The first result is for the set of DNA sequences from figure 4.2b and the second sample is for the set of DNA sequences from Figure 4.2d. These results show all 8-mers from each set of DNA sequences that maximize the consensus score.
The consensus string for the first set is ATGCAACT which matches the same string found in the previous assignment, when running the same example. This is a good indication that the program is working properly.
The second set, on the other hand is a bit different. The previous assignment got "ATTGATG" as the consensus string (or median string), but the consensus string for this assignment is "GCTGATC" (Found by finding the most occuring character in each column). Further, the first two underlined 8-mers in the book are "ATcCAgCT" and "ggGCAACT" and the first two underlined 8-mers found by this program are "gCTGGGTC" and "GATGgAtC". Is the program giving us incorrect results?
To answer this question let's take a look at what the book gives us first and compute the hamming distance:
A T c C A g C T
x x x : : x : : 4
g g G C A A C T
Now let's take a look at what the strings the program finds:
g C T G G G T C
: x : : : x : : 2
G A T G g A t C
The program is acting as it should since the set of strings it found have a lower hamming distance than the strings in the book.
Since the algorithm uses these two strings to find the rest of the l-mers, it will find a set of strings that maximizes the consensus score with the initial two strings.