Comparing Two Sequences
Objectives:
• Get the basics about dot plots
• Know how to interpret the most common patterns in a dot plot
• Use Dotlet
• Use Lalign to extract local alignments
Why compare two?
• Database searches are useful for finding homologues
• Database searches don't provide precise comparisons
• More precise tools are needed to analyze the sequences in detail including
– Dot plots for graphic analysis
– Local or global alignments for residue/residue analysis
• The alignment of two sequences is called a pairwise alignment
Dot Plot:
• A dot plot is a graphic representation of pairwise similarity
• The simplicity of dot plots prevents artifacts
• Ideal for looking for features that may come in different orders
• Reveal complex patterns
• Benefit from the most sophisticated statistical-analysis tool in the universe . . . your brain
Choosing your two sequences:
• Making pairwise comparisons takes time
• Use BLAST to rapidly select your sequences
– More than 70% identity for DNA
– More than 25% identity for proteins
• If your sequences are too similar, comparing them yields no useful information
What can you analyze with Dot Plot?
• Any pair of sequences
– DNA
– Proteins
– RNA
• DNA with proteins
– Dotlet is an appropriate tool
– To compare full genomes, install the program locally
• Sequences longer than 1000 symbols are hard to analyze online
• Divergent sequences where only a segment is homologous
• Long insertions and deletions
• Tandem repeats
The square shape of the pattern is characteristic of these repeats
Using Dotlet:
• Dotlet is one of the handiest tools for making dot plots
• Dotlet is a Java applet
• Open and download the applet at the following site:
– www.isrec.isb-sib.ch/java/dotlet
• Dotlet slides a window along each sequence
• If the windows are more similar than the threshold, Dotlet prints a dot at their intersection
• You can control the similarity threshold with the little window on the left
• Every dot has a score given by the window comparison
• When the score is
– Below threshold 1 ó black dot
– Between thresholds 1 and 2 ó grey dot
– Above threshold 2 ó white dot
• The blue curve is the distribution of scores in the sequences
• The peak ó most common score,
– Most common ó less informative
• Window size and the stringency control the aspect of your dot plot
– Very stringent = clean dot plot, little signal
– Not stringent enough = noisy dot plot, too much signal
• Play with the threshold until a usable signal appears
• The square shape is typical of tandem repeats
• The repeats are not perfect because the sequences have diverged after their duplication
Comparing a Gene and its Product:
• Eukaryotic genes are transcribed into RNA
• The RNA is then spliced to remove the introns' sequences
• It may be necessary to compare the gene and its product
• Dotlet makes this comparative analysis easy
Aligning Sequences:
• Dotlet dot plots are a good way to provide an overview
• Dot plots don't provide residue/residue analysis
• For this analysis you need an alignment
• The most convenient tool for making precise local alignments is Lalign
Lalign and BLAST:
• Lalign is like a very precise BLAST
• It works on only two sequences at a time
• You must provide both sequences
Going Farther:
• If you need to align coding DNA with a protein, try these sites:
– www.tcoffee.org => protogene
• If you need to align very large sequences, try this site:
– www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi
• If you need a precise estimate of your alignment's statistical significance, use PRSS
– The program is available at fasta.bioch.virginia.edu
– Low E-value ó good alignment