BiRG Lab Blog: July 2008

Friday, July 25, 2008

BiRG minutes: 7-18-2008

Comparing Two Sequences

Objectives:

• Get the basics about dot plots

• Know how to interpret the most common patterns in a dot plot

• Use Dotlet

• Use Lalign to extract local alignments

Why compare two?

• Database searches are useful for finding homologues

• Database searches don't provide precise comparisons

• More precise tools are needed to analyze the sequences in detail including

– Dot plots for graphic analysis

– Local or global alignments for residue/residue analysis

• The alignment of two sequences is called a pairwise alignment

Dot Plot:

• A dot plot is a graphic representation of pairwise similarity

• The simplicity of dot plots prevents artifacts

• Ideal for looking for features that may come in different orders

• Reveal complex patterns

• Benefit from the most sophisticated statistical-analysis tool in the universe . . . your brain

Choosing your two sequences:

• Making pairwise comparisons takes time

• Use BLAST to rapidly select your sequences

– More than 70% identity for DNA

– More than 25% identity for proteins

• If your sequences are too similar, comparing them yields no useful information

What can you analyze with Dot Plot?

• Any pair of sequences

– DNA

– Proteins

– RNA

• DNA with proteins

– Dotlet is an appropriate tool

– To compare full genomes, install the program locally

• Sequences longer than 1000 symbols are hard to analyze online

• Divergent sequences where only a segment is homologous

• Long insertions and deletions

• Tandem repeats

The square shape of the pattern is characteristic of these repeats

Using Dotlet:

• Dotlet is one of the handiest tools for making dot plots

• Dotlet is a Java applet

• Open and download the applet at the following site:

– www.isrec.isb-sib.ch/java/dotlet

• Dotlet slides a window along each sequence

• If the windows are more similar than the threshold, Dotlet prints a dot at their intersection

• You can control the similarity threshold with the little window on the left

• Every dot has a score given by the window comparison

• When the score is

– Below threshold 1 ó black dot

– Between thresholds 1 and 2 ó grey dot

– Above threshold 2 ó white dot

• The blue curve is the distribution of scores in the sequences

• The peak ó most common score,

– Most common ó less informative

• Window size and the stringency control the aspect of your dot plot

– Very stringent = clean dot plot, little signal

– Not stringent enough = noisy dot plot, too much signal

• Play with the threshold until a usable signal appears

• The square shape is typical of tandem repeats

• The repeats are not perfect because the sequences have diverged after their duplication

Comparing a Gene and its Product:

• Eukaryotic genes are transcribed into RNA

• The RNA is then spliced to remove the introns' sequences

• It may be necessary to compare the gene and its product

• Dotlet makes this comparative analysis easy

Aligning Sequences:

• Dotlet dot plots are a good way to provide an overview

• Dot plots don't provide residue/residue analysis

• For this analysis you need an alignment

• The most convenient tool for making precise local alignments is Lalign

Lalign and BLAST:

• Lalign is like a very precise BLAST

• It works on only two sequences at a time

• You must provide both sequences

Going Farther:

• If you need to align coding DNA with a protein, try these sites:

– www.tcoffee.org => protogene

– coot.embl.de/pal2nal

• If you need to align very large sequences, try this site:

– www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi

• If you need a precise estimate of your alignment's statistical significance, use PRSS

– The program is available at fasta.bioch.virginia.edu

– Low E-value ó good alignment

BiRG Minutes : June 11, 2008

Analyzing Protein Sequences

In-silico biochemistry

Sliding-windows techniques – most ancient way of looking at sequences

-used if the strand of DNA was cut in the middle

-ND THE WAS where the A was cut off

NDT HEW AS

DTH EWA S

THE WAS

-Use past experiences and what proteins have been together in the past

-Hydrophobicity is the most popular analysis – a good indicator of transmembrane segments or core regions within a protein.

Predicting transmembrane domains

ProtScale allows one to compute and represent the profile produced by any amino acid scale on a selected protein.

amino acid scale is defined by a numerical value assigned to each type of amino acid.

THMM Transmembrane Helix Prediction is a method for predicting transmembrane helices based on a hidden Markov Model (HMM)

HMM - a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications.

THMM creates a prediction, or what it should have been

ProtScale has parameters and shows what it is

Predicting post-translational modifications w/PROSITE

Proteins get modified between the cell and getting read

PROSITE motifs are written as patterns

– Short patterns are not very informative by themselves

– They only indicate a possibility

– Combine them with other information to draw a conclusion

NOT EVERYTHING IS IN PROSITE

Interpreting PROSITE patterns

Some patterns may suggest nonexistent protein features

Short patterns are more informative if they are conserved across homologous sequences

Domains is defined as " independent globular folding units". It is a portion of protein that can keep its shape if you remove it from the rest of the protein. It consists of at least 50 amino acids. - Domains are like the various components of our kitchen – such as the oven, the microwave, the refrigerator, etc. All together they constitute the complete kitchen, but they can also exist separately. You only need to use microwave when making pop corn - which can be done outside the kitchen.

An average protein consists of 2 or 3 domains. Usually each domain plays a specific role in the function of the protein. It may interact with other proteins, or bind ion like calcium or zinc, or it may contain an active site. It is common to have a catalytic domain associated with a binding domain and a regulatory domain. Imagine - a toaster, where you have the grill [catalytic], the toast holder [binding], and the switch [regulation].

Domains are like independent functions that can be taken out of a program but still function

Researchers

A domain is a multi-sequence alignment similar to a puzzle

Using Domain collections

Scientists have been discovering and characterizing protein domains for more than 20 years

Manual collections are precise but small; where the researchers must document everything on their own

Automatic collections go out and find data in research documents etc

It is probably that only one of these servers will have the information to help you understand your protein

BiRG Lab Blog

Friday, July 25, 2008

BiRG minutes: 7-18-2008

BiRG Minutes : June 11, 2008

IU News: Science

IU News: Technology

Blog Archive

BiRG Links

Genomic Databases and Search Tools

Articles