Friday, July 25, 2008

BiRG minutes: 7-18-2008

Comparing Two Sequences

 Objectives:

          Get the basics about dot plots

          Know how to interpret the most common patterns in a dot plot

          Use Dotlet

          Use Lalign to extract local alignments

 
Why compare two?

          Database searches are useful for finding homologues

          Database searches don't provide precise comparisons

          More precise tools are needed to analyze the sequences in detail including

        Dot plots for graphic analysis

        Local or global alignments for residue/residue analysis

          The alignment of two sequences is called a pairwise alignment

 
Dot Plot:

          A dot plot is a graphic representation of pairwise similarity

          The simplicity of dot plots prevents artifacts

          Ideal for looking for features that may come in different orders

          Reveal complex patterns

          Benefit from the most sophisticated statistical-analysis tool in the universe . . . your brain

 
Choosing your two sequences:

          Making pairwise comparisons takes time

          Use BLAST to rapidly select your sequences

        More than 70% identity for DNA

        More than 25% identity for proteins

          If your sequences are too similar, comparing them yields no useful information

 What can you analyze with Dot Plot?

          Any pair of sequences

        DNA

        Proteins

        RNA

          DNA with proteins

        Dotlet is an appropriate tool

        To compare full genomes, install the program locally

          Sequences longer than 1000 symbols are hard to analyze online

           Divergent sequences where only a segment is homologous

          Long insertions and deletions

          Tandem repeats

The square shape of the pattern is characteristic of these repeats

Using Dotlet:

          Dotlet is one of the handiest tools for making dot plots

          Dotlet is a Java applet

          Open and download the applet at the following site:

        www.isrec.isb-sib.ch/java/dotlet

          Dotlet slides a window along each sequence

          If the windows are more similar than the threshold, Dotlet prints a dot at their intersection

          You can control the similarity threshold with the little window on the left

          Every dot has a score given by the window comparison

          When the score is

        Below threshold 1                           ó black dot

        Between thresholds 1 and 2       ó grey dot

        Above threshold 2                           ó white dot

          The blue curve is the distribution of scores in the sequences

          The peak ó most common score,

        Most common ó less informative

          Window size and the stringency control the aspect of your dot plot

        Very stringent = clean dot plot, little signal

        Not stringent enough = noisy dot plot, too much signal

          Play with the threshold until a usable signal appears

 
          The square shape is typical of tandem repeats

          The repeats are not perfect because the sequences have diverged after their duplication

Comparing a Gene and its Product:

          Eukaryotic genes are transcribed into RNA

          The RNA is then spliced to remove the introns' sequences

          It may be necessary to compare the gene and its product

          Dotlet makes this comparative analysis easy

 
Aligning Sequences:

          Dotlet dot plots are a good way to provide an overview

          Dot plots don't provide residue/residue analysis

          For this analysis you need an alignment

          The most convenient tool for making precise local alignments is Lalign

 Lalign and BLAST:

          Lalign is like a very precise BLAST

          It works on only two sequences at a time

          You must provide both sequences

 
Going Farther:

          If you need to align coding DNA with a protein, try these sites:

        www.tcoffee.org => protogene

        coot.embl.de/pal2nal

          If you need to align very large sequences, try this site:

        www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi

          If you need a precise estimate of your alignment's statistical significance, use PRSS

        The program is available at fasta.bioch.virginia.edu

        Low E-value ó good alignment

BiRG Minutes : June 11, 2008

Analyzing Protein Sequences

 In-silico biochemistry

 Sliding-windows techniques – most ancient way of looking at sequences

                -used if the strand of DNA was cut in the middle

                -ND THE WAS where the A was cut off

                                NDT HEW AS

                                DTH EWA S

                                THE WAS

                -Use past experiences and what proteins have been together in the past

-Hydrophobicity is the most popular analysis – a good indicator of transmembrane segments or core regions within a protein.

 
Predicting transmembrane domains

ProtScale allows one to compute and represent the profile produced by any amino acid scale on  a selected protein.

amino acid scale is defined by a numerical value assigned to each type of amino acid.

THMM Transmembrane Helix Prediction is a method for predicting transmembrane helices based on a hidden Markov Model (HMM)

HMM - a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications.

THMM creates a prediction, or what it should have been

ProtScale has parameters and shows what it is

Predicting post-translational modifications w/PROSITE

                Proteins get modified between the cell and getting read

                PROSITE motifs are written as patterns

      Short patterns are not very informative by themselves

      They only indicate a possibility

      Combine them with other information to draw a conclusion

NOT EVERYTHING IS IN PROSITE

                Interpreting PROSITE patterns

                                Some patterns may suggest nonexistent protein features

                                Short patterns are more informative if they are conserved across homologous sequences

Domains is defined as " independent globular folding units".  It is a portion of protein that can keep its shape if you remove it from the rest of the protein.  It consists of at least 50 amino acids.  -  Domains are like the various components of our kitchen – such as the oven, the microwave, the refrigerator, etc.  All together they constitute the complete kitchen, but they can also exist separately.  You only need to use microwave when making pop corn - which can be done outside the kitchen.

An average protein consists of 2 or 3 domains.  Usually each domain plays a specific role in the function of the protein.  It  may interact with other proteins, or bind ion like calcium or zinc, or it may contain an active site.  It is common to have a catalytic domain associated with a binding domain and a regulatory domain.   Imagine -   a toaster, where you have the grill [catalytic], the toast holder [binding], and the switch [regulation].

Domains are like independent functions that can be taken out of a program but still function

Researchers

A domain is a multi-sequence alignment similar to a puzzle

Using Domain collections

Scientists have been discovering and characterizing protein domains for more than 20 years

Manual collections are precise but small; where the researchers must document everything on their own

Automatic collections go out and find data in research documents etc

It is probably that only one of these servers will have the information to help you understand your protein

IU News: Science

IU News: Technology