



Our BiRGer 'Mark' surprised me today when he walked into the BiRG lab with a big cake. Here are the photos of the cake and the BiRG lab.
Thank you very much for the cake! A very Merry Christmas and a very Happy New Year to you all.
Indiana University Southeast BioInformatics Research Group Web Log
BLOOMINGTON, Ind. -- Computerworld magazine announced in its August 2008 issue its "Top IT Schools to Watch 2008," and the Indiana University School of Informatics was among the 10 schools recognized in a feature article on graduate programs.
The schools, including institutions such as Carnegie Mellon University, Stanford University, the University of Pennsylvania, and the University of Virginia, were selected based on how well they were keeping pace with today's IT workplace, and the relevance of their curriculum to the ever-changing technology industry.
IU's School of Informatics was touted for not only providing students with real-world experience, but for its interdisciplinary approach to the field and for the responsiveness of its faculty and the students.
The list was compiled by a panel of more than two dozen IT executives, hiring managers, recruiters and academics who were asked to help identify the country's leading-edge schools for IT workers seeking to advance their careers. They considered graduate-level IT programs and schools that give graduates the best value in terms of salary increases or promotions vs. cost of tuition, and that best gear their curriculum to the everyday demands of today's IT workplace.
From the IT schools selected by the panel, Computerworld editors chose the innovative IT schools to profile. Finally, Computerworld partnered with Dice.com to survey alumni at the schools, asking for feedback on their satisfaction with their schools' program.
"We are honored to be part of Computerworld's list for 2008," said Bobby Schnabel, dean of the School. "It is gratifying to be in the company of schools that have long been considered at the top of the computing field, and to gain recognition for our still young set of graduate programs in informatics."
The complete story can be found in the August issue of ComputerWorld magazine and online at www.computerworld.com.
Founded in 2000 as the first school of its kind in the United States, the Indiana University School of Informatics is dedicated to research and teaching across a broad range of computing and information technology, with emphases on science, applications, and societal implications. The school includes the Departments of Computer Science and Informatics on the Bloomington campus and Informatics on the IUPUI campus.
The school administers a variety of bachelor's and master's degree programs in computer science and informatics, as well as Ph.D. programs in computer science and the first-ever doctorate in informatics. The School is dedicated to excellence in education and research, to partnerships that bolster economic development and entrepreneurship, and to increasing opportunities for women and underrepresented minorities in computing and technology. For more information, visit www.informatics.indiana.edu.
Comparing Two Sequences
Objectives:
• Get the basics about dot plots
• Know how to interpret the most common patterns in a dot plot
• Use Dotlet
• Use Lalign to extract local alignments
Why compare two?
• Database searches are useful for finding homologues
• Database searches don't provide precise comparisons
• More precise tools are needed to analyze the sequences in detail including
– Dot plots for graphic analysis
– Local or global alignments for residue/residue analysis
• The alignment of two sequences is called a pairwise alignment
Dot Plot:
• A dot plot is a graphic representation of pairwise similarity
• The simplicity of dot plots prevents artifacts
• Ideal for looking for features that may come in different orders
• Reveal complex patterns
• Benefit from the most sophisticated statistical-analysis tool in the universe . . . your brain
Choosing your two sequences:
• Making pairwise comparisons takes time
• Use BLAST to rapidly select your sequences
– More than 70% identity for DNA
– More than 25% identity for proteins
• If your sequences are too similar, comparing them yields no useful information
What can you analyze with Dot Plot?
• Any pair of sequences
– DNA
– Proteins
– RNA
• DNA with proteins
– Dotlet is an appropriate tool
– To compare full genomes, install the program locally
• Sequences longer than 1000 symbols are hard to analyze online
• Divergent sequences where only a segment is homologous
• Long insertions and deletions
• Tandem repeats
The square shape of the pattern is characteristic of these repeats
Using Dotlet:
• Dotlet is one of the handiest tools for making dot plots
• Dotlet is a Java applet
• Open and download the applet at the following site:
– www.isrec.isb-sib.ch/java/dotlet
• Dotlet slides a window along each sequence
• If the windows are more similar than the threshold, Dotlet prints a dot at their intersection
• You can control the similarity threshold with the little window on the left
• Every dot has a score given by the window comparison
• When the score is
– Below threshold 1 ó black dot
– Between thresholds 1 and 2 ó grey dot
– Above threshold 2 ó white dot
• The blue curve is the distribution of scores in the sequences
• The peak ó most common score,
– Most common ó less informative
• Window size and the stringency control the aspect of your dot plot
– Very stringent = clean dot plot, little signal
– Not stringent enough = noisy dot plot, too much signal
• Play with the threshold until a usable signal appears
• The square shape is typical of tandem repeats
• The repeats are not perfect because the sequences have diverged after their duplication
Comparing a Gene and its Product:
• Eukaryotic genes are transcribed into RNA
• The RNA is then spliced to remove the introns' sequences
• It may be necessary to compare the gene and its product
• Dotlet makes this comparative analysis easy
Aligning Sequences:
• Dotlet dot plots are a good way to provide an overview
• Dot plots don't provide residue/residue analysis
• For this analysis you need an alignment
• The most convenient tool for making precise local alignments is Lalign
Lalign and BLAST:
• Lalign is like a very precise BLAST
• It works on only two sequences at a time
• You must provide both sequences
Going Farther:
• If you need to align coding DNA with a protein, try these sites:
– www.tcoffee.org => protogene
• If you need to align very large sequences, try this site:
– www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi
• If you need a precise estimate of your alignment's statistical significance, use PRSS
– The program is available at fasta.bioch.virginia.edu
– Low E-value ó good alignment
Analyzing Protein Sequences
In-silico biochemistry
Sliding-windows techniques – most ancient way of looking at sequences
-used if the strand of DNA was cut in the middle
-ND THE WAS where the A was cut off
NDT HEW AS
DTH EWA S
THE WAS
-Use past experiences and what proteins have been together in the past
-Hydrophobicity is the most popular analysis – a good indicator of transmembrane segments or core regions within a protein.
Predicting transmembrane domains
ProtScale allows one to compute and represent the profile produced by any amino acid scale on a selected protein.
amino acid scale is defined by a numerical value assigned to each type of amino acid.
THMM Transmembrane Helix Prediction is a method for predicting transmembrane helices based on a hidden Markov Model (HMM)
HMM - a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications.
THMM creates a prediction, or what it should have been
ProtScale has parameters and shows what it is
Predicting post-translational modifications w/PROSITE
Proteins get modified between the cell and getting read
PROSITE motifs are written as patterns
– Short patterns are not very informative by themselves
– They only indicate a possibility
– Combine them with other information to draw a conclusion
NOT EVERYTHING IS IN PROSITE
Interpreting PROSITE patterns
Some patterns may suggest nonexistent protein features
Short patterns are more informative if they are conserved across homologous sequences
Domains is defined as " independent globular folding units". It is a portion of protein that can keep its shape if you remove it from the rest of the protein. It consists of at least 50 amino acids. - Domains are like the various components of our kitchen – such as the oven, the microwave, the refrigerator, etc. All together they constitute the complete kitchen, but they can also exist separately. You only need to use microwave when making pop corn - which can be done outside the kitchen.
An average protein consists of 2 or 3 domains. Usually each domain plays a specific role in the function of the protein. It may interact with other proteins, or bind ion like calcium or zinc, or it may contain an active site. It is common to have a catalytic domain associated with a binding domain and a regulatory domain. Imagine - a toaster, where you have the grill [catalytic], the toast holder [binding], and the switch [regulation].
Domains are like independent functions that can be taken out of a program but still function
Researchers
A domain is a multi-sequence alignment similar to a puzzle
Using Domain collections
Scientists have been discovering and characterizing protein domains for more than 20 years
Manual collections are precise but small; where the researchers must document everything on their own
Automatic collections go out and find data in research documents etcIt is probably that only one of these servers will have the information to help you understand your protein
Types of Organisms: Prokaryotic, Eukaryotic, and Archea
Protein Maturation
Deciphering a Swiss-Prot entry
Specialized protein databases: KEGG (the metabolic pathways database) or PDB (structure database)
2 ways to predict genetics
1. Genes to proteins or translation (genomics)
2. DNA
We must merge the two
From Gene to functional Protein
DNA > mRNA > proteins > upon maturing > transportation > destination
Protein Maturation:
-removal of some fragments
-specific protein cleavage
-chemical modifications
-Phosphorylation (addition of phosphate that gives the protein its shape)
-adition of lipids or sugars (glycosylation)
-Proteins are often modified to make them active
-Modification can imply attaching a lipid or a sugar
-Use these resources to determine the details of the modification
Swiss-Prot Database – (British) entries describe all proteins that have known functions
tremble contains the 4 mill putative proteins found in GenBank
Swiss-Prot contains the subset of tremble with a known function
This is redundant to create many databases using the same information
A Swiss-Prot entry: www.expasy.org/uniprot/P00533
Gen Info (accession number), References, Commments, Cross-reference, feature table, sequence
General Information: Entry Name, Primary Accession Number (PXXXX [P is for protein]), Last Modified, Protein name and synonyms, from/taxonomy fields (tells where protein came from), references section
Comments section lists all the known functions of the protein
Features Section localizes precisely every known function of your protein, each on its sequence
• TRANSMEM: Transmembrane domain (something that passes through the membrane)
• ACT_SITE: Active sites (where chemicals can bond)
• BINDING: Binding sites
• DISULPHID: Bridge of cysteines
• EMBL: GenBank original DNA sequence
• PDB: Experimental structure of your protein
• DIP: Proteins interacting with your protein
• GlycoSuiteDB: Glycolsylations
• MIM: List of genetic diseases involving your protein
• Ontologies: Function of your protein
• Profiles: Known protein domains in your protein
• ENSEMBL: Genomic location of your protein
By alternative splicing, the protein can have MANY functions
• To find out about the function of your protein, you will need to determine
– Where your protein works
– Metabolic pathway in which the protein is involved
– The protein's 3D structure
– Which protein family it belongs to
Where do proteins work?
Part of the metabolic pathway
Chain of production linking several different proteins
Modify metabolites by passing them from one enzyme to the next
On KEGG pathway, each enzyme appears w/its EC number
– KEGG is the most extensive database of metabolic pathways
– You can use it to compare species Japan
– The IUBMD assigns the EC numbers used to describe an enzyme activity UK
– An exhaustive list of all known metabolic pathways in E. coli and other bacteria
Some important Protein Families
– Kinases control everything in us; their deregulation is the cause of many cancers
– Immunoglobulins are key elements of our natural defenses
– This site is a key resource on restriction enzymes
Predicting protein function is a central goal in biology
• Protein databases help organize knowledge
• They provide the material for
– Developing new biological experiments
– Developing new prediction algorithms
– Extrapolating experimental data to unknown sequences
Chancellor's Award for Interdisciplinary achievement | |||
2008 | Investigating Alu Distribution by Family Across Human Chromosome Sequences | John Lannon Informatics | Sridhar Ramachandran |