Bioinformatics Tools from Bioinformatics Analysis of Macromolecules class


COMMON ERRORS

  • Make sure you are not putting a nucleotide sequence into a protein query, and vice-versa.



GENSCAN


http://genes.mit.edu/GENSCAN.html



Intr - internal


Term - terminal


Prom - promoter


Sngl - single





FegenesH, Animal Version - Ab initio gene prediction (SOFTBERRY)


http://www.softberry.com/berry.phtml?topic=fgenesh&group=programs&subgroup=gfind






GeneID


GeneFinder




Notes


Chimeric Sequences


t(9,22)(q34,q11)  Chromosomes 9 and 22 are affected. Locations q34 and q11 are affected respectively.




Comparison of Fgenesh (softberry) and Genscan.


(MUST BE IN FASTA FORMAT)


NG_011759 both return 15 exons total with 11 on one stand, and 4 on the other.


EU445484 Genscan - 7 exons. Softberry - 6 exons.


AF083883 Both returned a result of 3 total exons.


Y16787 GenScan returned a result of 6 exons, while Fgenesh returned a result of 8 exons. In reality there are 7 exons total. Fgenesh (Softberry) had results that were superior to GenScan.



Notes from prof and students

Use both in the lab practical. BLAST putative exons for quality control. GenScan uually had the exons in the right location. Fgen was generally wrong with this regard. Fgen was more sensitive  one out of the 4 cases.




Bacterial Genomes


When accessing a particular region at NCBI search for the ID number of the gene, then click on “change region shown”.



fgenesB

  • don’t forget to select the given organism


http://www.softberry.com/berry.phtml?topic=fgenesb&group=programs&subgroup=gfindb


Notes: Tu/Op = transcription units/ operons




Glimmer


http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi




GeneMark (preferred method)

  • Note, GeneMark not only creates a prediction, but gives alternative predictions and their likelihood.


http://www.ncbi.nlm.nih.gov/genomes/MICROBES/genemark.cgi



MRSA sequence NC_002952 GeneMark accurately predicted two of the three genes, which was closer than either of the two other two other programs. It is agreed in  class that GeneMark is the superior overall program.


NC_012578 GeneMark and Softberry both analyzed the sequence and predicted accurately, however Softberry requires the selection of a particular organisms -_-


NC_002505 - GeneMarkined up the numbers perfectly! Softberry was close.




Secondary Structure Prediction (for proteins)


alpha helix

beta strand

coils, turn disorganized region


http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml


https://predictprotein.org/


http://cib.cf.ocha.ac.jp/bitool/MIX/


http://www.ncbi.nlm.nih.gov/Structure/mmdb/mmdbsrv.cgi?uid=6043


With regards to CD4, all three structural prediction proteins on the Japanese website turned out to be utterly useless. All three programs predicted an abundance of alpha helices, yet the crystalized structure contained none. I feel like I am spinning chicken bones to try and predict the harvest with these useless programs. (Note, when I used these programs during my practice practical for a general bioinformatics class, these tools turned out to be more useful.)


predictprotein.org is far superior to the three rotein prediction methods on the Japanese website.



Use cn3d to view the 3D protein structure. Don’t forget Style - Coloring Shortcuts - econdary Structure



Protein Location Prediction Tools


PSORT II - Protein Sorting

http://psort.hgc.jp/form2.html




PROTCOMP


Protein Compartment

http://www.softberry.com/berry.phtml?topic=index&group=programs&subgroup=proloc


Between Protcomp and PSORTII, Protcomp accurately predicted the location of CD4 (plasma membrane) however PSORT II did not.


Protcomp accurately predicted the location of all 3 isoforms of CD4. It said that all three were located in the plasma membrane.





Accession Number

Protein Location

PSORTII

Protcomp

AAK64604

mitochondrial protein

Accurate

Accurate

CAA47024

cytoplasmic protein

Accurate

Accurate

AAA59599

plasma membrane

Not Accurate

Accurate

NP_001081025

nuclear protein

Accurate

Not Accurate

NP_000509

cytoplasmic protein

Accurate

Accurate

AAA61140

extracellular protein

Not Accurate

Accurate

NP_001019820

endoplasmic reticulum

Accurate

Accurate

NP_006735

extracellular protein

Accurate

Accurate



Note: Protcomp was inaccurate for NP_000509 at first, however when a FASTA format was copy/pasted into the query, instead of the version of the sequence at the bottom of the protein database that includes numbered lines etc., the protein prediction came back accurate.

Conclusion for Protein Prediction Location


Both programs have similar degrees of accuracy. Use both programs to predict protein locations and compare the predictions.



Discrepancy During Lab Practical


Make note of the respective predictions for each program. Rationally interpret the results. One prediction may need to be discarded, or it might have a close tie between two predictions. Also remember that protein locations are not static within the cell, proteins move from one location to another, and can often be found in multiple locations.



In Professional Setting


Follow the steps during the Lab Practical discrepancy, as described above. Additionally, it may be a good idea to search the professional databases (such as NCBI) and infer the location of this protein based on similar proteins.




Transmembrane Proteins Prediction


(When comparing to NCBI ctrl F for “transmembrane region”)


Sosui - (may require firefox.)

In Windows go to the control panel. May need to allow the Sosui website, and update Java.


In Mac Click the security tab in FireFox, edit site list, add Sosui.


http://harrier.nagahama-i-bio.ac.jp/sosui/sosui_submit.html




TM Predict


http://www.ch.embnet.org/software/TMPRED_form.html




TMHMM2


http://cbs.dtu.dk/services/TMHMM/



Helical Wheel program - A great compliment to the other three programs, a section of a protein sequence thought to be a helical structure can be pasted into this program for analysis. (NOTE: This program assumes a-priori that any sequence caopy/pasted into the query box is a helical region.)


http://rzlab.ucr.edu/scripts/wheel/wheel.cgi


Main site for the Helical Wheel lab lab


http://rzlab.ucr.edu/





Accession

Sosui

TMPredict

TMHMM2

Notes

Human glycophorin A

NP_002090





Vitamin K epoxide reductase

ADN49753





Bovine rhodopsin

NP_001014890

Prediction somewhat off

Prediction was very accurate

Nearly identical to NCBI flat file

TMPredict seemed to be the most clear tool, and the most accurate, with TMHMM2 being a close second.

Gorilla ABC-transporter

AAA91199





CFTR protein

NP_000483








Class verdict - all three programs were pretty accurate and performed with minimal error.






Protein Structural Prediction


Predicts signal to go to the ER and associated cleavage site:


Signal P

http://www.cbs.dtu.dk/services/SignalP/


Signal P explaination

http://www.cbs.dtu.dk/services/SignalP-4.1/output.php



Disulfide Bonds (Cysteine - Cysteine bonds) (See also the Prosite Tool under Protein Signatures)


CYS_REC

  • Used to predict the presence of disulfide bonds in a protein.

http://www.softberry.com/berry.phtml?topic=cys_rec&group=programs&subgroup=propt


DiANNA (not used in BAM class) - A tool for predicting disulfide bonds.


http://clavius.bc.edu/~clotelab/DiANNA/




Predict coiled-coils


COILS

http://www.ch.embnet.org/software/COILS_form.html


Marcoil

http://toolkit.tuebingen.mpg.de/marcoil


COILS/ PCOILS (not used in BAM class)

  • Predicts coiled coils and compares the prediction to a database of known sequances.

http://toolkit.tuebingen.mpg.de/pcoils



Repetitive Elements Search

Nucleotide Blast for dbALU database. Finds alu repetitive elements in your nucleotide sequence query.


http://blast.st-va.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome



Leucine Zipper


2zip Server

http://2zip.molgen.mpg.de/


Protein Signatures


Scan Prosite Tool - Tool excludes motifs with high probability by default (this is a good thing).

http://prosite.expasy.org/scanprosite/


Fingerprint Scan - A tool that scans for motif fingerprints.

http://www.bioinf.manchester.ac.uk/cgi-bin/dbbrowser/fingerPRINTScan/FPScan_fam.cgi



Nucleotide Signatures


Search for Human Promoters (nucleic acid analysis) - A threshhold of .8 is used by default to avoid false positives (this is a good thing).

http://linux1.softberry.com/berry.phtml?topic=fprom&group=programs&subgroup=promoter


CpG - Searches for CpG islands.

http://www.softberry.com/berry.phtml?topic=cpgfinder&group=programs&subgroup=promoter



Searching with Putative Chimeric Proteins

BLAST-P Against rough-seq database


https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome




Miscilaneous Bioinformatics Links


http://cbs.dtu.dk/services/


http://cbs.dtu.dk/biolinks/




3D Visualization and Prediction


CN3D


Download http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3dwin.shtml



CN3D Exercise

α, β, α/β, α+β, and a transmembrane protein



Molecule

Inferred Class

Actual Class

1rnb, G-specific endonuclease



1cd8, CD8 molecule



3hhb, hemoglobin



1kzu, light-harvesting complex



1a50, tryptophan synthase β subunit
















Phylogenetics Analysis

MEGA - MEGA is a program that I have found to be exceptionally useful for phylogenetics analysis. When I took Evolutionary Biology in 2013, MEGA was the main tool that I used for my phylogenetics project. My project involved comparing 5 different genes between marsupials and placentals, and paying special attention to counterparts in each group.

The program included multiple different algorithms and made the analysis very easy. For that reason, I am very proud to recommend this program to others.

NOTE!!! Be sure to be very careful that everything is aligned properly. When I first made a phylogenetic tree, everything was aligned improperly, and I got bizarre results. My professor, Dr. B, was a very good professor and took the time to show me where I went wrong, and helped me to fix it. If you get bizarre results, chances are, the alignment is off.