Bioinformatics Concept Objective
• To understand and appreciate the enormous potential of bioinformatics and genomics in
the contemporary life sciences.
• How do we determine which primary source literature is relevant to our study? How do
we understand the reading?
• Develop an awareness of the breadth of bioinformatics resources and applications,
including non-sequence based biological information
• Develop a basic understanding of the theoretical foundation and underlying assumptions
of the programs, and their relative strengths/limitations
Bioinformatics is the application of computers to the life sciences
and in particular to genomics. This makes it possible to study
biology at the genome-wide level. The aim of this lab is to focus
on the major applications, including data storage, retrieval, and
analysis of biological information, rather than on the engineering
of new bioinformatics applications.
“Rocket science is for kids,
Bioinformatics is for scientists”
“While you make PASTA,
I will sequence your gene in FASTA”
Bioinformatics is a scientific discipline that has emerged recently in response to
accelerating demand for an expedient, flexible, and intelligent means of storing, managing and
querying large and complex biological data sets. It is an interdisciplinary science with strong
links between the life sciences and mathematics, statistics, and computer science, and it is
regarded by a wide range of interest groups (governments, universities, and industry) as a crucial
area of development. The ultimate goal of bioinformatics is to allow scientific exploration and
exploitation of previously uncharted interdisciplinary territories. Stated differently, the aim is to
enable the discovery of new biological insights and to create a global perspective, from which
unifying principles in biology can be discerned (Mount 2001).
Bioinformatics is one of the fastest growing interdisciplinary sciences of the late 20th and
early 21st century. For many students, bioinformatics is such a new discipline that they will not
necessarily have the required background knowledge. It is therefore necessary to build bridges
for students with diverse backgrounds. This lab gives a basic introduction to what bioinformatics
involves and provides some examples of how the approaches of student-centered teaching and
active learning techniques can be employed to enhance the learning experience.
To understand the major concepts of bioinformatics, two conceptual frameworks are
used: similarity, which enables analysis of predictions about structure/function; and dissimilarity,
that allows inference of evolutionary history based on distance. These attributes are used as
examples of core modules (the bridges), because all analyses need to be undertaken in
appropriate biological context but involve multiple disciplines, including mathematics, statistics,
computer science, and information technology, that need to be integrated into biology.
In the beginning of the genomics era, bioinformatics was mainly concerned with the
creation and maintenance of databases to store digitized biological information, such as
nucleotide and amino acid sequences. Development of these types of databases involved not only
design issues, but also the development of complex interfaces whereby researchers could both
access existing data, and submit new or revised data (e.g. to the NCBI, http://www.ncbi.nlm.nih.gov/). More recently, emphasis has shifted towards the questions of how to
analyze large data sets, particularly those stored in different formats in different databases.
Ultimately, however, integration is needed (e.g. Chicurel 2002) in order to form a comprehensive
picture of normal cellular and sub-cellular activities, so that researchers may study how these
activities are globally regulated. The actual process of analyzing and interpreting digitized
biological data is often referred to as computational biology. It is commonly recognized that subdisciplines within bioinformatics and computational biology include: (i) the development and
implementation of tools that enable efficient access, management and use of various types of
digitized biological information; and (ii) the development of new mathematical theorems,
statistical methods and algorithms to infer relationships among members of large data sets, locate
genes within nucleotide sequence, and predict protein structure and/or function.
Annotating – The process of identifying the protein coding sequences and other biological
features within genomic DNA sequences and adding such information to the sequence.
Assembly – Aligning and merging shorter sequences of a much longer DNA sequence in order to
reconstruct the original sequence. To generate a significant portion of a genomic DNA sequence,
assembly is usually used because current technology only allows for sequencing of 600–1000
base pair fragments of DNA with high fidelity.
Base call – Reading a DNA sequencing chromatograph and assigning a base to each peak.
Bioinformatics – Research, development, or application of computational tools and approaches
for expanding the use of biological, medical, behavioral or health data, including those to
acquire, store, organize, archive, analyze, or visualize such data (NIH working definition: ).
BLAST – Basic Local Alignment Search Tool – a suite of computer programs that are used to
compare DNA and protein sequences to those in libraries of databases to search for similarities
Chromatogram – A visual representation of the signal peaks detected by a sequencing
instrument. The chromatogram contains information on the signal intensity as well as the peak
Consensus sequence – A sequence that has been constructed from the comparison of multiple
sequences. The result represents the best guess of what the base calls (or amino acids in the case
of protein alignments) should be at each location.
Contig – A sequence that has been constructed by comparing and merging the information from
sets of overlapping DNA segments.
Depth of coverage – Multiple reads of the same sequence. Two methods for obtaining multiple
reads are: 1) using different primers to sequence the same clone of a gene, or 2) sequencing
unique clones of the same gene.
Discrepancies – Differences in base calls between two or more different sequences of the same
clone or between different clones of the same gene.
DNA sequencing – Determining the exact order of nucleotides in a DNA molecule.
Exon – Eukaryotic gene segment that is transcribed to RNA, retained after RNA processing, and
will be (with other exons) part of the mRNA that is translated to protein. Exon can refer to either
the DNA sequence or the RNA transcript. Exons are separated in DNA and in the primary RNA
transcript by introns. Exons are also known as the protein coding sequences of genes and introns
as the noncoding regions.
FASTA format – A format used for submitting sequence data (bases or amino acids) to
alignment programs. The first line is a description of the data, beginning with the greater than (>)
symbol and ending with a paragraph break without any spaces within the line. FASTA format
uses single letter codes for the sequence without spaces or paragraph breaks within the sequence.
Finishing – A process in which researchers examine the contigs to look for misassemblies or
regions that require additional coverage.
GenBank – The sequence database maintained by NIH. As of February 2008, GenBank
contained 85,759,586,764 bases in 82,853,685 sequence records
Genome – The total genetic material of an organism.
Genomic DNA (gDNA) – All of the chromosomal DNA found in a cell or organism.
Homologous – Genes that are similar because they share a common ancestor.
Homology (of DNA or proteins) – Regions of protein or DNA that have a high level of
sequence similarity due to shared ancestry. However sequence similarity does not necessarily
indicate homology, especially if the similar sequences are short.
Indel – A sequence discrepancy due to either an inserted or a deleted base.
Paralogous – Genes that share a high level of homology and are from the same genome.
Orthologous – Genes that share a high level of homology but are from different species.
Quality score (or value) – A numerical value indicating the confidence level for base calls. A
higher quality value means higher confidence that the base is correct. A lower quality value
suggests that the base call has a lower chance of being reliable and thus accepted.
Query – In terms of Geneious and relational databases, this is a program written in SQL that is
used to extract information from a sequence database.
Query sequence – The input sequence (or other type of search term) with which all of the entries
in a database are to be compared (http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html).
Read – Sequences of bases that contain information about the parent chromatogram. As long as
the base sequence is linked to the chromatogram it can be considered a read.
Reference Sequence Database – The Reference Sequence (RefSeq) collection provides a
comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic
DNA, transcripts, and proteins. RefSeq sequences form a foundation for medical, functional, and
Relational database – A database consisting of multiple tables of information, based on a model
of the data and the relationships between different types of data. (For example, a DNA sequence
that is related to and linked to a chromatogram and also to information about the sample).
Sequence – The ordered list of bases that make up a DNA strand. When linked with a
chromatogram this would be considered a read.
SQL (structured query language) – Programming language used to extract relationships
between different data sets in a relational database. For Geneious, a program written in SQL that
does this is called a query.
Subject sequence – A sequence found by BLAST to have similarity to a sequence entered by the
user (the query sequence).
Bioinformatics ACTIVITY PAGES
Main Activities: Bioinformatics
Bioinformatics Exercise I:
1. Using the following website, what are five different kinds of databases available:
2. Using the website: http://www.ncbi.nlm.nih.gov
a. Select gene in the search bar
b. Type in BRCA1 (which is the human gene that is involved in causing breast
cancer) and hit search
c. Select the first option – BRCA1 in humans
3. What is the description of the BRCA 1 gene?
4. What is the location of the gene?
5. How many exons are present in the gene?
6. Select nucleotide in the search bar
a. Type in BRCA1 and hit search
b. Select the eighth option – BRCA1 in humans
7. How long is the polyA signal? (Click Regulatory for the PolyA signal sequence, the
signal will be highlighted)
8. What is the total size of the gene?
Exercise II: Primer Designing Activity
1. Following Exercise I, copy the sequence given to you by the NCBI, and go back to the
Bioinformatics Software and Tools homepage.
a. From the heading “BI Tools” choose “Primer Designing” and choose the website
b. Paste sequence into window available on Primer3.
c. Click button labeled “Pick Primers”, from this new page copy the summary of the
primers and paste it below:
i. Left Primer:
ii. Right Primer:
Bioinformatics Exercise III:
• For this exercise you will be using The Arabidopsis Information Resource (TAIR).
“The Arabidopsis Information Resource (TAIR) maintains a database of genetic and
molecular biology data for the model higher plant Arabidopsis thaliana”.
• Locate the search bar type in the gene you will be using, NPR1, and make sure Gene is
selected, and then search.
• Choose the hyperlink underneath the heading Gene Model. (Choose the first link)
• From this page answer the following questions:
1. Describe the NPR1 Gene.
2. Where is the NPR gene located?
3. Scroll down to sequence and select the full length cDNA button
a. Copy the sequence
b. Go back to the bioinformatics website
c. Select BI tools in the heading,
d. Select restriction analysis
e. Choose restriction mapper website
f. Choose restriction enzyme -> virtual restrict
g. Paste sequence
h. Perform single restriction digest
i. Paste the results here
a. A single restriction analysis of your choice
b. Double restriction analysis of your choice
c. Triple restriction analysis of your choice.
5. Perform a virtual digest with all the enzyme and report the three enzymes that you will get the
most restriction sites than others?
Bioinformatics Exercise IV: Sequence Alignment
1. Go to http://www.proteinstructures.com/Sequence/Sequence/sequence-alignment.html
2. What is sequence alignment?
3. Go to http://www.arabidopsis.org
4. Type WRKY70 in to the search bar
5. Open the first link – WRKY70
6. What is the function of the WRKY70 gene in Arabidopsis?
7. Scroll down to “sequence” and click the full length genomic option
8. Open a new tab, go to http://www.arabidopsis.org, search for PR1
9. What is the function of the PR1 gene in Arabidopsis?
10. Scroll down to sequence and click the full length genomic option
11. Open a new tab, go to
12. Copy and paste the WRKY70 sequence in to the first open box on the EMBOSS matcher
13. Copy and paste the PR1 sequence in the second open box on the EMBOSS matcher page
14. Click submit
15. What is the percent similarity between the two genes?
16. Copy and paste matching sequences found at the bottom of the page
Exercise V – Open Reading Frames
1. Go to http://ghr.nlm.nih.gov/glossary=openreadingframe
2. What is an open reading frame?
3. Open a new tab and go to http://www.ncbi.nlm.nih.gov/gorf/
4. Copy and paste the PR1 sequence into the open box on the ORF finder page
5. Click OrfFind button
6. How many open reading frames are in this sequence?
7. What is the length of each open reading frame?
Exercise VI – Proteomics
In this exercise you will compare amino acid sequences of protein from different organisms to
study their evolutionary relatedness. Determine the evolutionary relatedness of species through
comparisons of amino acid sequences of α-hemoglobin.
A bat looks much like a rodent until it flies, at which point it looks much like a bird. So what are
bats more closely related to: birds or mammals? You will be able to discover for yourself the
answer to this question by exploring the protein databases available to you. Since hemoglobin
amino acid sequences have been studied extensively in a wide range of species, these proteins
make a good candidate for comparing evolutionary relatedness. There are more sequences
available for the alpha chain than there are for the beta chain, but you will be able to use either in
1. Decide whether you want to do your work with α-hemoglobin. You may want to
collaborate with a partner and do companion searches, one searching with α-hemoglobin.
If this is the case, you will want to search for the same species of mammals, bats, and
birds, and at the end of the exercise, you can compare your results with each other to
determine whether your different proteins showed the same evolutionary relationships.
2. Go to the web site: uniprot.org
3. Click Swiss-Prot
4. In “Enter search key work” type “alpha hemoglobin” and click “submit”. The results of
this search will come up on your screen. How many protein sequences were reported to
you from this query?
5. Go back to the “Enter search key work” and type “bat alpha hemoglobin” and click
“submit”. When you get the results of this search, how many sequences of alpha
hemoglobin did you get for bat species?
6. *NOTE: Check the species names and common names for each of the α-hemoglobins that
came in this sequence report to make sure that that are, in fact, bat sequences. Sometimes
a search won’t recognize the difference, for example, between a “bat” and some other
word, such as “wombat”!
7. Select a bat α-hemoglobin sequence to save to a flash drive by clicking on the colorhighlighted and underlined accession code for that protein sequence. An accession code is
how protein sequences are identified and archived in databases. In the case of α-
hemoglobin sequences, this accession code will start with the letters “HBA”. The
symbols for all α-hemoglobins will begin with “HBA”.
8. The page that opens will contain information about the sequence such as the taxonomy of
the organism that it came from. Near the end of the page you will see a title called
Sequence with the protein sequence written with single-letter designations of the amino
acids. Click the “FASTA” download link. This is the best way to save sequence
information on your flashdrive, because it is a sequence format that all computer search
programs can understand. Click the link and this will bring up a page with the sequence
9. Copy the amino acid sequence to a flash drive. To do this,
a. Highlight the amino acid sequence,
b. Copy it, and paste it into a word file.
c. Save it to your flash drive.
*NOTE: It is OK to copy and save your FASTA-formatted sequence to word document and save
on the desktop if you don’t have a flash drive
>sp|Q7M2Y4|HBA_CHAMP Hemoglobin subunit alpha OS=Chalinolobus morio GN=HBA
>sp|P11757|HBA_MYOVE Hemoglobin subunit alpha OS=Myotis velifer GN=HBA PE=1
10. Return to the web page with the list of bat alpha hemoglobin sequences (1 “back” click
on the web pages will get you there), identify another sequence for a bat α-hemoglobin
and repeat the process of highlighting the FASTA formatted amino acid sequence to your
Word file. Save all your FASTA formatted α-hemoglobin sequences together in one file
on your flash drive.
11. When you have saved two α-hemoglobin sequences from two bat species, repeat steps
3-9 to get 2 sequences from bird species and 2 sequences from mammalian species. It
doesn’t matter which species you choose, as long as 2 are birds and 2 are mammals. You
may want to choose species that you predict are related to bats.
IMPORTANT: Be aware that if you are limiting your search for bird α-hemoglobin sequences
with the keyword “bird”, the search will only locate protein entries where the name “bird”
appears. If the entry was archived under other descriptions such as “hawk” or “eagle” or
“penguin, you will find entries under these categories.
12. When you have saved six α-hemoglobin sequences to your flash drive (two from bats,
two from birds, and two from mammals), go to http://clustalw.genome.ad.jp.
CLUSTALW is a computer program that you can use to search for sequence similarities
between many sequences at a time and display regions of alignment.
13. Copy your entire file of sequences saved on your flash drive into the textbox and click
“Submit” . Note that the sequence descriptions preceded by the “>” mark will be copied
in with the protein sequences. This will not be a problem with your search. Without
changing any of the default settings on your search, click on the blue colored “Execute
Multiple Alignment” bar.
14. The page that will come up next will show the alignment of amino acid sequences for the
6 proteins that you have retrieved from the SWISSPROT database, using the single-letter
designations for amino acids. An asterisk will appear along the bottom row of amino acid
alignment at positions where there is an amino acid that is found in all 6 proteins. These
amino acids are said to be “highly conserved”, since they haven’t changed since these
species diverged from a common ancestor.
a. How many of the amino acids are found to be the same in all of the 6 α-
hemoglobin sequences in your alignment?
b. What percentage of all the α-hemoglobin amino acids are conserved in all 6
proteins? (You will have to count the number of conserved amino acids by hand.)
c. Examine the regions of conserved amino acid sequences. Are there any specific
regions of the α-hemoglobin sequences that are especially conserved? Is one end
of the molecule more conserved than the other? Describe your observations.
d. Do you see any amino acids that appear more frequently in conserved regions of
the protein than in the non conserved regions? If so, which amino acids are they?
(Go to the table at the end of this Lab Exercise to decode the single-letter
designation for amino acids.) Amino acid table.
e. If you did find amino acids that were more frequently conserved in your
alignment report, were they ones with side groups that were nonpolar, polar, or
15. At the top of your CLUSTALW report, you will find the exact percentages of amino acids
in the sequence alignment that are identical when comparing only two sequences at a
time. For example, if your report says “Sequences (1:2) Aligned. Score: 87.2”, this means
that when the first two sequences saved on your floppy were aligned, 87.2% of the amino
acids were identical in both sequences. Transfer these percentages into a table format, in
which the species whose sequences you have aligned are headers for both the columns
and the rows. Your table should similar to this:
Table 1: Percent identity in amino acid alignment for α-hemoglobins
Notice that you need not fill out both halves of this table since the information is redundant.
From this table, can you see whether the α-hemoglobin sequences are more similar for bats and
birds, compared with bats and mammals? What does this suggest about the evolutionary
relatedness of these species? Which species diverged from each other the most recently and have
the most recent common ancestor? Which species diverged from each other the most long ago
and have the most ancient common ancestor? From the information in this you should be able to
predict that bats are more closely related to either birds or mammals.
16. A phylogenetic tree can present the relatedness of species from sequence similarity data
such as your Table 1. These trees link species that are more closely related in “branches”,
and the length of the branches is their evolutionary distance. You can draw a phylogenetic
tree from your amino acid alignment report by pairing species who have the most
sequence similarities to make short branches, and branches. Species who have less
sequence similarities will branch from each other on the tree farther apart.
The CLUSTALW on the page that your report appears on will automatically draw a
phylogenetic tree for you. At the bottom of the page, click on either the drop down arrow to
select a tree option. Any option will do. Screen shot this tree and save for your report. Does
this tree agree with your analysis above of the “Percent identity in amino acid alignment for
α-hemoglobins” table? Explain.
17. One way to test the validity of the phylogenetic tree that you drew for bats, birds, and
mammals is to compare it with trees constructed from sequences of other proteins.
18. Repeat the comparisons that you made (steps 1-16 above) with other species such as the
a. Compare whales to mammals and fish.
b. Compare reptiles to birds and mammals
Exercise VII – Bringing it all together with a case study
Pretend that you are an OBGYN at Penn State Hershey Medical Center. A patient named Mary
comes to see you. Mary is worried because her mother died of breast cancer at age 40. Mary
wants to know if she should have a double mastectomy (breast removal) in order to prevent
breast cancer. You tell Mary that she should have genetic tests done before having a double
1. Go to http://www.ncbi.nlm.nih.gov/, search first for BRCA1 and then for BRCA2
2. Briefly describe the BRCA1 and BRCA2 genes. How can these genes contribute to breast
You decide to have a breast biopsy done on Mary. A small tissue sample is taken from Mary and
tested. You take the tissue sample and go down to the lab. You decide to run a PCR and a gel
electrophoresis separately on the sample.
3. How can PCR be used to determine if Mary has breast cancer?
4. How can a gel electrophoresis bed used to determine if Mary has breast cancer?
After you run a PCR and a gel, you determine that Mary has a mutated form of BRCA1. Now
you are very interested in the BRCA1 gene. You decide to run a sequence alignment of the
regular BRCA1 gene against the mutated form of the BRCA1 gene. You find out that the BRCA1
gene has a series of tandem repeats which makes the gene nonfunctional.
5. Go to http://ghr.nlm.nih.gov/glossary=repeatsequences
6. What is a tandem repeat?
7. Why would a tandem repeat make a gene nonfunctional?