I. INTRODUCTION

When scientists discover a novel gene or protein, their ability to understand the structure, function, and evolutionary lineage associated with that protein is greatly assisted by the Basic Local Alignment Search Tool (BLAST). First published in 1990 by Altschul et al., BLAST is a computer algorithm that looks for the degree of similarity between an input DNA or protein sequence (the query) and the numerous sequences stored in a database or set of databases. Instead of trying to align the whole query sequence with an entire sequence from the database (a “global alignment”), BLAST looks for matches between portions of the query and portions of a database sequence to build a “local alignment.” (This is analogous to searching the Internet using Google. If one were to type a very long sentence into Google, one would get fewer results than if a few short words were typed in.) This is incredibly advantageous to the study of proteins, as proteins with similar functions generally share similar domains while remaining divergent in sequence when considered as a whole. This module explores the practical basics needed to use this invaluable tool for bioinformatics.

II. ORFS AND READING FRAMES

Before discussing different types of BLAST searches, let’s take stock of how the nature of the genetic code impacts BLAST searching. A gene is comprised of codons, sets of three consecutive nucleotides. Each codon encodes one amino acid. A sequence of codons makes up what is called a reading frame. Double stranded DNA can be read six possible ways and each way is unique.  Remember, there are two strands, each of which can be read in the 5′ –> 3′ direction. On each of these strands, there are 3 possible places to start reading:  the 1st, 2nd, or 3rd nucleotide. Since there are three possibilities on each strand there is a total of six possible reading frames. Only one of these six reading frames will encode for a functional polypeptide. This correct open reading frame (ORF) will generally be many codons long, and must have only one stop codon encountered at the end of the frame. The other reading frames will have multiple stop codons throughout the frame – these are known as “closed reading frames”.

To define a gene within a DNA sequence, scientists look for the open reading frame (ORF) beginning with a start codon (usually ATG in E. coli, coding for methionine) and ending with one of three stop codons (TGA, TAG, TAA). Finding an open reading frame within a nucleotide sequence and the amino acid sequence encoded by it are often prerequisites for the most common BLAST searches (BLASTp, see below) as well as for locating the promoter and other extragenic sequences outside the ORF important for transcription.

III. TYPES OF BLAST COMMONLY USED IN BIOLOGY

A nucleotide query is aligned with a nucleotide database. The advantage of this approach is that there are many genome databases available to search; more than 800 organisms have been sequenced since 1995!

A big problem with BLASTn searches stems from the fact that the genetic code is degenerate–most amino acids are encoded by several different codons. The sequences of all genes “drift” during evolution, partly as a result of i) silent mutations that change a codon without altering the amino acid encoded by that codon, and ii) synonymous mutations, which result in coding for an amino acid that is chemically very similar to the original amino acid, such as leucine for isoleucine. Genetic drift can result in two nucleic acids that are very different in sequence, yet produce identical or highly similar proteins. Unless you’re comparing nucleotide sequences between two organisms that are very closely related, it is hard to identify a related gene in another organism using BLASTn.

An amino acid query is aligned with sequences in a protein database. This is the most common BLAST search performed, as it is easier to find similarities between amino acid sequences than nucleotide sequences subject to genetic drift. One limitation of BLASTp searches is that there are far fewer sequenced proteins available to search than sequenced genomes. Furthermore, if your query sequence is part of a common protein domain found in many proteins of distinct functions, establishing evolutionary lineages using that domain can be very difficult.

A nucleotide query is translated by the computer in all six reading frames, and all six resulting amino acid sequences are aligned with a protein database. This can be very useful if you’re not sure which reading frame in a DNA sequence encodes a gene, and translating the nucleotide query can circumvent gaps in aligning two nucleotide sequences due to sequencing errors. If the query is too large, however, you may retrieve protein search results (also called BLAST “hits”) in multiple reading frames.

A protein query is aligned with sequences in a DNA database that have been translated in all six reading frames. Essentially the inverse of BLASTx, this approach is great for finding exons (protein coding sequences) within a nucleotide database using the protein query as a guide.

A nucleotide sequence is translated in all six reading frames and then aligned with a nucleotide database that has also been translated in all six frames. While this is no longer computationally intensive, aligning six queries with all six reading frames of database sequences usually retrieves many hits that have to be carefully evaluated. Finding a significant result may not be helpful if the database in question is poorly annotated (i.e. little is known about the organism and few proteins have been identified).

IV. PERFORMING A BLAST SEARCH IN ECOCYC

The EcoCyc BLAST program can be directly accessed on EcoCyc using the Search pull-down menu at the top left of any EcoCyc page.

Paste nucleotide or protein query sequence in FASTA format into the search window

Select:

  • query type as either a nucleotide or protein query
  • database search as nucleotide or protein database
  • program (button can perform any of the six BLAST searches listed above)

FASTA : each nucleotide or amino acid sequence is preceded by a greater than (>) symbol

Options:

  • Change organisms for which  the search is performed  in the corner of the page under the Quick Search field (“Searching Escherichia coli K-12 substr. MG1655” is the default).
  • Advanced users can adjust the sensitivity of the BLAST search by changing the “Expectation Value Threshold,” you’ll only need to use the standard of 10 for the Exercises provided in this module. Paste in your sequence and hit “Search”—you’ll see a report much like this one for a BLASTp search.

V. EVALUATING A BLAST SEARCH

The bit-score or score is a direct measure of similarity between the query and a search result retrieved by the BLAST algorithm. The fewer gaps between matching regions of the query and BLAST hit, the higher the score and the stronger the alignment. In protein alignments, gaps arising from amino acids that are chemically similar to but not identical with those in the query do not lower the score as much, as these substituted amino acids are common between related proteins in different organisms.

To have confidence in a search result as good evidence for homology with the query, BLAST programs evaluate each result for statistical significance using the expected value, or E-value. An E-value is generated for each result by calculating the number of random sequences expected to have a similar alignment score. The lower the E-value, the more statistically significant the alignment is; typically 10-5 or lower is considered significant. A low E-value (close to or equal to zero) implies that there is a low probability that the hit between query and database is due to a random match.

One of the strongest applications for BLAST searches is identifying homologous or similar nucleotide or protein sequences, particularly if the query sequence is novel. With a high score and low E-value, one can place confidence in assigning function(s) to the novel query sequence. The term homolog can refer to either a protein or the gene that encodes it, as long as it has a sequence similarity to some other known sequence. There are distinct terms for the two kinds of homologs you’ll frequently encounter in biology: i) Orthologs, which are homologous sequences found in different organisms, and ii) Paralogs, homologous sequences found within the same organism as a result of a gene duplication event. Paralogs frequently have distinct functions within the organism whereas orthologs tend to have the same function in different organisms. While discovering orthologs can inform you about functions that are conserved evolutionarily between species, it’s often exciting for scientists to elucidate different roles played by paralogs within an organism, in many cases to conduct similar functions under differing growth conditions. Paralogous genes that mutate to the extent that they become nonfunctional are considered pseudogenes; be mindful of these in evaluating BLAST searches!

VI. BLAST SEARCH TUTORIAL (ANSWERS UNDERLINED)