I. INTRODUCTION

EcoCyc is a bioinformatics database providing comprehensive information on the model bacterium E. coli K12, from all genes, proteins, and other molecular components to bacterial metabolism and the network of regulatory mechanisms that control gene expression and protein function. The entire genome is available for viewing, and each known and hypothetical gene is annotated based on information compiled from over 16,000 publications on this organism. Updated frequently, EcoCyc has been integrated with similar genomics and proteomics databases from a wide variety of other bacterial species.  The extensive linking between organism databases facilitates comparative studies of these organisms. Another key feature is the ability to visualize experimental data on global gene expression (genomics) or global metabolic response (metabolomics) by superimposing (aka “painting”) that data onto the full E. coli genome, perfectly suited for scientists analyzing hundreds to thousands of individual results. EcoCyc is one of over 1000 biological databases that together form a collection known as BioCyc (http://BioCyc.org ) and is intensively managed by SRI International based in Menlo Park, CA.

II. ECOCYC PAGES

One of EcoCyc’s strengths lies in the fact that a massive amount of information is organized into clear categories – genes are found on “Gene pages”, compounds are found on “Compound pages”, etc. For most categories, we provide a sample page illustrating key features, following the “Guided Tour” of EcoCyc provided at http://ecocyc.org/samples.shtml . To see demonstrations in real-time on how to use these pages, video webinars located at http://ecocyc.org/webinar.shtml are extremely helpful.

1. Gene product/RNA/Polypeptide/Protein pages: Some of the most detailed pages in EcoCyc are for the products of genes, whether they are proteins or non-translated RNAs (the page “type” is indicated in bold at the top of the page). The name of the gene encoding the product is always given at the top of the page, as are buttons linking to the gene’s DNA sequence (no introns) and the amino acid sequence of the encoded protein. The gene’s exact location on the chromosome in nucleotides is stated further down the page, and next to it is a link to “Genome Browser” where you can view the gene’s local context in the bacterial genome (i.e. you can view what other genes/sequences are near it on the chromosome).

For every protein that has been experimentally studied, a mini-review summary is provided at the top of the page and provides essential details on the protein’s function, stucture, and regulation at all levels. Each protein page also has a useful “Genetic Regulation Schematic”, which gives information concerning regulation at each level of gene expression.  By clicking on the “?” icons next to the schematics, information on how to interpret the schematic is given.

In cases where the protein binds a cofactor (and is not an enzyme by itself), or when the protein is one subunit of a functional complex, the page for the subunit alone is titled “Polypeptide:”, and the page for the complex is titled “Protein:”; one example of the latter is the TyrR repressor complex (see Sample Page#1.ppt). Pages describing enzymes have titles of “Enzyme:” rather than “Protein:”, and will contain additional information about metabolic pathways the enzyme takes part in and the known effector molecules and cofactors that regulate the enzyme’s activity. An example is the TyrB aminotransferase (see Sample Page #2.ppt).

2. Transcription Unit pages: Transcription units include a gene (or multiple genes within an operon) and all the regulatory DNA binding sites that control that gene’s transcription, including promoters, terminators, and binding sites for activator and repressor proteins. In addition to the location and sequences of these regulatory sites, information about how strongly they control transcription is often provided in the summaries. An example with the gene tyrB is provided here (see Sample Page #3.ppt).

3. Reaction pages: These pages show individual reactions of three different kinds: 1. metabolic reactions with links to the anabolic/catabolic pathways in which they occur, 2. substrate transport reactions, and 3. allosteric effector binding to proteins (regardless of whether the effector activates or inhibits the allosteric protein,enzymes and DNA binding proteins). For each kind of reaction, associated enzymes are provided, and reactions are either shown as linear equations or graphically, which is very useful for visualizing substrate transport and electron transport at the cytoplasmic membrane. One example of a metabolic reaction page details the transamination catalyzed by tyrB used in leucine biosynthesis (see Sample Page #4.ppt), and for substrate transport, the page describing transport of L-leucine (see Sample Page #5.ppt).

4. Pathway pages: A pathway page will depict an entire metabolic pathway from the initial substrate to the final product as well as branch points that cross with other metabolic pathways. You can adjust the amount of detail shown in the pathway (click the “More Detail” or “Less Detail” buttons directly under the title of the page) to give you the structures of each intermediate as well as the enzymes catalyzing each reaction in the pathway. This is useful for when you want to know how the carbon skeleton of a molecule evolves throughout a metabolic pathway. Post-translational regulation of enzymes (feedback inhibition, covalent modification, etc.) is also shown, along with the locations of genes encoding each enzyme on the bacterial genome. Take a look at this sample pathway page for leucine biosynthesis (see Sample Page #6.ppt).

5. Transporter pages: Transport systems are mechanisms crucial for acquiring nutrients and macromolecule precursors from the external environment, so each known E. coli protein transport system has its own page. Features include a brief summary of the transporter’s function (e.g. the transport reaction carried out) and subunit composition (most transporters are made up of several protein subunits). This example page describes the ABC transporter that imports leucine into E. coli (see Sample Page #7.ppt).

6. Compound pages: A compound page gives structural information as well as  lists any pathways in which the compound appears, both as a reactant and as a product. These pages can be useful in determining whether the substance acts as an enzyme cofactor or effector molecule. The example we present here is the page for L-leucine (see Sample Page #8.ppt)

7. Some general features of EcoCyc pages: Passing the mouse over most items in any EcoCyc page brings up a preview of further information that can be examined if the object is clicked on. For example, go to the Polypeptide: LeuC page (http://BioCyc.org/ECOLI/NEW-IMAGE?type=ENZYME&object=LEUC-MONOMER), and find the “Gene-Reaction Schematic” diagram. Holding the cursor over the purple box labeled “leuD” brings up a text box with three bolded terms: Gene, giving various names of the gene according to nomenclature determined by databases or consensus, Location, listing exactly between which nucleotides in an organism’s genome the gene is located, and Product, displaying a short summary of what that gene encodes. Clicking on this term leads to the Polypeptide: LeuD page for more detail about this protein.

The end of each EcoCyc page provides a list of references to journal articles dealing directly with the subject of the EcoCyc page. Article references are always listed by first author and the year in which the article was published (e.g. Fultz81 = a publication by P.N. Fultz et al. in 1981), followed by the link to the paper. Reading article abstracts will often give key insights needed to answer questions posed in the various exercises attached to Microbe Scholar modules.

Setting up an online account with EcoCyc is easy and offers multiple benefits, including the ability to customize how EcoCyc pages are viewed, getting email updates on new features within the latest EcoCyc release, and saving your settings on the Omics and Comparative Analysis tools for ease of use. You can create a new account here: http://biocyc.org/preferences.html?status=new.

III. SEARCHING WITHIN ECOCYC

 

1. Simple Search Features (in the upper right hand corner of every page)

A. Quick Search: This feature is likely the one you’ll use the most often to locate EcoCyc database objects/pages. Just type in a whole or partial name of any gene, protein, pathway, metabolite, etc. in the field provided and hit “Quick Search.” If the search term is within the title of only one type of EcoCyc page, that one page will be retrieved (often just taking you to that page automatically). If the search term is found in multiple pages, a list of these pages will be retrieved, organized by page type.

If you want to find a page using multiple terms, simply type them all in the same field—e.g. peptidoglycan biosynthesis—note that quotation marks don’t make any difference. Putting a comma between these terms (e.g. peptidoglycan, biosynthesis) first retrieves a list of all pages containing the first term, then a list of all pages containing the second, and so on.

You can further specify Quick Searches in two ways:

  • Enter a term followed by search:exact to only retrieve a page with that exact title. For example, entering trpA would normally retrieve the associated “Gene/Protein” page and “Transcription Unit” pages. Entering trpA search:exact will only take you to the Protein page for trpA.

 

  • The qualifier type: <page type> will only retrieve pages of a certain type with the term you enter in the title. For example, searching for ATP will retrieve many pages for genes called atp, ATP-using enzymes, ATP generating pathways, etc. But searching ATP type:compound will just pull up Compound pages with ATP in the title. The <page type> qualifiers you can use are gene, protein, enzyme, rna, go-terms, compound, reaction, operon, pathway, and organism.

B. Gene Search: This button right next to the Quick Search is just like doing type:gene for your term of interest, when you want to find a specific gene.

2. “change organism database”: EcoCyc is part of the BioCyc database collection, which has approximately 1000 organism databases. Each database’s content is organized in precisely the same way as in EcoCyc, i.e. using protein pages, compound pages, etc. To change from one organism database to another, click on “change organism database” directly below the Quick Search button or follow the link at the bottom of the Search pulldown menu. The list of databases is organized alphabetically. Keep in mind that if you’re looking at a protein page for a particular organism, changing the database will not take you to the corresponding ortholog’s page; use the Compare Orthologs button for this feature. Remember, names of genes/proteins/compounds in E. coli will most likely differ in the newly selected organism.

3. Object Searches within the Search Menu

The following advanced search functions enable you to locate specific pages (database objects) based on one or multiple criteria, which you enable or disable by clicking (expanding) pulldown menus. Unless a pulldown menu for a given criterion is expanded, that criterion won’t be included in the resulting search (even if it was previously selected – only selected criteria of an expanded menu are used in the resulting search). Making a complex search based on a combination of attributes is a powerful way to sift through the tremendous amount of information within EcoCyc.

1. Compound Search: At its most basic, you can perform a Compound search by entering in the full or partial name of the substance you’re interested in. Two further criteria you may find useful are “Search/Filter by molecular weight” and “by chemical formula.” In the first criterion, you can establish limits on the molecular weight (in Daltons) of compounds you wish to look up; for example, entering 15 in the left field and 15000 in the right field will retrieve all compounds between 15 and 15,000 Da in weight. To search by chemical formula, type in the chemical symbol for an element of interest (ex. O for oxygen), and enter a number after the symbol for compounds containing only that number of atoms of that specific element. For example, O14 will retrieve all compounds with only 14 oxygen atoms, regardless of what else is in that substance.

2. Genes/Proteins/RNA: This type of object search has the greatest number of criteria available, as the most detailed pages within EcoCyc describe proteins and the genes that encode them. You can search for a gene and/or protein/RNA by full or partial name, as well as by a designated sequence length (more than x nucleotides/amino acids and fewer than y nucleotides/amino acids). The “replicon and/or gene map position” criteria enable you to restrict your search for genes specific to a given area of the bacterial genome, selected by entering the map position in base pairs (this is similar to selecting the display level of the Genome Browser tool discussed below). Searching by molecular weight enables you to screen for genes encoding a gene product with a specific predicted size in kDa. The isoelectric point (pI), or the pH at which a protein has no net electric charge is also useful as a search criterion, as enzymes often maximally function within a short range of pH values.

As detailed in the Regulation module, a central pillar of protein control is often implemented after the protein has been synthesized through transcription and translation, using molecules called effectors that can either activate or inhibit the protein’s activity. Accordingly, you can search for specific proteins that bind desired effectors, e.g. all enzymes that are inhibited by L-tryptophan. Another available criterion is the evidence code; you can find all genes or gene products divided into those predicted by a computer or those directly tested (inferred) by experiment. “Search/Filter by cell component” is a particularly helpful criterion, enabling you to limit your search by the predicted location of a protein of interest (cell envelope, membrane, periplasmic space, or the cytoplasm, aka “super component”).

Special consideration should be given to searching by Gene Ontology (GO) or Multifun terms, as you’ll be encountering them frequently within EcoCyc. When scientists enter or annotate information about a certain protein, they often assign specific GO terms for that protein.  The 3 major categories of GO terms are: i. known molecular functions, ii. the component of the cell it’s a part of, or iii. the general biological process it contributes to. In the Gene Ontology Search, they form three large parent classes of EcoCyc objects. Each parent class contains many different subclasses that get more and more specific. Many gene product pages lacking Summary sections only have clues to the protein’s function through annotated GO terms, so note them carefully. MultiFun terms are an older, very similar form of annotation featuring some overlapping parent classes and some unique ones. Finally, you can search using the title or author of an article, as many publications predominantly discuss one specific gene or protein.

3. Reactions: Here you can search for all Reaction pages featuring a particular enzyme by either entering in the enzyme’s name or EC number. You can also search for all reactions involving a particular compound or set of compounds, whether they appear as reactants or products. The Ontology filter allows you to look for specific reactions within hierarchies of reactions classified by the type of substrate processed, the type of conversion performed, etc.

4. Pathways: In addition to looking up a specific pathway or set of pathways by full or partial name, you can search for pathways containing a designated number of reaction steps (more than x steps and fewer than y steps), as well as by specific substrates that can feed into the pathway (useful if you want to find out cross-talk between pathways utilizing different substrates). As many publications only discuss or elaborate on a specific pathway, you can also search by an author or article title.

5. Advanced: For the more programming-savvy among students, you can construct queries using Boolean logic to examine multiple types of database objects, enabling a global analysis of E. coli. Let’s wait until the author knows how to use this tool before writing more about it!

6. Ontologies: One alternative to searching EcoCyc for specific terms of interest
is to take advantage of the numerous ways EcoCyc content has been organized as hierarchies of objects, also known as ontologies. Ontologies can be useful when you want to find a whole set of database pages connected by a defined characteristic; all the known transcriptional repressors in E. coli, for example. In addition to ontologies arranged using GO terms and Multifun terms, additional ontologies include organized lists of metabolic pathways, compounds, and reactions (specified by EC Number).

IV. ECOCYC TOOLS (PULL-DOWN MENU ACCESSIBLE FROM EVERY PAGE, INCLUDING HOME)

A. Genome browser: The entire E. coli genome is organized by nucleotide position within the single chromosome, and annotated genes can be examined at each of four levels:

  • Genome, showing the entire set of E. coli genes unlabeled but roughly organized into transcription units (multiple genes in an operon all have the same color);
  • Operons, viewing an ~100 kb region where color-coded operons are labeled by gene names, iii. Genes, viewing a ~13 kb region with annotated and hypothetical genes along with predicted promoters; iv. Sites, viewing a 3-4 kb region of the genome with every predicted promoter, transcriptional terminator, and extragenic site labeled. A powerful variant of this tool called the Comparative Genome Browser  allows you to visualize gene orthologs from species you select within the particular regions of bacterial chromosomes those genes occupy. For example, this sample page shows you the gene tyrB compared across 6 species displayed at the “Gene” level: (see Sample Page #9.ppt). Comparing and contrasting other genes surrounding an ortholog of interest can give you a broader picture of how chromosomes have evolved between different species carrying related genes.

 

B. Cellular/Genome/Regulatory Overview: Overview pages are designed to give you a “big-picture” look at E. coli’s metabolic pathways, the bacterial genome, or regulation of gene transcription.

  • The Cellular Overview zoomed furthest out (the initial view) shows you a rectangle representing the bacterial cell with pathways and reactions displayed as lines between circles, squares, triangles and other shapes. These reactions are roughly grouped according to cellular location, e.g. substrate transport reactions and their associated proteins are located between the two lines making the boundaries of the rectangle signifying the outer and cytoplasmic membranes. A set of related pathways is contained within the same light grey box; for example, all the reactions involved in the biosynthesis of various lipids. You can zoom in on the overview to find the particular reaction/protein of interest, although it’s best to use the pull down “Cellular Overview” menu and select “Highlight Pathway” “Highlight Gene” “Highlight Reaction” etc. buttons to search for your object of interest by name.

 

  •  The Genome Overview is the same as the Genome level of the Genome browser mentioned above.
  •  The Regulatory Overview focuses on transcriptional regulation, with sets of genes organized into 3 concentric rings. The innermost ring depicts genes encoding global or master regulators responsible for controlling the expression of many operons (sometimes hundreds!). The middle ring displays genes encoding local regulatory proteins (known to control only one operon) and the outermost ring shows the many non-regulatory genes whose regulators are known or predicted. Similar to the Cellular Overview, genes are roughly organized such that genes in the outermost circle are placed close to the regulators controlling their expression in the inner circles.

 

C. Omics Viewer: This tool allows for high-throughput experimental data, whether from a single experiment or a time-course of experiments, to be uploaded and visualized by overlaying that data onto the Cellular, Genome, or Regulatory Overview. For example, if you wanted to visualize data from a microarray experiment where E. coli is grown under varying conditions, you can overlay that data onto the Genome Overview displaying every gene in E. coli, where each gene icon is assigned a color according to its particular level of expression. Similarly, comprehensive data on the production of compounds in the cell (metabolites) can be visualized by overlaying that data onto the Regulatory Overview, where compound icons are color-coded in relation to their concentration. In this manner, you can get a complete picture of how your experiments impact the cell in gene expression and/or metabolic pathways in one diagram.

D. Comparative Analysis: This tool connects information from the central E. coli K-12 genome database with 30 other E. coli genomes and almost 1000 other bacterial genomes.  By checking options for “comparative-analysis tables,” you can view profiles comparing known biochemical reactions, pathways, compounds, proteins, transporters, or transcription units between as many organisms of interest selected in the central portion of the page. This is helpful when you want to understand the degree to which molecular components and pathways are conserved between organisms; gaps in conservation may suggest where further research is needed to find analogous components, or they may suggest alternative pathways used by an organism to achieve the same metabolic goal.

E. Reports: These four tools enable you to quantify what’s known about a given organism. In “Summary statistics,” you can know the total number of genes, proteins, metabolic pathways, etc. that have been identified in a Bacteria species. “History of updates” lists which features have been incorporated in the database with each release. “Pathway evidence” and “Pathway holes” list what metabolic pathways are believed to exist in an organism and the metabolic reactions for which no associated enzyme has been identified, respectively.