VAMPS FAQ

Frequently Asked Questions


Data Request

Raw sequence data is not available from the VAMPS website. To request original raw data files from projects that are currently on vamps please note the following options:

  • For illumina projects: If the project was sequenced at the MBL (Bay Paul Center) in Woods Hole then please send an email to Hilary (morrison@mbl.edu) to request the raw FASTQ files. Be sure to specify the project name, as it appears on VAMPS, in your email subject line.
  • For older public projects (e.g., 454 and Ion Torrent) the original data files are probably on the NCBI SRA Site.
  • If the project is a user-uploaded project then you will need to request the data from the project owner. The project owner's email address can be located on the project profile page.
  • If you are unsure then send a note to vamps@mbl.edu detailing your request.


  • Tag Generation/Sequence Processing

    In October of 2018, we switched from using the Invitrogen Platinum Taq DNA Polymerase High Fidelity (Cat. No. 11304102) to the Invitrogen Platinum SuperFi DNA Polymerase (Cat. #12351-050). High Fidelity polymerase offers 6X fidelity vs Taq, whereas SuperFi polymerase offers 100X fidelity vs Taq. Modified PCR master mix recipe is as follows:

    Illumina Amplicon Generation

    Fusion PCR recipe (125 uL):
    • 91.25 uL of water
    • 25 uL of 5X SuperFi Buffer
    • 2.5 uL of 10 mM dNTP Mix
    • 1.25 uL of Platinum SuperFi Polymerase
    • 4 uL of 10 uM Fusion Primer Mix
    • 1 uL of template*
    *More template can be used in place of water if sample concentration is low (< 1 ng).
    Prior to addition of template, 25 uL of master mix is removed to serve as a negative control. After addition of template, master mix is divided into 3 reactions of 33 uL each.

    v4v5 Program:
    • 94°C for 3 min
    • 30 cycles of:
      • 94°C for 30 sec
      • 57°C for 45 sec
      • 72°C for 1 min
    • 72°C for 2 min
    • Hold at 4°C
    Supplies:
    Invitrogen Platinum SuperFi DNA Polymerase (Cat. #12351-050)
    Thermo Scientific 10 mM dNTP Mixes (Cat. # FERR0192)

    In 2013 we switched to using Illumina platforms for 16S sequencing. Similar to the strategy for 454, we use fusion primers composed of the Illumina adaptors, multiplexing identifiers, and domain-specific primers. Thermocycling and reaction mixtures are different from 454 sequencing for Illumina amplicon PCR.

    For Archaeal V6:

    • 94C for 2 min
    • 30 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
    • 72C for 2 min
    • Hold at 4C

    For Bacterial and Archaeal V4V5:

    • 94C for 2 min
    • 30 cycles of (94C for 30sec, 57C for 45sec, 72C for 1min)
    • 72C for 2 min
    • Hold at 4C

    Bacterial V6 uses first the domain-specific primers for 25 cycles, then the products are cleaned and used in a second 5-cycle PCR with fusion primers.

    For Bacterial V6:

    • 94C for 2 min
    • 25 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
    • 72C for 2 min
    • Hold at 4C
    • Visualize and quantitate the amplicons in the BioAnalyzer or Caliper.
    • Purify the amplicons using Qiagen MinElute columns.
    • Use product in second PCR:
      • 94C for 2 min
      • 5 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
      • 72C for 2 min
      • Hold at 4C

    After we produce amplicons, we can clean and/or size-select for the target products using Agencourt AMPure XP beads. Then we quantitate products with an Invitrogen Picogreen assay, pool at desired concentrations (e.g. equimolar), and quantitate the final pool with qPCR.


    Conserved sequences that flank the hypervariable V6-V4 region of rRNAs serve as primer sites to generate PCR amplicons. Each PCR reaction produces products that can be informatically identified using a unique "key" incorporated between the 454 Life Sciences primer A or B and the 5' flanking rRNA primer. The use of a 5-bp key allows for the synthesis of as many as 81 oligonucleotides that differ by at least two sites. Our multiplexing strategy allows the concurrent collection of 10,000-50,000 tags from each of 8-40 samples in a single nine- hour sequencing run without use of partitioning gaskets that reduce the number of sequencing wells on the PicoTiterPlateTM. Amplicons can be pooled before the emPCR step and each pool is run on a large region of the plate.

    454 Amplicon PCR (Christina Holmes/Ekaterina Andreishcheva) for four reactions:

    • 96 ul water
    • 13.4 ul 10X Platinum buffer
    • 9.4 ul 50 mM MgSO4
    • 2.6 ul 10 mM Pure Peak dNTPs
    • 4 ul 10uM Fusion Primer A
    • 4 ul 10uM Fusion Primer B
    • 2.6 ul 2.5 U/ul Platinum HiFi Pol
    • [2 ul template (~5-25 ng)*]
    • 33 ul total volume/reaction

    *If template stock is dilute or otherwise resistant to amplification, more template can be added in place of water.

    The 5 reactions are the three replicates of the environmental template, positive control, and negative control. Template (plasmid pool for positive control; water for negative control) is added as final step.

    Program:

    • 94C for 2 min
    • 30 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
    • 72C for 2 min
    • Hold at 4C
    • Visualize and quantitate the amplicons in the BioAnalyzer or Caliper.
    • Purify the amplicons using Agencourt AMPure XP beads as described in 454 Sequencing Technical Bulletin 2011-007, and resuspend in 100 uL of Buffer EB.
    • Store purified amplicons at -20C.

    Supplies:

    • DNA 1000 Kit, Agilent, 5067-1504, 25 chips. $372
    • Agencourt AMPure XP, 60 mL, A63881. $1050
    • PurePeak DNA Polymerization Mix 10 mM 1 ml, Pierce/ThermoFisher, NU606001, $115
    • Platinum HiFi Taq polymerase plus buffer and MgSO4, Invitrogen.

    Sequencing Pipeline

    Our sequencing pipeline is public and the details are stored on GitHub. The links are here:

    The Steps in order are: Demultiplexing, Merging, Uniqueing, Chimera Checking and Taxonomy Assignment

    We use Meren's scripts to perform some of these steps. You can find them here on his GitHub site: https://github.com/merenlab/illumina-utils/tree/master/scripts

    1. Demultiplexing

    2. Merging (with Quality Control)

    3. Uniqueing

    4. Chimera Checking
      • We use vsearch for chimera checking (combining result from both ref and denovo)
      • [sample ref:] vsearch -uchime_ref ACAGTG_ACTGC_1_MERGED-MAX-MISMATCH-3.unique.chg -db rRNA16S.gold.fasta -uchimeout ACAGTG_ACTGC_1_MERGED-MAX-MISMATCH-3.unique.chimeras.db -chimeras ACAGTG_ACTGC_1_MERGED-MAX-MISMATCH-3.unique.chimeras.db.chimeric.fa -strand plus -notrunclabel
      • [sample denovo:] vsearch -uchime_denovo TTAGGC_NNNNCGACG_1_MERGED-MAX-MISMATCH-3.unique.chg -uchimeout TTAGGC_NNNNCGACG_1_MERGED-MAX-MISMATCH-3.unique.chimeras.txt -chimeras TTAGGC_NNNNCGACG_1_MERGED-MAX-MISMATCH-3.unique.chimeras.txt.chimeric.fa -notrunclabels

    5. Taxonomy Assignment (using GAST)

    Quality Control

    • There are several steps in the quality control process. Some of these steps are tied up in other pipeline steps (See Sequencing Pipeline above).
    • While demultiplexing we check if barcodes and indexes are the same as in provided metadata, if not reads are discarded.
    • Merging read 1 and read 2 into fasta files. At this point we also discard low quality reads. Primers and barcodes are striped out.
    • For part of it "iu-merge-pairs" and "iu-filter-merged-reads" Meren's utilities are used. (See Sequencing Pipeline above)
      • Filters used during merging (sequence will be discarded if it fails any of these tests):
        1. Is the prefix correct for read_1 and read_2?
        2. P value? (P value is the ratio of the number of mismatches and the length of the overlap. Merged sequences can be discarded based on this ratio)
        3. Does the read have less than the maximum number of mismatches?
        4. Does the read have any 'N's?
        5. Better than Q30? (Phred quality score, if Phred assigns a Q score of 30 (Q30) to a base, this is equivalent to the probability of an incorrect base call 1 in 1000 times)
        6. More than minimum expected overlap?

      • Default filter values:
      • "P" value .................................................   0.300000
        Maximum number of mismatches in the overlapped region ......  None
        Minimum overlap size .......................................  15
        Minimum Q-score for mismatches .............................  15
        Q30 enforced? ..............................................  True
                        
    • Chimera checking using vsearch, both denovo and reference based.

    What is GAST?

    GAST stands for Global Alignment for Sequence Taxonomy.

    It uses a reference database of SSU sequences to determine the taxonomy of hypervariable region tags. The specifics are described in the citation below.

    Citation:
    Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, Sogin ML (2008) 
    Exploring Microbial Diversity and Taxonomy Using SSU rRNA Hypervariable Tag Sequencing. 
    PLoS Genet 4(11): e1000255. https://doi.org/10.1371/journal.pgen.1000255        
              

    Exporting (Downloading) Data

    There is no raw sequence data available on VAMPS.

    The primary way to export data from VAMPS is to use the 'Download Data' page which is accessed from the main 'Sample Selection' page. After you've selected the datasets you want to include in your download select the blue 'Download Data' button.
    You can download data from VAMPS in various formats. Listed on the download page are the formats (with descriptions) available to you.

    • Fasta (3 formats available)
    • Counts Matrix (2 formats available)
    • Metadata (2 formats available)
    • Biom (Modified JSON format)

    There are other downloading opportunities available for data and images on the Visualization pages.

    Importing Data

    Description of Import Options

    To upload data to VAMPS go here.

    Fasta files and count matrix files can be uploaded to VAMPS as long as they conform to the correct format which is described on the upload page for each type.

    Multiple-Dataset fasta file

    Metadata file (for projects already in VAMPS)

    This is a csv (comma separated values) file but the data are actually separated by <TABS> to make it more human readable. These files must conform to the qiime mapping file format. Since these are processed (trimmed) data that are being uploaded, the 'BarcodeSequence' and 'LinkerPrimerSequence' fields can be left empty but the header names have to be present. The #SampleID field must be present and there can be no duplicate sample names. The Description field must also be present and must be the last field. There is a handy 'validate_mapping_file.py' script available in qiime to assist you with this file.


    Clustering and Diversity

    How are SLP clusters created?

    A combination of ESPRIT, SLP and mothur computes taxonomic independent clusters (Operational Taxonomic Units - OTUs)
    using the total collection of available V6 sequences in VAMPS. The sequences were binned into separate datasets for the
    Archaeal or Eukaryal domains, and into Bacterial phylum- or Proteobacterial class-level datasets. For each bin, the unique.seqs
    function in mothur, eliminated duplicate sequences but retained information about observed frequencies for each unique read.
    The kmerdist module of ESPRIT (with default values) identified all sequence pairs within each bin that are predicted to be at
    least 90% similar. The needledist module in ESPRIT generated a sparse matrix of pairwise distances by performing a Needleman-Wunsch
    alignment on the sequence pairs and calculating pairwise distances using quickdist. The algorithm SLP uses the pairwise distances
    to perform a modified single-linkage preclustering at 2% to reduce noise in the sequence data. Initially SLP orders sequences
    according to their rank abundance and then steps through the ordered sequences assigning them to clusters. The most abundant
    sequence defines the first cluster. Each subsequent sequence is tested against the growing list of clusters using the
    single-linkage algorithm. If the sequence has a pairwise distance less than 0.02 (equivalent to a single difference in the V6 region)
    to any of the sequences already in the cluster, the new sequence will be added to the cluster and not tested against subsequent clusters.
    If the sequence is not within a distance of 0.02 from any read in any of the existing clusters, it will establish a new cluster.
    Once all sequences have been assigned to clusters, sequences in the low abundance clusters (< 10 tags) are tested against
    the larger clusters and added to those clusters if possible. For each precluster, SLP uses the sequence with the highest
    frequency and the count of all tags in the precluster for average linkage clustering by mothur. Taxonomy for each cluster
    relies upon on a two-thirds majority of the taxonomy for each cluster member; CATCHALL estimates the estimate richness.

    Citation:
      Huse, S.M., D. Mark Welch, H.G Morrison, and M.L. Sogin. (2010) 
      Ironing out the wrinkles in the rare biosphere. Environmental Microbiology early view.
    

    Phyloseq and ggplot2

    What is Phyloseq?

    Phyloseq is an R library for microbiome data: (https://joey711.github.io/phyloseq/).
    VAMPS uses R (https://www.r-project.org/) and Python (https://www.python.org/) scripts to produce some of the visualizations.

    To use the Phyloseq library with your VAMPS data download the three Phyloseq files from the 'Display Choices' Page:
    Import them directly into R (or R-script) as shown below to create a Phyloseq Object:

    • biom_file <- 'phyloseq.biom'
    • tax_file <- 'phyloseq_taxonomy.txt'
    • map_file <- 'phyloseq_metadata.txt'
    • library(phyloseq)
    • library(ggplot2)
    • TAX<-as.matrix(read.table(tax_file, header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
    • OTU <- import_biom(biom_file)
    • MAP <- import_qiime_sample_data(map_file)
    • TAX <- tax_table(TAX)
    • OTU <- otu_table(OTU)
    • physeq <- phyloseq(OTU,TAX,MAP)

    See the Phyloseq website for more help and examples.


    What is ggplot2

    ggplot2 is an R library that provides quality graphic displays using various big data formats such as VAMPS downloads.
    See https://ggplot2.tidyverse.org/ for more information.

    Reference Databases

    How are the reference databases created?

    We create Ref16S, a reference database of aligned full-length sequences based on all available sequences in SILVA exported using the ARB software. New updates to both SILVA and RDP are incorporated as they become available.

    • We flag as low-quality and delete all sequences with a sequence quality score <= 50, an alignment score <= 50 or a pintail (chimera) score <= 40.
    • We flag as redundant and delete all exact copies of the full-length sequence.
    • We classify all bacterial and archaeal sequences directly with the Ribosomal Database Project Classifier (RDP). We used only RDP classifications with a bootstrap value of >=80%. If the bootstrap value was <80%, the taxonomic assignment was moved to a higher classification level until an 80% or better bootstrap value was achieved. For example, if the genus assignment had a bootstrap value of 70%, but the family had a value of 85%, that sequence would be assigned only as far as family and not to genus. RDP Classifier does not classify sequences below the genus level.
    • We incorporate other taxonomy sources, such as Entrez Genome accession numbers or researcher knowledge of specific entries, as they become available. These "other sources" are used preferentially over the RDP for bacteria and archaea. RDP does not classify eukaryotes. For eukaryota taxonomies, we use the EMBL taxonomy from the SILVA database where we do not have other sources.
    • We create hypervariable region specific databases (RefV6, RefV3, RefV9, etc.).
      • For each hypervariable region we calculate the Ref16S alignment coordinates.
      • We then excise from the Ref16s aligned sequences the section corresponding to the hypervariable region.
      • The gaps are removed from the aligned sequences to create a set of unaligned sequences. Any hypervariable sequences that contain an 'N' are deleted. Any hypervariable sequences shorter than 50 nt are deleted. Any full-length sequences that were not sequenced all the way through the specific hypervariable region are deleted. Two reference IDs are assigned to each reference hypervariable region. A ref16s_id (previously alt_local_gi) links the hypervariable sequence with its source full-length sequence, and a second, e.g., refv6_id (previously known as local_gi) is used to identify all entries having the exact same sequence of the hypervariable region. The taxonomy is carried directly from the full-length source.
      • Unique reference hypervariable regions sequences are exported to a blastable database for use in the assignment of taxonomy to pyrosequencing reads through GAST.



    Metadata

    Required metadata fields:

    collection_dateExample: 2003-03-25More Information
    geo_loc_namename of country or longhurst zoneMore Information
    dna_regionMore Information
    domainMore Information
    env_biome; env_feature; env_materialEnvo Ontology Browser Biome: Description of the site;
    Feature: Description of feature in the biome where sample was obtained;More Information
    Material: Description of material
    env_packageMore Information
    target_geneEnter '16s' or '18s' for the section of the Small Subunit rRNAMore Information
    latitudeGeographical origin of sample in decimal degrees (WGS84 system), not DMS
    longitudeGeographical origin of sample in decimal degrees (WGS84 system), not DMS
    sequencing_platform'454', 'illumina', 'ion-torrent', 'sanger' or 'unknown'More Information
    adapter_sequenceillumina specific
    illumina_indexillumina specific
    primer_suiteMBL Specific
    runMBL Specific


    File and Name Formats

    Project Name

    No spaces allowed!

    Dataset Name

    No spaces allowed!

    FASTA File

    Metadata File

    There is one metadata file format allowed for import (except taxbyseq --see below):
    • QIIME style mapping file: This file MUST have a header named '#SampleID', 'sample_name', 'dataset' or 'dataset_name' as the first column name.
      • The Header line is required
      • The columns must be tab delimited (unless commas are required by the specific page)
      • There must be one row for each dataset

    TaxBySeq File and Metadata from Old (legacy) VAMPS

    • See VAMPS Exports
    • Download the TaxBySeq and Metadata files to your harddrive.
    • Import them as a pair either directly (still compressed) or uncompressed.

    JSON Configuration File

    • Required to be valid JSON (try https://jsonlint.com)
    • Requires "source":"VAMPS*"
    • Requires id_name_hash.ids: [ ] --A valid list of dataset ids that you have permissions to view.
    • Optional:
      1. normalization  (Valid values: "none", "maximum", "frequency")
      2. selected_distance  (Valid values: "morisita-horn", "jaccard", "kulczynski", "canberra" or "bray-curtis")
      3. tax_depth  (Valid values: one of "domain", "phylum", "klass", "order", "family", "genus", "species", "strain")
      4. include_nas  (Valid values: "yes" or "no")
      5. domains  (Valid values: any or all of "Archaea","Bacteria","Eukarya","Organelle","Unknown")
      6. min_range  (Valid values: integers 0-99)
      7. max_range  (Valid values: integers 1-100)
    • Image -- (optional) To render a table, chart or other visual element automatically on upload try:
      • "image":"dheatmap" (added at same level as "source")
        possible values: "dheatmap", "piecharts", "barcharts", "counts_matrix", "metadata_table", "fheatmap", "dendrogram01", "dendrogram03", "pcoa", "pcoa3d", "geospatial" or "adiversity"
      SAMPLE:             
      {
        "source":"VAMPS",
        "post_items":
        { 
          "normalization":"maximum",
          "selected_distance":"morisita_horn",
          "tax_depth":"phylum",
          "domains":["Archaea","Bacteria","Eukarya","Organelle","Unknown"],
          "include_nas":"yes",
          "min_range":0,
          "max_range":100
        },
          "id_name_hash":
          {
            "ids":["49","50","51","52"]
          }
      }
                    


    Definitions

    Project

    The project name refers to the overall study or research project to which the data belong. The project ties multiple samples and sequencing runs together.

    Dataset

    The dataset name refers to a set of sequences within the project that are from one sampling location or individual at a particular date and time. The dataset combines sequences sampled or amplified together. Sequence and taxonomic data are uploaded on a dataset by dataset basis. Multiple datasets may be combined together or compared separately when using the Community Visualization tools.

    FASTA Files

    When you upload a file it will be filtered for valid file format and data. If valid, the file will be uploaded into a temporary table of VAMPS data that will be available immediately for viewing.

    FASTA definition line (or defline)

    The FASTA file defline follows NCBI FASTA format. Each read starts with a ‘>’ and the read ID is between the ‘>’ and the first ‘|’ (a ‘pipe’ symbol), it cannot contain any special characters other than dash ‘-’ or underscore ‘_’ and must be less than 32 characters. If there is any other information on the definition line, it must be after the first ‘|’. The whole definition line is separated from the sequence data by a return or linefeed.