VAMPS:FAQ

VAMPS FAQ

Frequently Asked Questions

Data Request

Raw sequence data is not available from the VAMPS website. To request original raw data files from projects that are currently on vamps please note the following options:

For illumina projects: If the project was sequenced at the MBL (Bay Paul Center) in Woods Hole then please send an email to Hilary (morrison@mbl.edu) to request the raw FASTQ files. Be sure to specify the project name, as it appears on VAMPS, in your email subject line.

For older public projects (e.g., 454 and Ion Torrent) the original data files are probably on the NCBI SRA Site.

If the project is a user-uploaded project then you will need to request the data from the project owner. The project owner's email address can be located on the project profile page.

If you are unsure then send a note to vamps@mbl.edu detailing your request.

Tag Generation/Sequence Processing

In October of 2018, we switched from using the Invitrogen Platinum Taq DNA Polymerase High Fidelity (Cat. No. 11304102) to the Invitrogen Platinum SuperFi DNA Polymerase (Cat. #12351-050). High Fidelity polymerase offers 6X fidelity vs Taq, whereas SuperFi polymerase offers 100X fidelity vs Taq. Modified PCR master mix recipe is as follows:

Illumina Amplicon Generation

Fusion PCR recipe (125 uL):

91.25 uL of water
25 uL of 5X SuperFi Buffer
2.5 uL of 10 mM dNTP Mix
1.25 uL of Platinum SuperFi Polymerase
4 uL of 10 uM Fusion Primer Mix
1 uL of template*

*More template can be used in place of water if sample concentration is low (< 1 ng).
Prior to addition of template, 25 uL of master mix is removed to serve as a negative control. After addition of template, master mix is divided into 3 reactions of 33 uL each.

v4v5 Program:

94°C for 3 min
30 cycles of:

94°C for 30 sec
57°C for 45 sec
72°C for 1 min

72°C for 2 min
Hold at 4°C

Supplies:
Invitrogen Platinum SuperFi DNA Polymerase (Cat. #12351-050)
Thermo Scientific 10 mM dNTP Mixes (Cat. # FERR0192)

In 2013 we switched to using Illumina platforms for 16S sequencing. Similar to the strategy for 454, we use fusion primers composed of the Illumina adaptors, multiplexing identifiers, and domain-specific primers. Thermocycling and reaction mixtures are different from 454 sequencing for Illumina amplicon PCR.

For Archaeal V6:

94C for 2 min
30 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
72C for 2 min
Hold at 4C

For Bacterial and Archaeal V4V5:

94C for 2 min
30 cycles of (94C for 30sec, 57C for 45sec, 72C for 1min)
72C for 2 min
Hold at 4C

Bacterial V6 uses first the domain-specific primers for 25 cycles, then the products are cleaned and used in a second 5-cycle PCR with fusion primers.

For Bacterial V6:

94C for 2 min
25 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
72C for 2 min
Hold at 4C
Visualize and quantitate the amplicons in the BioAnalyzer or Caliper.
Purify the amplicons using Qiagen MinElute columns.
Use product in second PCR:

94C for 2 min
5 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
72C for 2 min
Hold at 4C

After we produce amplicons, we can clean and/or size-select for the target products using Agencourt AMPure XP beads. Then we quantitate products with an Invitrogen Picogreen assay, pool at desired concentrations (e.g. equimolar), and quantitate the final pool with qPCR.

Conserved sequences that flank the hypervariable V6-V4 region of rRNAs serve as primer sites to generate PCR amplicons. Each PCR reaction produces products that can be informatically identified using a unique "key" incorporated between the 454 Life Sciences primer A or B and the 5' flanking rRNA primer. The use of a 5-bp key allows for the synthesis of as many as 81 oligonucleotides that differ by at least two sites. Our multiplexing strategy allows the concurrent collection of 10,000-50,000 tags from each of 8-40 samples in a single nine- hour sequencing run without use of partitioning gaskets that reduce the number of sequencing wells on the PicoTiterPlate^TM. Amplicons can be pooled before the emPCR step and each pool is run on a large region of the plate.

454 Amplicon PCR (Christina Holmes/Ekaterina Andreishcheva) for four reactions:

96 ul water
13.4 ul 10X Platinum buffer
9.4 ul 50 mM MgSO4
2.6 ul 10 mM Pure Peak dNTPs
4 ul 10uM Fusion Primer A
4 ul 10uM Fusion Primer B
2.6 ul 2.5 U/ul Platinum HiFi Pol
[2 ul template (~5-25 ng)*]
33 ul total volume/reaction

*If template stock is dilute or otherwise resistant to amplification, more template can be added in place of water.

The 5 reactions are the three replicates of the environmental template, positive control, and negative control. Template (plasmid pool for positive control; water for negative control) is added as final step.

Program:

94C for 2 min
30 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
72C for 2 min
Hold at 4C
Visualize and quantitate the amplicons in the BioAnalyzer or Caliper.
Purify the amplicons using Agencourt AMPure XP beads as described in 454 Sequencing Technical Bulletin 2011-007, and resuspend in 100 uL of Buffer EB.
Store purified amplicons at -20C.

Supplies:

DNA 1000 Kit, Agilent, 5067-1504, 25 chips. $372
Agencourt AMPure XP, 60 mL, A63881. $1050
PurePeak DNA Polymerization Mix 10 mM 1 ml, Pierce/ThermoFisher, NU606001, $115
Platinum HiFi Taq polymerase plus buffer and MgSO4, Invitrogen.

Sequencing Pipeline

Our sequencing pipeline is public and the details are stored on GitHub. The links are here:

The Steps in order are: Demultiplexing, Merging, Uniqueing, Chimera Checking and Taxonomy Assignment

We use Meren's scripts to perform some of these steps. You can find them here on his GitHub site: https://github.com/merenlab/illumina-utils/tree/master/scripts

Demultiplexing

https://github.com/merenlab/illumina-utils/blob/master/scripts/iu-demultiplex

Merging (with Quality Control)

From https://github.com/merenlab/illumina-utils/tree/master/scripts

overlap: iu-merge-pairs
filter mismatches: iu-filter-merged-reads
for v6 only: iu-trim-V6-primers

Uniqueing

https://gist.github.com/meren/abd5c6f014da9aa554a4

Chimera Checking

We use vsearch for chimera checking (combining result from both ref and denovo)
[sample ref:] vsearch -uchime_ref ACAGTG_ACTGC_1_MERGED-MAX-MISMATCH-3.unique.chg -db rRNA16S.gold.fasta -uchimeout ACAGTG_ACTGC_1_MERGED-MAX-MISMATCH-3.unique.chimeras.db -chimeras ACAGTG_ACTGC_1_MERGED-MAX-MISMATCH-3.unique.chimeras.db.chimeric.fa -strand plus -notrunclabel
[sample denovo:] vsearch -uchime_denovo TTAGGC_NNNNCGACG_1_MERGED-MAX-MISMATCH-3.unique.chg -uchimeout TTAGGC_NNNNCGACG_1_MERGED-MAX-MISMATCH-3.unique.chimeras.txt -chimeras TTAGGC_NNNNCGACG_1_MERGED-MAX-MISMATCH-3.unique.chimeras.txt.chimeric.fa -notrunclabels

Taxonomy Assignment (using GAST)

gast_ill (https://github.com/annaship/BPC-scripts/blob/master/gast_ill)
You also will need Taxonomy.pm and taxonomy reference database files.

Quality Control

There are several steps in the quality control process. Some of these steps are tied up in other pipeline steps (See Sequencing Pipeline above).
While demultiplexing we check if barcodes and indexes are the same as in provided metadata, if not reads are discarded.
Merging read 1 and read 2 into fasta files. At this point we also discard low quality reads. Primers and barcodes are striped out.
For part of it "iu-merge-pairs" and "iu-filter-merged-reads" Meren's utilities are used. (See Sequencing Pipeline above)

Filters used during merging (sequence will be discarded if it fails any of these tests):

Is the prefix correct for read_1 and read_2?
P value? (P value is the ratio of the number of mismatches and the length of the overlap. Merged sequences can be discarded based on this ratio)
Does the read have less than the maximum number of mismatches?
Does the read have any 'N's?
Better than Q30? (Phred quality score, if Phred assigns a Q score of 30 (Q30) to a base, this is equivalent to the probability of an incorrect base call 1 in 1000 times)
More than minimum expected overlap?

Default filter values:

"P" value .................................................   0.300000
Maximum number of mismatches in the overlapped region ......  None
Minimum overlap size .......................................  15
Minimum Q-score for mismatches .............................  15
Q30 enforced? ..............................................  True

Chimera checking using vsearch, both denovo and reference based.

What is GAST?

GAST stands for Global Alignment for Sequence Taxonomy.

It uses a reference database of SSU sequences to determine the taxonomy of hypervariable region tags. The specifics are described in the citation below.

GAST Code on GitHub
Taxonomy.pm perl module
Available GAST-formatted SSU Reference Sets.

Citation:

Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, Sogin ML (2008) 
Exploring Microbial Diversity and Taxonomy Using SSU rRNA Hypervariable Tag Sequencing. 
PLoS Genet 4(11): e1000255. https://doi.org/10.1371/journal.pgen.1000255

Exporting (Downloading) Data

There is no raw sequence data available on VAMPS.

The primary way to export data from VAMPS is to use the 'Download Data' page which is accessed from the main 'Sample Selection' page. After you've selected the datasets you want to include in your download select the blue 'Download Data' button.
You can download data from VAMPS in various formats. Listed on the download page are the formats (with descriptions) available to you.

Fasta (3 formats available)
Counts Matrix (2 formats available)
Metadata (2 formats available)
Biom (Modified JSON format)

There are other downloading opportunities available for data and images on the Visualization pages.

Importing Data

Description of Import Options

To upload data to VAMPS go here.

Fasta files and count matrix files can be uploaded to VAMPS as long as they conform to the correct format which is described on the upload page for each type.

Multiple-Dataset fasta file

Metadata file (for projects already in VAMPS)

This is a csv (comma separated values) file but the data are actually separated by <TABS> to make it more human readable. These files must conform to the qiime mapping file format. Since these are processed (trimmed) data that are being uploaded, the 'BarcodeSequence' and 'LinkerPrimerSequence' fields can be left empty but the header names have to be present. The #SampleID field must be present and there can be no duplicate sample names. The Description field must also be present and must be the last field. There is a handy 'validate_mapping_file.py' script available in qiime to assist you with this file.

Clustering and Diversity

How are SLP clusters created?

A combination of ESPRIT, SLP and mothur computes taxonomic independent clusters (Operational Taxonomic Units - OTUs)
using the total collection of available V6 sequences in VAMPS. The sequences were binned into separate datasets for the
Archaeal or Eukaryal domains, and into Bacterial phylum- or Proteobacterial class-level datasets. For each bin, the unique.seqs
function in mothur, eliminated duplicate sequences but retained information about observed frequencies for each unique read.
The kmerdist module of ESPRIT (with default values) identified all sequence pairs within each bin that are predicted to be at
least 90% similar. The needledist module in ESPRIT generated a sparse matrix of pairwise distances by performing a Needleman-Wunsch
alignment on the sequence pairs and calculating pairwise distances using quickdist. The algorithm SLP uses the pairwise distances
to perform a modified single-linkage preclustering at 2% to reduce noise in the sequence data. Initially SLP orders sequences
according to their rank abundance and then steps through the ordered sequences assigning them to clusters. The most abundant
sequence defines the first cluster. Each subsequent sequence is tested against the growing list of clusters using the
single-linkage algorithm. If the sequence has a pairwise distance less than 0.02 (equivalent to a single difference in the V6 region)
to any of the sequences already in the cluster, the new sequence will be added to the cluster and not tested against subsequent clusters.
If the sequence is not within a distance of 0.02 from any read in any of the existing clusters, it will establish a new cluster.
Once all sequences have been assigned to clusters, sequences in the low abundance clusters (< 10 tags) are tested against
the larger clusters and added to those clusters if possible. For each precluster, SLP uses the sequence with the highest
frequency and the count of all tags in the precluster for average linkage clustering by mothur. Taxonomy for each cluster
relies upon on a two-thirds majority of the taxonomy for each cluster member; CATCHALL estimates the estimate richness.

Citation:

  Huse, S.M., D. Mark Welch, H.G Morrison, and M.L. Sogin. (2010) 
  Ironing out the wrinkles in the rare biosphere. Environmental Microbiology early view.

Phyloseq and ggplot2

What is Phyloseq?

Phyloseq is an R library for microbiome data: (https://joey711.github.io/phyloseq/).
VAMPS uses R (https://www.r-project.org/) and Python (https://www.python.org/) scripts to produce some of the visualizations.

To use the Phyloseq library with your VAMPS data download the three Phyloseq files from the 'Display Choices' Page:
Import them directly into R (or R-script) as shown below to create a Phyloseq Object:

biom_file <- 'phyloseq.biom'
tax_file <- 'phyloseq_taxonomy.txt'
map_file <- 'phyloseq_metadata.txt'
library(phyloseq)
library(ggplot2)
TAX<-as.matrix(read.table(tax_file, header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
OTU <- import_biom(biom_file)
MAP <- import_qiime_sample_data(map_file)
TAX <- tax_table(TAX)
OTU <- otu_table(OTU)
physeq <- phyloseq(OTU,TAX,MAP)

See the Phyloseq website for more help and examples.

What is ggplot2

ggplot2 is an R library that provides quality graphic displays using various big data formats such as VAMPS downloads.
See https://ggplot2.tidyverse.org/ for more information.

Reference Databases

How are the reference databases created?

We create Ref16S, a reference database of aligned full-length sequences based on all available sequences in SILVA exported using the ARB software. New updates to both SILVA and RDP are incorporated as they become available.

We flag as low-quality and delete all sequences with a sequence quality score <= 50, an alignment score <= 50 or a pintail (chimera) score <= 40.
We flag as redundant and delete all exact copies of the full-length sequence.
We classify all bacterial and archaeal sequences directly with the Ribosomal Database Project Classifier (RDP). We used only RDP classifications with a bootstrap value of >=80%. If the bootstrap value was <80%, the taxonomic assignment was moved to a higher classification level until an 80% or better bootstrap value was achieved. For example, if the genus assignment had a bootstrap value of 70%, but the family had a value of 85%, that sequence would be assigned only as far as family and not to genus. RDP Classifier does not classify sequences below the genus level.
We incorporate other taxonomy sources, such as Entrez Genome accession numbers or researcher knowledge of specific entries, as they become available. These "other sources" are used preferentially over the RDP for bacteria and archaea. RDP does not classify eukaryotes. For eukaryota taxonomies, we use the EMBL taxonomy from the SILVA database where we do not have other sources.
We create hypervariable region specific databases (RefV6, RefV3, RefV9, etc.).
- For each hypervariable region we calculate the Ref16S alignment coordinates.
- We then excise from the Ref16s aligned sequences the section corresponding to the hypervariable region.
- The gaps are removed from the aligned sequences to create a set of unaligned sequences. Any hypervariable sequences that contain an 'N' are deleted. Any hypervariable sequences shorter than 50 nt are deleted. Any full-length sequences that were not sequenced all the way through the specific hypervariable region are deleted. Two reference IDs are assigned to each reference hypervariable region. A ref16s_id (previously alt_local_gi) links the hypervariable sequence with its source full-length sequence, and a second, e.g., refv6_id (previously known as local_gi) is used to identify all entries having the exact same sequence of the hypervariable region. The taxonomy is carried directly from the full-length source.
- Unique reference hypervariable regions sequences are exported to a blastable database for use in the assignment of taxonomy to pyrosequencing reads through GAST.

Sample Submission Process

https://vamps2.mbl.edu/metadata/metadata_new

Metadata

Required metadata fields:

collection_date	Example: 2003-03-25More Information
geo_loc_name	name of country or longhurst zoneMore Information
dna_region	More Information
domain	More Information
env_biome; env_feature; env_material	Envo Ontology Browser Biome: Description of the site; Feature: Description of feature in the biome where sample was obtained;More Information Material: Description of material
env_package	More Information
target_gene	Enter '16s' or '18s' for the section of the Small Subunit rRNAMore Information
latitude	Geographical origin of sample in decimal degrees (WGS84 system), not DMS
longitude	Geographical origin of sample in decimal degrees (WGS84 system), not DMS
sequencing_platform	'454', 'illumina', 'ion-torrent', 'sanger' or 'unknown'More Information
adapter_sequence	illumina specific
illumina_index	illumina specific
primer_suite	MBL Specific
run	MBL Specific

File and Name Formats

Project Name

No spaces allowed!

Dataset Name

No spaces allowed!

FASTA File

Metadata File

There is one metadata file format allowed for import (except taxbyseq --see below):

QIIME style mapping file: This file MUST have a header named '#SampleID', 'sample_name', 'dataset' or 'dataset_name' as the first column name.
- The Header line is required
- The columns must be tab delimited (unless commas are required by the specific page)
- There must be one row for each dataset

TaxBySeq File and Metadata from Old (legacy) VAMPS

See VAMPS Exports
Download the TaxBySeq and Metadata files to your harddrive.
Import them as a pair either directly (still compressed) or uncompressed.

JSON Configuration File

Required to be valid JSON (try https://jsonlint.com)
Requires "source":"VAMPS*"
Requires id_name_hash.ids: [ ] --A valid list of dataset ids that you have permissions to view.
Optional:
1. normalization (Valid values: "none", "maximum", "frequency")
2. selected_distance (Valid values: "morisita-horn", "jaccard", "kulczynski", "canberra" or "bray-curtis")
3. tax_depth (Valid values: one of "domain", "phylum", "klass", "order", "family", "genus", "species", "strain")
4. include_nas (Valid values: "yes" or "no")
5. domains (Valid values: any or all of "Archaea","Bacteria","Eukarya","Organelle","Unknown")
6. min_range (Valid values: integers 0-99)
7. max_range (Valid values: integers 1-100)
Image -- (optional) To render a table, chart or other visual element automatically on upload try:

SAMPLE:             
{
  "source":"VAMPS",
  "post_items":
  { 
    "normalization":"maximum",
    "selected_distance":"morisita_horn",
    "tax_depth":"phylum",
    "domains":["Archaea","Bacteria","Eukarya","Organelle","Unknown"],
    "include_nas":"yes",
    "min_range":0,
    "max_range":100
  },
    "id_name_hash":
    {
      "ids":["49","50","51","52"]
    }
}

Definitions

Project

The project name refers to the overall study or research project to which the data belong. The project ties multiple samples and sequencing runs together.

Dataset

The dataset name refers to a set of sequences within the project that are from one sampling location or individual at a particular date and time. The dataset combines sequences sampled or amplified together. Sequence and taxonomic data are uploaded on a dataset by dataset basis. Multiple datasets may be combined together or compared separately when using the Community Visualization tools.

FASTA Files

When you upload a file it will be filtered for valid file format and data. If valid, the file will be uploaded into a temporary table of VAMPS data that will be available immediately for viewing.

FASTA definition line (or defline)

The FASTA file defline follows NCBI FASTA format. Each read starts with a ‘>’ and the read ID is between the ‘>’ and the first ‘|’ (a ‘pipe’ symbol), it cannot contain any special characters other than dash ‘-’ or underscore ‘_’ and must be less than 32 characters. If there is any other information on the definition line, it must be after the first ‘|’. The whole definition line is separated from the sequence data by a return or linefeed.