VAMPS FAQ

Frequently Asked Questions


Clustering and Diversity

How are SLP clusters created?

A combination of ESPRIT, SLP and mothur computes taxonomic independent clusters (Operational Taxonomic Units - OTUs)
using the total collection of available V6 sequences in VAMPS. The sequences were binned into separate datasets for the
Archaeal or Eukaryal domains, and into Bacterial phylum- or Proteobacterial class-level datasets. For each bin, the unique.seqs
function in mothur, eliminated duplicate sequences but retained information about observed frequencies for each unique read.
The kmerdist module of ESPRIT (with default values) identified all sequence pairs within each bin that are predicted to be at
least 90% similar. The needledist module in ESPRIT generated a sparse matrix of pairwise distances by performing a Needleman-Wunsch
alignment on the sequence pairs and calculating pairwise distances using quickdist. The algorithm SLP uses the pairwise distances
to perform a modified single-linkage preclustering at 2% to reduce noise in the sequence data. Initially SLP orders sequences
according to their rank abundance and then steps through the ordered sequences assigning them to clusters. The most abundant
sequence defines the first cluster. Each subsequent sequence is tested against the growing list of clusters using the
single-linkage algorithm. If the sequence has a pairwise distance less than 0.02 (equivalent to a single difference in the V6 region)
to any of the sequences already in the cluster, the new sequence will be added to the cluster and not tested against subsequent clusters.
If the sequence is not within a distance of 0.02 from any read in any of the existing clusters, it will establish a new cluster.
Once all sequences have been assigned to clusters, sequences in the low abundance clusters (< 10 tags) are tested against
the larger clusters and added to those clusters if possible. For each precluster, SLP uses the sequence with the highest
frequency and the count of all tags in the precluster for average linkage clustering by mothur. Taxonomy for each cluster
relies upon on a two-thirds majority of the taxonomy for each cluster member; CATCHALL estimates the estimate richness.

Citation:
  Huse, S.M., D. Mark Welch, H.G Morrison, and M.L. Sogin. (2010) 
  Ironing out the wrinkles in the rare biosphere. Environmental Microbiology early view.

Sequence Processing

Donec id elit non mi porta gravida at eget metus. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus. Etiam porta sem malesuada magna mollis euismod. Donec sed odio dui.


Exporting Taxonomic Counts

Donec id elit non mi porta gravida at eget metus. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus. Etiam porta sem malesuada magna mollis euismod. Donec sed odio dui.


Exporting Data

Donec id elit non mi porta gravida at eget metus. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus. Etiam porta sem malesuada magna mollis euismod. Donec sed odio dui.


Importing Data

Description of Import Options

There are currently two fasta file formats that are accepted for uploading sequences to VAMPS. Both of these methods require processed data. Raw data is not accepted yet but we intend to have the MBL trimming pipeline included in the future.

Single-Dataset Fasta File

The defline should include the uniqueId after the '>' symbol. If there is anything else on the line it must be separated from the uniqueId by a space or pipe '|':

>uniqueId
or
>uniqueId|any_other_information
Like any other fasta style file the sequence is on the line immediately following the defline.

Multiple-Dataset fasta file

Metadata file

This is a csv (comma separated values) file but the data are actually separated by <TABS> to make it more human readable. These files must conform to the qiime mapping file format. Since these are processed (trimmed) data that are being uploaded, the 'BarcodeSequence' and 'LinkerPrimerSequence' fields can be left empty but the header names have to be present. The #SampleID field must be present and there can be no duplicate sample names. The Description field must also be present and must be the last field. There is a handy 'validate_mapping_file.py' script available in qiime to assist you with this file.

TaxBySeq File

This is a file that has to be downloaded from the old VAMPS (legacy) program on its export_data page. On that page you can choose your data by whole project or mix together datasets from different projects. Select the 'TaxBySeq File' option and make sure you are selecting raw counts (Not Normalized). When the data is ready you can download the compressed file at the data retrieval page. To get the data into New VAMPS the TaxBySeq file should be on your harddrive and not compressed. From the New VAMPS page 'import_choices' choose the TaxBySeq radio button. Now you can choose to upload all the data under a new single project name of your choice or to keep the original VAMPS names (the data will separate into multiple projects if selected that way origially).

How to convert data from the old vamps (for administrators)

If you have access to the old (legacy) vamps or vampsdev database you can run these two mysql commands to create separate sequence and metadata files for import into the new vamps database. Alter the commands below to include the desired project name and if the project is a user uploaded project change 'vamps_sequences' to 'vamps_sequences_pipe'. Also if you are using vampsdev change 'vampsdb' to 'vampsdev'.

  • mysql -B -h vampsdb vamps -e "SELECT * FROM vamps_metadata where project='projectName';" |sed "s/'/\'/;s/\t/\",\"/g;s/^/\"/;s/$/\"/;s/\n//g" > metadata.csv
  • mysql -B -h vampsdb vamps -e "SELECT * FROM vamps_sequences where project='projectName';" |sed "s/'/\'/;s/\t/\",\"/g;s/^/\"/;s/$/\"/;s/\n//g" > sequences.csv
There is a script named 'convert_old_vamps_project.py' in the public/scripts directory that will convert the csv files and install the data into the new vamps database.


Phyloseq

What is Phyloseq?

Phyloseq is an R library for microbiome data: (https://joey711.github.io/phyloseq/).
VAMPS uses R (https://www.r-project.org/) and Python (https://www.python.org/) scripts to produce some of the visualizations.

To use the Phyloseq library with your VAMPS data download the three Phyloseq files from the 'Display Choices' Page:
Import them directly into R (or R-script) as shown below to create a Phyloseq Object:

  • biom_file <- 'phyloseq.biom'
  • tax_file <- 'phyloseq_taxonomy.txt'
  • map_file <- 'phyloseq_metadata.txt'
  • library(phyloseq)
  • library(ggplot2)
  • TAX<-as.matrix(read.table(tax_file, header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
  • OTU <- import_biom(biom_file)
  • MAP <- import_qiime_sample_data(map_file)
  • TAX <- tax_table(TAX)
  • OTU <- otu_table(OTU)
  • physeq <- phyloseq(OTU,TAX,MAP)

See the Phloseq website for more help and examples.


Reference Databases

How are the reference databases created?

We create Ref16S, a reference database of aligned full-length sequences based on all available sequences in SILVA exported using the ARB software. New updates to both SILVA and RDP are incorporated as they become available.

  • We flag as low-quality and delete all sequences with a sequence quality score <= 50, an alignment score <= 50 or a pintail (chimera) score <= 40.
  • We flag as redundant and delete all exact copies of the full-length sequence.
  • We classify all bacterial and archaeal sequences directly with the Ribosomal Database Project Classifier (RDP). We used only RDP classifications with a bootstrap value of >=80%. If the bootstrap value was <80%, the taxonomic assignment was moved to a higher classification level until an 80% or better bootstrap value was achieved. For example, if the genus assignment had a bootstrap value of 70%, but the family had a value of 85%, that sequence would be assigned only as far as family and not to genus. RDP Classifier does not classify sequences below the genus level.
  • We incorporate other taxonomy sources, such as Entrez Genome accession numbers or researcher knowledge of specific entries, as they become available. These "other sources" are used preferentially over the RDP for bacteria and archaea. RDP does not classify eukaryotes. For eukaryota taxonomies, we use the EMBL taxonomy from the SILVA database where we do not have other sources.
  • We create hypervariable region specific databases (RefV6, RefV3, RefV9, etc.).
    • For each hypervariable region we calculate the Ref16S alignment coordinates.
    • We then excise from the Ref16s aligned sequences the section corresponding to the hypervariable region.
    • The gaps are removed from the aligned sequences to create a set of unaligned sequences. Any hypervariable sequences that contain an 'N' are deleted. Any hypervariable sequences shorter than 50 nt are deleted. Any full-length sequences that were not sequenced all the way through the specific hypervariable region are deleted. Two reference IDs are assigned to each reference hypervariable region. A ref16s_id (previously alt_local_gi) links the hypervariable sequence with its source full-length sequence, and a second, e.g., refv6_id (previously known as local_gi) is used to identify all entries having the exact same sequence of the hypervariable region. The taxonomy is carried directly from the full-length source.
    • Unique reference hypervariable regions sequences are exported to a blastable database for use in the assignment of taxonomy to pyrosequencing reads through GAST.


Sample Submission Process

Donec id elit non mi porta gravida at eget metus. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus. Etiam porta sem malesuada magna mollis euismod. Donec sed odio dui.


Tag Generation

Conserved sequences that flank the hypervariable V6-V4 region of rRNAs serve as primer sites to generate PCR amplicons. Each PCR reaction produces products that can be informatically identified using a unique "key" incorporated between the 454 Life Sciences primer A or B and the 5' flanking rRNA primer. The use of a 5-bp key allows for the synthesis of as many as 81 oligonucleotides that differ by at least two sites. Our multiplexing strategy allows the concurrent collection of 10,000-50,000 tags from each of 8-40 samples in a single nine- hour sequencing run without use of partitioning gaskets that reduce the number of sequencing wells on the PicoTiterPlateTM. Amplicons can be pooled before the emPCR step and each pool is run on a large region of the plate.

454 Amplicon PCR (Christina Holmes/Ekaterina Andreishcheva) for four reactions:

  • 96 ul water
  • 13.4 ul 10X Platinum buffer
  • 9.4 ul 50 mM MgSO4
  • 2.6 ul 10 mM Pure Peak dNTPs
  • 4 ul 10uM Fusion Primer A
  • 4 ul 10uM Fusion Primer B
  • 2.6 ul 2.5 U/ul Platinum HiFi Pol
  • [2 ul template (~5-25 ng)*]
  • 33 ul total volume/reaction

*If template stock is dilute or otherwise resistant to amplification, more template can be added in place of water.

The 5 reactions are the three replicates of the environmental template, positive control, and negative control. Template (plasmid pool for positive control; water for negative control) is added as final step.

Program:

  • 94C for 2 min
  • 30 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
  • 72C for 2 min
  • Hold at 4C
  • Visualize and quantitate the amplicons in the BioAnalyzer or Caliper.
  • Purify the amplicons using Agencourt AMPure XP beads as described in 454 Sequencing Technical Bulletin 2011-007, and resuspend in 100 uL of Buffer EB.
  • Store purified amplicons at -20C.

Supplies:

  • DNA 1000 Kit, Agilent, 5067-1504, 25 chips. $372
  • Agencourt AMPure XP, 60 mL, A63881. $1050
  • PurePeak DNA Polymerization Mix 10 mM 1 ml, Pierce/ThermoFisher, NU606001, $115
  • Platinum HiFi Taq polymerase plus buffer and MgSO4, Invitrogen.

In 2013 we switched to using Illumina platforms for 16S sequencing. Similar to the strategy for 454, we use fusion primers composed of the Illumina adaptors, multiplexing identifiers, and domain-specific primers. Thermocycling and reaction mixtures are different from 454 sequencing for Illumina amplicon PCR.

For Archaeal V6:

  • 94C for 2 min
  • 30 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
  • 72C for 2 min
  • Hold at 4C

For Bacterial and Archaeal V4V5:

  • 94C for 2 min
  • 30 cycles of (94C for 30sec, 57C for 45sec, 72C for 1min)
  • 72C for 2 min
  • Hold at 4C

Bacterial V6 uses first the domain-specific primers for 25 cycles, then the products are cleaned and used in a second 5-cycle PCR with fusion primers.

For Bacterial V6:

  • 94C for 2 min
  • 25 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
  • 72C for 2 min
  • Hold at 4C
  • Visualize and quantitate the amplicons in the BioAnalyzer or Caliper.
  • Purify the amplicons using Qiagen MinElute columns.
  • Use product in second PCR:
    • 94C for 2 min
    • 5 cycles of (94C for 30sec, 60C for 45sec, 72C for 1min)
    • 72C for 2 min
    • Hold at 4C

After we produce amplicons, we can clean and/or size-select for the target products using Agencourt AMPure XP beads. Then we quantitate products with an Invitrogen Picogreen assay, pool at desired concentrations (e.g. equimolar), and quantitate the final pool with qPCR.


Metadata

Required metadata fields:

collection_dateExample: 2003-03-25More Information
geo_loc_namename of country or longhurst zoneMore Information
dna_regionMore Information
domainMore Information
env_biome; env_feature; env_materialEnvo Ontology Browser Biome: Description of the site;
Feature: Description of feature in the biome where sample was obtained;More Information
Material: Description of material
env_packageMore Information
target_geneEnter '16s' or '18s' for the section of the Small Subunit rRNAMore Information
latitudeGeographical origin of sample in decimal degrees (WGS84 system), not DMS
longitudeGeographical origin of sample in decimal degrees (WGS84 system), not DMS
sequencing_platform'454', 'illumina', 'ion-torrent', 'sanger' or 'unknown'More Information
adapter_sequenceillumina specific
illumina_indexillumina specific
primer_suiteMBL Specific
runMBL Specific


File and Name Formats

Project Name

No spaces allowed!

Dataset Name

No spaces allowed!

FASTA File

Metadata File

There is one metadata file format allowed for import (except taxbyseq --see below):
  • QIIME style mapping file: This file MUST have a header named '#SampleID', 'sample_name' or 'dataset_name' as the first column name.
    • The Header line is required
    • The columns must be tab delimited
    • There must be one row for each dataset

TaxBySeq File and Metadata from Old (legacy) VAMPS

  • See VAMPS Exports
  • Download the TaxBySeq and Metadata files to your harddrive.
  • Import them as a pair either directly (still compressed) or uncompressed.

JSON Configuration File

  • Required to be valid JSON (try https://jsonlint.com)
  • Requires "source":"VAMPS*"
  • Requires id_name_hash.ids: [ ] --A valid list of dataset ids that you have permissions to view.
  • Optional:
    1. normalization  (Valid values: "none", "maximum", "frequency")
    2. selected_distance  (Valid values: "morisita-horn", "jaccard", "kulczynski", "canberra" or "bray-curtis")
    3. tax_depth  (Valid values: one of "domain", "phylum", "klass", "order", "family", "genus", "species", "strain")
    4. include_nas  (Valid values: "yes" or "no")
    5. domains  (Valid values: any or all of "Archaea","Bacteria","Eukarya","Organelle","Unknown")
    6. min_range  (Valid values: integers 0-99)
    7. max_range  (Valid values: integers 1-100)
  • Image -- (optional) To render a table, chart or other visual element automatically on upload try:
    • "image":"dheatmap" (added at same level as "source")
      possible values: "dheatmap", "piecharts", "barcharts", "counts_matrix", "metadata_table", "fheatmap", "dendrogram01", "dendrogram03", "pcoa", "pcoa3d", "geospatial" or "adiversity"
    SAMPLE:							
    {
      "source":"VAMPS",
      "post_items":
      { 
        "normalization":"maximum",
        "selected_distance":"morisita_horn",
        "tax_depth":"phylum",
        "domains":["Archaea","Bacteria","Eukarya","Organelle","Unknown"],
        "include_nas":"yes",
        "min_range":0,
        "max_range":100
      },
        "id_name_hash":
        {
          "ids":["49","50","51","52"]
        }
    }
    							


Definitions

Project

The project name refers to the overall study or research project to which the data belong. The project ties multiple samples and sequencing runs together.

Dataset

The dataset name refers to a set of sequences within the project that are from one sampling location or individual at a particular date and time. The dataset combines sequences sampled or amplified together. Sequence and taxonomic data are uploaded on a dataset by dataset basis. Multiple datasets may be combined together or compared separately when using the Community Visualization tools.

FASTA Files

When you upload a file it will be filtered for valid file format and data. If valid, the file will be uploaded into a temporary table of VAMPS data that will be available immediately for viewing.

FASTA definition line (or defline)

The FASTA file defline follows NCBI FASTA format. Each read starts with a ‘>’ and the read ID is between the ‘>’ and the first ‘|’ (a ‘pipe’ symbol), it cannot contain any special characters other than dash ‘-’ or underscore ‘_’ and must be less than 32 characters. If there is any other information on the definition line, it must be after the first ‘|’. The whole definition line is separated from the sequence data by a return or linefeed.


Template

Donec id elit non mi porta gravida at eget metus. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus. Etiam porta sem malesuada magna mollis euismod. Donec sed odio dui.