Reference Database Files

Reference Database Files

RefSSU, the primary reference database of near full-length reference sequences, is derived from the SILVA rRNA database project (version 119). Low quality sequences are flagged as deleted (pintail score <40, sequence quality <50 or alignment quality <50). Taxonomic sources include Entrez genomes, the RDP, SILVA, EMBL, and hand-curation by collaborators. A RefSSU_ID is assigned to each sequence and is formed from the accession number and the start and stop locations of the sequence on the rRNA gene.

Individual reference files for specific hypervariable regions (RefHVR_v3, RefHVR _v6, RefHVR _v9) are created by excising in-silico the appropriate section of the full-length sequences. Only sequences that cover the entire hypervariable region are included. A RefHVR_ID is assigned to each unique hypervariable region sequence. RefHVR_IDs are prefixed with the hypervariable region it includes (e.g., v6_AF153). The suffix of the ID is simply a unique alphanumeric that contains no additional information. If multiple RefSSU entries have the same hypervariable region they will have the same RefHVR_ID for that region. The database files contain the necessary information to determine the RefSSU source(s) for each RefHVR_ID.

Fasta files include the reference ID, the taxonomy assigned, and the source of the taxonomy. The RefHVR fastas include both the RefHVR_ID and the source RefSSU_ID. Only high-quality sequences are included in the fasta files.

    RefSSU file:

  • RefSSU: full length, unaligned fasta, based on SILVA119: refssu.fa.gz (125M).
    Aligned sequences are available directly from SILVA.

RefHVR: SILVA sequences cut between the specified primers:

SSU rRNA RegionPrimersUnaligned FastaGAST FormatNote
v3338F - 533R refhvr_v3.fa.gz refv3.tgz
v3v5341F - 785Fa** refhvr_v3v5.fa.gz refv3v5.tgz
v4v6565Fa** - 1064R refhvr_v4v6.fa.gz refv4v6.tgz Assumes 3' - 5' sequencing.
v4v6a685Fa** - 1048R refhvr_v4v6a.fa.gz Assumes 3' - 5' sequencing.
Primer locations optimized for Archaea.
v6967F - 1064R refhvr_v6.fa.gz refv6.tgz
v6a958F 1048R refhvr_v6a.fa.gz refv6a.tgz Primer locations optimized for Archaea.
v91380F - 1510R refhvr_v9.fa.gz refv9.tgz Primer locations optimized for Eukarya.
refssu refssu.tgz Full length database.
** the full v3v5 and v4v6 regions are longer than current 454 technology can read. We trim sequences to an "anchor" location within the read (~400-480nt). Since these anchor sequences are not primers, their bases are included in the reference files.

Public Data Files