To demonstrate the accuracy of binning using ClaMS, we binned a real metagenome and a simulated metagenome using ClaMS. The real metagenome, the Phrap-assembled phosphorus removal sludge metagenome (SLU) sampled from laboratory-scale bioreactor (IMG/M, taxon OID: 2000000000 [6]), is 56.6M bases long, has 60.45% GC, and contains 31,742 assembled contigs. The simulated metagenome, the assembled medium complexity simulated simMC dataset from FAMeS [7], has 15109 non-chimeric contigs that were 1000 bases or longer and candidates for binning using ClaMS. We evaluated the results using cross-validation of the binned contigs. In the case of simMC, the correct bins of the contigs were already known for cross-validation, in the case of SLU, best hits from Blast alignment were used to cross-validate bins.
The phylogenetic distribution of genes in the SLU dataset based on their best Blast hits in IMG/M [6] and the 16S rRNA tree in [8] showed that the dataset was dominated by Betaproteobacteria (127 species), Gammaproteobacteria (396 species), Bacteroidetes (81 species), and the genome of Candidatus A. phosphatis. Four training sets were used to bin SLU: the longest contig belonging to Candidatus A. phosphatis in the SLU dataset (subsequently removed from the set to be binned), betaproteobacterial isolate genomes, all gammaproteobacterial isolate genomes, and all genomes of Bacteroidetes. Scaffolds assigned to each bin were then cross-validated using their existing Blast-based class assignment in IMG/M. As part of the processing pipeline in IMG/M, the phylogenetic distribution for the metagenome is computed by aligning genes on scaffolds (using BLASTP) to the non-redundant database of sequences computed from isolate genomes stored in IMG. Results are viewable as a phylogenetic distribution of genes in the metagenome by assigning scaffolds to appropriate bins at various taxonomic levels based on the alignment of genes present on them. Results are outlined in Figure 1 Approximately 91% of the scaffolds in the Candidatus A. phosphatis bin have best BLAST matches to Betaproteobacteria, as do 77% of the scaffolds in the Betaproteobacteria bin. Similarly,90% of the scaffolds in the Bacteroidetes bin have BLAST matches to Bacteroidetes, while the scaffolds in the Gammaproteobacteria bin are distributed between Betaproteobacteria (59%) and Gammaproteobacteria (25%). The latter misclassification could be attributed to the fact that the Gammaproteobacteria in the SLU dataset are dominated by Xanthomonadales whose scaffolds have high GC content (64-67%) that is closer to that of Betaproteobacteria (62%) than to Gammaproteobacteria (48%). Moreover the taxonomic position of Xanthomonadales is not well defined [9]. This example illustrates the dangers of relying on isolate genome sequences as a training set, especially when relatively large taxonomic groups, such as phyla or classes are considered. Binning can often produce more accurate results if longer contigs from the sequence set to be binned, whose origins are known, are used as training sets.
The Phrap-assembled simulated acid mine drainage dataset (simMC) from FAMeS was binned in an unsupervised manner at various phylogenetic levels. The dataset has been constructed from the reads collected from genomes classified to 79 genera, 60 families, 42 orders, 17 classes, and 9 phyla under the bacterial and archaeal domains. Whole genome sequences of organisms under a taxonomic unit were used to train the bin for that taxonomic unit. For example, all Alphaproteobacteria species (except those used in the simulated dataset) were used to train the Alphaproteobacteria bin. All contigs longer than 1,000 bases were binned using ClaMS. Figure 2 (Left) illustrates the sensitivity and specificity of the unsupervised binning process at various phylogenetic levels when the best two bins for a contig are considered for the correct match. For example, at the genus level, 79 bins (one for each genus) were used to bin the assembled contigs, where a bin for a particular genus was trained using genomic sequences from all isolate genomes belonging to that genus. Negatives were determined by counting sequences that could not be binned at given cut-offs for distance and contig length. Sensitivity was computed as the percentage of sequences for which bins existed that were binned correctly (ratio of the number of true positives to the sum of the number of true positives and the number of false negatives) while specificity was computed as the ratio of the number of true negatives to the sum of the number of true negatives and the number of false positives. Unsupervised binning of a metagenomic dataset yields relatively accurate results at the genus, family, and domain levels, but the same cannot be said of the order, class, and phylum levels, where the dispersion in the properties of the signature is much greater and the accuracy of binning is much lower. For metagenomic datasets whose dominant constituent populations are known, supervised binning while training on contigs from the same dataset is the best course of action. This is illustrated by the specificity vs. sensitivity plots in Figure 2 (Right), where binning was done on all contigs longer than 1,000 bases in the simMC dataset using training sets specified by the user. A total of 9 genera, 8 families/orders and 6 classes were selected and each bin was trained using contigs from the same metagenome. A combination of the two binning approaches, in which the user specified a training set of isolate genomes instead of selecting training sequences from the same metagenome produces better results than unsupervised binning, but is less accurate than supervised binning with training contigs from the same metagenome (Figure 3).
ClaMS can run in a command-line mode, which makes it convenient to be included in processing pipelines and large-scale batch-processing jobs. Screenshots of the ClaMS user-interface and a demonstration of the usage including visualization of results are available at http://clams.jgi-psf.org. The user-friendly interface, built-in taxonomy browser, bundled genomic signatures, and fast computations make ClaMS an ideal desktop supervised binning application for biologists.