Towards long-read metagenomics: complete assembly of three novel genomes from bacteria dependent on a diazotrophic cyanobacterium in a freshwater lake co-culture
Standards in Genomic Sciences volume 12, Article number: 9 (2017)
Here we report three complete bacterial genome assemblies from a PacBio shotgun metagenome of a co-culture from Upper Klamath Lake, OR. Genome annotations and culture conditions indicate these bacteria are dependent on carbon and nitrogen fixation from the cyanobacterium Aphanizomenon flos-aquae, whose genome was assembled to draft-quality. Due to their taxonomic novelty relative to previously sequenced bacteria, we have temporarily designated these bacteria as incertae sedis Hyphomonadaceae strain UKL13-1 (3,501,508 bp and 56.12% GC), incertae sedis Betaproteobacterium strain UKL13-2 (3,387,087 bp and 54.98% GC), and incertae sedis Bacteroidetes strain UKL13-3 (3,236,529 bp and 37.33% GC). Each genome consists of a single circular chromosome with no identified plasmids. When compared with binned Illumina assemblies of the same three genomes, there was ~7% discrepancy in total genome length. Gaps where Illumina assemblies broke were often due to repetitive elements. Within these missing sequences were essential genes and genes associated with a variety of functional categories. Annotated gene content reveals that both Proteobacteria are aerobic anoxygenic phototrophs, with Betaproteobacterium UKL13-2 potentially capable of phototrophic oxidation of sulfur compounds. Both proteobacterial genomes contain transporters suggesting they are scavenging fixed nitrogen from A. flos-aquae in the form of ammonium. Bacteroidetes UKL13-3 has few completely annotated biosynthetic pathways, and has a comparatively higher proportion of unannotated genes. The genomes were detected in only a few other freshwater metagenomes, suggesting that these bacteria are not ubiquitous in freshwater systems. Our results indicate that long-read sequencing is a viable method for sequencing dominant members from low-diversity microbial communities, and should be considered for environmental metagenomics when conditions meet these requirements.
Metagenomic sequencing is the process of sampling DNA sequences from multiple genomes in a community of organisms, and has been applied to many environmental samples to assess both functional diversity and species richness of microbial communities [1, 2]. Recently, there has been a progression in metagenomic approaches associated with advances in sequencing technologies. Next-generation sequencing methods  such as 454 and Illumina HiSeq/MiSeq have greatly reduced sequencing costs per base relative to Sanger sequencing due to increased throughput, which has facilitated shotgun metagenomics (randomly sequencing all DNA in a sample). This has provided several advantages over amplicon sequencing. For example, focus has shifted from assigning taxa using single genes to using multiple genes and sequence composition instead [2, 4]. It has also permitted functional characterization of individual representatives or whole microbial communities [5, 6].
However, there are technical hurdles associated with short-read sequencing. Specifically, assembling short reads (50-300 bp) into contiguous sequences (contigs) rarely leads to complete genome assemblies due to repetitive genomic elements such as 16S rRNA genes and IS elements  that are 1 kb or greater in length. There are two consequences as a result. First, closing draft genomes by primer walking requires considerable manual effort and time. Second, if closure is not possible, contigs must be clustered and binned using methods like differential coverage , co-abundance [5, 9, 10], or gene/nucleotide composition . While useful, these methods are often not comprehensive and become even more difficult to implement when used in a metagenomic context, where multiple genomes (sometimes from closely related organisms) must be delineated .
Single-molecule real time sequencing technologies, such as PacBio and Oxford Nanopore, are part of the third-generation sequencing wave . These sequencers produce average read lengths in the 5–50 kb range, with ~50% of reads longer than 14 kb , which exceeds the size of repetitive elements in the average bacterial genome. Although more error-prone, these longer reads have proven advantageous for assembling closed genomes if sequencing depth is high enough to allow error correction . To date, long-read sequencing has rarely been used for metagenomics for several reasons: 1) the amount of sequence data returned is a fraction of an Illumina run (up to 750 Gb/flow cell of Illumina HiSeq 3000 vs. up to 1 Gb/SMRT cell of PacBio Sequel based on company specifications), 2) the sequencing cost per base pair is higher, and 3) PacBio does not rely upon DNA amplification, so high concentrations of raw DNA are required. Due to these limitations, long-read metagenomics has so far been limited to whole-16S amplicon sequencing  and to improving binning from fragmented (short-read) assemblies .
Here, we have generated a PacBio shotgun metagenome from a non-axenic cyanobacterium culture established in summer 2013 originating from Upper Klamath Lake, OR. In this freshwater lake, the N2-fixing filamentous cyanobacterium Aphanizomenon flos-aquae blooms annually. These blooms are harvested and sold as nutritional supplements. Little is known about the co-occurring microbial community in this lake, whose composition could be influenced by the presence of A. flos-aquae as the dominant primary producer [17, 18]. By applying a selective growth medium lacking nitrogen, our goal was to sequence and assemble complete genomes from a relatively simple community, in turn assessing the possibility for using PacBio shotgun sequencing for environmental metagenomics. We closed three novel bacterial genomes, which provide insight into putative metabolic dependencies of these bacteria on A. flos-aquae in the co-culture. However, we were unable to close the A. flos-aquae genome, which is in draft quality and will be discussed elsewhere.
Classification and features
The taxonomic placement of each genome was assessed three ways (Additional file 1: Table S1). We used the SILVA SSU Ref NR database (accessed on March 9, 2016) to search for significant 16S rDNA matches . Also, we generated 16S phylogenetic trees for each genome, using the SINA aligner  and FastTree , with all classified Alphaproteobacteria , Betaproteobacteria , and Bacteroidetes representatives in SILVA, shown with their nearest groups in Fig. 1a-c. For the second taxonomic placement method, we used PhyloPythiaS+ , which searches for genomes with similar k-mer composition. The third method, Phylosift , is a pipeline that aligns 40 marker genes to generate a weighted probability score for specific taxonomic assignments. Due to lack of similarity with previously-classified bacterial representatives, these approaches were unable to assign these genomes to genus or species levels. Phylogenetic analysis of 16S rRNA genes placed each of the novel bacteria between established clades within or between families, with 0.11–0.13 average substitutions per site to the nearest neighbor's 16S gene (Fig. 1a-c). Phylogenomic analyses (PhylopythiaS+ and Phylosift) were also unable to find close relatives, resulting in the genomes being placed at higher taxonomic levels than genus (Additional file 1: Table S1). We have therefore used this information to designate these organisms as incertae sedis (of uncertain placement). While Candidatus designations are often assigned to unplaced taxa, the International Code of Nomenclature for Prokaryotes requires the Candidatus usage to be accompanied by phenotypic information , which we did not have available. We assigned the temporary strain names Hyphomonadaceae UKL13-1, Betaproteobacterium UKL13-2, and Bacteroidetes UKL13-3 until further representative sequences become available to guide the naming of new genera as appropriate. Minimum Information about the Genome Sequences is summarized in Table 1.
Although we initiated and maintained this mixed-community culture for 1 year, the culture was lost and we did not obtain physiological information regarding these organisms. Sustaining long-term A. flos-aquae cultures is often difficult, and it is common for cultures to die. In the absence of phenotypic information, we discuss insights from the genome annotations for the three novel bacteria.
Genome sequencing information
Genome project history
Cultures were initiated from UKL, where annual A. flos-aquae blooms constitute a serious ecological disturbance but are also harvested and sold as nutritional supplements. The genome sequences were deposited to DDBJ/EMBL/GenBank under the accessions CP012156, CP012157, and CP012155 for the Hyphomonadaceae UKL13-1, Betaproteobacterium UKL13-2, and Bacteroidetes UKL13-3 genomes, respectively. Project information is summarized in Table 2.
Growth conditions and genomic DNA preparation
One Aphanizomenon flos-aquae colony from a depth-integrated water sample from the UKL Link Dam site collected on 6 August 2013 was transferred to Bold 3 N0 medium  without NaNO3 (or any other form of N). This medium consisted of 0.17 mM CaCl2, 0.3 mM MgSO4, 0.43 mM K2HPO4, 1.29 mM KH2PO4, 0.43 mM NaCl, P IV trace metals, and 0.1 μM vitamin B12 at pH 8.0. The culture was maintained under cool white fluorescent light (20 μE m−2 s−1) with a light/dark cycle of 16 h/8 h at 24 °C. Three separate DNA extractions were performed from this culture (Table 3). A sample taken in November 2013 was collected on a 1.2 μm GF/C filter (Whatman), and DNA was extracted for Illumina sequencing using a DNA extraction kit (GeneRite DNA-EZ RWOC1). A similarly collected sample (Nov 2013) was extracted using phenol-chloroform  and pooled with phenol-chloroform extracted DNA from an unfiltered sample of the culture collected during March 2015 (to balance the proportion of sequencing capacity associated with cyanobacteria and heterotrophic bacteria). This pooled sample was quantified with the Q32850 Quant-iT dsDNA BR Assay Kit. Approximately eight micrograms of DNA were submitted for PacBio sequencing.
Genome sequencing and assembly
The November 2013 sample was processed using a Nextera XT kit and sequenced using the Illumina HiSeq 2000 at the Oregon State University Center for Genome Research and Biocomputing to generate 17,617,259 paired-end reads (101 bp). The pooled (11/2013 & 3/2015) sample was processed for PacBio sequencing by the Molecular Biology and Genomics Core at Washington State University. Eight SMRT cells of PacBio RS sequencing generated 348,623 reads with an average length of 7,737 bp. PacBio sequences were assembled using HGAP  with three different parameter sets to optimize for assembly of different genomes (Additional file 2: Table S2). Initially, only the Bacteroidetes genome assembled from two SMRT cells (167,289 PacBio reads), at a seed read length cutoff of 12.8 kb. The less abundant Hyphomonadaceae UKL13-1 and Betaproteobacterium UKL13-2 genomes required all eight SMRT cells to close (348,623 reads). While the Betaproteobacterium genome closed with a seed read-length cutoff of 13.6 kb, the Hyphomonadaceae genome only assembled completely when this cutoff was lowered to 6 kb, likely since it had the lowest coverage of the three genomes. A lower cutoff directs more reads towards use in assembling, thereby improving chances of completing low-coverage assemblies . However, this also reduces the number of aligned reads used for error correction, which can in turn affect assembly quality. These tradeoffs should be considered before performing assemblies, but it is notable that we would not have completed the Hyphomonadaceae UKL13-1 genome without lowering this cutoff. The Hyphomonadaceae , Betaproteobacterium, and Bacteroidetes genomes were of finished quality (Table 4), with each having average Phred scores (ASCII base 33) of 75.9, 76.0, and 81.9, respectively. Although HGAP assembly involves chimera detection, we additionally evaluated the possibility of chimeric assembly of each genome by mapping the Illumina reads to the completed genomes using REAPR , which breaks incorrect assemblies by assessing the paired-read coverage distribution at each base; no chimeras were identified. We were unable to complete other genomes in the culture, including the draft-quality A. flos-aquae genome assembly (Table 4).
The Illumina-sequenced culture was assembled using the IDBA-Hybrid  software. We binned Illumina-assembled contigs from the three completed genomes by differential coverage of reads from both PacBio and Illumina samples. That is, Illumina and PacBio reads were separately mapped to each assembly using BWA-MEM  and BLASR , respectively. Contigs were then binned using the mmgenome R package  (Table 5).
All genomes were annotated with NCBI's PGAP  and PROKKA . Counts of features (Genes, CDS, pseudogenes, rRNAs, tRNAs, ncRNAs, and CRISPR arrays) come from PGAP annotations. Amino acid sequences were assigned to COG categories by searching against the COG protein database  using RAPSEARCH , taking only the top hits above an E-value of 1E-30. Amino acid sequences from each genome were also annotated using the KEGG database  with the GhostKOALA  pipeline and the “genus_prokaryotes” database on September 3, 2015.
Each genome assembled into one closed contig, whose properties and statistics are shown in Table 6. The Hyphomonadaceae UKL13-1 genome consists of a single circular chromosome 3,501,508 bp long and a GC content of 56.12%. The genome contains a total of 3255 predicted genes, including 2934 predicted protein-coding sequences, 277 pseudogenes, and 44 RNA genes (40 tRNAs, one copy of the 16S-23S-5S rRNA operon, and 1 ncRNA) (Fig. 2). The Betaproteobacterium UKL13-2 genome consists of a single circular chromosome 3,387,087 bp long and a GC content of 54.98%. The genome contains a total of 3087 predicted genes, including 2772 predicted protein-coding sequences, 265 pseudogenes, and 50 RNA genes (43 tRNAs, two copies of the 16S-23S-5S rRNA operon, and 1 ncRNA) (Fig. 3). The Bacteroidetes UKL13-3 genome consists of a single circular chromosome 3,236,529 bp long and a GC content of 37.33%. The genome contains a total of 2850 predicted genes, including 2598 protein-coding sequences, 211 pseudogenes, and 41 RNA genes (35 tRNAs and two copies of the 16S-23S-5S rRNA operon) (Fig. 4). The distribution of genes into COG functional categories is summarized in Table 7.
Insights from the genome sequence
PacBio metagenome and comparison to Illumina metagenome
The bacterial community associated with the Aphanizomenon flos-aquae culture was subjected to metagenomic analysis with sequencing on eight PacBio SMRT cells, resulting in three completed novel bacterial genomes: Hyphomonadaceae UKL13-1, Betaproteobacterium UKL13-2, and Bacteroidetes UKL13-3 (Table 4). There were insufficient reads to close the genome of A. flos-aquae, although 67 contigs could be clustered to represent an estimated 97% of the genome (Table 4). The A. flos-aquae genome was sequenced with lower coverage than the three completed genomes, and additional sequencing would be needed for genome completion. Contigs from partial genomes of two additional bacteria were also clustered: a novel Flavobacterium (63% estimated genome completeness) and a novel Brevundimonas bacterium (17% estimated genome completeness) (Table 4), which were taxonomically placed via PhylopythiaS+. The Flavobacterium genome contained 16S rDNA genes with 98% similarity to Flavobacterium aquatile DSM 1132, but no 16S gene was identified in the Brevundimonas contigs. Our results indicate the presence of at least six separate bacterial taxa in this non-axenic culture.
A parallel Illumina HiSeq 2000 metagenome allowed comparison of PacBio-only and Illumina-only assemblies. When assembled with Illumina reads, the three predominant genomes separated into bins containing ~100 or more contigs. The Betaproteobacterium genome bin contained more contigs than the Hyphomonadaceae and Bacteroidetes genomes, although it was sequenced at the highest Illumina depth of the three (63x coverage vs. 23x and 58x coverage, respectively) (Table 5). There was a ~200 kb discrepancy between Illumina bin length and completed genome length for each of the three genomes. The total binned contig lengths for the Bacteroidetes and Betaproteobacterium were each shorter than the completed genomes, while the Hyphomonadaceae total binned contig length was longer (Table 5). The additional sequences in the Hyphomonadaceae bin were primarily contigs shorter than 10 kb that were not part of the PacBio-assembled Hyphomonadaceae genome. The bin quality control program CheckM  overestimated genome completeness or underestimated contamination when compared with the finished genome size. For example, CheckM estimated that the Hyphomonadaceae UKL13-1 bin contained ~2% contamination, while comparing the bin length with the completed genome length suggests ~6% contamination (Table 5). These discrepancies indicate that genome binning has a tendency to exclude important sequences or include extraneous sequences, and reveals the difficulty of assessing binned genome completeness and contamination without a reference. Incomplete binning is common for draft genomes, particularly from metagenomic assemblies .
We also assessed the extent to which genome repeats affected Illumina assemblies. Repeats in each genome were identified by using BLASTN to align each genome with itself, with a minimum E-value cutoff of 1E-30. Both intragenome BLASTN hits and missing Illumina coverage were then visualized with a circular genome plot (Additional file 3: Figure S1a-c). Breaks in Illumina assemblies commonly co-localized with intragenomic repeats in each genome. In particular, the Betaproteobacterium UKL13-2 genome is enriched for repeat sequences relative to the other two genomes and contains larger regions unassembled by Illumina reads, factors that possibly contributed to the greater genome fragmentation (Table 5).
We then analyzed gene functions in sequences missing from Illumina bins to assess the extent to which critical gene content was missing (Fig. 5). Most annotated genes in these regions were assigned to the mobilome category (X, esp. transposases), although genes from most other COG categories were also represented. Annotations within these regions included essential genes such as tRNAs, rRNA operons, translation-associated genes, and nucleotide metabolism genes, in addition to a variety of enzymes and transporters (Table 8; Additional file 4: Tables S3, Additional file 5: Table S4 and Additional file 6: Table S5). The presence of duplicated essential genes (DNA ligase, EF-Tu) resulted in both copies being absent from the Betaproteobacterium genome (Table 8); the presence of multiple rDNA sequences commonly produces breaks in short-read assemblies . In such cases, rDNA sequences confined to small contigs lose their linkage to other genes. This makes assigning 16S sequences to draft genomes difficult when multiple organisms are present in the same sample, and can make it difficult to link 16S amplicon information to binned genomes from shotgun metagenomes. Also, the functional variety of non-mobilome-associated missing genes within these assembly breaks suggests they hold informative sequences regarding physiology or lifestyle.
Novel completed genomes
To functionally characterize the three novel genomes, we searched all protein-coding sequences against the COG database using Rapsearch 2.16 and a 1E-30 E-value cutoff. We then repeated this for all bacterial genomes in GenBank (collected on November 3, 2015) and compared these to our novel genomes to assess enrichment of protein-coding sequences associated with each COG functional category. These are shown as a percentage of all protein-coding sequences from each respective genome (Additional file 7: Figure S2). Our results indicate that the Hyphomonadaceae UKL13-1 genome contains more lipid metabolism genes than most bacteria (at 5.01% of predicted coding sequences vs. a mean of 2.96%), while the Bacteroidetes UKL13-3 genome contains more cell wall/envelope/membrane biogenesis genes (7.39%, vs. a mean of 4.61%).
We then searched the KEGG database to identify complete and partial pathways in each genome. Identification of additional genes was aided by using Mauve whole- or partial-genome alignments  to reference genomes ( Cytophaga hutchinsonii , Roseobacter denitrificans , Rubrivivax gelatinosus , and Rhodobacter capsulatus ) and between Hyphomonadaceae UKL13-1 and Betaproteobacterium UKL13-2. The Hyphomonadaceae UKL13-1 and Betaproteobacterium UKL13-2 genomes contain anoxygenic photosynthesis and reaction center genes, as well as genes for bacteriochlorophyll and carotenoid synthesis (Additional file 8: Table S6). Phylogenies of their 16S genes reveal they do not cluster near groups containing phototrophic bacteria (Fig. 1a, b). Neither genome contains RuBisCO genes, consistent with these bacteria being aerobic anoxygenic phototrophs. These are a class of heterotrophs that use phototrophy to drive ATP and NAD(P)H production, but are unable to fix net carbon through photosynthesis [41, 42]. For Betaproteobacterium UKL13-2, the presence of genes for thiosulfate or sulfite oxidation (soxABCDXYZ) suggests that these sulfur compounds can serve as electron donors for ATP synthesis , perhaps in addition to organic compounds (Fig. 6). Both A. flos-aquae and Betaproteobacterium UKL13-2 appear to be capable of assimilatory sulfate reduction of MgSO4 (provided as the only S source in the growth medium) (Additional file 8: Table S6), which is often used as the pathway for amino acid synthesis. Photolithotrophic oxidation of reduced S compounds obtained from A. flos-aquae by the Betaproteobacterium could be energetically advantageous. Since neither genes for oxidation of reduced sulfur or nitrogen compounds are evident in the Hyphomonadaceae genome, organic compounds likely serve as electron donors in this bacterium [41, 42] (Fig. 6). In contrast with the proteobacterial genomes, Bacteroidetes UKL 13-3 contains no autotrophy genes, consistent with the typical lifestyle of bacteria from this phylum as feeding on cellular detritus . However, fewer genes were annotated from Bacteroidetes UKL13-3, and fewer completed KEGG pathway modules were identified than for the Hyphomonadaceae or Betaproteobacterium genomes (38 vs. 72 and 80, respectively). This could be due to protein-coding sequences carrying distant homology to those currently deposited in KEGG, limiting the ability to identify metabolic genes and pathways.
The A. flos-aquae genome was the only identified source of nitrogen fixing genes in the culture. Since the growth medium was nitrogen-deplete, all other bacteria in the community likely depend on reduced N provided by the cyanobacterium. It has been shown that A. flos-aquae from the Baltic Sea fixes N2 and releases it as NH4 +, which is then taken up by surrounding heterotrophic or phototrophic bacteria [44, 45]. Both proteobacterial genomes contain the ammonia transporter gene amtB, which would allow uptake of NH4 + released by A. flos-aquae (Fig. 6). No ammonia channel transport genes were annotated in the Bacteroidetes UKL13-3 genome. The proteobacterial genomes both contain chemotaxis and flagellar genes, and the Betaproteobacterium genome also contains type IV pilus genes for twitching motility (Additional file 8: Table S6; Fig. 6). Motility may be necessary for these organisms to stay associated with and obtain benefits from A. flos-aquae, similar to other host-associated bacteria .
We searched the novel genomes for the presence of other transporters to inform of the needs for survival and growth. Both proteobacterial genomes contain transporters for alkanesulfonate, iron(III), phosphate, and phosphonate. The Hyphomonadaceae genome also contains a transporter for putrescine, while the Betaproteobacterium genome contains complete transporter modules for tungstate, molybdate, glutamate/aspartate, and branched-chain amino acids. Few, and only broadly functional, transporter modules were identified in the Bacteroidetes genome. All three genomes appear to carry complete genetic pathways for nucleotide biosynthesis, as well as genes for synthesis of all 20 amino acids, indicating these organisms are self-sufficient in this regard. Because the Flavobacterium and Brevundimonas genomes were so incomplete (Table 4), their gene content is not reported here.
We were unable to identify any plasmids in the assemblies. The distribution of all plasmids in GenBank shows that the majority are found in Proteobacteria (~47%), although most of these are associated with Gammaproteobacteria (~63%), rather than Alphaproteobacteria (~22%) or Betaproteobacteria (~8.7%) . Plasmids from Bacteroidetes were much rarer at ~1.6%. It may then be unsurprising that these bacteria lack plasmids.
Freshwater bacteria associated with cyanobacterial blooms
Bacteria from the Alphaproteobacteria , Betaproteobacteria , and Bacteroidetes are common in freshwater systems , are known to be commonly associated with cyanobacterial blooms, and can directly influence the growth of cyanobacteria in culture . We therefore propose that the three newly sequenced genomes represent a bacterial community that is dependent on Aphanizomenon flos-aquae (Fig. 6). Some Alphaproteobacteria have been identified in such cyanobacterial-associated communities . For example, Alphaproteobacteria 16S sequences have been detected in association with the nitrogen-fixing cyanobacterium Gloeotrichia echinulata . Interestingly, 16S rDNA from Hyphomonadaceae UKL13-1 shared significant identity (Additional file 1: Table S1) with one of these sequences (A0904), suggesting that bacteria related to Hyphomonadaceae UKL13-1 are associated with various bloom-forming cyanobacteria. However, the extent to which such co-occurrences reflect physiological interdependencies remains to be explored. Betaproteobacteria have been seen physically associated with cyanobacteria [18, 49]. The predicted chemotaxis and flagellar and twitching motility genes (Additional file 8: Table S6) would assist both Hyphomonadaceae UKL13-1 and Betaproteobacterium UKL13-2 to remain associated with A. flos-aquae colonies and obtain the benefits of ammonium and organic nutrients released by the cyanobacterium (Fig. 6). We have detected no genes by which these photoheterotrophic bacteria could obviously benefit A. flos-aquae.
Bacteria from the Bacteroidetes phylum are commonly identified in, and sometimes dominate, freshwater lake systems . They are also frequently found in particle-associated communities and commonly degrade extracellular polysaccharide matrices that are grazed via gliding motility . Bacteroidetes UKL13-3 possesses annotated gliding motility genes, which may indicate physical association with the originally isolated A. flos-aquae colony. Extracellular mucilage, as well as a range of nutrients (reduced C, N and S compounds) released by A. flos-aquae, may support the growth of Bacteroidetes UKL13-3, whose genome seems to lack many functionally annotated pathways. Bacteroidetes UKL13-3 has the only annotated extracellular peroxidase gene in the three genomes, which could protect against reactive oxygen species generated by photosynthesis in A. flos-aquae (Fig. 6), which itself lacks annotated peroxidase genes. This may indicate a mutual benefit for both bacteria, and conform to the Black Queen Hypothesis described for interactions between the unicellular cyanobacterium Prochlorococcus with other interacting bacteria . On the other hand, large populations of Bacteroidetes bacteria have been observed following cyanobacterial bloom decline  due to subsequently favorable conditions for copiotrophs  and cell turnover of A. flos-aquae and other cells may provide organic material for Bacteroidetes UKL13-3 growth in co-culture, with similar benefits from lysed cells for the two Proteobacteria .
Search for the novel genomes in freshwater metagenomes
We searched for the occurrence of the three novel bacteria in 62 freshwater lake metagenomes from 8 sampling sites across the United States, including Oregon, Washington state, California, Texas, and Kansas (BioProject accessions: PRJNA312985, PRJNA282166, PRJNA312830, PRJNA312986, and PRJNA294203, respectively). To do so, we mapped reads from these metagenomes to the references with BWA-MEM with default parameters (~0.067% mismatch rate) and calculated average genome coverage. Matches were found in two samples. A metagenome from Copco Reservoir, CA, on the Klamath River downstream of UKL on September 19, 2007 contained ~86x average read coverage depth of the Hyphomonadaceae UKL13-1 genome and ~151x coverage depth of the Bacteroidetes UKL13-3 genome from 398,356,734 Illumina read pairs. Additionally, a metagenome from Cranberry Lake, WA on August 11, 2014 contained the Betaproteobacterium UKL13-2 genome at ~99x average coverage depth from 13,955,857 Illumina read pairs. We also searched in 50 additional freshwater lake metagenomes in the IMG, MG-RAST, and SRA databases. The only detection found was the Betaproteobacterium UKL13-2 genome at ~19x average coverage depth in a metagenome consisting of 319,415,720 Illumina read pairs labeled “vibrio metagenome HEM-04” from a freshwater lake (BioProject accession: PRJNA64039). This initial analysis shows that the three novel bacteria are found elsewhere in freshwater habitats, although they do not appear to be ubiquitous or widely abundant.
Taxonomic placement and naming of genomes from uncharacterized bacteria
Currently, there is a lack of guidance and standardization for assigning taxonomic nomenclature to genomic sequences lacking phenotypic information. Until now, most of these sequences have been amplicons or draft assemblies from shotgun metagenomes, in which the genomes are usually incomplete and there are likely to be some contaminating contigs. We observed both of these defects in the draft genomes assembled from the short-read Illumina sequencing that paralleled the assembly of complete genomes from the PacBio long-read sequencing (Tables 4 and 5). Critical genes—including ribosomal RNA operons—were missing from the draft genomes (Table 8), making it perhaps premature to assign new taxonomic designations to such bacteria. Our work demonstrates that as long-read sequencing depth increases, so will the likelihood of assembling complete microbial genomes from uncultured samples of high relevance to extant microbial communities. Especially for circular chromosomes, the assembly of such complete genomes would seem to carry a low risk of artifactual assembly of the genome from a mixed-DNA sample in a metagenomic study, particularly when no PCR amplifications are used in the sequencing protocol, as in the present case. It would seem beneficial to the advancement of microbial ecology to develop new guidelines by which currently unclassifiable complete genomes derived from metagenomic data can be taxonomically placed with new genus and species names as appropriate.
Here, we have shown that completing multiple genome assemblies is possible from a simple microbial community using PacBio sequencing, a feat that is nearly impossible with short-read shotgun sequencing alone. There are several advantages to this approach. Completing genome assemblies from a shotgun metagenome avoids genome gaps and excludes contaminant sequences, which are significant issues with binned draft genomes. Absent sequences can contain functionally relevant information, such as gene clusters encoding secondary metabolites  or antibiotic resistance genes near mobile elements . Here we observed that key essential genes (Table 8) were missing from each short-read assembly. Also, short-read assemblers can compress small repeats, potentially removing important functional information . In addition to providing more complete genomic information, long-read sequencing of communities such as mixed cultures or environmental samples creates possibilities for new experimental designs. For example, complete genomes from novel organisms sequenced from the environment can be used as new references for culture-free resequencing efforts, such as to explore gene linkage patterns among alleles in a population. Further, long-read sequencers often detect DNA modifications, such as methylation, allowing capture of epigenetic information from environmental sequencing runs.
Although PacBio sequencing is low-throughput compared with short-read sequencers, our results suggest that the current state of this technology allows genome sequencing from communities with relatively low diversity, such as those in extreme environments  or when dominated by one or a few organisms . Platform improvement, such as the recently released PacBio Sequel instrument, is expected to make long-read sequencing increasingly desirable for shotgun metagenomics in the future.
Here, we have sequenced three novel genomes that may be associated with A. flos-aquae as part of the cyanobacterial phycosphere (Fig. 6). Based on gene annotations and growth medium, both Proteobacteria are motile aerobic anoxygenic phototrophs that utilize fixed nitrogen and carbon provided by A. flos-aquae. Bacteroidetes UKL13-3 is a heterotroph that likely has similar nutritional requirements, and may exist in a mutual relationship with A. flos-aquae through provision of an extracellular peroxidase. In future work, it will be interesting to explore the possible existence and nature of dependencies between these novel bacteria and A. flos-aquae colonies in blooms in Upper Klamath Lake and elsewhere.
Aerobic anoxygenic phototroph
Clusters of orthologous genes
Dissolved organic matter
Kyoto encyclopedia of genes and genomes
Prokaryotic genome annotation pipeline
Single molecule real-time
Upper Klamath Lake
Gilbert JA, Dupont CL. Microbial Metagenomics: Beyond the Genome. Annu Rev Mar Sci. 2011;3:347–71.
Escobar-Zepeda A, de León AV-P, Sanchez-Flores A. The road to metagenomics: from microbiology to DNA sequencing technologies and bioinformatics. Front Genet. 2015;6:348.
Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402.
Gregor ID J, Schirmer M, Quince C, McHardy AC. PhyloPythiaS+: A self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes. arXiv. 2014;1406:7123.
Sharon I, Morowitz MJ, Thomas BC, Costello EK, Relman DA, Banfield JF. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 2013;23(1):111–20.
Evans PN, Parks DH, Chadwick GL, Robbins SJ, Orphan VJ, Golding SD, Tyson GW. Methane metabolism in the archaeal phylum Bathyarchaeota revealed by genome-centric metagenomics. Science. 2015;350(6259):434–8.
Koren S, Phillippy AM. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr Opin Microbiol. 2015;23:110–20.
Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen PH. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol. 2013;31(6):533–8.
Imelfort M, Parks D, Woodcroft BJ, Dennis P, Hugenholtz P, Tyson GW. GroopM: an automated tool for the recovery of population genomes from related metagenomes. Peer J. 2014;2:e603.
Kang DD, Froula J, Egan R, Wang Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. Peer J. 2015;3:e1165.
Cleary B, Brito IL, Huang K, Gevers D, Shea T, Young S, Alm EJ. Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning. Nat Biotechnol. 2015;33:1053–60.
Hess M, Sczyrba A, Egan R, Kim T-W, Chokhawala H, Schroth G, Luo S, Clark DS, Chen F, Zhang T. Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science. 2011;331(6016):463–7.
Lee H, Gurtowski J, Yoo S, Marcus S, McCombie WR, Schatz M. Error correction and assembly complexity of single molecule sequencing reads. BioRxiv. 2014:doi: 10.1101/006395
Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, Wang Z, Rasko DA, McCombie WR, Jarvis ED. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol. 2012;30(7):693–700.
Fichot EB, Norman RS. Microbial phylogenetic profiling with the Pacific Biosciences sequencing platform. Microbiome. 2013;1(1):1.
Frank JA, Pan Y, Tooming-Klunderud A, Eijsink VG, McHardy AC, Nederbragt AJ, Pope PB. Improved metagenome assemblies and taxonomic binning using long-read circular consensus sequence data. Sci Rep. 2016;6:25373.
Bagatini IL, Eiler A, Bertilsson S, Klaveness D, Tessarolli LP, Vieira AAH. Host-specificity and dynamics in bacterial communities associated with bloom-forming freshwater phytoplankton. PLoS ONE. 2014;9:e85950.
Louati I, Pascault N, Debroas D, Bernard C, Humbert JF, Leloup J. Structural diversity of bacterial communities associated with bloom-forming freshwater cyanobacteria differs according to the cyanobacterial genus. PLoS ONE. 2015;10(11):e0140614.
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2013;41(D1):D590–6.
Pruesse E, Peplies J, Glöckner FO. SINA: accurate high-throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics. 2012;28(14):1823–9.
Price MN, Dehal PS, Arkin AP. FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS ONE. 2010;5(3):e9490.
Darling AE, Jospin G, Lowe E, Matsen IV FA, Bik HM, Eisen JA. PhyloSift: phylogenetic analysis of genomes and metagenomes. Peer J. 2014;2:e243.
Parker CT, Tindall BJ, Garrity GM. International code of nomenclature of prokaryotes. Int J Syst Evol Microbiol. 2015;doi: 10.1099/ijsem.0.000778
Culture Collection of Algae at the University of Texas at Austin. B3N media recipe. [https://utex.org/products/bold-3n-medium]. Accessed 12 Jan 2017.
Sambrook J, Russell DW. Purification of nucleic acids by extraction with phenol: chloroform. Cold Spring Harbor Protocols 2006, 2006(1):doi: 10.1101/pdb.prot4455.
Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, Clum A, Copeland A, Huddleston J, Eichler EE, et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods. 2013;10(6):563–9.
Forde BM, Zakour NLB, Stanton-Cook M, Phan M-D, Totsika M, Peters KM, Chan KG, Schembri MA, Upton M, Beatson SA. The complete genome sequence of Escherichia coli EC958: a high quality reference sequence for the globally disseminated multidrug resistant E. coli O25b: H4-ST131 clone. PLoS ONE. 2014;9(8):e104400.
Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD. REAPR: a universal tool for genome assembly evaluation. Genome Biol. 2013;14(5):1.
Peng Y, Leung HC, Yiu SM, Chin FY. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28(11):1420–8.
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:13033997 2013.
Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics. 2012;13(1):238.
NCBI Prokaryotic Genome Annotation Pipeline Release Notes, Version 3.3 [https://www.ncbi.nlm.nih.gov/genome/annotation_prok/release_notes/]. Accessed 12 Jan 2017.
Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30(14):2068–9.
Galperin MY, Makarova KS, Wolf YI, Koonin EV. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. 2014:doi: 10.1093/nar/gku1223.
Zhao Y, Tang H, Ye Y. RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics. 2012;28(1):125–6.
Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. KEGG as a reference resource for gene and protein annotation. Nucleic acids research 2015:doi: 10.1093/nar/gkv1070.
Kanehisa M, Sato Y, Morishima K. BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. J Mol Biol. 2016;428(4):726–31.
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25(7):1043–55.
Koren S, Harhay GP, Smith TP, Bono JL, Harhay DM, Mcvey SD, Radune D, Bergman NH, Phillippy AM. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 2013;14(9):1.
Darling AE, Mau B, Perna NT. ProgressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS ONE. 2010;5(6):e11147.
Moran MA, Miller WL. Resourceful heterotrophs make the most of light in the coastal ocean. Nat Rev Microbiol. 2007;5(10):792–800.
Yurkov V, Csotonyi JT. New light on aerobic anoxygenic phototrophs. In: Advances in Photosynthesis and Respiration. Dordrecht: Springer; 2009. p. 31-55. ISBN: 978-1-4020-8814-8.
Newton RJ, Jones SE, Eiler A, McMahon KD, Bertilsson S. A guide to the natural history of freshwater lake bacteria. Microbiol Mol Biol Rev. 2011;75(1):14–49.
Ploug H, Musat N, Adam B, Moraru CL, Lavik G, Vagner T, Bergman B, Kuypers MM. Carbon and nitrogen fluxes associated with the cyanobacterium Aphanizomenon sp. in the Baltic Sea. ISME J. 2010;4(9):1215–23.
Adam B, Klawonn I, Sveden JB, Bergkvist J, Nahar N, Walve J, Littmann S, Whitehouse MJ, Lavik G, Kuypers MM, et al. N-fixation, ammonium release and N-transfer to the microbial and classical food web within a plankton community. ISME J. 2016;10:450–9.
Lertsethtakarn P, Ottemann KM, Hendrixson DR. Motility and chemotaxis in Campylobacter and Helicobacter. Annu Rev Microbiol. 2011;65:389–410.
Shintani M, Sanchez ZK, Kimbara K. Genomics of microbial plasmids: classification and identification based on replication and transfer systems and host taxonomy. Front Microbiol. 2015;6.
Berg KA, Lyra C, Sivonen K, Paulin L, Suomalainen S, Tuomi P, Rapala J. High diversity of cultivable heterotrophic bacteria in association with cyanobacterial water blooms. ISME J. 2009;3(3):314–25.
Eiler A, Olsson JA, Bertilsson S. Diurnal variations in the auto‐and heterotrophic activity of cyanobacterial phycospheres (Gloeotrichia echinulata) and the identity of attached bacteria. Freshw Biol. 2006;51(2):298–311.
Pernthaler J, Zöllner E, Warnecke F, Jürgens K. Bloom of filamentous bacteria in a mesotrophic lake: identity and potential controlling mechanism. Appl Environ Microbiol. 2004;70(10):6272–81.
Lemarchand C, Jardillier L, Carrias J-F, Richardot M, Debroas D, Sime-Ngando T, Amblard C. Community composition and activity of prokaryotes associated to detrital particles in two contrasting lake ecosystems. FEMS Microbiol Ecol. 2006;57(3):442–51.
Morris JJ, Lenski RE, Zinser ER. The Black Queen Hypothesis: evolution of dependencies through adaptive gene loss. mBio 2012, 3(2):doi: 10.1128/mBio.00036-12.
Eiler A, Bertilsson S. Flavobacteria blooms in four eutrophic lakes: linking population dynamics of freshwater bacterioplankton to resource availability. Appl Environ Microbiol. 2007;73(11):3511–8.
Zeder M, Peter S, Shabarova T, Pernthaler J. A small population of planktonic Flavobacteria with disproportionally high growth during the spring phytoplankton bloom in a prealpine lake. Environ Microbiol. 2009;11(10):2676–86.
Harrison J, Studholme DJ. Recently published Streptomyces genome sequences. Microb Biotechnol. 2014;7(5):373–80.
Zowawi HM, Forde BM, Alfaresi M, Alzarouni A, Farahat Y, Chong T-M, Yin W-F, Chan K-G, Li J, Schembri MA. Stepwise evolution of pandrug-resistance in Klebsiella pneumoniae. Sci Rep. 2015;5:15082.
Brown NM, Mueller RS, Shepardson JW, Landry ZC, Morre JT, Maier CS, Hardy FJ, Dreher TW. Structural and functional analysis of the finished genome of the recently isolated toxic Anabaena sp. WA102. BMC Genomics. 2016;17(1):457.
Méndez-García C, Peláez AI, Mesa V, Sánchez J, Golyshina OV, Ferrer M. Microbial diversity and metabolic networks in acid mine drainage habitats. Front Microbiol. 2015;6:475.
Lin K-H, Liao B-Y, Chang H-W, Huang S-W, Chang T-Y, Yang C-Y, Wang Y-B, Lin Y-TK WY-W, Tang S-L. Metabolic characteristics of dominant microbes and key rare species from an acidic hot spring in Taiwan revealed by metagenomics. BMC Genomics. 2015;16(1):1.
Abraham W-R, Rohde M. The Family Hyphomonadaceae. In: The Prokaryotes. Heidelberg: Springer; 2014. p. 283-99. ISBN: 978-3-642-30193-3.
Prosser JI, Head IM, Stein LY. The family Nitrosomonadaceae. In: The Prokaryotes. Heidelberg: Springer; 2014. p. 901-18. ISBN: 978-3-642-30193-3.
Oren A. The Family Rhodocyclaceae. In: The Prokaryotes. Heidelberg: Springer; 2014. p. 975-98. ISBN: 978-3-642-30193-3.
Field D, Garrity G, Gray T, Morrison N, Selengut J, Sterk P, Tatusova T, Thomson N, Allen MJ, Angiuoli SV, et al. The minimum information about a genome sequence (MIGS) specification. Nat Biotechnol. 2008;26(5):541–7.
Woese CR, Kandler O, Wheelis ML. Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci. 1990;87(12):4576–9.
Garrity GM, Bell JA, Lilburn T. Phylum XIV. Proteobacteria phyl. In: Garrity GM, Brenner DJ, Krieg NR, Staley JT, editors. Bergey’s manual of systematic bacteriology, second edition, volume 2, part b. New York: Springer; 2005. p. 1.
Krieg NR, Ludwig W, Euzéby J, Whitman WB: Phylum XIV. Bacteroidetes phyl. nov. In: Bergey’s Manual® of Systematic Bacteriology. New York: Springer; 2010. p. 25-469. ISBN 0387950427.
Garrity GM, Bell JA, Lilburn T. Class I. Alphaproteobacteria class. nov. In: Bergey’s Manual® of Systematic Bacteriology. New York: Springer; 2005. p. 1-574. ISBN 0387950427.
Garrity GM, Bell JA, Lilburn T: Class II. Betaproteobacteria class. nov. In: Bergey’s manual® of systematic bacteriology. New York: Springer; 2005. p. 575-922. ISBN 0387950427.
Garrity GM, Bell JA, Lilburn T. Order III. Rhodobacterales ord. nov. In: Bergey’s manual® of systematic bacteriology. Edited by Brenner DJ, Krieg NR, Staley JT, Garrity GM, vol. 2, part C: New York: Springer; 2005. p. 161. ISBN 0387950427.
Lee K-B, Liu C-T, Anzai Y, Kim H, Aono T, Oyaizu H. The hierarchical system of the ‘Alphaproteobacteria’: description of Hyphomonadaceae fam. nov., Xanthobacteraceae fam. nov. and Erythrobacteraceae fam. nov. Int J Syst Evol Microbiol. 2005;55(5):1907–19.
Morris BE, Henneberger R, Huber H, Moissl-Eichinger C. Microbial syntrophy: interaction for the common good. FEMS Microbiol Rev. 2013;37(3):384–406.
We thank Sara Eldridge at the USGS for providing samples from UKL used for culture inoculum, Kim Halsey for discussions and Mark Wildung and Derek Pouchnik from the WSU Molecular Biology and Genomics Core for DNA preparation and PacBio sequencing in addition to providing HGAP assemblies. We also thank the NL Tartar Research Fellowship, the Oregon State University Agricultural Experiment Station, and the Mabel E. Pernot Trust for providing funding.
Conceived the project: TWD, TGO, NMB, CBD. Provided strains: TGO. Annotated the genomes: CBD. Analyzed and interpreted results: CBD, TWD, NMB. Wrote the manuscript: CBD, TWD, TGO. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Taxonomic placement of each novel genome by 16S similarity, composition (PhyloPythiaS+), and multiple marker gene similarities (Phylosift). (CSV 312 bytes)
Assembly parameters for genome assemblies from PacBio reads. Minimum read length cutoff is lowest read-length used for assembly, with remaining reads used for error correction. (CSV 195 bytes)
Hyphomonadaceae UKL13-1 genome repeats and Illumina breaks. Blue lines signify intragenomic repeats (based on BLASTN with a minimum E-value cutoff of 1E-30), and red bars mark sequences missing from Illumina assemblies. b. Betaproteobacterium UKL13-2 genome repeats and Illumina breaks. c. Bacteroidetes bacterium UKL13-3 genome repeats and Illumina breaks. (ZIP 4776 kb)
Annotated genes in Hyphomonadaceae UKL13-1 Illumina breaks. Genes called and annotated with PROKKA. (XLSX 13 kb)
Annotated genes in Betaproteobacterium UKL13-2 Illumina breaks. Genes called and annotated with PROKKA. (XLSX 19 kb)
Annotated genes in Bacteroidetes UKL13-3 Illumina breaks. Genes called and annotated with PROKKA. (XLSX 16 kb)
Percentage of protein-coding sequences from all bacterial genomes assigned to COG categories. Novel genomes are highlighted. (PDF 73 kb)
Predicted significant genes identified in the novel genomes. Identified by PGAP annotation, PROKKA annotation, or whole-genome alignment to reference genomes with Mauve. Locus names correspond with PGAP gene calls. (XLSX 36 kb)
About this article
Cite this article
Driscoll, C.B., Otten, T.G., Brown, N.M. et al. Towards long-read metagenomics: complete assembly of three novel genomes from bacteria dependent on a diazotrophic cyanobacterium in a freshwater lake co-culture. Stand in Genomic Sci 12, 9 (2017). https://doi.org/10.1186/s40793-017-0224-8