Sample types and settings
We sampled three distinct potential AMR reservoirs, namely: (i) pooled pig caecal contents from 10 pigs from a breeder farm in Yorkshire and the Humber (denoted as “pig caeca”); (ii) river sediment 100 m upstream of a sewage treatment works (STW) at Cholsey STW, Cholsey, Oxfordshire (“upstream sediment”); and (iii) treated sewage effluent emitted from Cholsey STW (“effluent”). Cholsey STW is a plant that serves a population equivalent of ~ 21,000 with a consented flow of 3200 m3/day; processes include primary settlement tanks, followed by biological disc filters and humus tanks, and subsequently disc filtration. These sample types were chosen to represent a spectrum of predicted diversity of microbial communities (i.e. high to low: effluent, pig caeca, upstream sediment).
The pooled pig caeca had been collected as part of a separate study surveying the presence of AMR genes in E. coli in pigs from 56 farms across the UK [28]. In brief, caecal contents were sampled from 10 randomly selected healthy finishing pigs from each of the farms at 12 different abattoirs (March 2014–October 2015), and suspended in 22.5 mL of PBS (processing within 24 h of collection). Aliquots of 100 μL were frozen at − 80 °C. This study used an aliquot of pooled pig caeca selected randomly from this collection.
For effluent and upstream sediment samples, sterile Whirl-pack™ bags were attached to extendable sampling arms and placed into flow at the relevant site. Samples in the bags were stirred with sterile spoons, and 5 mLs added to a sterile 50 mL centrifuge tube. This process was repeated five times to create a composite sample of approximately 25 mL. Samples were stored in a cool box at 4 °C for transport and processed within 24 h.
Metagenomic DNA extractions and Thermus spike-in
Metagenomic extractions on all samples were performed using the MoBio PowerSoil® DNA Isolation Kit (Qiagen, Venlo, Netherlands), as per the manufacturer’s protocol, and including a beadbeating step of two 40 s cycles at 6 m/s in lysing matrix E. 12.5 ng of naked Thermus thermophilus DNA (reference strain HB27, Collection number ATCC BAA-163, ordered from DSMZ, Germany) was added to each sample in the PowerBead tube at the start of the experiment, prior to the addition of Solution C1 of the DNA Isolation Kit. The rationale for this was to enable subsequent normalisation to the number of T. thermophilus genomes sequenced to adjust for varying amounts of sample input, and extraction bias [29] (see ‘Normalisation of gene counts’, below).
Metagenomic sequencing
Pooled libraries of all DNA extracts were sequenced across four lanes of an Illumina HiSeq 4000 platform, generating a median of 102,787,432,150 bp paired-end reads (30.8 Gb) of data per extract. For the samples extracted in replicate, we therefore had a median of 202,579,676 paired-end reads (60.7 Gb) of data available for evaluation and sub-sampling analyses (Additional file 3: Table S1). To confirm replicability of our extraction method on the same sample, duplicate extractions of all three samples were performed. To test replicability of sequencing, pooled libraries derived from extracts were each sequenced across four sequencing lanes. The sequences were pooled into each sample resulting in 202,579,676, 215,047,930 and 198,865,221 reads for the effluent, pig caeca and upstream sediment respectively. The effluent and pig caeca samples were both randomly subsampled down to 200 million reads per sample for downstream analysis.
Analysis of both AMR gene profiles and taxonomic profiles for the same extract pooled across multiple sequencing lanes (HiSeq) were highly reproducible, with little evidence of differences across lanes, although there was a significant difference between replicates of AMR gene profiles from pooled pig caeca (p = 0.03), and replicates of taxonomic profiles for upstream sediment (p = 0.03) (Additional file 6: Table S4).
Sequencing depth subsampling and quality filtering
In order to simulate the effect of sequencing at different depths, each set of pooled reads from the three samples was repeatedly subsampled (n = 10) using VSEARCH (fastx_subsampling, [30]) into the following set of depth intervals: 1 M, 2 M, 4 M, 6 M, 7 M, 8 M, 9 M, 10 M, 20 M, 40 M, 60 M, 80 M, 100 M, 120 M, 140 M, 160 M and 180 M. Low-quality portions of all reads were trimmed using TrimGalore (v.0.4.4_dev, [31]). Specifically, we used a length cut-off of 75 bp and average Phred score ≥ 25, and the first 13 bp of Illumina standard adapters (AGATCGGAAGAGC) for adapter trimming.
Taxonomic profiling
For profiling the abundance of bacterial species, the reads were classified with Kraken (v.1.1, default settings [16];) and Centrifuge (v.1.0.4, default settings [15];), which were chosen based on recency and reported frequency of use in the literature. RefSeq sequences (v.91 [32];) at a “Complete genome” assembly level for bacteria (11,443 strains), archaea (275 strains), viral (7,855 strains) and human were downloaded from the NCBI repositories and used to build two sets of indexed databases for both Kraken and Centrifuge using respective scripts provided by each classifier. An ‘in silico 16S’ marker-gene based classification was performed by extracting 16S rRNA genes from the reads using METAXA2 [4] followed by taxonomic assignment with the naïve Bayesian RDP classifier (v2.10 [33];) with a minimum confidence of 0.5 against the GreenGenes database (v.13.5 [34];).
To validate the taxonomic profiling component of our pipeline, we analyzed ten previously simulated gut metagenomes (GI tract data from “2nd CAMI Toy Human Microbiome Project Dataset”, https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_Gastrointestinal_tract) produced for benchmarking as part of CAMI [21]. Comparing to the ground truth of the simulated composition, using either Centrifuge or Kraken recovered the major features of the taxonomic composition (Additional file 1: Figure S1a) with high correlation between simulated and inferred species abundances (Additional file 1: Figure S1b), although there were apparent discrepancies between methods which we did not investigate further.
AMR gene profiling
The quality filtered reads were mapped with bbmapskimmer.sh (BBMap suite [35];) with default settings against sequences from the Comprehensive Antibiotic Resistance Database (CARD, v.3.0.0, [10]) and the genome sequence of T. thermophilus which was spiked into the samples. At the time of writing, CARD contained 2439 AMR sequences. As CARD is primarily designed for genomic data, each sequence has an associated ‘model’ of detection i.e. criteria determining matches to the CARD reference sequences for any given query sequence. The chief distinction is between genes that have a “protein homolog” model, where detection is assessed using a BLASTP cut-off to find functional homologs (n = 2238; e.g. NDM-1 beta-lactamase), and those with a “non protein homolog” model, where detection is assessed using other methods including the locations of specific SNPs (n = 247; e.g. M. tuberculosis gyrA conferring resistance to fluoroquinolones). Although we use a mapping-based approach from shotgun metagenomic reads, we have included this information in ResPipe. For simplicity, we designate “protein homolog” model genes and “non protein homolog” model genes under the broad headings “resistance by presence” and “resistance by variation”, respectively (where “variation” can encompass SNPs, knockout, or overexpression). The BAM files generated by the mapping were processed by a custom script to generate a count table where only alignments with a strict 100% sequence identity (without allowing any deletions or insertions) to CARD sequences were counted. Where a read mapped to more than one AMR gene family or an AMR allelic variant (i.e. could not be designated into any one AMR gene family or AMR allelic variant) it was counted as “multiple families” or “multiple alleles” respectively. For each AMR allelic variant, we calculated “lateral coverage”, defined as the proportion of the gene covered by at least a single base of mapped reads. Where reads mapped to multiple families or alleles, lateral coverage could not be calculated.
Rarefaction curves
For fitting the relationship between sequencing depth per sample d and the richness r of AMR gene families or allelic variants, we used the species accumulation model defined by Clench [36]: \( r(d)=\frac{a\times d}{1+b\times d} \). This model may be flawed, but is only used here to give a rough estimate of the sequencing depth required to achieve a proportion of q (e.g. 95%) of the total richness, which is then \( {d}_q=\frac{q}{b\times \left(1-q\right)} \).
Normalisation of gene counts
Assuming random sequencing, longer genes are more likely to be represented in reads. In order to alleviate this gene length bias, the resulting table was adjusted by multiplying each count by the average length of mapped reads followed by dividing by the length of the AMR allelic variant to which the reads were mapped. Where there were multiple alleles, average length was used. In order to adjust for varying amounts of sample input and extraction bias, the table was further normalised to the number of reads that mapped to T. thermophilus using an adopted protocol from Satinsky et al. [29]. We added 12.5 ng of Thermus thermophilus to each sample. This corresponds to adding 6,025,538 copies of the T. thermophilus genome. The size of the T. thermophilus genome is 1,921,946 bases, so the number of bases of T. thermophilus added is \( {N}_{TT}^{\mathrm{added}} \) = 6,025,538 × 1,921,946. To obtain the number of bases of T. thermophilus recovered by sequencing (\( {N}_{TT}^{\mathrm{recovered}} \)), we take the number of reads assigned to T. thermophilus and multiply it by the insert size (300 bp). The read count Ng for a particular subject g (e.g. a gene family or allelic variant) can then be normalised as:
$$ {\overset{\sim }{N}}_g={N}_g\times \left({N}_{TT}^{\mathrm{added}}\div {N}_{TT}^{\mathrm{recovered}}\right) $$
These normalisation protocols are intended to produce a pseudo-absolute gene copy number of each AMR gene family and AMR allelic variant, while recognising that this remains an estimated of the actual copy number of genes present in any given sample.
Isolate culture and DNA extraction
For effluent samples, the effluent filter was mixed with 20 mL of nutrient broth and shaken for 10 mins at 120 rpm. 100 μL of neat sample, and 10− 1 and 10− 2 dilutions (in nutrient broth) were plated onto a CHROMagar Orientation agar supplemented with a 10 μg cefpodoxime disc placed on one half of the agar plate. For pig caeca and upstream sediment samples, aliquots of 100 μL of sample at neat, 10− 1, 10− 2, and 10− 3-fold dilutions were plated onto a CHROMagar Orientation agar supplemented supplemented with a 10 μg cefpodoxime disc placed on one half of the agar plate. Serial dilutions were plated to enable morphological identification and isolation of individual colonies. All plates were incubated at 37 °C for 18 h. We used cefpodoxime resistance as a surrogate marker for the selective culture of multi-drug-resistant Enterobacteriaceae [37, 38].
Up to four individual colonies from each sample with a typical appearance for E. coli, Klebsiella spp., Enterobacter spp. or Citrobacter spp., and from either within or external to the cefpdoxime zone, were subcultured on MacConkey agar with or without cefpodoxime discs, respectively. Following sub-culture, species was confirmed by MALDI-ToF (Bruker), and stored in nutrient broth + 10% glycerol at − 80 °C prior to repeat sub-culture for DNA extraction.
DNA was extracted from pure sub-cultures using the Qiagen Genomic tip/100G (Qiagen, Venlo, Netherlands), according to the manufacturer’s instructions. Extracts from seven isolates (four from effluent, two from pig caeca, and one from upstream sediment) were selected for combination long-read (Pacific Biosciences) and short-read sequencing, based on sufficient DNA yield (with a requirement at the time of the study for ~ 5 μg DNA for library preparation), and appropriate fragment size distributions (assessed using TapeStation 4200, Agilent, Santa Clara, USA). These isolates were identified using MALDI-ToF as Citrobacter freundii (two isolates), Enterobacter kobei/cloacae (three isolates), and E. coli (two isolates) (Table 1).
Isolate sequencing
Aliquots of the same DNA extract were sequenced by two methods: short-read (Illumina), and long-read (Pacific BioSciences). For Illumina sequencing, extracts were sequenced on the HiSeq 4000 platform. Libraries were constructed using the NEBNext Ultra DNA Sample Prep Master Mix Kit (NEB), with minor modifications and a custom automated protocol on a Biomek FX (Beckman). Sequenced reads were 150 bp paired-end, with a median of 1,355,833 reads per isolate (range: 1.06–1.66 million) after read correction with SPAdes (Additional file 4: Table S2), corresponding to a chromosomal coverage per isolate of ~30X with a insert size of 300 bp.
To generate long-read data from the same DNA extract for any given isolate, we used single molecule real-time sequencing using the PacBio RSII. Briefly, DNA library preparation was performed according to the manufacturer’s instructions (P5-C3 sequencing enzyme and chemistry, respectively see Supplementary Material of Sheppard et al. [39]). After read correction and trimming, there were a median of 14,189 reads per isolate (range: 12,162-17,523) with a median read length of 13,146 bp (range: 10,106-14,991) (Additional file 4: Table S2).
Hybrid assembly for isolates
We assembled genomes for isolates using a version of a pipeline we had previously developed and validated against multiple Enterobacteriaceae genomes including two reference strains (De Maio, Shaw et al. 2019). In brief, we corrected Illumina reads with SPAdes (v3.10.1) and corrected and trimmed PacBio reads with Canu (v1.5), then performed hybrid assembly using Unicycler (v0.4.0) with Pilon (v1.22) without correction, with a minimum component size of 500 and a minimum dead end size of 500. Out of 35 total contigs across seven isolates, 28 were circularised (78%), including two chromosomes and 24 plasmids. Normalised depths of plasmids ranged from 0.6–102.6x relative to chromosomal depth, and lengths between 2.2–162.9 kb (Additional file 5: Table S3). The majority of plasmids were found in effluent isolates (24/29). We checked MALDI-ToF species identification with mlst (v2.15.1 [40];) and found agreement (Additional file 4: Table S2).
Mapping of metagenomic sequences onto isolates
To investigate the feasibility of accurately identifiying genetic structures (chromosomes and plasmids) in the metagenomic reads in relation to the impact of sequencing depth, we used the assembled chromosomes and plasmids derived from the cultured and sequenced isolates as reference genomes (in silico genomic “probes”) to which the metagenomic short reads were mapped. We used the same mapping protocol used for the aforementioned AMR gene profiling and lateral coverage was calculated for each chromosome/plasmid at any given sequencing depth.
Implementation into a Nextflow pipeline
The entire workflow (both taxonomic and AMR gene profiling) has been implemented into a Nextflow [41] pipeline complying with POSIX standards, written in Python: ResPipe (https://gitlab.com/hsgweon/ResPipe). All analyses were performed on a compute cluster hosted by the NERC Centre for Ecology and Hydrology, Wallingford, UK, with 50 compute nodes, each with a total of 1 TB of RAM.
Statistical analyses
We assessed differences in taxonomic and AMR gene profiles between replicates and sequencing lanes by calculating Bray-Curtis dissimilarities, which quantify compositional differences based on relative abundances. These were then used to perform permutational multivariate analysis of variance tests (PERMANOVA) using the vegan package (v.2.4–1 [42];). A t-test from R base package [43] was performed to assess the differences in richness between subsampled groups of consecutive sequencing depths. Figures were produced using ggplot2 [44].