The impact of sequencing depth on the inferred taxonomic composition and AMR gene content of metagenomic samples

Background Shotgun metagenomics is increasingly used to characterise microbial communities, particularly for the investigation of antimicrobial resistance (AMR) in different animal and environmental contexts. There are many different approaches for inferring the taxonomic composition and AMR gene content of complex community samples from shotgun metagenomic data, but there has been little work establishing the optimum sequencing depth, data processing and analysis methods for these samples. In this study we used shotgun metagenomics and sequencing of cultured isolates from the same samples to address these issues. We sampled three potential environmental AMR gene reservoirs (pig caeca, river sediment, effluent) and sequenced samples with shotgun metagenomics at high depth (~ 200 million reads per sample). Alongside this, we cultured single-colony isolates of Enterobacteriaceae from the same samples and used hybrid sequencing (short- and long-reads) to create high-quality assemblies for comparison to the metagenomic data. To automate data processing, we developed an open-source software pipeline, ‘ResPipe’. Results Taxonomic profiling was much more stable to sequencing depth than AMR gene content. 1 million reads per sample was sufficient to achieve < 1% dissimilarity to the full taxonomic composition. However, at least 80 million reads per sample were required to recover the full richness of different AMR gene families present in the sample, and additional allelic diversity of AMR genes was still being discovered in effluent at 200 million reads per sample. Normalising the number of reads mapping to AMR genes using gene length and an exogenous spike of Thermus thermophilus DNA substantially changed the estimated gene abundance distributions. While the majority of genomic content from cultured isolates from effluent was recoverable using shotgun metagenomics, this was not the case for pig caeca or river sediment. Conclusions Sequencing depth and profiling method can critically affect the profiling of polymicrobial animal and environmental samples with shotgun metagenomics. Both sequencing of cultured isolates and shotgun metagenomics can recover substantial diversity that is not identified using the other methods. Particular consideration is required when inferring AMR gene content or presence by mapping metagenomic reads to a database. ResPipe, the open-source software pipeline we have developed, is freely available (https://gitlab.com/hsgweon/ResPipe).


BACKGROUND
Antimicrobial resistance (AMR) is a significant global health threat (1,2) and understanding the evolution, emergence and transmission of AMR genes requires a 'One Health' approach considering human, animal and environmental reservoirs (3). Methods for profiling species and AMR gene content in samples from these niches can be broadly categorised as either culture-dependent or culture-independent. Culture-dependent methods have the advantage of isolating individual strains for detailed analysis, but hugely underestimate species and AMR gene diversity. Culture-independent methods typically involve shotgun metagenomics, in which the DNA of a complete microbial community in a sample is extracted and sequenced, and the sequencing reads are used to estimate AMR gene and/or species distributions. The advantage of shotgun metagenomics is its relative lack of bias, but it tends to be less sensitive than targeted, culture-based or molecular approaches identifying specific drug-resistant isolates or AMR genes of interest (4)(5)(6).
Problems in characterising the epidemiology of AMR are exemplified by the Enterobacteriaceae family of bacteria. This family contains over 80 genera, and includes many common human and animal pathogens, such as Escherichia coli, that can also asymptomatically colonise human and animal gastrointestinal tracts, and are also found in environmental reservoirs (7).
The genetic diversity of some Enterobacteriaceae species is remarkable: in E.
coli, it has been estimated that only ~10% of the 18,000 orthologous gene families found in the pangenome are present in all strains (8). AMR in Enterobacteriaceae is mediated by >70 resistance gene families, and >2,000 known resistance gene variants have been catalogued (9,10). In addition to mutational resistance, AMR genes are also commonly shared both within and between species on mobile genetic elements such as insertion sequences, transposons and plasmids. Individuals have been shown to harbour multiple diverse AMR gene variants, strains and species of Enterobacteriaceae in their gastrointestinal tract (11,12), highlighting that single-colony subcultures do not recover the true AMR reservoir even within a small subsection of a microbial community.
Attempting to near-completely classify AMR gene and species diversity by any culture-based approach for raw faeces, effluent, and river sediment is therefore unlikely to be feasible; hence, the use of shotgun metagenomics to achieve this aim. However, the replicability of metagenomic surveys and the sequencing depth (reads per sample) required to analyse these sample types has not yet been explored in detail (13,14).
Motivated by the need to analyze large numbers of these samples in the REHAB study (http://modmedmicro.nsms.ox.ac.uk/rehab/), here we carried out a pilot study (Figure 1) to investigate: (i) the replicability of sequencing outputs using common DNA extraction and sequencing methods; and the impact of (ii) widely used taxonomic and AMR gene profiling approaches; (iii) sequencing depth on taxonomic and AMR gene profiles; and (iv) sequencing depth on the recoverability of genetic contents from isolates identified in the same samples using culture-based approaches.

Impact of sequencing depth on AMR profiles
Metagenomic sequencing produced approximately 200 million metagenomic 150bp paired-end reads per sample i.e. over 56 gigabases per sample (Supplementary Table 1), of which <0.05% of reads mapped with 100% identity to a known AMR-related sequence (see next section). The number of reads mapping to AMR gene families was largest in pig caeca (88,816 reads) and effluent (77,044 reads). Upstream sediment did not have enough AMRrelated reads for further analysis (49 reads).
The effluent sample had the highest total richness of both AMR gene families and AMR allelic variants ( Figure 2). Sequencing depth significantly affected the ability to evaluate richness of AMR gene families in effluent and pig caeca, which represent highly diverse microbial environments. The number of AMR gene families observed in effluent and pig caeca stabilized (see Methods: 'Rarefaction curves') at a sequencing depth of ~80 million reads per sample (depth required to achieve 95% of estimated total richness, !.!" : 72-127 million reads per sample). For AMR allelic variants in effluent, the richness did not appear to have plateaued even at a sequencing depth of 200 million reads per sample, suggesting the full allelic diversity was not captured ( !.!" : 193 million reads per sample).

Specific mapping to AMR genes and allelic variants
We exploited the hierarchical structure of the Comprehensive Antimicrobial Resistance Database (CARD) to assign reads to their respective AMR gene families and AMR allelic variants using a specific read mapping strategy i.e. to count only reads which mapped to a unique region of an allele or a gene family. In order to place a lower bound on the AMR diversity present, we adopted a stringent approach which counted only alignments with 100% sequence identity to CARD sequences. The resulting AMR gene family profiles differed significantly between the samples (Figure 3). The most abundant AMR gene families in effluent and pig caeca were "23S rRNA with mutations conferring resistance to macrolide" and "tetracycline-resistant ribosomal protection protein", respectively. There were 10,631 and 733 reads assigned to a "multiple gene family" category in the effluent and pig caeca, respectively. These represent reads that were mapped across multiple AMR gene families and therefore could not be uniquely assigned to any single family.
Reads that mapped to one specific AMR gene family but onto multiple allelic variants (i.e. could not be assigned to one specific allele) were classified as "multiple alleles". There was evidence of high allelic diversity, including among clinically relevant AMR gene families. For example, 47.7% of the reads mapped to the "OXA beta-lactamase" family could not be assigned to a specific allele (4,466 out of 9,357 reads; third-most abundant gene family by reads). Similarly, the most abundant gene family by reads in pig caeca was "tetracycline-resistant ribosomal protection protein", and 35.8% of the reads that mapped within this family could not be assigned to a specific allele (18,228 out of the 50,886 reads).

Impact of normalisation strategies on AMR allelic variant abundances
Normalising by gene length (see Methods: 'Normalisation of gene counts') had a profound effect on the distributions and the ranking order of AMR allelic variants in general ( Figure 4). Further normalisation by T. thermophilus reads did not affect the per sample distributions of AMR allelic variants, but it allowed more accurate comparison between samples by estimating absolute abundance of any given variant in the sample. The number of reads that mapped to T. thermophilus were similar between three samples, and this meant that the changes were small (i.e. a slight relative increase in the effluent compared to the pig caeca sample). While most of the alleles had lateral coverages between 90% and 100% in effluent and pig caeca samples ( Figure 3, right panels), "Moraxella catarrhalis 23S rRNA with mutation conferring resistance to macrolide antibiotics" had lateral coverage of 29% despite being one of the most abundant alleles in the effluent.

Impact of different assignment methods on taxonomic composition
While Centrifuge overall classified more reads than Kraken, both methods showed a similar trend of effluent having a greater proportion of reads classified as bacterial compared to upstream sediment, which had more than pig caeca (Figure 5a). Apart from Centrifuge classifying noticeably more Eukaryota and Viruses (0.7% and 0.05% respectively) than Kraken (0.09% and 0.01% respectively), a large proportion of reads from both methods were unclassified (70.0% and 83.3% for Centrifuge and Kraken respectively). The proportions of recoverable bacterial 16S rRNA fragments were low for all samples (0.16%, 0.23% and 0.04% for effluent, pig caeca and upstream sediment samples respectively), highlighting that shotgun metagenomics is an extremely inefficient method for obtaining 16S rRNA gene sequences.
The bacteria phylum-level classication (Figure 5b) showed structural differences among all three classification methods. The overall community structure and composition were more similar between Kraken and Centrifuge than the 'in silico 16S' approach (see Methods: 'Taxonomic profiling'). This was particularly apparent in the upstream sediment, where using 'in silico 16S' produced distinctively different community profiles from the other methods.
Kraken and Centrifuge classified between 377,675 to over 4 million reads as Enterobacteriaceae. Again, overall composition was similar between these two methods but showed some granularity in structure for pig caeca e.g. the relative abundances of Escherichia were 34.3% and 50.9%, and for Klebsiella 10.6% and 4.9%, for Centrifuge and Kraken respectively.

Impact of sequencing depth on genus-level richess and taxonomic profiles
Kraken and Centrifuge taxonomic profiles were highly stable to sequencing depth within samples. Comparing different sequencing depths within samples using Bray-Curtis dissimilarity showed that the relative taxonomic composition was highly robust to sequencing depth, with 1 million reads per sample already sufficient for <1% dissimilarity to the composition inferred from 200 million reads per sample (Supplementary Figure 1). This was true at both the genus and species level, even though all classification methods are known to have less precision and sensitivity at the species level (15,16). Intriguingly, the genus-level richness rapidly reached a plateau for all samples at ~1 million reads per sample (Figure 6a and Figure 6b), suggesting a database artifact (see 'Discussion').

Recovery of known genomic structures from cultured isolates using metagenomes
In order to assess how well shotgun metagenomics could recapitulate culturedependent diversity, we cultured seven Enterobacteriaeceae isolates (four from effluent, two from pig caeca, one from upstream sediment;   concluding that as few as 0.5 million reads per sample could recover broadscale taxonomic changes and species profiles at >0.05% relative abundance (14). In line with this, we find that 1 million reads per sample is already sufficient to accurately obtain taxonomic composition (at <1% dissimilarity to the 'true' composition at 200 million reads). However, even 200 million reads per sample is not enough to obtain the complete diversity of AMR genes in effluent. This is potentially concerning because environmental metagenomics studies often use sequencing depths of as little as ~10 million reads per sample (~3.6Gb). For pig caeca samples, 80 million reads per sample appears to be adequate for sampling all AMR gene families represented in CARD, but still not adequate for exhausting AMR allelic variants. Notably, we adopted the stringent criterion of a perfect (i.e. 100%) match to assign any given read to a reference AMR sequence. This strategy obviously reduces the risk of false positives, while increasing false negatives. Therefore, our results represent a conservative lower bound on the AMR diversity present in the samples we analysed.
An additional challenge of metagenomics analysis in the context of AMR is choosing a consistent strategy for 'counting' AMR genes, whether in terms of their presence or relative abundance, from mapped reads. It remains unclear what the best approach is for this problem. One option is to count all the reads which map to a reference gene; however, this means that reads are potentially counted multiple times when the reference gene shares homology with other genes in the database, or that counts may be underestimated if reads are randomly assigned to best reference matches. In addition, reads which map to a wildtype, non-resistant sequence may also be inadvertently and inappropriately counted. Another option is to use only reads which map to regions of a gene that are unique and not shared with other genes in the database (e.g. as in ShortBRED (20)). This is a more conservative approach, but may be inherently biased against closely-related genes in the database.
For example, CARD contains 14 sequences for bla NDM genes, which differ at less than 2% of their positions, so each gene individually has very few specific regions. Exploiting knowledge of the often complex genetic variation within AMR gene families is necessary to avoid erroneous conclusions regarding presence/absence. Inferred abundances of particular AMR genes are likely frequently contingent not only on mapping and counting strategies, but also on the particular genetic features of the AMR genes catalogued in the chosen reference database. Interpreting and comparing results across studies utilising different methods therefore becomes difficult.
Once the type of count data to be considered (in terms of number of reads mapping to a gene) has been chosen, a normalisation strategy is required to compare across genes and samples. We found that normalising by gene length changed the inferred abundance distributions of AMR genes across all the sample types studied, again with important implications for those studies that have not undertaken this kind of normalisation. We have also outlined a protocol to obtain a pseudo-absolute gene copy number of specific regions of AMR genes by normalising by both gene length and an exogenous spike of T.
thermophilus. While we do not claim that this accurately reflects the true abundance of individual genes, we believe it is useful for comparisons across samples within a study. In our study we took great care to ensure standardised DNA extraction and had small batches of samples; probably as a result, we obtained similar proportions of sequences of T. thermophilus for all samples (range: 0.067-0.082%), but this may not always be the case.
Appropriate normalisation using exogenous DNA spikes to account for some of the extraction biases could have potentially dramatic effects on results and their interpretation.
As well as examining normalised abundances, the lateral coverage of a gene is also an important metric to decide whether a certain allele is likely present in the sample. In effluent, the most abundant gene by specific read count was "Moraxella catarrhalis 23S rRNA with mutation conferring resistance to macrolide antibiotics". However, the gene only had 29% lateral coverage, and this result should therefore be interpreted cautiously. In fact, the high specific read count is probably because CARD only includes one Moraxella rRNA gene with an AMR mutation compared to twenty Escherichia rRNA genes; the lateral coverage suggests that the AMR allele is not in fact present. This underlines the importance of considering multiple metrics simultaneously.
Both taxonomic and AMR gene profiling outputs are clearly dependent on the species and AMR databases used as references. It should be additionally noted that for AMR gene profiling, some genes are variants of a wildtype which may differ by as little as a single SNP. Because short-read metagenomics typically surveys ≤150 bp fragments, even specific read counts can in fact plausibly be wildtypes rather than particular resistance variants. This can be overcome by adopting our stringent approach which requires an exact match (i.e. at 100%) to call a given variant in the database; although obviously this increases the rate of false negatives, we have shown that this strategy appears successful given adequate sequencing depth.
We found a reasonable consistency between taxonomic classification methods, but there were differences between Kraken and Centrifuge, and undoubtedly there would have been differences with other methods, had we tested them. This is a previously recognised issue (e.g. as in (21)) and has no single solution; methods are optimised for different purposes and perform differently depending on the combination of sample type, sequencing method, and reference database used. As the field changes so rapidly and newer methods become available, we strongly recommend that researchers with shotgun metagenomic data review excellent benchmarking efforts such as CAMI (21) and LEMMI (22) and assess the tools using a particular quantitative metric rather than making a (perhaps arbitrary) choice for their analysis. Investigating the robustness of conclusions to choice of method is also a recommended step (23,24).
Remarkably, there were no 'unique genera' at high sequencing depth: reads assigned to all genera were present in all three sample types at high depth.
We believe this is an artifact due to the limited number of genomes available in the species database used for the assignment methods. The RefSeq database contains complete genomes for 11,443 strains, but these represent only 1,065 genera. Our samples almost exhausted the entire genus space: the number of genera that were classified by Centrifuge was 1,036, and this number was the same for the effluent, pig caeca and upstream sediment samples, i.e. all three samples had the same number of total unique genera observed at 200 million reads depth. This was the same with Kraken, which classified 1,035 genera in total and there was no difference in richness between the three samples. This highlights the importance of using diversity measures which take into account the relative abundance of taxa rather than just their presence or absence.
We also found that a large number of reads (>50%) were unclassified by either Kraken or Centrifuge. The absence of organisms such as fungi from our reference database could have played a role in this, but other studies of effluent have also found that between 42% and 68% of short metagenomic reads cannot be assigned to any reference sequence (25)(26)(27). Our focus was on using the best available tools to assess the bacterial composition of samples; understanding what this unassigned microbial 'dark matter' represents was beyond the scope of this study, but would be valuable future work.
Our analyses confirm that using culture-based methods offered complementary and additional information to shotgun metagenomics. By mapping metagenomic reads back to high-quality hybrid assemblies obtained via culture, we found that the majority of isolates from effluent were recoverable by metagenomics at sequencing depths. However, the majority of isolates from pig caeca were not recovered. These results exemplify the need for exploring both methods in analysing AMR genes and microbial communities, as both show different perspectives on the AMR profiles and strains present in a given sample.

CONCLUSIONS
In summary, we have used a combination of deep metagenomic sequencing, hybrid assembly of cultured isolates, and taxonomic and AMR gene profiling methods to perform a detailed exploration of methodological approaches to characterise animal and environmental metagenomic samples. Sequencing depth critically affects the inferred AMR gene content and taxonomic diversity of complex, polymicrobial samples, and even 200 million reads per sample was insufficient to capture total AMR allelic diversity in effluent. Choice of taxonomic profiler can result in significant differences in inferred species composition.
The open-source software pipeline we have developed is freely available as 'ResPipe'. As well as packaging existing tools, ResPipe provides detailed information on various metrics that are useful for assessing AMR gene abundances, including: a novel normalisation technique for read counts, specific mapping counts, and lateral coverage, all of which can provide different but important insights. There is undoubtedly vast diversity present in microbial communities. Establishing best practices and pipelines for analysing this diversity with shotgun metagenomics is crucial to appropriately assess AMR in environmental, animal and human faecal samples.

Sample types and settings
We sampled three distinct potential AMR reservoirs, namely: (i) pooled pig caecal contents from 10 pigs from a breeder farm in Yorkshire and the Humber (denoted as "pig caeca"); (ii) river sediment 100m upstream of a sewage treatment works (STW) at Cholsey STW, Cholsey, Oxfordshire ("upstream sediment"); and (iii) treated sewage effluent emitted from Cholsey STW ("effluent"). Cholsey STW is a plant that serves a population equivalent of ~21,000 with a consented flow of 3200 m 3 /day; processes include primary settlement tanks, followed by biological disc filters and humus tanks, and subsequently disc filtration. These sample types were chosen to represent a spectrum of predicted diversity of microbial communities (i.e. high to low: effluent, pig caeca, upstream sediment).
The pooled pig caeca had been collected as part of a separate study surveying the presence of AMR genes in E. coli in pigs from 56 farms across the UK (28). In brief, caecal contents were sampled from 10 randomly  Table 4).

Sequencing depth subsampling and quality filtering
In order to simulate the effect of sequencing at different depths, each set of   Figure 2b), although there were apparent discrepancies between methods which we did not investigate further.

AMR gene profiling
The quality filtered reads were mapped with bbmapskimmer.sh (BBMap suite; Although we use a mapping-based approach from shotgun metagenomic reads, we have included this information in ResPipe. For simplicity, we designate "protein homolog" model genes and "non protein homolog" model genes under the broad headings "resistance by presence" and "resistance by variation", respectively (where "variation" can encompass SNPs, knockout, or overexpression). The BAM files generated by the mapping were processed by a custom script to generate a count table where only alignments with a strict 100% sequence identity (without allowing any deletions or insertions) to CARD sequences were counted. Where a read mapped to more than one AMR gene family or an AMR allelic variant (i.e. could not be designated into any one AMR gene family or AMR allelic variant) it was counted as "multiple families" or "multiple alleles" respectively. For each AMR allelic variant, we calculated "lateral coverage", defined as the proportion of the gene covered by at least a single base of mapped reads. Where reads mapped to multiple families or alleles, lateral coverage could not be calculated.

Rarefaction curves
For fitting the relationship between sequencing depth per sample and the richness of AMR gene families or allelic variants, we used the species accumulation model defined by Clench (40): = !×! !!!×! . This model may be flawed, but is only used here to give a rough estimate of the sequencing depth required to achieve a proportion of (e.g. 95%) of the total richness, which is

Normalisation of gene counts
Assuming random sequencing, longer genes are more likely to be represented in reads. In order to alleviate this gene length bias, the resulting These normalisation protocols are intended to produce a pseudo-absolute gene copy number of each AMR gene family and AMR allelic variant, while recognising that this remains an estimated of the actual copy number of genes present in any given sample.   Table 2).

Hybrid assembly for isolates
We assembled genomes for isolates using a version of a pipeline we had  Table 3). The majority of plasmids were found in effluent isolates (24/29). We checked MALDI-ToF species identification with mlst (v2.15.1; (32)) and found agreement (Supplementary Table 2).

Mapping of metagenomic sequences onto isolates
To investigate the feasibility of accurately identifiying genetic structures (chromosomes and plasmids) in the metagenomic reads in relation to the impact of sequencing depth, we used the assembled chromosomes and plasmids derived from the cultured and sequenced isolates as reference genomes (in silico genomic "probes") to which the metagenomic short reads were mapped. We used the same mapping protocol used for the aforementioned functional profiling and lateral coverage was calculated for each chromosome/plasmid at any given sequencing depth.

Implementation into a Nextflow pipeline
The entire workflow (both taxonomic and AMR gene profiling) has been implemented into a Nextflow (41) pipeline complying with POSIX standards, written in Python: ResPipe (https://gitlab.com/hsgweon/ResPipe). All analyses were performed on a compute cluster hosted by the NERC Centre for Ecology and Hydrology, Wallingford, UK, with 50 compute nodes, each with a total of 1TB of RAM.

Statistical analyses
We assessed differences in taxonomic and AMR gene profiles between replicates and sequencing lanes by calculating Bray-Curtis dissimilarities, which quantify compositional differences based on relative abundances.
These were then used to perform permutational multivariate analysis of variance tests (PERMANOVA) using the vegan package (v.2.4-1; (42)). A ttest from R base package (43) was performed to assess the differences in richness between subsampled groups of consecutive sequencing depths.

Conflict of interests.
The authors declare they have no conflict of interest. Figure 1. Schematic overview of the study. For each sample, we used both a metagenomics and culture-based approach. We developed a software pipeline ('ResPipe') for the metagenomic data. For more details on each step of the workflow, see Methods.