Skip to main content

Table 1 Datasets used in this study

From: Accurate prediction of metagenome-assembled genome completeness by MAGISTA, a random forest model built on alignment-free intra-bin statistics

Dataset

Namea

Complexityb

Input materialc

Sequencing output

Read sourced

Assembly toole

Binning method

Binning parametersf

Training

HC227_Cc

227

gDNA

evenly

2 × 150 bp PE

total: 60 Gb

ERS5705986

SPAdes

CONCOCT

comp

HC227_Ccc

comp + cov

HC227_Xcc

MaxBin

comp + cov

HC227_Mc

MetaBAT2

comp

HC227_Mcc

comp + cov

Test

BMock12_Mc

12

gDNA

unevenly

2 × 150 bp PE

total: 64 Gb

SRR8073716

SPAdes

MetaBAT2

comp

BMock12_Mcc

comp + cov

Rinke_Mc

54

gDNA

evenly

2 × 150 bp PE

Total: 13 Gb

Rinke et al. [31]b

SPAdes

MetaBAT2

comp

Rinke_Mcc

comp + cov

MBARC-26_Mc

26

gDNA

unevenly

2 × 150 bp PE

total: 51.9 Gb

SRR3656745

SPAdes

MetaBAT2

comp

MBARC-26_Mcc

comp + cov

ZymoCS_Mc

10

gDNA

evenly

2 × 150 bp PE

total: 3 Gb

ERR2984773

SPAdes

MetaBAT2

comp

ZymoCS_Mcc

comp + cov

Quince_Mc

210

Simulated reads unevenly

2 × 150 bp PE

total: 180 Gb

Quince et al. [33]b

MEGAHIT

MetaBAT2

comp

Quince_Mcc

comp + cov

  1. aLetter code after underscore refers to binning method (upper case) and parameters (lower case)
  2. bNumber of strains in the mock
  3. cgDNA: genomic DNA, (un)evenly specifies the distribution of the individual inputs
  4. dSRR (Sequence Read Archive accession number), ERR (European Nucleotide Archive accession number)
  5. eSPAdes version 3.14, For MEGAHIT, assemblies were provided with the publication
  6. fComp, composition; cov, coverage