Complete genome sequence of Thermotoga sp. strain RQ7

Thermotoga sp. strain RQ7 is a member of the family Thermotogaceae in the order Thermotogales. It is a Gram negative, hyperthermophilic, and strictly anaerobic bacterium. It grows on diverse simple and complex carbohydrates and can use protons as the final electron acceptor. Its complete genome is composed of a chromosome of 1,851,618 bp and a plasmid of 846 bp. The chromosome contains 1906 putative genes, including 1853 protein coding genes and 53 RNA genes. The genetic features pertaining to various lateral gene transfer mechanisms are analyzed. The genome carries a complete set of putative competence genes, 8 loci of CRISPRs, and a deletion of a well-conserved Type II R-M system.


Background
Thermotoga species are a group of thermophilic or hyperthermophilic bacteria that can ferment a wide range of carbohydrates and produce hydrogen gas as one of the major final products [1,2]. Their hydrogen yield from glucose can reach the theoretical maximum: 4 mol of H 2 from each mole of glucose [2,3], which makes them ideal candidates for biofuel production. Meanwhile, because their enzymes are thermostable by nature, they also hold great prospect in the biocatalyst sector. 16S rRNA gene sequence analyses place Thermotoga at a deep branch in the tree of life, and genomic studies also reveal extensive horizontal gene transfer events between Thermotogales and other groups, particularly Archaea and Firmicutes [4]. Controversy over the phylogenetic significance of Thermotoga has triggered a prolonged debate on the concepts of species and biogeography, etc. [5].
We have been interested in the genetics of Thermotoga over the years and have developed the earliest set of tools to genetically modify these bacteria [6][7][8]. Strain RQ7 plays an essential role in these studies. This strain possesses the smallest known plasmid, pRQ7 (846 bp) [9], that is absent from most Thermotoga strains and serves as the base vector for all Thermotoga-E. coli shuttle vectors developed so far. T. sp. strain RQ7 is also the first Thermotoga strain in which natural competence was discovered [7]. To gain insights into the genetic and genomic features of the strain and to facilitate the continuing effort on developing genetic tools for Thermotoga, we set out to sequence the whole genome of T. sp. strain RQ7.

Organism information
Classification and features T. sp. strain RQ7 was isolated from marine sediments of Ribeira Quente, Azores [1]. The strain is a member of the genus Thermotoga, the family Thermotogaceae, and the order Thermotogales (Table 1). Based on 16S rRNA gene sequences, the closest relative of T. sp. strain RQ7 is T. neapolitana DSM 4359, and these two strains cluster with T. maritima MSB8 and T. sp. strain RQ2 (Fig. 1). The results are in agreement with previous reports [10].
Like its close relatives T. neapolitana DSM 4359 and T. maritima MSB8, T. sp. strain RQ7 is a strict anaerobe, growing best around 80°C, utilizing both simple and complex sugars, and producing hydrogen gas. These bacteria grow in both rich and defined media, are free living and non-pathogenic to humans, animals, or plants. Cells are rod-shaped, about 0.5 to 2 μm in length and 0.4 to 0.5 μm in diameter (Fig. 2). The most distinctive feature of Thermotoga cells is the "toga" structure that Phylum Thermotogae TAS [38,39] Class Thermotogae TAS [39,40] Order Thermotogales TAS [39,41] Family Thermotogaceae TAS [39,42] Genus Evidence codes -IDA Inferred from Direct Assay, TAS Traceable Author Statement (i.e., a direct report exists in the literature), NAS Non-traceable Author Statement (i.e., not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence), IGC Inferred from Genomic Content (i.e., average nucleotide identity, syntenic regions). These evidence codes are from the Gene Ontology project [49]   sp. strain RQ7 relative to other species within the order Thermotogales. Only species with complete genome sequences are included. The tree was built with 16S rRNA gene sequences, using the Neighbor-Joining method with MEGA7 [50]. Fervidobacterium nodosum serves as the outgroup balloons out from both ends of the rod [1,11], an extension of their outer membrane [12].

Genome project history
The project started in June 2011, and the genome was sequenced by BGI Americas (Cambridge, MA) using the Illumina technology. A total of 400 Mb of clean data were generated, which covered the genome more than 200 fold. The assembled scaffold covers 97.7% of the chromosome. PCR and Sanger sequencing were later used for gap filling. The assembly was finalized in February 2014, and the complete sequence was submitted to the GenBank in April 2014. The sequence was annotated with the NCBI Prokaryotic Genome Annotation Pipeline [13] and the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4) [14]. The project information is summarized in Table 2.
Growth conditions and genomic DNA preparation T. sp. strain RQ7 was kindly provided by Drs. Harald Huber and Robert Huber at the University of Regensburg, Germany. It was cultivated in SVO medium [15] at 77°C, and its genomic DNA was extracted with standard phenol extraction method [16]. Briefly, cells from 250 ml of overnight culture were collected by centrifugation and resuspended in 10 ml of STE solution (10 mM Tris-HCl, 1 mM EDTA, 100 mM NaCl, pH 8.0). SDS and proteinase K were added to a final concentration of 1% (w/v) and 20 μg/ml. The mixture was incubated at 50°C for 6 h followed by the addition of an equal volume of phenol/chloroform/isoamyl alcohol (25:24:1, v/v/v). After gentle mixing, the mixture was centrifuged at 12,000 g at 4°C for 15 min. The upper aqueous layer was transferred to a clean tube and mixed with 1/10 volume of 3 M sodium acetate (pH 5.5) and 2 volumes of ice cold 95% (v/v) ethanol. The DNA was spooled out by a glass rod, washed with 70% (v/v) ethanol, air dried, dissolved in 2 ml of TE buffer (10 mM Tris-HCl, 1 mM EDTA, pH 8.0) containing 20 μg/ml RNase A, and stored at −20°C.

Genome sequencing and assembly
The genome of T. sp. strain RQ7 was mainly sequenced by BGI Americas using Illumina HiSeq 2000 sequencing platform. Three paired-end libraries, in size of 500, 2000, and 5000 kb, were constructed. The raw data were filtered by a quality control step and generated 400 Mb of clean data, which indicated a coverage of more than 200-fold. The reads were assembled by SOAPdenovo [17] and polished by SOAPaligner [18]. This resulted in a single scaffold of 1,822,593 bp that covered 97.7% of the genome and contained 28 gaps. The gap filling efforts included the integration of the current scaffold with contigs generated by the CLC Genomics Workbench [19] and a small amount of public sequences in GenBank. GapFish [20] was then used to solve a dozen ambiguous regions. Finally, PCR and primer walking were performed to close the remaining gaps, resulting a final assembly of 1,851,618 bp. The entire assembling process integrated wet lab methods with in silico approaches, and the programs used included public software  (SOAPdenovo and SOAPaligner [17,18]), a commercial product (CLC Genomics Workbench [19]), and an in-house program GapFish [20]. Details of the assembling process are described in our previous report [20].

Genome annotation
The genome was independently annotated by two pipelines, the NCBI Prokaryotic Genome Annotation Pipeline [13] and the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4) [14]. Both pipelines combine a gene-calling algorithm with a similarity-based gene detection approach, even though the algorithms and databases they use are different. For example, PGAAP uses GeneMarkS+ for de novo gene prediction, while MGAP uses Prodigal. Consequently, the two pipelines produced slightly different annotation results. The analyses in this report took into consideration of the results from both pipelines and are assisted with manual curation.

Genome properties
The genome of T. sp. strain RQ7 is composed of a circular chromosome of 1,851,618 bp with a GC content of 47.05% and a single mini-plasmid of 846 bp with a GC percentage of 39.95 ( Fig. 3; Table 3). The plasmid pRQ7 has been characterized [9] and sequenced [6,21] before. According to the annotation of MGAP, the chromosome carries 1906 putative genes, of which, 1853 are protein coding genes and 53 are RNA genes (Table 4). Among all the genes that are assigned to a COG category (Table 5), a significant portion (~12%, 191 genes) are devoted to carbohydrate utilization, which is typical to Thermotoga strains and accords with their versatile use of carbon and energy sources.  Insights from the genome sequence The chromosomal sequence of T. sp. strain RQ7 was compared to those of T. maritima MSB8, T. neapolitana DSM 4359, and T. sp. strain RQ2, with emphases on the genetic elements that have the highest impacts on genetic engineering attempts, such as natural competence genes, CRISPRs, and R-M systems.

Full genome comparison
The alignment of the complete genomic sequence of the four Thermotoga strains (Fig. 4) revealed high levels of synteny among their genomes, particularly within the pairs of T. sp. strain RQ7−T. neapolitana DSM 4359 and T. sp. strain RQ2−T. maritima MSB8. This is in agreement with their placements in the phylogenetic tree (Fig. 1). The average nucleotide identity between T. sp. strain RQ7 and the type strain T. neapolitana DSM 4359 is 98.49%, which is higher than the conventional cutoff of 95% for species delineation [22]. Therefore, T. sp. strain RQ7 should be considered as a strain of T. neapolitana, same as T. sp. strain RQ2 to T. maritima [23].  A detailed comparison of T. sp. strain RQ7 and T. neapolitana DSM 4359 found 100 genes belonging only to the former and 120 genes only to the latter. Some of these genes became unique because their counterparts in the other genome have mutated to a pseudogene. However, many of the unique genes seem to have been acquired via recent lateral gene transfer events. The putative functions of these genes are mainly associated to transportation and utilization of carbohydrates and nucleotides. The most notable gene clusters include TRQ7_01555-01655 (nucleotide metabolism), TRQ7_02 675-02725 (carbohydrate metabolism), TRQ7_03440-03490 (arabinose metabolism), CTN_0026-0038 (synthesis of antibiotics), CTN_0236-0245 (carbohydrate metabolism), CTN_0355-0373 (ribose metabolism), CTN_1540-1554 (carbohydrate metabolism), and CTN_1602-1627 (ribose metabolism). Follow-up functional genomics studies are needed to validate the predictions on these gene functions and metabolic pathways.

Natural competence
Thermotoga species are known to undergo lateral gene transfer events. One of the ways this could happen is via natural transformation. Natural competence has been established in T. sp. strain RQ7 [7] and T. sp. strain RQ2 [8]. Using experimentally characterized competence genes as references, we are able to identify the genes that might play a role in natural competence in Thermotoga (Table 6). These genes are widely spread among bacterial genomes, and none of them are clustered into operons. This might imply a primitive form of natural competence that is shared by most, if not all, bacteria. Perhaps, most free-living bacteria are more or less naturally competent during some points of their life. The trick is to identify the right conditions under which the natural competence will be allowed to develop.

CRISPRs
CRISPRs provide prokaryotes a form of adaptive immunity against invading phages and plasmids in a sequence specific manner [24,25]. The system utilizes non-coding CRISPR RNA and a set of CRISPR-associated proteins to target invading nucleic acid, including both DNA and RNA. CRISPRs have been reported to prevent natural transformation [26,27]. They have been noticed before in Thermotoga and are credited for large scale chromosomal recombination events in these species [28,29]. NCBI's PGAAP pipeline identified 6 loci of CRISPR arrays in T. sp. strain RQ7, whereas JGI-IMG's MGAP pipeline and a manual analysis using CRISPRFinder [30] recognized a total of 8 loci (Table 7). Among these eight CRISPR loci, #1 and #3 are the ones not considered by PGAAP. Two clusters of cas genes are also found. The cas6-cas2 cassette is sandwiched between loci #3  100000  200000  300000  400000  500000  600000  700000  800000  900000  1000000  1100000  1200000  1300000  1400000  1500000  1600000  1700000  1800000   gTn_nnl_dnaA.fasta   100000  200000  300000  400000  500000  600000  700000  800000  900000  1000000  1100000  1200000  1300000  1400000  1500000  1600000  1700000 Table 7). Although analysis with CRISPRFinder revealed the same number of CRISPR loci in the four close relatives, i.e. T. sp. strain RQ7, T. neapolitana DSM 4359, T. maritima MSB8, and T. sp. strain RQ2, the total number of spacers they carry vary dramatically, as 95, 60, 106, and 129 spacers are found respectively. T. maritima MSB8 and T. sp. strain RQ2 also harbor RNA-targetting cmr genes in addition to DNA-targetting cas genes [31]. These differences may affect the efficiency of lateral gene transfer events among the strains.   [32,33]. The nuclease R.TneDI cleaves at the center of the recognition site (CG↓CG), and the methylase M.TneDI modifies one of the cytosines. The TneDI system has been found in many members of the Thermotogaceae family, including T. maritima MSB8 and T. sp. strain RQ2 [32]. However, it is absent from T. sp. strain RQ7, although the neighborhood is still highly conserved (Fig. 6). To exclude the possibility of an assembling error, primers spanning the region in question were designed, and the PCR results confirmed the deletion (Fig. 7). The absence of the TneDI system makes the DNA of T. sp. strain RQ7 susceptible to R.TneDI, and in vitro treatment with M.TneDI provides complete protection to its genomic DNA (Fig. 8).
M.TneDI has been predicted to be a m 4 C methylase based on sequence analysis [32]. It has also been noticed that m 4 C methylation is more common than m 5 C in thermophiles, probably due to a reduced risk of deamination [34]. The speculation of M.TneDI being a m 4 C methylase is further supported by the observation that the genomic DNA of TneDI-bearing species is still suspetible to BstUI (Fig. 9), which is an isoschizomer of R.TneDI and known to be blocked by m 5 C methylation [35].

Conclusions
The genome of T. sp. strain RQ7 shares large regions of synteny with those of its close relatives, namely, T. neapolitana DSM 4359, T. maritima MSB8, and T. sp. strain RQ2. They all have a complete set of putative competence genes, although natural transformation has yet to be established in T. neapolitana DSM 4359 and T. maritima MSB8. The same number of CRISPR loci are found in all four genomes, even though the number of spacers vary. The most noticeable difference among the strains is the absence of the TneDI R-M system in T. sp. strain RQ7, which partially explains why this strain is more amenable to genetic modifications than others. In general, this work sheds light on the genetic features of T. sp. strain RQ7, promoting genetic and genomic studies of Thermotoga spp.