Genome overview of eight Candida boidinii strains isolated from human activities and wild environments


 Candida boidinii is an Ascomycota yeast with important biotechnological applications. In this paper we present the genome sequencing and annotation of eight strains of this species isolated from human activities and wild environments. The produced assemblies revealed several strain specific features in terms of genomic GC content (ranging from 30.9 to 32.7%), genome size (comprised between 18,791,129 and 19,169,086 bp) and total number of protein coding genes (ranging from 5819 to 5998), with putative assignation to their general KOG functional categories. The obtained data underlined the presence of two different groups for this species. The results reported herein provide new insights into the plasticity of the genome of this yeast species and represent a starting point for further studies in view of its biotechnological applications.



Introduction
Candida boidinii is a yeast belonging to Ascomycota phylum of the Kingdom Fungi, class Saccharomycetes, order Saccharomycetales, phylogenetically related to the Ogataea clade. This yeast species was first identified in Spain from a wash of tree bark by Ramirez [1], albeit the ecology of this microorganism is widespread and it has been isolated from diverse substrates related to human activity (wine fermentations, olive manufacturing, tepache, etc.) and natural environments (soil, seawater, sap fluxes of many sugar rich tree species, etc.) [2].
C. boidinii is a yeast species with a clear biotechnological potential. Indeed, this xylose-consuming and methylotrophic yeast proved to be suitable for the study of genes related with methanol degradation [3][4][5]. Moreover, this species is involved in olive processing, where it exhibits different multifunctional features such as lipase activity [6], biofilm formation on fruit epidermis [7,8] and co-aggregation with LAB species such as Lactobacillus pentosus [9,10].
Intraspecific biodiversity appears to be a distinctive feature of the C. boidinii species. Indeed, Lee and Komagata [11] compared the electrophoretic profiles of enzymes expressed in diverse strains of this species, revealing the presence of two distinct groups. Lin et al. [12] studied 19 C. boidinii strains isolated from diverse sources and also identified two divergent clusters both in terms of molecular (DNA base composition, electrophoretic karyotype, RFLP of RNA genes) and chemical (cellular fatty acid composition and ubiquinone system) features. The authors even highlighted a distinctive chromosomal banding pattern for each strain. Finally, statistics reported by the CBS-KNAW Fungal Biodiversity Centre show an average similarity between C. boidinii strains of 97.61% for 26S rDNA sequences (n = 38), and 98.06% for ITS sequences (n = 25) (http://www.cbs.knaw.nl/Collections/).
The biotechnological potential of C. boidinii, together with its underlined biodiversity, urge to obtain more information on the genome of this Ascomycota yeast. In facts, at the time of writing, the genome sequences of only two C. boidinii strains were available, namely GF002 (isolated from sugarcane bagasse, Bioproject PRJNA299882, [13]), and JCM9604 (isolated from tanning fluid, Bioproject PRJDB3623). In order to fill this lack of information, we hereafter report the genomic sequence and annotation of eight additional C. boidinii strains that were isolated from both human activities and wild environments.

Classification and features
After previous studies on the ability of diverse yeast species to co-aggregate with diverse Lactobacillus pentosus strains [9] isolated from table olive fermentations, we selected eight strains of C. boidinii featuring different origins and degrees of co-aggregation. Strains UNISS-Cb18 and UNISS-Cb60 were obtained from the UNISS microbial collection (Università degli Studi di Sassari, Italy), TOMC-Y13 and TOMC-Y47 belong to the Table  Olive Microorganisms Collection (Instituto de la Grasa-CSIC, Seville, Spain), DBVPG6799, DBVPG7578, and DBVPG8035 were obtained from the Industrial Yeast Collection (Università degli Studi di Perugia, Italy), and strain NDK27A1 was obtained from the Yeast Collection of the Dipartimento di Agraria (Università degli Studi di Naples, Italy). Tables 1, 2, 3, 4, 5, 6, 7 and 8 summarizes the classification, origin and main features of the studied organisms, whereas Fig. 1 shows, as an example, the morphology of one of the analysed strains (e.g. UNISS-Cb60) by scanning electron microscopy. Figure 2 shows the phylogenetic position of the selected C. boidinii isolates with respect to other yeast species, confirming its closely relationship with the Ogataea clade. The result presented here is originated by the alignment of the 18S rRNA sequences (Fig. 2); C. albicans (strain MUCL29800) 18S rRNA gene (accession id X53497.1), was used as a query to retrieve the homologues sequences within the other species assemblies (low coverage alignment prevented the inclusion of the published C. boidinii strain in the analysis). The observed phylogenetic closeness of the C. boidinii to the Ogataea clade was confirmed by the alignment of the D1/D2 domain of 26S rRNA gene (Additional file 1: Figure S1). Figure 3 shows the genotyping of these strains by RAPD-PCR analysis with M13 primers. All the strains were clearly grouped into different clusters for a cut-off value of 84.6% (the lowest reproducibility value was obtained between replicates for strain DBVPG6799).
The specific ability of the eight C. boidinii strains to form biofilm alone or in combination with three LAB strains isolated from table olives (L. pentosus TOMC-LAB2, Lactobacillus plantarum TOMC-LAB9, and Pediococcus pentosaceus TOMC-P56) was quantified by crystal violet staining. Briefly, 96-well microtiter plates were inoculated with 100 μL of overnight culture of each C. boidinii strain, alone or in combination with 100 μL of the mentioned LAB. After 48 h incubation at 28°C, liquid was removed from wells and washed twice with sterile saline solution (0.9%). Subsequently, a crystal violet solution (0.8% w/v) was added to each well. Plates were incubated at room temperature for 30 min and then washed twice with sterile distilled water. Finally, an ethanol-acetone mixture (80:20, v/v) was added in order to extract crystal violet bound to biofilm. After 30 min incubation at room temperature, the OD at 595 nm was determined with a spectrophotometer model Spectrostar Nano (BMG Labtech, Ortemberg Germany). , not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from the Gene Ontology project [46] Multifactorial ANOVA was used to compare OD values obtained for the different strains. Results are shown in Fig. 4. As clearly deduced, different ability to form biofilms was exhibited among strains. In mono-culture, the lowest value was obtained for strain NDK27A1 (OD 0.5), which was statistically different compared to the strain with the highest value (TOMC-Y13, OD 1.3). Moreover, for many of the strains, biofilm production was statistically higher in mixed culture in presence of the L. pentosus species, which was especially evident for strains UNISS-Cb18, UNISS-Cb60, and NDK27A1. This fact did not occur for the other LAB species. Only for strain NDK27A1, the presence of L. plantarum also produced a considerable increase in its ability to form biofilm.

Genome sequencing information
Genome project history Formation of mixed biofilms between yeasts and LAB on the surface of olives during the fermentation process is a widely observed phenomenon [8]. This phenotype is determined by the expression of multiple genes of both the bacteria and the yeast. In this regard, C. boidinii has been described as a yeast with high ability to

MIGS-4.4 Altitude
Not determined a Evidence codes -IDA Inferred from Direct Assay, TAS Traceable Author Statement (i.e., a direct report exists in the literature), NAS Non-traceable Author Statement (i.e., not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from the Gene Ontology project [46] form mixed biofilms [10] and, for this reason, several strains were sequenced aiming to investigate in further studies the genetic bases of the observed peculiar behaviour. The genome project was deposited under the accession number PRJNA359406. Tables 9 and 10 shows a summary of this genome project, which encompassed for a total of eight microorganisms.

Growth conditions and genomic DNA preparation
DNA extraction of the C. boidinii strains was performed according to Borelli et al. [13] with slight modifications.      Table S1 in Additional file 2 for more details) were quality-filtered using prinseq-lite program [14] applying the following parameters: min_length: 50, trim_qual_right: 30, trim_qual_type: mean, trim_qual_window: 20). Then, R1 and R2 from Illumina sequencing where joined using fastq-join from ea-tools suite (https://expressionanalysis.github.io/ea-utils/) applying the following default parameters: maximum percent difference: 8, minimum overlap: 6. The resulting datasets were used to assemble all the C. boidinii strains' genomes by using the software SPAdes [15]. Scaffolds that proved to be shorter than 500 bp were removed from the final assembly.

Genome annotation
The obtained genomes were annotated using the tool Augustus [16] that was trained with transcripts from Candida tropicalis. Such a species was chosen among others (e.g. Candida albicans and Candida guilliermondii from the built-in Augustus training sets and Candida glabrata from an ad hoc training set derived from the gene models available at the NCBI genome database) based on the number of predicted genes showing high homology (blastp search, e-value < 0.0001, Additional file 3: Table S2) with a dataset of proteins annotated in several yeasts species (e.g. C. dublinensis, C. albicans, C. glabrata, C. guilliermondii, C. lusitaniae, C. orthopsilosis, C. parapsilosis, C. tropicalis, D. hansenii, D. kurascia, L. elongisporus, P. tannophilus, P. membranifaciens). Reliability of prediction was confirmed by a remarkable concordance of the predicted exonic ranges among different training sets (e.g. 98% of the exons predicted using C. tropicalis as the training set proved to be consistent with exons predicted with C. glabrata as training set). Transfer RNA and ribosomal RNA were predicted by using the software tRNAscan [17] and RNAmmer [18] respectively. The tool Blast2GO [19] was used to assign a putative function to the predicted transcripts either in terms of molecular function, cellular component or biological process. The presence of Pfam domains [20] was investigated by the use of the Batch Web CD-Search Tool from NCBI [21], whereas KOG functional categorization was achieved  using the WebMGA web server [22]. Finally, CRISPRFinder [23], SignalP 4.1 server [24] and TMHMM server [25] were used to investigate the presence of CRISPR repeats, signal peptides and transmembrane domains, respectively, within the predicted genes. RepeatModeler [26] was used to investigate the presence of transposable elements in the eight investigated C. boidinii species; the retrieved sequences were merged with the Repbase fungi transposable elements dataset [27] and the resulting library was used to perform a full analysis of the C. boidinii strains repetitive regions by using the RepeatMasker tool [28].

Insights from the genome sequence
Sequencing data were used to compare the reported strains to the published genome of C. boidinii (strain GF002) [13]. The reads of each experiment were aligned to the reference genome by using the software bwa [29] with default parameters (edit distance = 4%). The obtained results highlighted the presence of two distinct groups. Indeed, while UNISS-Cb18, UNISS-Cb60, DBVPG6799 and NDK27A1 (hereafter referred to as group A) proved to share only 9% with the reference DNA sequence (with such a percentage increasing to around 50% when the most permissive aligner bwa mem was used), the remaining strains (TOMC-Y13, TOMC-Y47, DBVPG7578, and DBVPG8035, hereafter referred to as group B) proved to cover around 97% with the GF002 genome. Notably, these two groups also significantly differ in their GC content (p < 0.0001) and genome length (p < 0.001). Although the phylogenetic tree (Fig. 2) and the high level of D1D2 26S ribosomal sequence conservation within as well as between the two groups (Additional file 5: Table S4) show a clear strong C. albicans (strain MUCL29800) 18S rRNA (accession id X53497.1) was used as a query to retrieve the homologues sequences in the other presented species. Sequences were aligned using MUSCLE [37], and the phylogenetic tree was determined using the neighbourjoining algorithm with the Kimura 2-parameter distance model in MEGA (version 7) [38]. A gamma distribution (shape parameter = 1) was used for rate variation among sites. The optimal tree with the sum of branch lengths = 0.1734 is shown, and nodes that appeared in more than 50% of replicate trees in the bootstrap test (1000 replicates) are marked with their bootstrap support values phylogenetic relationship among the presented strains, the observed genetic diversity is not surprising. A marked GC content variability and the identification of two distinct groups (based on the chemo-variability derived from the electrophoretic patterns of several enzymes) was previously reported for this species [12].

Extended insights
The emergence of two apparently distinct groups for the reported C. boidinii strains was further investigated by analysing their genetic diversity in terms of both nucleotide divergence and chromosomal structural variability. In this regard, we first computed the frequency of all possible k-mers (DNA substrings of a specific size k = 25) that are included in each of the assembled genomes by using the pipeline FFP (v. 3.19, [30]). Such an approach has been used to investigate the signature of genetic similarity by directly comparing several genomes even in the absence of a well characterized model organism. The obtained frequencies were used to compute a distance matrix (Fig. 5a) that clearly confirmed the strong similarity between strains belonging to the same group. We speculate that the observed compositional diversity can be due to different factors such as the strength of the mutational pressure [31], the effect of selection [32] or the incidence of the GC biased gene conversion [33]. In this regard, the occurrence of complex structural rearrangements can not be excluded either. For this reason, we used the OrthoMCL pipeline (with default parameters, [34]) to find the orthologues genes of the presented strains and studied their collinearity by using the tool MCscanX [35]. A low sinteny level generally underlie the occurrence of complex structural variation events such as genomic rearrangements or horizontal gene transfer [36]. The analysis involved a total of 47,184 genes and revealed that 88.2% of these were in a collinear group: however a large variability emerged when the collinear group were analysed for each pairs of species (Fig. 5b). The lowest number of collinear genes arose when strains belonging to different groups were compared. Notably, a very high number of genes proved to be collinear when analysing strains belonging to group A with such a trend being less marked for strains within group B and with strain TOMC-Y13 featuring, in general, the smallest values. As reported in Table 14, the sinteny  analysis revealed several parameters discriminating the two groups such the number of dispersed genes (e.g. transcripts that are not collinear with any of the orthologues genes, A < B, p < 0.01), the occurrence of tandem duplications (A < B, p < 0.001) and the number of proximal genes (e.g. transcripts that are duplicated within the analysed species at a distance comprised between 2 and 20 genes, A > B, p < 0.001). The analysis of repetitive regions further confirmed such a discrimination (Additional file 4: Table S3) with group A featuring a higher number of LINE (p < 0.05), LTR (p < 0.001) but a lower number of simple repeats (p < 0.0001) and low complexity sequences (p < 0.0001). Taken together these results suggest an evident impact of complex structural variations in shaping the genome of the C. boidini with such a phenomenon conferring specific genomic structure to strains with diverse evolutionary histories.

Conclusions
In this study, we have sequenced and characterized the genome of eight C. boidinii strains isolated from diverse origins and featuring peculiar co-aggregation behaviour. The analysed species featured a high variability in terms of nucleotide compositional patterns and genomic structure, possibily reflecting their specific evolutionary history. This result underline the need to deeply investigate the phylogenesis of the C. boidinii species by comparing       the reported genomes to those of related species in terms of orthologues protein evolution or transcripts collinearity. The occurrence of both the strain specific duplicated genes and the singletons (e.g. genes with no orthologues in other strains) will need to be further investigated in order to study their involvement in the highlighted morphological differences. We strongly believe that generated data will boost future studies aiming the exploration of both the biotechnological potential and the genome plasticity of this Ascomycota yeast.