Solving the Problem: Genome Annotation Standards before the Data Deluge

Klimke, William; O’Donovan, Claire; White, Owen; Brister, J. Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D.; Tatusova, Tatiana

doi:10.4056/sigs.2084864

Environmental Microbiome

Table 3. Selected annotation report examples¹

From: Solving the Problem: Genome Annotation Standards before the Data Deluge

		Chromosome		Feature counts					Calculated values
Bioproject ID²	Organism name No. of replicons	Length (Mbp)	GC (%)	No. of proteins	No. of RNAs	No. of amino acids with tRNA⁵	No. of hypothetical proteins³	Coding Density⁴	Avg. protein length (aa)	Min. protein length (aa)	Short proteins [%]⁶	Percent standard start codon [%]⁷
225	Escherichia coli str. K-12 substr. MG1655 (1)	4.640	50.79	4,144	1,75	22	21	0.89	316	14	20.32	90.54
76	Bacillus subtilis subsp. subtilis str. 168 (1)	4.216	43.51	4,177	178	20	221	0.99	294	20	26.48	77.76
17977	Candidatus Carsonella ruddii PV (1)	0.160	16.56	182	31	20	44	1.14	274	37	32.42	96.15
32135	Candidatus Hodgkinia cicadicola Dsem (1)	0.144	58.39	169	18	12*	37	1.18	257	38	33.73	27.22
46847	Streptomyces bingchenggensis BCW-1 (1)	11.937	70.75	10,022	84	21	3606	0.84	342	24	19.86	60.69
19943	Rickettsia rickettsii str. Iowa (1)	1.268	32.45	1,384	37	19*	607	1.09	232	17	47.76	73.55
81	Clostridium tetani E88 (1)	2.799	28.75	2,373	72	20	247	0.85	336	101	12.09	68.27
12634	Anaeromyxobacter dehalogenans 2CP-C (1)	5.013	74.91	4,346	58	21	965	0.87	349	38	15.85	69.21
49535	Propionibacterium freudenreichii subsp. shermanii CIRM-BIA1 (1)	2.616	67.27	2,375	51	20	721	0.91	317	2	21.14	70.57
43535	Lactobacillus salivarius CECT 5713 (1)	1.828	32.94	1,350	120	21	86	0.74	352	95	2.22	80.00
105	Haloarcula marismortui ATCC 43049 (2)	3.420	61.93	3,412	59	20	1	1.00	285	30	27.02	100.00
13128	Photobacterium profundum SS9 (2)	6.323	41.71	5,413	209	21	2,490	0.86	316	35	21.97	73.88
28711	Haliangium ochraceum DSM 14365 (1)	9.446	69.48	6,719	55	20	1,827	0.71	411	32	13.37	79.67
244	Nostoc sp. PCC 7120 (1)	6.414	41.35	5,368	64	20	0	0.84	326	17	25.58	82.41
19857	Vibrio harveyi ATCC BAA-1116 (2)	5.969	45.44	5,944	159	20	5944*	1.00	286	24	30.43	84.84
28111	Sorangium cellulosum ‘So ce 56’ (1)	13.034	71.38	9,375	319	0*	4170	0.72	401	30	13.08	73.33
344	Rhizobium leguminosarum bv. viciae 3841 (1)	5.057	61.09	4,700	0*	0*	247	0.93	309	40	19.57	80.83
31271	Mycobacterium leprae Br4923 (1)	3.268	57.80	1,604	47	20	143	0.49	335	33	21.01	54.30
29335	Neisseria gonorrhoeae NCCP11945 (1)	2.232	52.37	2,662	67	20	324	1.19	240	32	41.81	71.22

1. Selected genomes and categories for INSDC genomes are shown. The first two rows are for the model organisms E. coli and B. subtilis. The other genomes were selected as the minimum (bolded) or maximum (bolded and underlined) in the categories shown. Those marked with an asterisk fall below the minimal standards described in this publication.
2. INSDC Bioproject ID for each genome [57].
3. Number of proteins annotated as ‘hypothetical protein’.
4. Number of proteins per Kbp ((total number of proteins/genome length (bp)) * 1000).
5. Number of amino acids for which at least one tRNA is annotated in the genome (excluding predicted or annotated pseudo tRNAs).
6. Percent of short proteins (number less than 150 amino acids in length/total number of proteins * 100).
7. Percent of standard starts for proteins (number of standard starts (ATG)/total starts * 100).

Back to article page

ISSN: 2524-6372

Contact us

Submission enquiries: journalsubmissions@springernature.com