Solving the Problem: Genome Annotation Standards before the Data Deluge

Klimke, William; O’Donovan, Claire; White, Owen; Brister, J. Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D.; Tatusova, Tatiana

doi:10.4056/sigs.2084864

Environmental Microbiome

Table 4. Pseudogene annotation strategies and outcomes

From: Solving the Problem: Genome Annotation Standards before the Data Deluge

Case	Situation	Flag¹	How to Annotate	Consequence²	In BLAST³
1	Pseudogene	“/pseudo”	pseudogene	no translation; product name is in note, associated feature (CDS, tRNA, rRNA, etc.) will be annotated	No
2	Potential pseudogene	N/A	normal gene annotated, potential pseudogene status in note	no CDS feature, not documented as a pseudogene, not trackable as protein vs. RNA-coding	No
3a	Frameshifted gene and sequence IS correct	“/pseudo”	combine intervals into a single gene with /pseudo	no translation; product name is in note	No
3b	Frameshifted gene and sequence MAY be correct	N/A	keep both and add a note to each CDS	two separate coding regions and two protein translations	Yes (Both)
3c*	Frameshifted gene and there are sequence ERRORS	/“exception=”annotated by transcript or proteomic data” AND (“/experiment” OR “/inference”)	experimental evidence defining the evidence that translation is correct and/or inference pointing to Accession Number with correct translation	protein sequence imported-translation does not match nucleotide	Yes
3d	Frameshifted gene and there are sequence ERRORS	“/artificial_location”	locations altered for ‘correct’ location	all protein deflines prefaced with “LOW-QUALITY PROTEIN:”	Yes
4	Region of similarity	N/A	misc_feature denoting location of region of similarity	no gene, no locus_tag, not systematically enumerated	No
5	Potential unresolvable problems	N/A	note explaining the issue	no change in annotation	Yes
6⁴	Split/interrupted gene in the case of an insertion (ex. transposon insertion)	N/A	could be either a single interval, or a split interval, annotation depends on consequence of insertion	no standards for split genes, locations do not match regions of similarity	No

1. Qualifier to be used on feature.
2. Downstream consequence of annotation decision, including impacts on presentation of the record.
3. Whether a protein sequence is encoded and will be present in protein and BLAST databases. Note, BLAST dbs only provide the ability to differentiate proteins based on defline changes. ie. Case 3b, 3c, and 5 present undifferentiated protein deflines in BLAST databases whereas case 3d has an altered protein defline.
4. Insertions can result in complicated cases such as gene fusion events. These annotation results should be due to real insertions, not simply regions of the genome that exhibit weak similarity to a part of a protein sequence.

Back to article page

ISSN: 2524-6372

Contact us

Submission enquiries: journalsubmissions@springernature.com