A Comprehensive Analysis of Rice Genome and Proteome Annotation through Full-Length Transcript-Based Proteogenomics

Rice (Oryza sativa) is a key crop species, and understanding its genome and proteome is vital for crop improvement. Inaccurate genome annotation has been a hindrance in functional studies, prompting the use of innovative technologies like single-molecule long-read RNA sequencing (lrRNA_seq) for proteogenomic analyses. This study delved into the complexity of the rice transcriptome, revealing a dense network of natural antisense transcripts (NATs), fusion genes, and intergenic transcripts. Over 900,000 transcript isoforms were identified, presenting a genomic arrangement and coding potential far more intricate than previously known.

The study identified that 60% of the loci in the rice transcriptome are associated with NATs, indicating their role in gene expression control. Fusion and intergenic transcripts highlighted the transcriptional diversity, with over 190,000 unique peptides expanding the rice proteome diversity. This complexity underscores the need for advanced technologies like lrRNA_seq to uncover the hidden layers of the rice genome’s organization and coding abilities.

Comparative analyses showed that lrRNA_seq outperformed short-read RNA sequencing (srRNA_seq) in identifying longer transcripts, fusion genes, and intergenic transcripts. The study revealed the prevalence of intricate posttranscriptional events such as alternative splicing, alternative transcription start sites, and alternative polyadenylation, enriching the transcriptome’s coding potential. Notably, lrRNA_seq identified a diverse array of splicing site combinations, suggesting a higher level of variability in eukaryotic splicing than previously thought.

Proteogenomic analysis provided direct evidence of the rice genome’s coding potential, identifying over 9,700 proteoforms/protein groups from a customized three-frame translated database. The study found that nearly 6% of the posttranscriptional events could be translated into peptides, shedding light on the translation potential of the rice genome. These findings underscore the importance of integrating genomic, transcriptomic, and proteomic data to unravel the intricate mechanisms governing the rice genome’s coding capacity.

Key Takeaways:
1. Full-length transcript-based proteogenomics reveals the intricate genomic arrangement, transcriptome diversity, and coding potential of the rice genome.
2. Natural antisense transcripts (NATs), fusion genes, and intergenic transcripts contribute significantly to the transcriptional complexity of the rice genome.
3. Advanced technologies like lrRNA_seq outperform traditional methods in identifying longer transcripts, fusion genes, and posttranscriptional events, enhancing our understanding of the rice genome.
4. Proteogenomic analysis uncovers over 9,700 proteoforms/protein groups, confirming the diverse coding potential of the rice genome and highlighting the need for integrated multi-omics approaches in genomics research.

Tags: regulatory, yeast, mass spectrometry