Event
Ph.D. Research Proposal Exam: Zahra Zare Jousheghani
Thursday, January 23, 2025
1:00 p.m.
IRB-4105
Maria Hoo
301 405 3681
mch@umd.edu
ANNOUNCEMENT: Ph.D. Research Proposal Exam
Name: Zahra Zare Jousheghani
Committee:
Prof. Robert Patro, Chair
Prof. Dinesh Manocha
Prof. Cunxi Yu
Date/time: Thursday, January 23, 2025 at 1 pm
Location: IRB-4105
Title: Enhanced probabilistic modeling leads to improved accuracy in bulk & single-cell RNA-seq transcriptome quantificatio
Abstract:
Advancements in long-read sequencing technologies have transformed transcriptomics by enabling the sequencing of full-length transcripts, providing unprecedented insights into gene expression and isoform diversity. However, accurate transcript quantification in both bulk and single-cell long read RNA-seq remains a significant challenge due to technical limitations, sequencing errors, and biases. This proposal focuses on developing enhanced algorithmic and probabilistic modeling techniques to address these challenges and improve the accuracy of transcript quantification in both bulk and single-cell long-read RNA sequencing data. The work discussed herein is divided into two major chapters, each tackling unique aspects of long-read quantification and proposing novel solutions.
The first chapter addresses the challenges of transcript quantification in bulk RNA-seq datasets generated by long-read sequencing technologies, which provide a detailed view of transcript structures and isoform diversity by aggregating data from a population of cells. Despite their potential, current quantification methods are hindered by sequencing errors, mapping ambiguities, and limitations in probabilistic models, particularly for transcript assignment. To overcome these issues, we propose a novel probabilistic framework implemented in a software tool called \texttt{oarfish}, which integrates read alignment scores and coverage profiles to improve quantification accuracy, sensitivity to low-abundance isoforms, and robustness against sequencing errors. Evaluations on both simulated and experimental PacBio and ONT datasets demonstrate its effectiveness, while proposed enhancements—such as dynamic coverage updates, factorized likelihood models, and genome-to-transcriptome alignment would pave the way for even broader applications and improved computational efficiency.
The second chapter focuses on single-cell long-read RNA-seq, which enables the study of cellular heterogeneity and dynamic biological processes at single-cell resolution. While long-read scRNA-seq offers advantages such as isoform-level resolution and splicing variation analysis, it faces challenges from sparse data, technical noise, and errors in cell barcodes (CBs) and unique molecular identifiers (UMIs). Building on the bulk RNA-seq framework, we propose adapting the probabilistic model for single-cell datasets by incorporating advanced error correction for CBs and UMIs and refining UMI deduplication methods. While the current focus is on PacBio data, future work will extend compatibility to ONT datasets and integrate innovations from Chapter 1—including dynamic coverage updates, factorized likelihood models, and genome-to-transcriptome alignment into single-cell quantification.