Ph.D. Research Proposal Exam: Ahmed Adel Attia

Monday, November 25, 2024
2:00 p.m.
AVW 1146 (ISR)
Maria Hoo
301 405 3681
mch@umd.edu

ANNOUNCEMENT: Ph.D. Research Proposal Exam

 

Name: Ahmed Adel Attia

Committee:

Professor Carol Espy-Wilson (Chair)

Professor Shihab Shama 

Professor Sanghamitra Dutta

Date/time: Monday November 25, 2024 at 2PM

Location: AVW 1146 (ISR)

Title: Advancing Speech Recognition for Educational and Low-Resource Domain

Abstract:

Automatic Speech Recognition (ASR) technology has significantly advanced in recent years, largely driven by innovations in deep learning and transformer-based models. Yet, current ASR systems fall short in educational and low-resource contexts, particularly when dealing with children’s speech in real-world classroom settings. Addressing these challenges is essential, given ASR's potential to support inclusive, adaptive learning by providing real-time transcription for students who are hard of hearing or English Language Learners (ELLs) and by offering educators valuable insights into classroom dynamics. Children’s speech and classroom environments pose unique complexities—such as high variability in speech patterns, disfluencies, overlapping conversations, and background noise—that often cause ASR models to perform poorly, underscoring the need for specialized approaches in educational ASR.

Despite ASR's success in broader applications, children’s speech remains notably underrepresented. Acoustically and linguistically distinct from adult speech, children’s voices show significant variability, which standard ASR models, typically trained on adult datasets, struggle to handle. Public datasets for children’s speech, such as My Science Tutor (MyST) and CSLU-Kids, are limited in size and diversity, further constraining ASR development. Privacy concerns compound these limitations, making large-scale labeled datasets difficult to access, a barrier that especially hinders ASR research tailored to young speakers.

This work approaches the problem through the adaptation and fine-tuning of transformer-based models, with models like Whisper and Wav2vec2.0 evaluated in child-specific, low-resource settings. The MyST dataset is curated and filtered to maximize utility, employing preprocessing techniques and error analysis to mitigate the challenges posed by low-quality transcriptions. The fine-tuning of Whisper models aims to capture the unique characteristics of children’s speech, pushing beyond limitations in existing research and enhancing model robustness to the disfluencies and atypical sentence structures frequently present in children’s spoken language.

The research also addresses ASR performance in classroom environments, where overlapping speech and background noise present additional challenges. Leveraging data from the National Center for Teacher Effectiveness (NCTE) and the M-Powering Teachers (MPT) projects, this work explores model adaptation techniques for multi-speaker, noisy conditions. Continued Pretraining (CPT) for domain adaptation helps refine ASR models for real-time classroom transcription, tackling challenges posed by overlapping speakers and far-field audio sources. This approach not only advances ASR for educational settings but also narrows the gap between theoretical advancements and practical applications in real-world classrooms.

To address data scarcity, classroom noise simulations are developed using game engines to produce realistic, controlled acoustic environments. This synthetic data serves as augmented training input, helping ASR models learn to isolate and transcribe target speech amidst noise. Additionally, a weakly supervised fine-tuning strategy is introduced: starting with weak transcriptions as pre-training, the ASR model learns general patterns before being fine-tuned on high-quality, hand-curated transcriptions. Proof-of-concept studies on these techniques demonstrate promising results, bolstering ASR performance in low-resource, noisy settings.

Further work includes investigating novel pre-training techniques inspired by speech enhancement and perceptual insights into noise effects on ASR systems. One avenue examines using clean speech to provide contrastive targets for self-supervised speech representation learning which guides the model in learning noise-invariant representations.   Another project examines speech-inversion-based self-supervised pre-training, where articulatory parameters are extracted and used as initialization for the latent space of self-supervised speech models. Additionally, a perceptual study will explore the impact of noise on learned representations during self-supervised pre-training, offering deeper insights for adjusting training strategies to build noise-robust ASR models from the start. Together, these pre-training strategies aim not only to improve ASR performance in challenging environments but also to deepen the foundational understanding of ASR in varied acoustic conditions, informing more resilient model designs.

 

Audience: Faculty 

remind we with google calendar

 

December 2024

SU MO TU WE TH FR SA
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31 1 2 3 4
Submit an Event