Event
Ph.D. Dissertation Defense: Ahmed Adel Attia
Tuesday, March 3, 2026
9:00 a.m.
AVW 1146
Committee:
Professor Chair/Advisor: Carol Espy-Wilson
Professor Shihab Shamma
Professor Sanghamitra Dutta
Professor Dinesh Manocha
Date/time: March 3rd, 2026, 9AM
Location: AVW 1146
Title: Advancing Speech Recognition for Low-Resource Domains: A Case Study in Educational and Classroom Speech
Abstract:
This dissertation examines data-centric and training-centric approaches for enhancing ASR under adverse low-resource conditions, utilizing classroom speech as a motivating and representative application domain. We study the adaptation of transformer-based ASR models, including Whisper and Wav2vec2.0, to children's speech and classroom environments. We start by curating and analyzing public and otherwise available children and classroom speech datasets, introducing filtering, preprocessing, and error analysis techniques that improve their utility for model training and evaluation.
The majority of research in low-resource speech models is concerned with low-resource languages, where the audio quality may vary, but the main challenge is adapting to new linguistic syntaxes. Classroom speech is unique, wherein the main challenge is the adversarial acoustic environment, with a distinct linguistic structure that is still within the English language. To this end, we start with a proposed adaptation technique through adapting the acoustic models through Continued Pre-training (CPT) on unlabeled in-domain audio. This approach demonstrates significantly improved robustness in multi-speaker, noisy conditions using real classroom datasets.
We follow this work by introducing techniques to leverage weak, inaccurate, and imprecise transcription. This approach stems from a practical need to increase the utility of available data and reduce the demand for high-cost, accurate verbatim transcriptions. We propose Weakly Supervised Pre-training (WSP), a training paradigm that utilizes large amounts of weak transcriptions as an intermediate supervision step prior to supervised fine-tuning. We validate this approach both with synthetically corrupted transcripts and real weak transcripts from classroom datasets, and under both tests, WSP shows significant improvement, which increases the utility of limited gold-standard transcripts.
A parallel approach to WSP that is also concerned with improving the acoustic modeling capacity of the model and increasing the utility of limited datasets is data simulation. Recent speech models have utilized AI to produce generative speech from text corpora. This approach, while effective, is expensive and is bottlenecked by the naturalness and variability of Text-To-Speech(TTS) models. We propose a simulation-based approach that uses available natural speech and simulates the acoustic environment itself. This approach's novelty lies in its usage of game engines to recreate the acoustic environments, allowing for fast, cheap, and scalable data simulation. Using these virtual environments, we simulate babble noise and capture Room Impulse Responses (RIRs) in simulated classroom environments. Paired with semantically-paired adult and child educational speech from different datasets, we create RealClass, the first and largest shareable classroom speech corpus. Benchmarks on this dataset show that it outperforms off-the-shelf speech datasets on classroom test data and works best when paired with real classroom speech, increasing the utility of available limited data.
Our final contribution to improve the acoustic modeling of ASR models is concerned with the reintegration of speech articulatory parameters in ASR. Mainly, we add an auxiliary acoustic-to-articulatory Speech Inversion (SI) task, which predicts the speech articulatory parameters from the raw speech signal, into the ASR model. This approach significantly improves the performance of ASR models under low-resource conditions.
Finally, we compare our approaches to emerging technologies, namely Large Language Model (LLM) driven ASR models. Our benchmarks highlight the great potential of prompt-based contextual biasing as a cheap and effective adaptation technique. We also show that the contributions of this dissertation, applied to a smaller, acoustic ASR model without LLM components, outperform LLM-based models, even with contextual biasing and supervised fine-tuning on in-domain classroom data. This finding underscores that even with the superior linguistic modeling capacity of LLM-based approaches, improved acoustic modeling is more effective in acoustically adversarial environments. Collectively, these contributions advance the understanding and practical performance of ASR systems under data scarcity, offering generalizable methodologies for low-resource speech recognition beyond educational domains.
