Ph.D. Dissertation Defense: Ahmed Adel Attia

Tuesday, March 3, 2026
9:00 a.m.
AVW 1146

Name: Ahmed Adel Attia

Committee:
Professor Chair/Advisor: Carol Espy-Wilson
Professor Shihab Shamma
Professor Sanghamitra Dutta
Professor Dinesh Manocha

Professor Dean's Representative Ramani Duraiswami

Date/time: March 3rd, 2026, 9AM

Location: AVW 1146

Title: Advancing Speech Recognition for Low-Resource Domains: A Case Study in Educational and Classroom Speech

Abstract:

Automatic Speech Recognition (ASR) technology has significantly advanced in recent years, largely driven by innovations in Artificial Intelligence (AI) and transformer-based models. ASR models' performance, like all AI models, is heavily data-dependent, causing them to underperform in many low-resource settings. Many critical applications remain data scarce and low-resource for various reasons, including annotation costs, privacy preservation concerns, and challenging acoustic conditions. Classrooms and other educational settings exemplify such domains with a large potential for useful speech AI applications that is largely unrealized due to the adverse acoustic environment and the data scarcity in that field. ASR has the potential to support inclusive, adaptive learning by providing real-time transcription for students who are hard of hearing or English Language Learners (ELLs) and by offering educators valuable insights into classroom dynamics. Classrooms represent a particularly challenging low-resource English domain, presenting unique complexities, including high variability in speech patterns, disfluencies, overlapping conversations, and background noise, which often cause ASR models to perform poorly, underscoring the need for specialized approaches in educational ASR.

This dissertation examines data-centric and training-centric approaches for enhancing ASR under adverse low-resource conditions, utilizing classroom speech as a motivating and representative application domain. We study the adaptation of transformer-based ASR models, including Whisper and Wav2vec2.0, to children's speech and classroom environments. We start by curating and analyzing public and otherwise available children and classroom speech datasets, introducing filtering, preprocessing, and error analysis techniques that improve their utility for model training and evaluation.

The majority of research in low-resource speech models is concerned with low-resource languages, where the audio quality may vary, but the main challenge is adapting to new linguistic syntaxes. Classroom speech is unique, wherein the main challenge is the adversarial acoustic environment, with a distinct linguistic structure that is still within the English language. To this end, we start with a proposed adaptation technique through adapting the acoustic models through Continued Pre-training (CPT) on unlabeled in-domain audio. This approach demonstrates significantly improved robustness in multi-speaker, noisy conditions using real classroom datasets.

We follow this work by introducing techniques to leverage weak, inaccurate, and imprecise transcription. This approach stems from a practical need to increase the utility of available data and reduce the demand for high-cost, accurate verbatim transcriptions. We propose Weakly Supervised Pre-training (WSP), a training paradigm that utilizes large amounts of weak transcriptions as an intermediate supervision step prior to supervised fine-tuning. We validate this approach both with synthetically corrupted transcripts and real weak transcripts from classroom datasets, and under both tests, WSP shows significant improvement, which increases the utility of limited gold-standard transcripts.

A parallel approach to WSP that is also concerned with improving the acoustic modeling capacity of the model and increasing the utility of limited datasets is data simulation. Recent speech models have utilized AI to produce generative speech from text corpora. This approach, while effective, is expensive and is bottlenecked by the naturalness and variability of Text-To-Speech(TTS) models. We propose a simulation-based approach that uses available natural speech and simulates the acoustic environment itself. This approach's novelty lies in its usage of game engines to recreate the acoustic environments, allowing for fast, cheap, and scalable data simulation. Using these virtual environments, we simulate babble noise and capture Room Impulse Responses (RIRs) in simulated classroom environments. Paired with semantically-paired adult and child educational speech from different datasets, we create RealClass, the first and largest shareable classroom speech corpus. Benchmarks on this dataset show that it outperforms off-the-shelf speech datasets on classroom test data and works best when paired with real classroom speech, increasing the utility of available limited data.

Our final contribution to improve the acoustic modeling of ASR models is concerned with the reintegration of speech articulatory parameters in ASR. Mainly, we add an auxiliary acoustic-to-articulatory Speech Inversion (SI) task, which predicts the speech articulatory parameters from the raw speech signal, into the ASR model. This approach significantly improves the performance of ASR models under low-resource conditions.

Finally, we compare our approaches to emerging technologies, namely Large Language Model (LLM) driven ASR models. Our benchmarks highlight the great potential of prompt-based contextual biasing as a cheap and effective adaptation technique. We also show that the contributions of this dissertation, applied to a smaller, acoustic ASR model without LLM components, outperform LLM-based models, even with contextual biasing and supervised fine-tuning on in-domain classroom data. This finding underscores that even with the superior linguistic modeling capacity of LLM-based approaches, improved acoustic modeling is more effective in acoustically adversarial environments. Collectively, these contributions advance the understanding and practical performance of ASR systems under data scarcity, offering generalizable methodologies for low-resource speech recognition beyond educational domains.

Audience: Graduate Faculty

Browse All Events

February 2026

SU	MO	TU	WE	TH	FR	SA
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
1	2	3	4	5	6	7

Submit an Event