Event
Ph.D. Dissertation Defense: Anton Ratnarajah
Friday, August 23, 2024
1:00 p.m.-3:00 p.m.
AVW 1146 (ISR)
Maria Hoo
301 405 3681
mch@umd.edu
ANNOUNCEMENT: Ph.D. Dissertation Defense
Name: Anton Ratnarajah
Committee:
Professor Dinesh Manocha (Chair)
Professor Carol Espy-Wilson
Professor Ramani Duraiswami
Professor Sanghamitra Dutta
Professor Nikhil Chopra, (Dean's Representative)
Date/Time: Friday, August 23, 2024 at 1:00 to 3:00 p.m.
Location: AVW 1146 (ISR)
Abstract:
Sound propagation is the process through which sound energy emitted by a speaker travels through the air as sound waves. The room impulse response (RIR) describes this process and is influenced by the positions of the source and listener, the room's geometry, and its materials. Physics-based acoustic simulators have been used for decades to generate high-quality RIRs for specific acoustic environments. However, we have encountered limitations with existing acoustic simulators. For example, they require a 3D representation and detailed material knowledge of the environment.
To address these limitations, we propose three solutions. First, we introduce a learning-based RIR generator that is two orders of magnitude faster than an interactive ray-tracing simulator. Our approach can be trained to be directly controlled using both statistical and traditional input parameters, and it can generate both monaural and binaural RIRs for both reconstructed and synthetic 3D scenes. Our generated RIRs outperform interactive ray-tracing simulators in speech-processing applications, including Automatic Speech Recognition (ASR), Speech Enhancement, and Speech Separation, by 2.5%, 12%, and 48%, respectively.
Secondly, we propose estimating RIRs from reverberant speech signals and visual cues in the absence of a 3D representation of the environment. By estimating RIRs from reverberant speech, we can augment training data to match test data, improving the word error rate of the ASR system. Our estimated RIRs achieve a 6.9% improvement over previous learning-based RIR estimators in real-world far-field ASR tasks. We also demonstrate that RIR estimation can be utilized for efficient compression of multi-channel audio codecs, significantly reducing the bandwidth of AudioDec by 52% for binaural speech. Our audio-visual RIR estimator aids tasks like visual acoustic matching, novel-view acoustic synthesis, and voice dubbing, validated through perceptual evaluation.
Finally, we introduce IR-GAN to augment high-quality RIRs using real RIRs. IR-GAN parametrically controls acoustic parameters learned from real RIRs to generate new RIRs that imitate different acoustic environments, outperforming Ray-tracing simulators on the Kaldi far-field ASR benchmark by 8.95%.