Event
Ph.D. Dissertation Defense: Yi-Ting Shen
Wednesday, April 8, 2026
3:45 p.m.
AVW 1146
ANNOUNCEMENT: Ph.D. Dissertation Defense
Name: Yi-Ting Shen
Committee:
Professor Shuvra S. Bhattacharyya (Chair/Advisor)
Professor Joseph JaJa
Professor Dinesh Manocha
Dr. Heesung Kwon
Professor Abhinav Shrivastava (Dean's Representative)
Date/time: Wednesday, April 8, 2026 at 3:45 - 5:45pm
Name: Yi-Ting Shen
Committee:
Professor Shuvra S. Bhattacharyya (Chair/Advisor)
Professor Joseph JaJa
Professor Dinesh Manocha
Dr. Heesung Kwon
Professor Abhinav Shrivastava (Dean's Representative)
Date/time: Wednesday, April 8, 2026 at 3:45 - 5:45pm
Location: AVW 1146
Title: Tackling Data Scarcity in Human-Centric Vision: From UAV-Based Detection to Multimodal Pose Understanding
Abstract:
Human-centric vision tasks, such as human detection and pose understanding, are fundamentally constrained by limited data availability and the high cost of annotation, posing significant challenges for data-driven learning approaches. These challenges are further amplified in UAV-based and multimodal scenarios. In UAV-based settings, large variations in viewpoint, scale, and imaging conditions demand substantially more data to learn robust and generalizable representations. In multimodal settings, annotation becomes more complex, requiring semantically consistent alignment across modalities (e.g., vision and language), which further limits scalability. This dissertation addresses these challenges through advances in dataset design, synthetic data utilization, and scalable multimodal annotation, supported by novel methods, benchmarks, and systematic analyses.
For UAV-based scenarios, we focus on aerial-view human detection. We first introduce Archangel, a hybrid benchmark that integrates real and synthetic imagery with metadata on camera position and human pose, enabling fine-grained evaluation under diverse conditions. Our analysis reveals that naively incorporating synthetic data into training is often ineffective due to the domain gap between synthetic and real data. To address this, we propose Progressive Transformation Learning (PTL), which incrementally selects and transforms synthetic data to better align with the real domain based on measured domain discrepancies before incorporating it into training, leading to improved detection performance in data-scarce settings. We further develop SynPoseDiv, a synthetic pose diversification framework that combines a diffusion-based pose generator with a pose-guided image translation model, enhancing synthetic data diversity and improving performance.
For multimodal scenarios, we address the challenge of scalable annotation for composed pose retrieval (CPR). We propose AutoComPose, a framework based on multimodal large language models (LLMs) that enables automatic pose-transition annotation. To evaluate its effectiveness, we construct new CPR benchmarks, demonstrating improved retrieval performance while substantially reducing reliance on manual annotation.
In summary, this dissertation demonstrates that data scarcity in human-centric vision can be effectively mitigated through improved dataset curation, the effective use of synthetic data via progressive sim-to-real transformation, enhanced pose diversity, and automated multimodal annotation. These contributions advance scalable and data-efficient learning in complex visual environments.
For UAV-based scenarios, we focus on aerial-view human detection. We first introduce Archangel, a hybrid benchmark that integrates real and synthetic imagery with metadata on camera position and human pose, enabling fine-grained evaluation under diverse conditions. Our analysis reveals that naively incorporating synthetic data into training is often ineffective due to the domain gap between synthetic and real data. To address this, we propose Progressive Transformation Learning (PTL), which incrementally selects and transforms synthetic data to better align with the real domain based on measured domain discrepancies before incorporating it into training, leading to improved detection performance in data-scarce settings. We further develop SynPoseDiv, a synthetic pose diversification framework that combines a diffusion-based pose generator with a pose-guided image translation model, enhancing synthetic data diversity and improving performance.
For multimodal scenarios, we address the challenge of scalable annotation for composed pose retrieval (CPR). We propose AutoComPose, a framework based on multimodal large language models (LLMs) that enables automatic pose-transition annotation. To evaluate its effectiveness, we construct new CPR benchmarks, demonstrating improved retrieval performance while substantially reducing reliance on manual annotation.
In summary, this dissertation demonstrates that data scarcity in human-centric vision can be effectively mitigated through improved dataset curation, the effective use of synthetic data via progressive sim-to-real transformation, enhanced pose diversity, and automated multimodal annotation. These contributions advance scalable and data-efficient learning in complex visual environments.
