Ph.D. Dissertation Defense: Yi-Ting Shen

Wednesday, April 8, 2026
3:45 p.m.
AVW 1146

ANNOUNCEMENT: Ph.D. Dissertation Defense

Name: Yi-Ting Shen

Committee:
Professor Shuvra S. Bhattacharyya (Chair/Advisor)
Professor Joseph JaJa
Professor Dinesh Manocha
Dr. Heesung Kwon
Professor Abhinav Shrivastava (Dean's Representative)

Date/time: Wednesday, April 8, 2026 at 3:45 - 5:45pm

Location: AVW 1146

Title: Tackling Data Scarcity in Human-Centric Vision: From UAV-Based Detection to Multimodal Pose Understanding

Abstract:

Human-centric vision tasks, such as human detection and pose understanding, are fundamentally constrained by limited data availability and the high cost of annotation, posing significant challenges for data-driven learning approaches. These challenges are further amplified in UAV-based and multimodal scenarios. In UAV-based settings, large variations in viewpoint, scale, and imaging conditions demand substantially more data to learn robust and generalizable representations. In multimodal settings, annotation becomes more complex, requiring semantically consistent alignment across modalities (e.g., vision and language), which further limits scalability. This dissertation addresses these challenges through advances in dataset design, synthetic data utilization, and scalable multimodal annotation, supported by novel methods, benchmarks, and systematic analyses.

For UAV-based scenarios, we focus on aerial-view human detection. We first introduce Archangel, a hybrid benchmark that integrates real and synthetic imagery with metadata on camera position and human pose, enabling fine-grained evaluation under diverse conditions. Our analysis reveals that naively incorporating synthetic data into training is often ineffective due to the domain gap between synthetic and real data. To address this, we propose Progressive Transformation Learning (PTL), which incrementally selects and transforms synthetic data to better align with the real domain based on measured domain discrepancies before incorporating it into training, leading to improved detection performance in data-scarce settings. We further develop SynPoseDiv, a synthetic pose diversification framework that combines a diffusion-based pose generator with a pose-guided image translation model, enhancing synthetic data diversity and improving performance.

For multimodal scenarios, we address the challenge of scalable annotation for composed pose retrieval (CPR). We propose AutoComPose, a framework based on multimodal large language models (LLMs) that enables automatic pose-transition annotation. To evaluate its effectiveness, we construct new CPR benchmarks, demonstrating improved retrieval performance while substantially reducing reliance on manual annotation.

In summary, this dissertation demonstrates that data scarcity in human-centric vision can be effectively mitigated through improved dataset curation, the effective use of synthetic data via progressive sim-to-real transformation, enhanced pose diversity, and automated multimodal annotation. These contributions advance scalable and data-efficient learning in complex visual environments.

Audience: Graduate Faculty

Browse All Events

May 2026

SU	MO	TU	WE	TH	FR	SA
26	27	28	29	30	1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31	1	2	3	4	5	6

Submit an Event

SU	MO	TU	WE	TH	FR	SA
26	27	28	29	30	1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31	1	2	3	4	5	6

SU	MO	TU	WE	TH	FR	SA
26	27	28	29	30	1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31	1	2	3	4	5	6

SU	MO	TU	WE	TH	FR	SA
26	27	28	29	30	1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31	1	2	3	4	5	6