Event
Ph.D. Research Proposal Exam: Xiyang Wu
Monday, August 11, 2025
11:00 a.m.
Souad Nejjar
301 405 8135
snejjar@umd.edu
ANNOUNCEMENT: Ph.D. Research Proposal Exam
Name: Xiyang Wu
Committee:
Professor Dinesh Manocha (Chair)
Professor Pratap Tokekar
Professor Sanghamitra Dutta
Date/time: Monday, August 11, 2025 at 11:00 AM
Title: Bridging the Misalignment Between the Foundation Models and the Physical World
Abstract: In recent years, foundation models have achieved remarkable breakthroughs across a wide range of fields. However, their misalignment with the physical world, i.e., the divergence between the perception and comprehension of the physical world compared with reality, particularly in understanding human behavior, commonsense, and physical laws, still limits their ability to perform advanced tasks in real-world settings. This research proposal aims to address these limitations through three major contributions: (1) Developing human behavior and intention-awareness algorithms to help foundation models interpret human context and adapt their decision-making in collaborative tasks in navigation scenarios. (2) Designing data-efficient post-training algorithms to boost foundation models’ physical and commonsense reasoning by integrating spatial and motion understanding modules for real-world perception, including instruction tuning to enhance reasoning and cross-modal alignment. (3) Enhancing the reliability and trustworthiness of foundation models by identifying and analyzing vulnerabilities, such as hallucination and instruction misalignment, through multiple dedicated algorithms focusing on triggering vulnerabilities for foundation models. The ultimate goal is to develop hallucination-robust, human-intention-aware, and physically plausible foundation models for both generative tasks (such as world models, which learn internal representations of the real world for embodied agents) and discriminative tasks (such as embodied foundation models for robot decision-making with deep reasoning capabilities and strong human intent comprehension in complex environments).