Ph.D. Defense: Jun Wang
Monday, November 28, 2022
3137 IRB Join Zoom Meeting: https://umd.zoom.us/j/5065254722
301 405 3681
ANNOUNCEMENT: Ph.D. Defense
NAME: Jun Wang
Professor Joseph F. JaJa (Chair)
Professor Larry S. Davis (Co-Chair)
Professor Min Wu
Professor Furong Huang
Professor Yang Tao (Dean's Representative)
Date/time: Monday, November 28, 2022, 9:30am-11:30am EST
Location: 3137 IRB
Join Zoom Meeting:
Title: Deep Learning for Scene Perception and Understanding
The ability to accurately perceive objects and capture motion information from the environment is crucial in many real-world applications, including autonomous driving, augmented reality, and robotics. In this dissertation, we will give an overview of our recent work on scene perception and understanding.
The point cloud data has been widely used in scene perception tasks. We propose three approaches to improve the efficiency and accuracy from different perspectives. First, to address the varying density problem of 3D point clouds, we introduce InfoFocus, which improves the accuracy of 3D object detection with little overhead by forcing the network to attend to the most informative part of the point cloud. Second, to narrow different feature representations gap, we introduce M3DETR, which models the point cloud by using transformers to fuse multi-representation, multi-scale, and mutual-relation features. Third, to understand dynamic 3D environments and identify motion information of objects, we further propose PointMotionNet, which handles 3D motion learning with a novel point-based spatiotemporal convolution operation module.
Besides accurately classifying, locating objects, and predicting their behaviors, the scene is always text-rich scenarios, which provides useful contextual information and can further help the perception. For example, to safely navigate through complex traffic scenarios, an autonomous system needs to understand traffic rules of the road, such as spotting traffic signals or temporary road signs. We introduce TAG, which exploits underexplored scene text information and enhances scene understanding of Text-VQA models by producing meaningful, and accurate question-answer (QA) samples using a multimodal transformer. TAG has the potential to be applied to identify challenging traffic situations that the autonomous vehicles will encounter on roads.