Ph.D. Dissertation Defense: Kota Hara

Monday, August 8, 2016
10:30 a.m.
Room 4424 AVW Bldg.
Maria Hoo
301 405 3681
mch@umd.edu

ANNOUNCEMENT: Ph.D. Dissertation Defense

NAME: Kota Hara

Advisory Committee:
Professor Rama Chellappa, Chair/Advisor
Professor Larry Davis
Professor Min Wu
Professor Uzi Vishkin
Professor Amitabh Varshney, Dean's representative

Date/Time: Monday, August 8, 2016 at 10:30 am

Place: Room 4424 A.V. Williams Building

Title: DEEP NEURAL NETWORKS AND REGRESSION MODELS FOR OBJECT DETECTION AND POSE ESTIMATION

Estimating the pose, orientation and the location of objects has been
a central problem in the computer vision community for decades. In
this dissertation, we propose new approaches for these important
problems using deep neural networks as well as tree-based regression
models.

For the first topic, we look at the human body pose estimation problem
and propose a novel regression-based approach. The goal of human body
pose estimation is to predict the locations of body joints, given an
image of a person. Due to significant variations introduced by pose,
clothing and body styles, it is extremely difficult to address this
task by the standard application of the regression method. Thus, we
address this task by dividing the whole body pose estimation problem
into a set of local pose estimation problems by introducing a
dependency graph which describes the dependency among different body
joints. For each local pose estimation problem, we train a boosted
regression tree model and estimate the pose by progressively applying
the regression along the paths in a dependency graph starting from the
root node.

Next, we turn our attention to the role of pose information for the
object detection task. In particular, we focus on the detection of
fashion items a person is wearing or carrying. It is clear that the
locations of these items are strongly correlated with the pose of the
person. To address this task, we first generate a set of candidate
bounding boxes by using an object proposal algorithm. For each
candidate bounding box, image features are extracted by a deep
convolutional neural network pre-trained on a large image dataset and
the detection scores are generated by SVMs. We introduce a
pose-dependent prior on the geometry of the bounding boxes and combine
it with the SVM scores. We demonstrate that the proposed algorithm
achieves significant improvement in the detection performance.

Our next work is on improving the traditional regression tree method
and demonstrating its effectiveness on pose/orientation estimation
tasks. The main issues of the traditional regression training are, 1)
the node splitting is limited to binary splitting, 2) the form of the
splitting function is limited to thresholding on a single dimension of
the input vector and 3) the best splitting function is found by
exhaustive search. We propose a novel node splitting algorithm for
regression tree training which does not have the issues mentioned
above. The algorithm proceeds by first applying k-means clustering in
the output space, conducting multi-class classification by SVM and
determining the constant estimate at each leaf node. We apply the
regression forest employing our regression tree models to head pose
estimation, car orientation estimation and pedestrian orientation
estimation tasks and demonstrate its superiority over various standard
regression methods.

Lastly, we address the object detection task by exploring a way to
incorporate an attention mechanism into the detection algorithm.
Humans have the capability of allocating multiple fixation points,
each of which attends to different locations and scales of the scene.
However, such a mechanism is missing in the current state-of-the-art
object detection methods. Inspired by the human vision system, we
propose a novel deep network architecture that imitates this attention
mechanism. For detecting objects in an image, the network adaptively
places a sequence of glimpses at different locations in the image.
Evidences of the presence of an object and its location are extracted
from these glimpses, which are then fused for estimating the object
class and bounding box coordinates. Due to the lack of ground truth
annotations for the visual attention mechanism, we train our network
using a reinforcement learning algorithm. Experiment results on
standard object detection benchmarks show that the proposed network
consistently outperforms the baseline networks that do not employ the
attention mechanism.

Audience: Graduate Faculty

Browse All Events

April 2024

SU	MO	TU	WE	TH	FR	SA
31	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	1	2	3	4

Submit an Event