Event
Ph.D. Dissertation Defense: Sri Venkata Anirudh Nanduri
Thursday, March 27, 2025
2:00 p.m.
IRB 4107
Maria Hoo
301 405 3681
mch@umd.edu
ANNOUNCEMENT: Ph.D. Dissertation Defense
Name: Sri Venkata Anirudh Nanduri
Committee:
Professor Rama Chellappa, Chair
Professor Min Wu
Professor Shuvra Bhattacharyya
Professor Abhinav Shrivastava
Dr. Richard W. Vorder Bruegge
Professor Ramani Duraiswami, Dean's Representative
Date/time: Thursday, March 27, 2025 at 2pm
Location: IRB 4107
Title: Multi-Domain Biometric Recognition using Face and Body Embeddings
Abstract:
Although image- or video-based biometric recognition boasts excellent performance in the visible spectrum even under unconstrained conditions with variations in pose, illumination, and resolution, biometric recognition in more challenging domains, such as infrared, surveillance imagery, or long-range imagery, remains a significant challenge due to domain shifts and limited labeled data. In this dissertation, we study the problem of multi-domain biometric recognition using face and body embeddings on the IARPA JANUS Benchmark Multi-domain Face (IJB-MDF) dataset.
While systems based on deep neural networks have produced remarkable performance on many tasks such as face/object detection and recognition, they also require large amounts of labeled training data. However, there are many applications where collecting a relatively large labeled training data may not be feasible due to time and/or financial constraints. Trying to train deep networks on these small datasets in the standard manner usually leads to serious over-fitting issues and poor generalization. We explore how a state-of-the-art deep learning pipeline for unconstrained visual face identification and verification can be adapted to domains with scarce data/label availability using a semi-supervised learning approach. The rationale for system adaptation and experiments are set in the following context - given a pretrained network (that was trained on a large training dataset in the source domain), adapt it to generalize onto a target domain using a relatively small labeled (typically hundred to ten thousand times smaller) and an unlabeled training dataset. We present algorithms and results of extensive experiments with varying training dataset sizes and composition, and model architectures, using the IJB-MDF dataset for training and evaluation with visible and short-wave infrared (SWIR) domains as the source and target domains respectively.
Next, we tackle some more challenging domains including visible surveillance, body-worn imagery, remote videos (captured at 300m, 400m and 500m) and short-wave infrared videos (captured at 15m and 30m). While significant research has been done in the fields of domain adaptation and domain generalization, in this dissertation we tackle scenarios in which these methods have limited applicability owing to the lack of training data from target domains. We focus on the problem of single-source (visible) and multi-target face recognition task. We demonstrate that the template generation algorithm plays a crucial role, especially as the complexity of the target domain increases. We propose a template generation algorithm called Norm Pooling (and a variant known as Sparse Pooling) and show that it outperforms traditional average pooling across different domains and network architectures using the IJB-MDF dataset.
Biometric recognition becomes increasingly challenging as we move away from the visible spectrum to infrared imagery, where domain discrepancies significantly impact identification performance. We show that body embeddings outperform face embeddings for cross-spectral person identification in the medium-wave infrared (MWIR) and long-wave infrared (LWIR) domains. Due to the lack of multi-domain datasets, previous research on cross-spectral body identification - also known as Visible-Infrared Person Re-Identification (VI-ReID) - has primarily focused on individual infrared bands, such as near-infrared (NIR) or LWIR, separately. We address the multi-domain body recognition problem using the IJB-MDF dataset, which enables matching of SWIR, MWIR, and LWIR images against RGB (VIS) images. We leverage a vision transformer architecture to establish benchmark results on the IJB-MDF dataset and, through extensive experiments, provide valuable insights into the interrelation of infrared domains, the adaptability of VIS-pretrained models, the role of local semantic features in body-embeddings, and effective training strategies for small datasets. Additionally, we show that finetuning a body model, pretrained exclusively on VIS data, with a simple combination of cross-entropy and triplet losses achieves state-of-the-art results on the LLCM dataset.
While systems based on deep neural networks have produced remarkable performance on many tasks such as face/object detection and recognition, they also require large amounts of labeled training data. However, there are many applications where collecting a relatively large labeled training data may not be feasible due to time and/or financial constraints. Trying to train deep networks on these small datasets in the standard manner usually leads to serious over-fitting issues and poor generalization. We explore how a state-of-the-art deep learning pipeline for unconstrained visual face identification and verification can be adapted to domains with scarce data/label availability using a semi-supervised learning approach. The rationale for system adaptation and experiments are set in the following context - given a pretrained network (that was trained on a large training dataset in the source domain), adapt it to generalize onto a target domain using a relatively small labeled (typically hundred to ten thousand times smaller) and an unlabeled training dataset. We present algorithms and results of extensive experiments with varying training dataset sizes and composition, and model architectures, using the IJB-MDF dataset for training and evaluation with visible and short-wave infrared (SWIR) domains as the source and target domains respectively.
Next, we tackle some more challenging domains including visible surveillance, body-worn imagery, remote videos (captured at 300m, 400m and 500m) and short-wave infrared videos (captured at 15m and 30m). While significant research has been done in the fields of domain adaptation and domain generalization, in this dissertation we tackle scenarios in which these methods have limited applicability owing to the lack of training data from target domains. We focus on the problem of single-source (visible) and multi-target face recognition task. We demonstrate that the template generation algorithm plays a crucial role, especially as the complexity of the target domain increases. We propose a template generation algorithm called Norm Pooling (and a variant known as Sparse Pooling) and show that it outperforms traditional average pooling across different domains and network architectures using the IJB-MDF dataset.
Biometric recognition becomes increasingly challenging as we move away from the visible spectrum to infrared imagery, where domain discrepancies significantly impact identification performance. We show that body embeddings outperform face embeddings for cross-spectral person identification in the medium-wave infrared (MWIR) and long-wave infrared (LWIR) domains. Due to the lack of multi-domain datasets, previous research on cross-spectral body identification - also known as Visible-Infrared Person Re-Identification (VI-ReID) - has primarily focused on individual infrared bands, such as near-infrared (NIR) or LWIR, separately. We address the multi-domain body recognition problem using the IJB-MDF dataset, which enables matching of SWIR, MWIR, and LWIR images against RGB (VIS) images. We leverage a vision transformer architecture to establish benchmark results on the IJB-MDF dataset and, through extensive experiments, provide valuable insights into the interrelation of infrared domains, the adaptability of VIS-pretrained models, the role of local semantic features in body-embeddings, and effective training strategies for small datasets. Additionally, we show that finetuning a body model, pretrained exclusively on VIS data, with a simple combination of cross-entropy and triplet losses achieves state-of-the-art results on the LLCM dataset.