TVI · Resources

2025

DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification

Yujie Yang^*, Shuang Li^* , Jun Ye, Neng Dong, Fan Li, Huafeng Li

ACM International Conference on Multimedia (ACM MM), 2025

Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval.

Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features.

Deepfake Detection Leveraging Self-Blended Artifacts Guided by Facial Embedding Discrepancy

Shanshan Han, Shuang Li, Shuodi Wang, Lin Yuan, Yan Zhang, Xinbo Gao

IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2025

Current deepfake detection methods commonly use data augmentation and authenticity-content disentanglement to extract more generalized features for detection tasks. However, these methods rely exclusively on low-level spatial artifacts to distinguish real from fake images, which presents significant challenges in accurately capturing the rich forgery cues. Deepfakes create discrepancies between forged and original facial features within the face-recognition (FR) embedding space, which can serve as an additional cue for detection.

To better exploit the artifacts in deepfake images, we propose a novel detection method that enhances the detector’s perception capability by incorporating not only the real and fake samples during training, but also the visual residual between real and fake images. Meanwhile, we integrate the discrepancy in facial embedding between the real and fake samples into the training procedure of artifact extraction, serving as a guidance signal with strong knowledge provided by the pretrained face recognition model. Specialized distillation loss along with additional cross-entropy losses are designed to enhance the detection capability. Experiments on multiple benchmarks demonstrate the superiority of the proposed approach in deepfake detection over literature methods.

Transferring and Refining Visual-Semantic Priors via Graph-Enhanced CLIP for 3D Hand Pose Estimation

Tingting Liu, Ji Gan, Jiaxu Leng, Shuang Li, Lei Chen, Xinbo Gao

Transactions on Multimedia (TMM), 2025

3D hand pose estimation is crucial for many humancomputer interaction applications. However, existing deep neural networks (DNNs) for 3D hand pose estimation suffer from poor generalization capability due to data scarcity and lack of domainspecific knowledge. In contrast, humans remain far better than DNNs at learning, which requires fewer exemplars for learning new concepts under the guidance of their prior knowledge. Inspired by this, we propose a Graph-Enhanced CLIP to deliver visual-semantic priors to DNNs and refine the domain-specific knowledge for better 3D hand pose estimation.

Specifically, we first introduce a pre-trained CLIP to guide the hand estimation model learning semantic-aware visual features, where text-free contrastive learning is proposed to effectively transfer high-level visual-semantic priors from the pre-trained large multi-modal models (LLMs). It is worth noting that our strategy is dataagnostic and can avoid designing hand-crafted text prompts for various visual inputs. Secondly, we further introduce novel graph Transformers to refine the domain-specific knowledge by fully exploiting local adjacent relations of hand joints and capturing the global structure representation of hand poses. The introduced graph Transformers are supposed to further refine the generalized CLIP feature for the downstream task (i.e., hand pose estimation) with better performance. Experiments show that our proposed graph-enhanced CLIP has achieved state-ofthe-art performance on benchmark datasets, demonstrating its effectiveness for 3D hand pose estimation.

Video-Level Language-Driven Video-Based Visible-Infrared Person Re-Identification

Shuang Li, Jiaxu Leng, Changjian Kuang, Mingpi Tan, Xinbo Gao

IEEE Transactions on Information Forensics and Security (TIFS), 2025

Video-based Visible-Infrared Person Re-Identification (VVI-ReID) aims to match pedestrian sequences across modalities by extracting modality-invariant sequence-level features. As a high-level semantic representation, language provides a consistent description of pedestrian characteristics in both infrared and visible modalities. Leveraging the Contrastive Language-Image Pre-training (CLIP) model to generate video-level language prompts and guide the learning of modality-invariant sequence-level features is theoretically feasible. However, the challenge of generating and utilizing modality-shared video-level language prompts to address modality gaps remains a critical problem.

To address this problem, we propose a simple yet powerful framework, video-level language-driven VVI-ReID (VLD), which consists of two core modules: invariant-modality language prompting (IMLP) and spatial-temporal prompting (STP). IMLP employs a joint fine-tuning strategy for the visual encoder and the prompt learner to effectively generate modality-shared text prompts and align them with visual features from different modalities in CLIP's multimodal space, thereby mitigating modality differences. Additionally, STP models spatiotemporal information through two submodules, the spatial-temporal hub (STH) and spatial-temporal aggregation (STA), which further enhance IMLP by incorporating spatiotemporal information into text prompts. The STH aggregates and diffuses spatiotemporal information into the [CLS] token of each frame across the vision transformer (ViT) layers, whereas STA introduces dedicated identity-level loss and specialized multihead attention to ensure that the STH focuses on identity-relevant spatiotemporal feature aggregation. The VLD framework achieves state-of-the-art results on two VVI-ReID benchmarks. On the HITSZ-VCM dataset, it improves the Rank-1 accuracy by 7.3% and mAP by 7.6% (infrared-to-visible) and the Rank-1 accuracy by 10.4% and the mAP accuracy by 9.3% (visible to infrared) and requires only 2 hours of training, 2.39M additional parameters, and 0.12G FLOPs.

Shape-centered Representation Learning for Visible-Infrared Person Re-identification

Shuang Li, Jiaxu Leng, Ji Gan, Mengjingcheng Mo, Xinbo Gao

Pattern Recognition (PR), 2025

Visible-Infrared Person Re-Identification (VI-ReID) plays a critical role in all-day surveillance systems. However, existing methods primarily focus on learning appearance features while overlooking body shape features, which not only complement appearance features but also exhibit inherent robustness to modality variations. Despite their potential, effectively integrating shape and appearance features remains challenging. Appearance features are highly susceptible to modality variations and background noise, while shape features often suffer from inaccurate infrared shape estimation due to the limitations of auxiliary models.

To address these challenges, we propose the Shape-centered Representation Learning (ScRL) framework, which enhances VI-ReID performance by innovatively integrating shape and appearance features. Specifically, we introduce Infrared Shape Restoration (ISR) to restore inaccuracies in infrared body shape representations at the feature level by leveraging infrared appearance features. In addition, we propose Shape Feature Propagation (SFP), which enables the direct extraction of shape features from original images during inference with minimal computational complexity. Furthermore, we design Appearance Feature Enhancement (AFE), which utilizes shape features to emphasize shape-related appearance features while effectively suppressing identity-unrelated noise. Benefiting from the effective integration of shape and appearance features, ScRL demonstrates superior performance through extensive experiments.

Mutual Information-guided Domain Shared Feature Learning for Bearing Fault Diagnosis under Unknown Conditions

Kaixiong Xu, Shuang Li, Shuiqing Xu, Youqiang Hu, Yongfang Mao, Yi Chai

IEEE Transactions on Instrumentation and Measurement (IEEE TIM), 2025

With the support of unsupervised domain adaptation techniques, variable condition bearing fault diagnosis has achieved considerable progress. Nevertheless, the prerequisite of obtaining target data in advance limits the practical application of these diagnostic models in real-world scenarios. For bearing fault diagnosis under unknown conditions, domain generalization-based methods show great promise, and acquiring domain-invariant knowledge is crucial to enhance generalization capability.

To address this issue, this paper proposes Mutual Information-guided Domain Shared Feature Learning algorithm (MI-DSFL). MI-DSFL designs a HS diagnosis branch and a WC identification branch to directly extract HS-related features and WC-related features, respectively. Through the interaction between the two branches and the mutual influence of the corresponding classifiers, domain-invariant HS-related features and WC-related features are ultimately decoupled.

Dual-Space Video Person Re-identification

Jiaxu Leng, Changjiang Kuang, Shuang Li, Ji Gan, Haosheng Chen and Xinbo Gao

International Journal of Computer Vision (IJCV), 2025

Video person re-identification (VReID) aims to recognize individuals across video sequences. Existing methods primarily use Euclidean space for representation learning but struggle to capture complex hierarchical structures, especially in scenarios with occlusions and background clutter. In contrast, hyperbolic space, with its negatively curved geometry, excels at preserving hierarchical relationships and enhancing discrimination between similar appearances. Inspired by these, we propose Dual-Space Video Person Re-Identification (DS-VReID) to utilize the strength of both Euclidean and hyperbolic geometries, capturing the visual features while also exploring the intrinsic hierarchical relations, thereby enhancing the discriminative capacity of the features. Specifically, we design the Dynamic Prompt Graph Construction (DPGC) module, which uses a pre-trained CLIP model with learnable dynamic prompts to construct 3D graphs that capture subtle changes and dynamic information in video sequences. Building upon this, we introduce the Hyperbolic Disentangled Aggregation (HDA) module, which addresses long-range dependency modeling by decoupling node distances and integrating adjacency matrices, capturing detailed spatial-temporal hierarchical relationships.

TriMatch: Triple Matching for Text-to-Image Person Re-Identification

Shuanglin Yan, Neng Dong, Shuang Li, Huafeng Li

IEEE Signal Processing Letters (SPL), 2025

Text-to-image person re-identification (TIReID) is a cross-modal retrieval task that aims to retrieve target person images based on a given text description. Existing methods primarily focus on mining the semantic associations across modalities, relying on the matching between heterogeneous features for retrieval. However, due to the inherent heterogeneous gaps between modalities, it is challenging to establish precise semantic associations, particularly in fine-grained correspondences, often leading to incorrect retrieval results. To address this issue, this letter proposes an innovative Triple Matching (TriMatch) framework that integrates cross-modal (image-text) matching and unimodal (image-image, text-text) matching for high-precision person retrieval. The framework introduces a generation task that performs cross-modal (image-to-text and text-to-image) feature generation and intra-modal feature alig achieve unimodal matching. By incorporating the generation task, TriMatch considers not only the semantic correlations between modalities but also the semantic consistency within single modalities, thereby effectively enhancing the accuracy of target person retrieval.

See as You Desire: Scale-Adaptive Face Super-Resolution for Varying Low Resolutions

Ling Li, Yan Zhang, Lin Yuan, Shuang Li, Huafeng Li, Xinbo Gao

IEEE Internet of Things Journal (IEEE IOT), 2025

Face super-resolution (FSR) is critical for bolstering intelligent security in Internet of Things (IoT) systems. Recent deep learning-driven FSR algorithms have attained remarkable progress. However, they always require separate model training and optimization for each scaling factor or input resolution, leading to inefficiency and impracticality. To overcome these limitations, we propose SAFNet, an innovative framework tailored for scale-adaptive FSR with arbitrary input resolution. SAFNet integrates scale information into representation learning to enable adaptive feature extraction and introduces dual-embedding attention to boost adaptive feature reconstruction. It leverages facial self-similarity and spatial-frequency collaboration to achieve precise scale-aware SR representations. This is attained through three key modules: 1) the scale adaption guidance unit (SAGU); 2) the scale-aware nonlocal self-similarity (SNLS) module; and 3) the spatial-frequency interactive modulation (SFIM) module. SAGU imports scaling factors using frequency encoding, SNLS exploits self-similarity to enrich feature representations, and SFIM incorporates spatial and frequency information to predict target pixel values adaptively. Comprehensive evaluations across four benchmark datasets reveal that SAFNet outperforms the second-best compared state-of-the-art (SOTA) method by about 0.2 dB/0.007 in PSNR/SSIM ( ×4 on CelebA) with reduced 18.68%/42.64% computational complexity/time cost.

2024

Small Object Detection Method Based on Global Multi-Level Perception and Dynamic Region Aggregation

Zhiqin Zhu, Renzhong Zheng, Guanqiu Qi, Shuang Li, Yuanyuan Li, Xinbo Gao

IEEE Transactions on Circuits and Systems for Video Technology (IEEE TCSVT), 2024

In the field of object detection, detecting small objects is an important and challenging task. However, most existing methods tend to focus on designing complex network structures, lack attention to global representation, and ignore redundant noise and dense distribution of small objects in complex networks. To address the above problems, this paper proposes a small object detection method based on global multi-level perception and dynamic region aggregation. The method achieves accurate detection by dynamically aggregating effective features within a region while fully perceiving the features. This method mainly consists of two modules: global multi-level perception module and dynamic region aggregation module. In the global multi-level perception module, self-attention is used to perceive the global region, and its linear transformation is mapped through a convolutional network to increase the local details of global perception, thereby obtaining more refined global information. The dynamic region aggregation module, devised with a sparse strategy in mind, selectively interacts with relevant features. This design allows aggregation of key features of individual instances, effectively mitigating noise interference. Consequently, this approach addresses the challenges associated with densely distributed targets and enhances the model’s ability to discriminate on a fine-grained level.

MGRL: Mutual-Guidance Representation Learning for Text-to-Image Person Retrieval

Tianle Lv,Shuang Li, Jiaxu Leng, Xinbo Gao

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

Text-to-image person retrieval aims to recognize target pedestrians based on specified text. Existing methods mainly obtain image and text features separately through distinct feature extractors, subsequently embedding them into a unified feature space and calculating their similarity. Despite great success, current methods still suffer from the lack of information interaction between images and text. To address this issue, we propose Mutual-guidance Representation Learning (MGRL) for text-to-image person retrieval, which captures the key features for matching via text-image information interaction. Accordingly, our MGRL consists of two customized modules: iterative text-guided feature extraction (ITFE) and vision-assisted specific mask complement (VSMC). Specifically, ITFE is first designed to extract the matching information between the text and the image concerning the local feature attention of the target pedestrians by iterative text guidance. Then, to further ensure the image features extracted by ITFE contain the text description, VSMC is designed to utilize the extracted image features to help complete masked text where the mask is difficult to complete with only unmasked text information.

2023

Logical Relation Inference and Multiview Information Interaction for Domain Adaptation Person Re-Identification

Shuang Li, Fan Li, Jinxing Li, Huafeng Li, Bob Zhang, Dapeng Tao, Xinbo Gao

IEEE Transactions on Neural Networks and Learning Systems (IEEE TNNLS), 2023

Domain adaptation person re-identification (Re-ID) is a challenging task, which aims to transfer the knowledge learned from the labeled source domain to the unlabeled target domain. Recently, some clustering-based domain adaptation Re-ID methods have achieved great success. However, these methods ignore the inferior influence on pseudo-label prediction due to the different camera styles. The reliability of the pseudo-label plays a key role in domain adaptation Re-ID, while the different camera styles bring great challenges for pseudo-label prediction. To this end, a novel method is proposed, which bridges the gap of different cameras and extracts more discriminative features from an image. Specifically, an intra-to-intermechanism is introduced, in which samples from their own cameras are first grouped and then aligned at the class level across different cameras followed by our logical relation inference (LRI). Thanks to these strategies, the logical relationship between simple classes and hard classes is justified, preventing sample loss caused by discarding the hard samples. Furthermore, we also present a multiview information interaction (MvII) module that takes features of different images from the same pedestrian as patch tokens, obtaining the global consistency of a pedestrian that contributes to the discriminative feature extraction. Unlike the existing clustering-based methods, our method employs a two-stage framework that generates reliable pseudo-labels from the views of the intracamera and intercamera, respectively, to differentiate the camera styles, subsequently increasing its robustness.

2022

Cross-compatible embedding and semantic consistent feature construction for sketch re-identification

Yafei Zhang, Yongzeng Wang, Huafeng Li, Shuang Li

ACM International Conference on Multimedia (ACM MM), 2022

Sketch re-identification (Re-ID) refers to using sketches of pedestrians to retrieve their corresponding photos from surveillance videos. It can track pedestrians according to the sketches drawn based on eyewitnesses without querying pedestrian photos. Although the Sketch Re-ID concept has been proposed, the gap between the sketch and the photo still greatly hinders pedestrian identity matching. Based on the idea of transplantation without rejection, we propose a Cross-Compatible Embedding (CCE) approach to narrow the gap. A Semantic Consistent Feature Construction (SCFC) scheme is simultaneously presented to enhance feature discrimination. Under the guidance of identity consistency, the CCE performs cross modal interchange at the local token level in the Transformer framework, enabling the model to extract modal-compatible features. The SCFC improves the representation ability of features by handling the inconsistency of information in the same location of the sketch and the corresponding pedestrian photo. The SCFC scheme divides the local tokens of pedestrian images with different modes into different groups and assigns specific semantic information to each group for constructing a semantic consistent global feature representation. Experiments on the public Sketch Re-ID dataset confirm the effectiveness of the proposed method and its superiority over existing methods.

Body Part-Level Domain Alignment for Domain-Adaptive Person Re-Identification With Transformer Framework

Yiming Wang, Guanqiu Qi, Shuang Li, Yi Chai, Huafeng Li

IEEE Transactions on Information Forensics and Security (IEEE TIFS), 2022

Although existing domain-adaptive person re-identification (re-ID) methods have achieved competitive per- formance, most of them highly rely on the reliability of pseudo-label prediction, which seriously limits their applicability as noisy labels cannot be avoided. This paper designs a Transformer framework based on body part-level domain alignment to solve the above-mentioned issues in domain-adaptive person re-ID. Different parts of the human body (such as head, torso, and legs) have different structures and shapes. Therefore, they usually exhibit different characteristics. The proposed method makes full use of the dissimilarity between different human body parts. Specifically, the local features from the same body part are aggregated by the Transformer to obtain the corresponding class token, which is used as the global representation of this body part. Additionally, a Transformer layer-embedded adversarial learning strategy is designed. This strategy can simultaneously achieve domain alignment and classification of the class token for each human body part in both target and source domains by an integrated discriminator, thereby realizing domain alignment at human body part level. Compared with existing domain-level and identity-level alignment methods, the proposed method has a stronger fine-grained domain alignment capability. Therefore, the information loss or distortion that may occur in the feature alignment process can be effectively alleviated. The proposed method does not need to predict pseudo labels of any target sample, so the negative impact caused by unreliable pseudo labels on re-ID performance can be effectively avoided.

Mutual prediction learning and mixed viewpoints for unsupervised-domain adaptation person re-identification on blockchain

Shuang Li, Fan Li, Kunpeng Wang, Guanqiu Qi, Huafeng Li

Simulation Modelling Practice and Theory, 2022

In addition to the domain shift between different datasets, the diversity of pedestrian appearance (physical appearance and postures) caused by different camera views also affects the performance of person re-identification (re-ID). Since existing methods tend to extract the shared information of the same pedestrian across multiple images, the above diversity issue has not been effectively alleviated. In addition, while making full use of pedestrian image data and realizing its value, there are also risks of privacy leakage and data loss. Therefore, this paper proposes the mutual prediction learning (MPL) and mixed viewpoints for unsupervised domain adaptation (UDA) person re-ID on blockchain. This method enables the network to first obtain the ability of MPL on multi-view polymorphic features and further acquire the reasoning imagination to alleviate the ambiguity caused by morphological differences. In the process of MPL, the training samples are first divided into different groups and each group has two sets. Then the corresponding identity classifiers of every two sets are integrated and applied to the cross-prediction of polymorphic features. Finally, the joint distribution alignment of domain- and identity-level features is achieved. Furthermore, an adversarial mechanism of mixed viewpoints is proposed to improve the accuracy of identity matching. The domain-invariant salient features are extracted and fused with the polymorphic features obtained by global average pooling (GAP) after domain alignment.

Learning modal-invariant and temporal-memory for video-based visible-infrared person re-identification

Xinyu Lin, Jinxing Li, Zeyu Ma, Huafeng Li, Shuang Li, Kaixiong Xu, Guangming Lu, David Zhang

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

Thanks for the cross-modal retrieval techniques, visible-infrared (RGB-IR) person re-identification (Re-ID) is achieved by projecting them into a common space, allowing person Re-ID in 24-hour surveillance systems. However, with respect to the "probe-to-gallery", almost all existing RGB-IR based cross-modal person Re-ID methods focus on image-to-image matching, while the video-to-video matching which contains much richer spatial- and temporal-information remains under-explored. In this paper, we primarily study the video-based cross-modal person Re-ID method. To achieve this task, a video-based RGB-IR dataset is constructed, in which 927 valid identities with 463,259 frames and 21,863 tracklets captured by 12 RGB/IR cameras are collected. Based on our constructed dataset, we prove that with the increase of frames in a tracklet, the performance does meet more enhancement, demonstrating the significance of video-to-video matching in RGB-IR person Re-ID. Additionally, a novel method is further proposed, which not only projects two modalities to a modal-invariant subspace, but also extracts the temporal-memory for motion-invariant.