Visible and Infrared Object Tracking Based on Multimodal Hierarchical Relationship Modeling
DOI:
https://doi.org/10.5566/ias.3124Keywords:
RGBT tracking, Transformer, Multimodal, Feature fusionAbstract
Visible RGB and Thermal infrared (RGBT) object tracking has emerged as a prominent area of focus within the realm of computer vision. Nevertheless, the majority of existing RGBT tracking methods, which predominantly rely on Transformers, primarily emphasize the enhancement of features extracted by convolutional neural networks. Unfortunately, the latent potential of Transformers in representation learning has been inadequately explored. Furthermore, most studies tend to overlook the significance of distinguishing between the importance of each modality in the context of multimodal tasks. In this paper, we address these two critical issues by introducing a novel RGBT tracking framework centered on multimodal hierarchical relationship modeling. Through the incorporation of multiple Transformer encoders and the deployment of self-attention mechanisms, we progressively aggregate and fuse multimodal image features at various stages of image feature learning. Throughout the process of multimodal interaction within the network, we employ a dynamic component feature fusion module at the patch-level to dynamically assess the relevance of visible information within each region of the tracking scene. Our extensive experimentation, conducted on benchmark datasets such as RGBT234, GTOT, and LasHeR, substantiates the commendable performance of our proposed approach in terms of accuracy, success rate, and tracking speed.
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Rui Yao, Jiazhu Qiu, Yong Zhou, Zhiwen Shao, Bing Liu, Jiaqi Zhao, Hancheng Zhu
This work is licensed under a Creative Commons Attribution 4.0 International License.