TFDepth: Self-Supervised Monocular Depth Estimation with Multi-Scale Selective Transformer Feature Fusion

Hongli Hu; Jun Miao; Guanghui Zhu; Jie Yan; Jun Chu

doi:10.5566/ias.2987

Authors

Hongli Hu The School of Aeronautical Manufacturing Engineering Nanchang Hangkong University
Jun Miao The School of Aeronautical Manufacturing Engineering Nanchang Hangkong UniversityKey Laboratory of Lunar and Deep Space Exploration, CAS
Guanghui Zhu
Jie Yan
Jun Chu

DOI:

https://doi.org/10.5566/ias.2987

Keywords:

monocular depth estimation, multi-scale fusion, self-supervised learning, Transformer

Abstract

Existing self-supervised models for monocular depth estimation suffer from issues such as discontinuity, blurred edges, and unclear contours, particularly for small objects. We propose a self-supervised monocular depth estimation network with multi-scale selective Transformer feature fusion. To preserve more detailed features, this paper constructs a multi-scale encoder to extract features and leverages the self-attention mechanism of Transformer to capture global contextual information, enabling better depth prediction for small objects. Additionally, the multi-scale selective fusion module (MSSF) is also proposed, which can make full use of multi-scale feature information in the decoding part and perform selective fusion step by step, which can effectively eliminate noise and retain local detail features to obtain a clear depth map with clear edges. Experimental evaluations on the KITTI dataset demonstrate that the proposed algorithm achieves an absolute relative error (Abs Rel) of 0.098 and an accuracy rate (δ) of 0.983. The results indicate that the proposed algorithm not only estimates depth values with high accuracy but also predicts the continuous depth map with clear edges.

References

Agarwal A, Arora C (2023). Attention attention everywhere: Monocular depth prediction with skip attention. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision:5861-70.

Bae J, Moon S, Im S (2023). Deep digging into the generalization of self-supervised monocular depth estimation. Proceedings of the AAAI Conference on Artificial Intelligence 37:187-96.

Cao Y, Wu Z, Shen C (2017). Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology 28:3174-82.

Casser V, Pirk S, Mahjourian R, Angelova A (2019). Unsupervised monocular depth and ego-motion learning with structure and semantics. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.

Chen C, Seff A, Kornhauser A, Xiao J (2015). Deepdriving: Learning affordance for direct perception in autonomous driving. Proceedings of the IEEE international conference on computer vision:2722-30.

Choi J, Jung D, Lee D, Kim C (2020). Safenet: Self-supervised monocular depth estimation with semantic-aware feature extraction. International Conference on Learning Representations.

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S (2021). An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations.

Eigen D, Puhrsch C, Fergus R (2014). Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems 27.

Farooq H, Chachoo MA (2023). A review of monocular depth estimation methods based on deep learning. ICIDSSD 2022: Proceedings of the 3rd International Conference on ICT for Digital, Smart, and Sustainable Development, ICIDSSD 2022, 24-25 March 2022, New Delhi, India:133.

Garg R, Bg VK, Carneiro G, Reid I (2016). Unsupervised cnn for single view depth estimation: Geometry to the rescue. Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14:740-56.

Geiger A, Lenz P, Stiller C, Urtasun R (2013). Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32:1231-7.

Godard C, Mac Aodha O, Brostow GJ (2017). Unsupervised monocular depth estimation with left-right consistency. Proceedings of the IEEE conference on computer vision and pattern recognition:270-9.

Godard C, Mac Aodha O, Firman M, Brostow GJ (2019). Digging into self-supervised monocular depth estimation. Proceedings of the IEEE/CVF international conference on computer vision:3828-38.

Guizilini V, Ambrus R, Pillai S, Raventos A, Gaidon A (2020). 3d packing for self-supervised monocular depth estimation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:2485-94.

He K, Zhang X, Ren S, Sun J (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition:770-8.

Hu J, Ozay M, Zhang Y, Okatani T (2019). Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. 2019 IEEE winter conference on applications of computer vision (WACV):1043-51.

Huang J, Wang C, Liu Y, Bi T (2019). The progress of monocular depth estimation technology. Journal of Image and Graphics 24:2081-97.

Khan F, Salahuddin S, Javidnia H (2020). Deep learning-based monocular depth estimation methods—a state-of-the-art review. Sensors 20:2272.

Klingner M, Termöhlen J-A, Mikolajczyk J, Fingscheidt T (2020). Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16 Springer International Publishing:582-600.

Li L, Li X, Yang S, Ding S, Jolfaei A, Zheng X (2020). Unsupervised-learning-based continuous depth and motion estimation with monocular endoscopy for virtual reality minimally invasive surgery. IEEE Transactions on Industrial Informatics 17:3920-8.

Li Z, Chen Z, Liu X, Jiang J (2023). Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation. Machine Intelligence Research 20:837-45.

Lyu X, Liu L, Wang M, Kong X, Liu L, Liu Y, Chen X, Yuan Y (2021). Hr-depth: High resolution self-supervised monocular depth estimation. Proceedings of the AAAI Conference on Artificial Intelligence 35:2294-301.

Masoumian A, Rashwan HA, Cristiano J, Asif MS, Puig D (2022). Monocular depth estimation using deep learning: A review. Sensors 22:5353.

Peng R, Wang R, Lai Y, Tang L, Cai Y (2021). Excavating the potential capacity of self-supervised monocular depth estimation. Proceedings of the IEEE/CVF International Conference on Computer Vision:15560-9.

Saunders K, Vogiatzis G, Manso LJ (2023). Dyna-dm: Dynamic object-aware self-supervised monocular depth maps. 2023 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC):10-6.

Schonberger JL, Frahm J-M (2016). Structure-from-motion revisited. Proceedings of the IEEE conference on computer vision and pattern recognition:4104-13.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017). Attention is all you need. Advances in neural information processing systems 30.

Wang C, Buenaposada JM, Zhu R, Lucey S (2018). Learning depth from monocular videos using direct methods. Proceedings of the IEEE conference on computer vision and pattern recognition:2022-30.

Wang L, Zhang J, Wang O, Lin Z, Lu H (2020). Sdc-depth: Semantic divide-and-conquer network for monocular depth estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition:541-50.

Yan J, Zhao H, Bu P, Jin Y (2021). Channel-wise attention-based network for self-supervised monocular depth estimation. 2021 International Conference on 3D vision (3DV):464-73.

Yin Z, Shi J (2018). Geonet: Unsupervised learning of dense depth, optical flow and camera pose. Proceedings of the IEEE conference on computer vision and pattern recognition:1983-92.

Yuan W, Gu X, Dai Z, Zhu S, Tan P (2022). Neural window fully-connected crfs for monocular depth estimation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition:3916-25.

Yuru C, Haitao Z (2020). Depth estimation based on adaptive pixel-level attention model. Journal of Applied Optics 41:490-9.

Zhou T, Brown M, Snavely N, Lowe DG (2017). Unsupervised learning of depth and ego-motion from video. Proceedings of the IEEE conference on computer vision and pattern recognition:1851-8.

Zou Y, Luo Z, Huang J-B (2018). Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. Proceedings of the European conference on computer vision (ECCV):36-53.