An Improved EViT Network for Semantic Segmentation of High-Resolution Remote Sensing Imagery

Rui Xu; Yihui Yang; Renzhong Mao; Yining Zhang; Yiteng Lin; Weiping Zhang

doi:10.5566/ias.3950

Authors

Rui Xu
Yihui Yang School of Computing and Data Science, Fujian University of Technology
Renzhong Mao
Yining Zhang
Yiteng Lin
Weiping Zhang

DOI:

https://doi.org/10.5566/ias.3950

Keywords:

Attention mechanism, CNN-Transformer fusion, high-resolution remote sensing imagery, Local-Global Feature Calibration, semantic segmentation, Spatial Perception Gating Mechanism

Abstract

To address the issues of blurred building boundaries, small-object omission, and severe background interference in the semantic segmentation of high-resolution remote sensing imagery, this study proposes an improved method based on the Enhanced Vision Transformer Network (EViT). Specifically, this paper introduces a Grouped Cross-Cascaded Multi-Head Self-Attention (GCC-MSA) module to enhance feature diversity while maintaining linear complexity, and a Local-Global Feature Calibration (LGC) module to fuse CNN local details with Transformer global context. Coordinate Attention (CoAt) replaces conventional channel attention to strengthen channel-spatial feature representation. Additionally, Semantic-Guided Spatial Pyramid Pooling (SGSPP) and a GCC-MSA-guided Edge Perception (GEP) module reinforce multi-scale semantic perception and boundary extraction, while a Spatial Perception Gating Mechanism (SPGM) adaptively fuses dual-branch features. On the WHU Aerial, Massachusetts, and GF-7 Building Datasets, the model achieves Intersection-over-Union (IoU) scores of 92.33%, 77.81%, and 78.29%, respectively. These represent improvements of 0.57, 0.67, and 0.62 percentage points over the original EViT. The model demonstrates superior performance in small-building extraction, complex boundary segmentation, and background noise suppression, thereby providing a robust solution for precise surface object information extraction from high-resolution remote sensing imagery.

Author Biography

Rui Xu

Associate Professor of Fujian University of Technology

References

Badrinarayanan V, Kendall A, Cipolla R (2017). SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39:2481–95. https://doi.org/10.1109/TPAMI.2016.2644615

Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M (2023). Swin-Unet: Unet-like pure transformer for medical image segmentation. In: Karlinsky L, Michaeli T, Nishino K, eds. Computer Vision – ECCV 2022 Workshops. Proceedings of the 17th European Conference on Computer Vision Workshops, 2022 Oct 23–27; Tel Aviv, Israel. Cham: Springer, 205–18.

Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y (2021). TransUNet: Transformers make strong encoders for medical image segmentation. Retrieved 2021 Feb 8, from https://arxiv.org/abs/2102.04306.

Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y, eds. Computer Vision - ECCV 2018: 15th European Conference. 2018 Sep 8–14; Munich, Germany. Cham: Springer, 801–18.

Chen P, Huang H, Ye F, Liu L, Li X, Liu M, Zhang L (2024). A benchmark GaoFen-7 dataset for building extraction from satellite images. Sci Data 11:187. https://doi.org/10.1038/s41597-024-03009-5

Dai Z, Liu H, Le QV, Tan M (2021). CoAtNet: Marrying convolution and attention for all data sizes. In: Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Vaughan JW, eds. Advances in Neural Information Processing Systems 34 (NeurIPS 2021). 2021 Dec 6–14; Virtual. Red Hook, NY: Curran Associates, 3965–77.

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. (2021). An image is worth 16×16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations (ICLR 2021). 2021 May 3-7; Virtual. Available from: https://openreview.net/forum?id=YicbFdNTTy.

He K, Zhang X, Ren S, Sun J (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). 2016 Jun 26–Jul 1; Las Vegas, NV, USA. New York: IEEE, 770–8.

Hu J, Shen L, Sun G (2018). Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018). 2018 Jun 18–22; Salt Lake City, UT, USA. New York: IEEE, 7132–41.

Ji S, Wei S, Lu M (2019). Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery dataset. IEEE Trans Geosci Remote Sens 57:574–86. https://doi.org/10.1109/TGRS.2018.2858817

Kang W, Xiang Y, Wang F, You H (2019). EU-Net: An efficient fully convolutional network for building extraction from optical remote sensing images. Remote Sens 11(23):2813. https://doi.org/10.3390/rs11232813

Li Y, Hong D, Li C, Yao J, Chanussot J (2024). HD-Net: High-resolution decoupled network for building footprint extraction via deeply supervised body and boundary decomposition. ISPRS J Photogramm Remote Sens 209:51–65. https://doi.org/10.1016/j.isprsjprs.2024.01.022

Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017). Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). 2017 Jul 21–26; Honolulu, HI, USA. New York: IEEE, 2117–25.

Liu H, Luo J, et al. (2019). DE-Net: Deep encoding network for building extraction from high-resolution remote sensing imagery. Remote Sens 11(20):2380. https://doi.org/10.3390/rs11202380

Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021). 2021 Oct 10–17; Montreal, QC, Canada. New York: IEEE, 10012–22.

Long J, Shelhamer E, Darrell T (2015). Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015). 2015 Jun 7-12; Boston, MA, USA. New York: IEEE, 3431-40.

Mehta S, Rastegari M (2022). MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. In: Proceedings of the International Conference on Learning Representations (ICLR 2022). 2022 Apr 25–29; Virtual. Available from: https://openreview.net/forum?id=vh-0n7s7sEx.

Mnih V (2013). Machine learning for aerial image labeling. Ph.D. dissertation. University of Toronto, Toronto, ON, Canada. Retrieved 2013, from https://api.semanticscholar.org/CorpusID:114890196.

Qin D, Leichner C, Delakis M, Marcin M, Wang J, Adam G, Howard A (2024). MobileNetV4: Universal models for the mobile ecosystem. In: Proceedings of the European Conference on Computer Vision (ECCV 2024). 2024 Sep 29–Oct 4; Milan, Italy. Cham: Springer. arXiv:2404.10518.

Ronneberger O, Fischer P, Brox T (2015). U-Net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells WM, Frangi AF, eds. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18th International Conference. 2015 Oct 5–9; Munich, Germany. Cham: Springer, 234–41.

Shrestha S, Vanneschi L (2018). Improved fully convolutional network with conditional random fields for building extraction. Remote Sens 10(7):1135. https://doi.org/10.3390/rs10071135

Shaker A, Maaz M, Rasheed H, Khan S, Yang MH, Khan FS (2023). SwiftFormer: Efficient additive attention for transformer-based real-time mobile vision applications. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2023). 2023 Oct 2–6; Paris, France. New York: IEEE, 17425–36.

Wang L, Fang S, Meng X, Li R (2022a). Building extraction with vision transformer. IEEE Trans Geosci Remote Sens 60:5625711. https://doi.org/10.1109/TGRS.2022.3186634

Wang L, Li R, Duan C, Zhang C, Meng X, Fang S (2022b). A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images. IEEE Geosci Remote Sens Lett 19:1–5.

Xiang X, Gong W, Li S, Chen J, Ren T (2024). TCNet: Multiscale fusion of transformer and CNN for semantic segmentation of remote sensing images. IEEE J Sel Top Appl Earth Obs Remote Sens 17:3123–36. https://doi.org/10.1109/JSTARS.2024.3349625

Xu J, Xiong Z, Bhattacharyya SP (2023). PIDNet: A real-time semantic segmentation network inspired by PID controllers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023). 2023 Jun 18–22; Vancouver, BC, Canada. New York: IEEE, 19529–39.

Yang F, Jiang FL, Li JZ, Lu L (2024). MSTrans: Multi-scale transformer for building extraction from HR remote sensing images. Electronics 13(23):4610. https://doi.org/10.3390/electronics13234610

Zhang H, Wang Y, Li Q, Xu L, Yang M-H (2024). Extracting building footprint from remote sensing images by an enhanced vision transformer network. IEEE Trans Geosci Remote Sens 62:5602315. https://doi.org/10.1109/TGRS.2024.3421651

Zhang R, Zhao J, Li M, Zou Q (2024). LGDB-Net: Dual-branch path for building extraction from remote sensing image. In: Proceedings of the 30th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2024). 2024 Oct 10–14; Belgrade, Serbia. New York: IEEE, 452–61.

Zhang Y, Liu H, Hu Q (2021). TransFuse: Fusing transformers and CNNs for medical image segmentation. In: de Bruijne M, et al., eds. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021: 24th International Conference. 2021 Sep 27–Oct 1; Strasbourg, France (virtual). Cham: Springer, 3–11.

Zhao H, Shi J, Qi X, Wang X, Jia J (2017). Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). 2017 Jul 21–26; Honolulu, HI, USA. New York: IEEE, 2881–90.

Zhou Y, Chen Z, Wang B, Li S, Liu H, Xu C (2022). BOMSC-Net: Boundary optimization and multi-scale context awareness based building extraction from high-resolution remote sensing imagery. IEEE Trans Geosci Remote Sens 60:5618617. https://doi.org/10.1109/TGRS.2022.3152575

Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J (2018). UNet++: A nested U-Net architecture for medical image segmentation. In: Stoyanov D, et al., eds. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, held in conjunction with MICCAI 2018. 2018 Sep 16–20; Granada, Spain. Cham: Springer, 3–11.

Zhu Q, Liao C, Hu H, Mei X, Li H (2021). MAP-Net: Multiple attending path neural network for building footprint extraction from remote sensed imagery. IEEE Trans Geosci Remote Sens 59:6169–81. https://doi.org/10.1109/TGRS.2020.3026051