Efficient Image Super-Resolution with Multi-Branch Mixer Transformer

Authors

DOI:

https://doi.org/10.5566/ias.3399

Keywords:

active token mixer, multi-branch token mixer, single image super-resolution, transformer

Abstract

Deep learning methods have demonstrated significant advancements in single image super-resolution (SISR), with Transformer-based models frequently outperforming CNN-based counterparts in performance. However, due to the self-attention mechanism in Transformers, achieving lightweight models remains challenging compared to CNN-based approaches. In this paper, we propose a lightweight Transformer model termed Multi-Branch Mixer Transformer (MBMT) for SR. The design of MBMT is motivated by two main considerations: while self-attention excels at capturing long-range dependencies in features, it struggles with extracting local features. Secondly, the quadratic complexity of self-attention forms a significant challenge in building lightweight models. To address these problems, we propose a Multi-Branch Token Mixer (MBTM) to extract richer global and local information. Specifically, MBTM consists of three parts: shifted window attention, depthwise convolution, and active token mixer. This multi-branch structure handles both long-range dependencies and local features simultaneously, enabling us to achieve excellent SR performance with just a few stacked modules. Experimental results demonstrate that MBTM achieves competitive performance while maintaining model efficiency compared to SOTA methods.

References

Agustsson E, Timofte R (2017). Ntire 2017 challenge on single image super- resolution: Dataset and study. In: PROC CVPR IEEE.

Ahn N, Kang B, Sohn KA (2018). Fast, accurate, and lightweight super-resolution with cascading residual network. In: LECT NOTES COMPUT SC.

Bevilacqua M, Roumy A, Guillemot C, Morel A (2012). Low-complexity single image super-resolution based on nonnegative neighbor embedding. In: BMVC.

Cao J, Liang J, Zhang K, Li Y, Zhang Y, Wang W, Van Gool L (2022). Reference-based image super-resolution with deformable attention transformer. In: LECT NOTES COMPUT SC.

Chen Z, Zhang Y, Gu J, Kong L, Yang X, Yu F (2023). Dual aggregation transformer for image super-resolution. In: IEEE I CONF COMP VIS.

Choi H, Lee JS, Yang J (2022). N-gram in swin transformers for efficient lightweight image super-resolution. PROC CVPR IEEE :2071–81.

Dong C, Loy CC, He K, Tang X (2016a). Image super-resolution using deep convolutional networks. IEEE T PATTERN ANAL 38:295–307.

Dong C, Loy CC, Tang X (2016b). Accelerating the super-resolution convolutional neural network. LECT NOTES COMPUT SC.

Dosovitskiy A, Beyer L, Houlsby N (2021). An image is worth 16x16 words: Transformers for image recognition at scale. ICLR .

Gao G, Wang Z, Li J, Li W, Yu Y, Zeng T (2022). Lightweight bimodal network for single-image super-resolution via symmetric cnn and recursive transformer. In: INT JOINT CONF ARTIF.

Gu A, Dao T (2024). Mamba: Linear-time sequence modeling with selective state spaces. In: COLM.

Gu J, Dong C (2021). Interpreting super-resolution networks with local attribution maps. In: PROC CVPR IEEE.

He K, Zhang X, Ren S, Sun J (2016). Deep residual learning for image recognition. In: PROC CVPR IEEE.

Hendrycks D, Gimpel K (2016). Gaussian error linear units (gelus). arXiv Learning .

Howard AG, Zhu M, Adam H (2017a). Mobilenets: Efficient convolutional neural networks for mobile vision applications. ArXiv abs/1704.04861.

Howard AG, Zhu M, Adam H(2017b). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv170404861.

Huang JB, Singh A, Ahuja N (2015). Single image super-resolution from transformed self-exemplars. In: PROC CVPR IEEE.

Huang Z, Zhang Z, Lan C, Zha ZJ, Lu Y, Guo B (2023). Adaptive frequency filters as efficient global token mixers. In: IEEE I CONF COMP VIS.

Hui Z, Gao X, Yang Y, Wang X (2019). Lightweight image super-resolution with information multi-distillation network. In: ACM MM.

Hui Z, Wang X, Gao X (2018). Fast and accurate single image super-resolution via information distillation network. In: PROC CVPR IEEE.

Kim J, Lee JK, Lee KM (2016a). Accurate image super-resolution using very deep convolutional networks. In: PROC CVPR IEEE.

Kim J, Lee JK, Lee KM (2016b). Deeply-recursive convolutional network for image super-resolution. In: PROC CVPR IEEE.

Kong F, Li M, Liu S, Liu D, He J, Bai Y, Chen F, Fu L (2022). Residual local feature network for efficient super-resolution. In: PROC CVPR IEEE.

Lai WS, Huang JB, Ahuja N, Yang MH (2017). Deep laplacian pyramid networks for fast and accurate super-resolution. In: PROC CVPR IEEE.

Ledig C, Theis L, Huszar F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z (2016). Photo-realistic single image super-resolution using a generative adversarial network. IEEE COMP SOC.

Li H, Cai D, Xu J, Watanabe T (2022a). Residual learning of neural text generation with n-gram language model. In: ACL.

Li Y, Zhang K, Timofte R, Van Gool L, et al. (2022b). Ntire 2022 challenge on efficient super-resolution: Methods and results. In: PROC CVPR IEEE.

Li Z, Liu Y, Chen X, Cai H, Gu J, Qiao Y, Dong C (2022c). Blueprint separable residual network for efficient image super-resolution. In: PROC CVPR IEEE.

Liang J, Cao J, Sun G, Zhang K, Van Gool L, Timofte R (2021). Swinir: Image restoration using swin transformer. In: IEEE I CONF COMP VIS.

Lim B, Son S, Kim H, Nah S, Lee KM (2017a). Enhanced deep residual networks for single image super-resolution. In: PROC CVPR IEEE.

Lim B, Son S, Kim H, Nah S, Lee KM (2017b). Enhanced deep residual networks for single image super-resolution. In: PROC CVPR IEEE.

Liu J, Tang J, Wu G (2020). Residual feature distillation network for lightweight image super-resolution. In: LECT NOTES COMPUT SC.

Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: IEEE I CONF COMP VIS.

Lu Z, Li J, Liu H, Huang C, Zhang L, Zeng T (2022). Transformer for single image super-resolution. In: PROC CVPR IEEE.

Martin D, Fowlkes C, Tal D, Malik J (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: IEEE I CONF COMP VIS, vol. 2.

Matsui Y, Ito K, Aramaki Y, Fujimoto A, Ogawa T, Yamasaki T, Aizawa K (2015). Sketch-based manga retrieval using manga109 dataset. MULTIMED TOOLS APPL 76:21811–38.

Ramachandran P, Zoph B, Le QV (2017). Swish: a self-gated activation function. arXiv Neural and Evolutionary Computing .

Shi W, Caballero J, Huszar F, Totz J, Aitken AP, Bishop R, Rueckert D, Wang Z (2016). Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: PROC CVPR IEEE.

Tai Y, Yang J, Liu X (2017a). Image super-resolution via deep recursive residual network. In: PROC CVPR IEEE.

Tai Y, Yang J, Liu X, Xu C (2017b). Memnet: A persistent memory network for image restoration. In: IEEE I CONF COMP VIS.

Tolstikhin IO, Houlsby Neiland Lucic M, Dosovitskiy A (2021). Mlp-mixer: An all-mlp architecture for vision. In: ADV NEUR IN, vol. 34.

Touvron H, Bojanowski P, Verbeek J, et al. (2022). Resmlp: Feedforward networks for image classification with data efficient training. IEEE T PATTERN ANAL 45:5314–21.

Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017). Attention is all you need. In: NEU INF PRO.

Wang H, Chen X, Ni B, Liu Y, Liu J (2023). Omni aggregation networks for lightweight image super-resolution. PROC CVPR IEEE :22378–87.

Wang X, Yu K, Wu S, Gu J, Liu Y, Dong C, Qiao Y, Loy CC (2018). Esrgan: Enhanced super-resolution generative adversarial networks. In: LECT NOTES COMPUT SC.

Wang Y, Liu Y, Zhao S, Li J, Zhang L (2024). Camixersr: Only details need more ”attention”. In: PROC CVPR IEEE.

Wang Z, Bovik A, Sheikh H, Simoncelli E (2004). Image quality assessment: from error visibility to structural similarity. IEEE T IMAGE PROCESS 13:600–12.

Wei G, Zhang Z, Lan C, Lu Y, Chen Z (2022). Activemlp: An mlp-like architecture with active token mixer. In: AAAI.

Yang F, Yang H, Fu J, Lu H, Guo B (2020). Learning texture transformer network for image super-resolution. In: PROC CVPR IEEE.

Yu W, Luo M, Zhou P, Si C, Zhou Y, Wang X, Feng J, Yan S (2022). Metaformer is actually what you need for vision. In: PROC CVPR IEEE.

Zeyde R, Elad M, Protter M (2010). On single image scale-up using sparse-representations. In: ICCS.

Zhang A, Ren W, Liu Y, Cao X (2023). Lightweight image super-resolution with superpixel token interaction. In: IEEE I CONF COMP VIS.

Zhang X, Zeng H, Guo S, Zhang L (2022). Efficient long-range attention network for image super-resolution. In: LECT NOTES COMPUT SC.

Zhang Y, Tian Y, Kong Y, Zhong B, Fu Y (2018). Residual dense network for image super-resolution. In: PROC CVPR IEEE.

Zhao H, Gallo O, Frosio I, Kautz J (2017). Loss functions for image restoration with neural networks. TCI 3:47–57.

Zhou Y, Li Z, Guo CL, Bai S, Cheng MM, Hou Q (2023). Srformer: Permuted self-attention for single image super-resolution. In: IEEE I CONF COMP VIS.

Downloads

Published

2025-02-05

Data Availability Statement

The code and data used in this study are publicly available on GitHub at the following repository: https://github.com/zl11250422/MBMT.

Issue

Section

Original Research Paper

How to Cite

Zhang, L., & Wan, Y. (2025). Efficient Image Super-Resolution with Multi-Branch Mixer Transformer. Image Analysis and Stereology. https://doi.org/10.5566/ias.3399