Image experience prediction for historic districts using a CNN-transformer fusion model
DOI:
https://doi.org/10.5566/ias.3361Keywords:
historic districts, sentiment analysis and evaluation system, convolutional neural network (CNN), transformer modelAbstract
This study addresses key challenges in historic district planning and design: capturing the emotional value of streetscape images and integrating this into the design process. We developed a deep learning-based sentiment analysis system, employing CNN and transformer models to analyze emotional tendencies and temporal states in images. Using a multi-view feature extraction framework combining VGG, ResNet CNNs, and the Swin Transformer model, we created a novel feature matrix. An attention mechanism and transfer learning strategy enhanced model accuracy in label recognition and classification. Applying this system to Jiangnan Historic District, we demonstrated how understanding and applying emotional value can enhance district appeal. By identifying emotional tendencies in streetscape images, designers can make better-informed decisions, fostering positive experiences. Our analysis of images from 12 Jiangnan historic districts showed the system’s efficiency in aligning images with existing imaging libraries, providing valuable references and feedback. The results highlight the practical potential of deep learning in visual sentiment analysis and emphasize the importance of emotional value in improving experiences in historic districts. This study offers new insights and methodological support for planning and designing such areas.
References
Acheampong, F. A., Wenyu, C., and Nunoo‐Mensah, H. 2020. Text‐based emotion detection: Advances, challenges, and opportunities. Engineering Reports 2:e12189. doi: 10.1002/eng2.12189.
Baltrušaitis, T., Ahuja, C., and Morency, L.-P. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41:423-43. doi: 10.1109/TPAMI.2018.2798607.
Bhushanam, P. N. 2023. A comprehensive analysis on unconstraint video analysis using deep learning approaches. 2023 2nd International Conference on Edge Computing and Applications (ICECAA) 563-67. doi:10.1109/ICECAA58104.2023.10212227.
Cai, G., He, X., and Chu, Y. 2019. Visual sentiment analysis by combining global and local regions of image. Journal of Computer Applications 39:2181. doi: 10.11772/j.issn.1001-9081.2018122452.
Cao, D., Ji, R., Lin, D., and Li, S. 2016. A cross-media public sentiment analysis system for microblog. Multimedia Systems 22:479-86. doi: 10.1007/s00530-014-0407-8.
Chalasani, N., Gurujala, S. S., Kota, S. S. S., Nishitha, S. N. T., and Kiran, J. S. 2020. EmotionC: A novel framework for emotion detection using personalized search engine. 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS) 869-75. doi:10.1109/ICISS49785.2020.9316040.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. doi: 10.48550/arXiv.2010.11929.
Espinoza, S., Aguilera, C., Rojas, L., and Campos, P. G. 2023. Analysis of fruit images with deep learning: A systematic literature review and future directions. IEEE Access. doi: 10.1109/ACCESS.2023.3345789.
Garcea, F., Serra, A., Lamberti, F., and Morra, L. 2023. Data augmentation for medical imaging: A systematic literature review. Computers in Biology and Medicine 152:106391. doi: 10.1016/j.compbiomed.2022.106391.
Giglio, S., Bertacchini, F., Bilotta, E., and Pantano, P. 2019. Using social media to identify tourism attractiveness in six Italian cities. Tourism management 72:306-12. doi: 10.1016/j.tourman.2018.12.007.
Gupta, S. K., Alemran, A., Singh, P., Khang, A., Dixit, C. K., and Haralayya, B. 2023. Image segmentation on gabor filtered images using projective transformation. 2023 International Conference on Recent Trends in Electronics and Communication (ICRTEC) 1-6. doi:10.1109/ICRTEC56977.2023.10111885.
Huang, F., Zhang, X., Zhao, Z., Xu, J., and Li, Z. 2019. Image–text sentiment analysis via deep multimodal attentive fusion. Knowledge-Based Systems 167:26-37. doi: 10.1016/j.knosys.2019.01.019.
Huang, Y., Yi, Y., Chen, Q., Li, H., Feng, S., Zhou, S., Zhang, Z., Liu, C., Li, J., and Lu, Q. 2023. Analysis of EEG features and study of automatic classification in first-episode and drug-naïve patients with major depressive disorder. BMC psychiatry 23:832. doi: 10.1186/s12888-023-05349-9.
Islam, M., Zunair, H., and Mohammed, N. 2024. CosSIF: Cosine similarity-based image filtering to overcome low inter-class variation in synthetic medical image datasets. Computers in Biology and Medicine 172:108317.doi: 10.1016/j.compbiomed.2024.108317.
Jayaswal, V., Ji, S., Singh, V., Singh, Y., and Tiwari, V. 2024. Image captioning using vgg-16 deep learning model. 2024 2nd International Conference on Disruptive Technologies (ICDT) 1428-33. doi:10.1109/ICDT61202.2024.10489470.
Jiao, L., and Zhao, J. 2019. A survey on the new generation of deep learning in image processing. IEEE Access 7:172231-63. doi: 10.1109/ACCESS.2019.2956508.
Jin, X., Li, Y., Zhou, W., Zhou, X., and Yang, H. 2023. Aesthetic visual question answering of photographs. 2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW) 359-64. doi:10.1109/ICMEW59549.2023.00068.
Jin, X., Wu, L., Zhao, G., Li, X., Zhang, X., Ge, S., Zou, D., Zhou, B., and Zhou, X. 2019. Aesthetic attributes assessment of images. Proceedings of the 27th ACM international conference on multimedia 311-19. doi:10.1145/3343031.3350970.
Khan, M. A., Javed, M. Y., Sharif, M., Saba, T., and Rehman, A. 2019. Multi-model deep neural network based features extraction and optimal selection approach for skin lesion classification. 2019 international conference on computer and information sciences (ICCIS) 1-7. doi:10.1109/ICCISci.2019.8716400.
Kumar, H. S. H., Gowramma, Y. P., Manjula, S. H., Anil, D., and Smitha, N. 2021. Comparison of various ml and dl models for emotion recognition using twitter. 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV) 1332-37. doi:10.1109/ICICV50876.2021.9388522.
Liu, H., Zheng, C., Li, D., Zhang, Z., Lin, K., Shen, X., Xiong, N. N., and Wang, J. 2022. Multi-perspective social recommendation method with graph representation learning. Neurocomputing 468:469-81. doi: 10.1016/j.neucom.2021.10.050.
Mahima, M. A., Patel, N. C., Ravichandran, S., Aishwarya, N., and Maradithaya, S. 2021. A text-based hybrid approach for multiple emotion detection using contextual and semantic analysis. 2021 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES) 1-6. doi:10.1109/ICSES52305.2021.9633843.
Miao, Y., Lei , Q., Zhang, W., and Wen, Y. 2021. Research on image emotional analysis of multi visual object fusion. Computer Application Research 38:1250-55. doi: 10.19734/j.issn.1001-3695.2020.02.0087.
Middel, A., Lukasczyk, J., Zakrzewski, S., Arnold, M., and Maciejewski, R. 2019. Urban form and composition of street canyons: A human-centric big data and deep learning approach. Landscape and Urban Planning 183:122-32. doi: 10.1016/j.landurbplan.2018.12.001.
Mozafari, F., and Tahayori, H. 2019. Emotion detection by using similarity techniques. 2019 7th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS) 1-5. doi:10.1109/CFIS.2019.8692152.
Naoi, T., Yamada, T., Iijima, S., and Kumazawa, T. 2011. Applying the caption evaluation method to studies of visitors’ evaluation of historical districts. Tourism management 32:1061-74. doi: 10.1016/j.tourman.2010.09.005.
Nasution, F. B. B., Nasution, N., and Hasan, M. A. 2023. Deep learning-based apple classification by color. 2023 International Conference on Converging Technology in Electrical and Information Engineering (ICCTEIE) 90-95. doi:10.1109/ICCTEIE60099.2023.10366623.
Poria, S., Chaturvedi, I., Cambria, E., and Hussain, A. 2016. Convolutional MKL based multimodal emotion recognition and sentiment analysis. 2016 IEEE 16th international conference on data mining (ICDM) 439-48. doi:10.1109/ICDM.2016.0055.
Pratibha, Khurana, M., Kaur, G., and Kaur, A. 2022. A stem to stern sentiment analysis emotion detection. 2022 10th international conference on reliability, infocom technologies and optimization (trends and future directions)(ICRITO) 1-5. doi:10.1109/ICRITO56286.2022.9964967.
Roshan, M., Rawat, M., Aryan, K., Lyakso, E., Mekala, A. M., and Ruban, N. 2024. Linguistic based emotion analysis using softmax over time attention mechanism. Plos one 19:e0301336. doi: 10.1371/journal.pone.0301336.
Sanchez, T. W. 2023. Planning on the verge of AI, or AI on the verge of planning. Urban Science 7:70. doi: 10.3390/urbansci7030070.
Sejal, D., Ganeshsingh, T., Venugopal, K. R., Iyengar, S. S., and Patnaik, L. M. 2016. Image recommendation based on ANOVA cosine similarity. Procedia Computer Science 89:562-67. doi: 10.1016/j.procs.2016.06.091.
Shang, J., Gao, M., Li, Q., Pan, J., Zou, G., and Jeon, G. 2023. Hybrid-Scale hierarchical transformer for remote sensing image super-resolution. Remote Sensing 15:3442. doi: 10.3390/rs15133442.
Shi, L., Luo, J., Zhu, C., Kou, F., Cheng, G., and Liu, X. 2023. A survey on cross-media search based on user intention understanding in social networks. Information Fusion 91:566-81. doi: 10.1016/j.inffus.2022.11.017.
Song, K., Yao, T., Ling, Q., and Mei, T. 2018. Boosting image sentiment analysis with visual attention. Neurocomputing 312:218-28. doi: 10.1016/j.neucom.2018.05.104.
Song, P., Li, J., An, Z., Fan, H., and Fan, L. 2022. CTMFNet: CNN and transformer multiscale fusion network of remote sensing urban scene imagery. IEEE Transactions on Geoscience and Remote Sensing 61:1-14. doi: 10.1109/TGRS.2022.3232143.
Tao, F., Peng, W., and Qi, C. 2020. The research of sentiment recognition of online users based on DNNs multimodal fusion. Journal of Information Resources Management 10:39-48.
Thilagavathy, A., Suresh, K. H., Chowdary, K. T., Tejash, M., and Chakradhar, V. L. 2023. Text detection based on deep learning. 2023 International Conference on Innovative Data Communication Technologies and Application (ICIDCA) 1-6. doi:10.1109/ICIDCA56705.2023.10099672.
Truong, Q.-T., and Lauw, H. W. 2019. Vistanet: Visual aspect attention network for multimodal sentiment analysis. Proceedings of the AAAI conference on artificial intelligence 305-12. doi:10.1609/aaai.v33i01.3301305.
Tu, J., Mei, G., Ma, Z., and Piccialli, F. 2022. SWCGAN: Generative adversarial network combining swin transformer and CNN for remote sensing image super-resolution. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15:5662-73. doi: 10.1109/JSTARS.2022.3190322.
Wu, L., Qi, M., Jian, M., and Zhang, H. 2020. Visual sentiment analysis by combining global and local information. Neural Processing Letters 51:2063-75. doi: 10.1007/s11063-019-10027-7.
Xia, G.-S., Hu, J., Hu, F., Shi, B., Bai, X., Zhong, Y., Zhang, L., and Lu, X. 2017. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing 55:3965-81. doi: 10.1109/TGRS.2017.2685945.
Xu, C., Cetintas, S., Lee, K., and Li, L. 2014. Visual sentiment prediction with deep convolutional neural networks. arXiv preprint arXiv:1411.5731.
You, Q., Cao, L., Jin, H., and Luo, J. 2016. Robust visual-textual sentiment analysis: When attention meets tree-structured recursive neural networks. Proceedings of the 24th ACM international conference on Multimedia 1008-17. doi:10.1145/2964284.2964288.
You, Q., Jin, H., and Luo, J. 2017. Visual sentiment analysis by attending on local image regions. Proceedings of the AAAI conference on artificial intelligence. doi:10.1609/aaai.v31i1.10501.
You, Q., Luo, J., Jin, H., and Yang, J. 2016. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. Proceedings of the Ninth ACM international conference on Web search and data mining 13-22. doi:10.1145/2835776.2835779.
Yuqing, M., Junhong, W., and Tonglai, L. 2019. Joint vissual-textual approach for microblog sentiment analysis. Computer Engineering and Design 40:1099-105.
Zadeh, A., Chen, M., Poria, S., Cambria, E., and Morency, L.-P. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250. doi: 10.48550/arXiv.1707.07250.
Zhang, W., Tan, Z., Lv, Q., Li, J., Zhu, B., and Liu, Y. 2024. An efficient hybrid cnn-transformer approach for remote sensing super-resolution. Remote Sensing 16:880. doi: 10.3390/rs16050880.
Zhang, Y., Wang, Y., Jin, J., and Wang, X. 2017. Sparse Bayesian learning for obtaining sparsity of EEG frequency bands based feature vectors in motor imagery classification. International journal of neural systems 27:1650032. doi: 10.1142/S0129065716500325.
Zhao, W., Wang, H., Li, Y., and Wang, Z. 2018. The history, current dilemmas and coordination mechanism of urban settlements in China: based on social and spatial perspectives. Urban Planning Journal 20-28.
Zhao, Z., Zhu, H., Xue, Z., Liu, Z., Tian, J., Chua, M. C. H., and Liu, M. 2019. An image-text consistency driven multimodal sentiment analysis approach for social media. Information Processing & Management 56:102097. doi: 10.1016/j.ipm.2019.102097.
Zunair, H., and Hamza, A. B. 2020. Melanoma detection using adversarial training and deep transfer learning. Physics in Medicine & Biology 65:135005. doi: 10.1088/1361-6560/ab86d3.
Downloads
Published
Data Availability Statement
The datasets generated and analyzed during this study were collected and photographed by the authors. User evaluations were conducted with the participants' consent. These data are available from the corresponding author on reasonable request.
Issue
Section
License
Copyright (c) 2025 Youping Teng, Weijia Wang

This work is licensed under a Creative Commons Attribution 4.0 International License.