Deep Multimodal Fusion of Visual and Auditory Features for Robust Material Recognition
DOI:
https://doi.org/10.15837/ijccc.2024.5.6457Keywords:
material recognition, deep neural network, visual information, auditory information, feature fusionAbstract
This paper presents a deep neural network incorporating visual and auditory data fusion to enhance material recognition performance. Traditional recognition techniques relying on single data modalities face accuracy and robustness limitations, especially in complex real-world environments. To address these challenges, we develop a multimodal fusion-based model. The proposed approach first extracts features from input images and sounds separately using CNNs and spectral analysis. A concatenation layer then integrates the visual and auditory features. Extensive experiments demonstrate superior material classification over uni-modal methods, with 100% test accuracy across seven material types. The multi-modal fusion model also demonstrates stronger resilience to noise and illumination variations. This research provides a valuable foundation for robust material perception in intelligent systems.
References
Sadjadi, S. O., Greenberg, C. S., Singer, E., Reynolds, D. A., Mason, L. P., & Hernandez-Cordero, J. (2020), The 2019 NIST speaker recognition evaluation CTS challenge. In Proc. Speaker Odyssey (submitted), Tokyo, Japan, May 2020. https://doi.org/10.21437/Odyssey.2020-38
Chuang Gan, Deng Huang, Hang Zhao, Joshua B Tenenbaum, and Antonio Torralba.(2020). Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10478-10487, 2020. https://doi.org/10.1109/CVPR42600.2020.01049
Wei, L., Zhang, J., Hou, J., & Dai, L. (2020). Attentive fusion enhanced audio-visual encoding for transformer based robust speech recognition. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, (APSIPA ASC), IEEE, 2020.
Che, J., Qiao, T., Yang, Y., Zhang, H., & Pang, Y. (2021). Longitudinal tear detection method of conveyor belt based on audio-visual fusion. Measurement: Journal of the International Measurement Confederation, 176, Article 109152. https://doi.org/10.1016/j.measurement.2021.109152
Farhoudi Z, Setayeshi S. (2021), Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition[J]. Speech Communication, 127: 92-103. https://doi.org/10.1016/j.specom.2020.12.001
Lee, J. T., Jain, M., Park, H., & Yun, S. (2020). Cross-attentional audio-visual fusion for weakly-supervised action localization. In International conference on learning representations, 2020.
Qian, X., Madhavi, M., Pan, Z., Wang, J., & Li, H. (2021). Multi-target DoA estimation with an audiovisual fusion mechanism. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP), IEEE, 2021. https://doi.org/10.1109/ICASSP39728.2021.9413776
Praveen, R. G., Granger, E., & Cardinal, P. (2021). Cross attentional audio-visual fusion for dimensional emotion recognition. In 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition, (FG 2021), IEEE, 2021. https://doi.org/10.1109/FG52635.2021.9667055
Mo, S., & Tian, Y. (2023). AV-SAM: Segment anything model meets audio-visual localization and segmentation. arXiv preprint, arXiv:2305.01836, 2023.
Babadian, R. P., Faez, K., Amiri, M., & Falotico, E. (2023). Fusion of tactile and visual information in deep learning models for object recognition. Information Fusion, 92, 313-325. https://doi.org/10.1016/j.inffus.2022.11.032
Selvaraj, A., & Russel, N. S. (2019). Bimodal recognition of affective states with the features inspired from human visual and auditory perception system. International Journal of Imaging Systems and Technology, 29(4), 584-598. https://doi.org/10.1002/ima.22338
Oh, Y., Schwalm, M., & Kalpin, N. (2022). Multisensory benefits for speech recognition in noisy environments. Frontiers in Neuroscience, 16, 1031424. https://doi.org/10.3389/fnins.2022.1031424
Choe, G., Lee, S., & Nang, J. (2019). CNN-based Visual Auditory Feature Fusion Method with Frame Selection for Classifying Video Events. ksii Transactions on Internet & Information Systems, 13(3), 254- 261. https://doi.org/10.3837/tiis.2019.03.033
Wang, L., Liu, G., Sun, L., Shi, L., & Ma, S. (2023). A novel deep-learning-based objective function for inverse identification of material properties. Journal of Nuclear Materials, 154579. https://doi.org/10.1016/j.jnucmat.2023.154579
Günther, F., Marelli, M., Tureski, S., & Petilli, M. A. (2023). ViSpa (Vision Spaces): a computer-visionbased representation system for individual images and concept prototypes, with large-scale evaluation. Psychological Review, 130(4), 896. https://doi.org/10.1037/rev0000392
Han, B., Lin, Y., Yang, Y., Mao, N., Li, W., Wang, H., & Palacios, T. (2020). Deep-Learning-Enabled Fast Optical Identification and Characterization of 2D Materials. Advanced Materials, 32(29), 2000953. https://doi.org/10.1002/adma.202000953
Lorenz Breinig, Rainer Leonhart, Olof Broman, Andreas Manuel, Franka Brüchert, & Günther Becker (2014). Classification of wood surfaces according to visual appearance by multivariate analysis of wood feature data. Journal of Wood Science, 61 (2), 89-112. https://doi.org/10.1007/s10086-014-1410-6
Chen, F. F., Yang, J. L., & Downes, G. (2008). A visual information assessment tool for resin canal identification and property measurement. Iawa Journal, 29(4), 397-408. https://doi.org/10.1163/22941932-90000194
Liu, H., Wang, F., Sun, F., & Fang, B. (2018). Surface material retrieval using weakly paired cross-modal learning. IEEE Transactions on Automation Science and Engineering, 16(2), 781-791. https://doi.org/10.1109/TASE.2018.2865000
Alex Belianinov, Anton V. Ievlev, Matthias Lorenz, Nikolay Borodinov, Benjamin Doughty, Sergei V. Kalinin, Facundo M. Fernández, & Olga S. Ovchinnikova (2018). Correlated Materials Characterization via Multimodal Chemical and Functional Imaging. ACS Nano, 12 (12), 11798-11818. https://doi.org/10.1021/acsnano.8b07292
Ahmad, M. S., Nuawi, M. Z., Othman, A., Ahmad, F., & Arif, M. (2016). Metallic material characterization using acoustics signal analysis. Jurnal Teknologi, 78(6-10), 31-37. https://doi.org/10.11113/jt.v78.9185
Emmett Kerr, T.M. McGinnity, & Sonya Coleman (2018). Material recognition using tactile sensing. Expert Systems With Applications, 94(0), 94-111. https://doi.org/10.1016/j.eswa.2017.10.045
Himani Chugh, Sheifali Gupta, Meenu Garg, Deepali Gupta, Heba G. Mohamed, Irene Delgado Noya, Aman Singh, & Nitin Goyal (2022). An Image Retrieval Framework Design Analysis Using Saliency Structure and Color Difference Histogram. Sustainability, 14 (16), 10357-10357. https://doi.org/10.3390/su141610357
Xiong, F., Zhou, J., Chanussot, J., & Qian, Y. (2019). Dynamic material-aware object tracking in hyperspectral videos. In 2019 10th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), IEEE, 2019 https://doi.org/10.1109/WHISPERS.2019.8921176
Suo, G. J., & Zheng, Z. K. (2011). Research on identification and classification of texture based on MATLAB. In 2012 International Workshop on Image Processing and Optical Engineering, SPIE, 2011. https://doi.org/10.1117/12.918079
Hsu, S. Y., & Huang, J. C. Y. (1997). Concealed fixed object detection with hyperspectral data in SRE's IMaG Environment. In Imaging Spectrometry III., SPIE, 1997. https://doi.org/10.1117/12.278928
Nagai, T., Matsushima, T., Koida, K., Tani, Y., Kitazaki, M., & Nakauchi, S. (2015). Temporal properties of material categorization and material rating: visual vs non-visual material features. Vision Research, 115, 259-270. https://doi.org/10.1016/j.visres.2014.12.011
Zhang, Y., Zhang, L., Bai, X., & Zhang, L. (2017). Infrared and visual image fusion through infrared feature extraction and visual information preservation. Infrared Physics & Technologyh, 83, 227 -237. https://doi.org/10.1016/j.infrared.2017.05.007
Sezen Yucel, Robert J. Moon, Linda J. Johnston, Douglas M. Fox, Byong Chon Park, E. Johan Foster, & Surya R. Kalidindi (2022). Transmission electron microscopy image analysis effects on cellulose nanocrystal particle size measurements. Cellulose, 29 (17), 9035-9053. https://doi.org/10.1007/s10570-022-04818-w
Ding, L., Hoover, A. N., Emerson, R. M., Lin, K. T., Gruber, J. N., Donohoe, B. S. & Ray, A. E. (2022). Image Analysis for Rapid Assessment and Quality-Based Sorting of Corn Stover. Frontiers in Energy Research, 10, 837698. https://doi.org/10.3389/fenrg.2022.837698
Li, F., Ng, M. K., Plemmons, R., Prasad, S., & Zhang, Q. (2010). Hyperspectral image segmentation, deblurring, and spectral analysis for material identification. InVisual Information Processing XIX, SPIE,2010. https://doi.org/10.1117/12.850121
Kong, S. Y., & Chin, R. K. Y. (2014). Feasibility Study of Using Acoustic Signal for Material Identification in Underwater Application Using a Single Transceiver. InInternational Journal of Simulation-Systems, Science & Technology, 15(2).
Shanbhag, H., Madani, S., Isanaka, A., Nair, D., Gupta, S., & Hassanieh, H. (2023). Contactless Material Identification with MillimeterWave Vibrometry. InProceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services, 2023. https://doi.org/10.1145/3581791.3596850
Wang, Y., Runting, Z., Wu, H., & Xue, G. (2021). Material Identification System with Sound Simulation Assisted Method in VR/AR Scenarios. InAdjunct Proceedings of the 2021 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2021 ACM International Symposium on Wearable Computers, 2021. https://doi.org/10.1145/3460418.3480162
Liu, H., Fang, J., Xu, X., & Sun, F. (2018). Surface material recognition using active multi-modal extreme learning machine. Cognitive Computation, 10, 937-950. https://doi.org/10.1007/s12559-018-9571-z
Eguíluz, A. G., Rañó, I., Coleman, S. A., & McGinnity, T. M. (2018). Multimodal material identification through recursive tactile sensing. Robotics and Autonomous Systems, 106, 130-139. https://doi.org/10.1016/j.robot.2018.05.003
Tsuji, S., Kimoto, A., & Takahashi, E. (2011). Material Identification by a Multimodal Tactile Sensor. IEEJ Transactions on Fundamentals and Materials, 131(4), 295-299. https://doi.org/10.1541/ieejfms.131.295
Additional Files
Published
Issue
Section
License
Copyright (c) 2024 Yifei Shi, Huei Ruey Ong, Liang Zhou, Shuai Yang
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
ONLINE OPEN ACCES: Acces to full text of each article and each issue are allowed for free in respect of Attribution-NonCommercial 4.0 International (CC BY-NC 4.0.
You are free to:
-Share: copy and redistribute the material in any medium or format;
-Adapt: remix, transform, and build upon the material.
The licensor cannot revoke these freedoms as long as you follow the license terms.
DISCLAIMER: The author(s) of each article appearing in International Journal of Computers Communications & Control is/are solely responsible for the content thereof; the publication of an article shall not constitute or be deemed to constitute any representation by the Editors or Agora University Press that the data presented therein are original, correct or sufficient to support the conclusions reached or that the experiment design or methodology is adequate.