Deep Multimodal Fusion of Visual and Auditory Features for Robust Material Recognition

Yifei Shi; Huei Ruey Ong; Shuai Yang; Yuxin Fan

doi:10.15837/ijccc.2024.5.6457

Authors

Yifei Shi Geely University of China / DRB-HICOM University of Automotive, Malaysia
Huei Ruey Ong DRB-HICOM University of Automotive, Malaysia
Shuai Yang Geely University of China
Yuxin Fan Geely University of China

DOI:

https://doi.org/10.15837/ijccc.2024.5.6457

Keywords:

material recognition, deep neural network, visual information, auditory information, feature fusion

Abstract

This paper presents a deep neural network incorporating visual and auditory data fusion to enhance material recognition performance. Traditional recognition techniques relying on single data modalities face accuracy and robustness limitations, especially in complex real-world environments. To address these challenges, we develop a multimodal fusion-based model. The proposed approach first extracts features from input images and sounds separately using CNNs and spectral analysis. A concatenation layer then integrates the visual and auditory features. Extensive experiments demonstrate superior material classification over uni-modal methods, with 100% test accuracy across seven material types. The multi-modal fusion model also demonstrates stronger resilience to noise and illumination variations. This research provides a valuable foundation for robust material perception in intelligent systems.

References

Sadjadi, S. O., Greenberg, C. S., Singer, E., Reynolds, D. A., Mason, L. P., & Hernandez-Cordero, J. (2020), The 2019 NIST speaker recognition evaluation CTS challenge. In Proc. Speaker Odyssey (submitted), Tokyo, Japan, May 2020. https://doi.org/10.21437/Odyssey.2020-38

Chuang Gan, Deng Huang, Hang Zhao, Joshua B Tenenbaum, and Antonio Torralba.(2020). Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10478-10487, 2020. https://doi.org/10.1109/CVPR42600.2020.01049

Wei, L., Zhang, J., Hou, J., & Dai, L. (2020). Attentive fusion enhanced audio-visual encoding for transformer based robust speech recognition. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, (APSIPA ASC), IEEE, 2020.

Che, J., Qiao, T., Yang, Y., Zhang, H., & Pang, Y. (2021). Longitudinal tear detection method of conveyor belt based on audio-visual fusion. Measurement: Journal of the International Measurement Confederation, 176, Article 109152. https://doi.org/10.1016/j.measurement.2021.109152

Farhoudi Z, Setayeshi S. (2021), Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition[J]. Speech Communication, 127: 92-103. https://doi.org/10.1016/j.specom.2020.12.001

Lee, J. T., Jain, M., Park, H., & Yun, S. (2020). Cross-attentional audio-visual fusion for weakly-supervised action localization. In International conference on learning representations, 2020.

Qian, X., Madhavi, M., Pan, Z., Wang, J., & Li, H. (2021). Multi-target DoA estimation with an audiovisual fusion mechanism. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP), IEEE, 2021. https://doi.org/10.1109/ICASSP39728.2021.9413776

Praveen, R. G., Granger, E., & Cardinal, P. (2021). Cross attentional audio-visual fusion for dimensional emotion recognition. In 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition, (FG 2021), IEEE, 2021. https://doi.org/10.1109/FG52635.2021.9667055

Mo, S., & Tian, Y. (2023). AV-SAM: Segment anything model meets audio-visual localization and segmentation. arXiv preprint, arXiv:2305.01836, 2023.

Babadian, R. P., Faez, K., Amiri, M., & Falotico, E. (2023). Fusion of tactile and visual information in deep learning models for object recognition. Information Fusion, 92, 313-325. https://doi.org/10.1016/j.inffus.2022.11.032

Selvaraj, A., & Russel, N. S. (2019). Bimodal recognition of affective states with the features inspired from human visual and auditory perception system. International Journal of Imaging Systems and Technology, 29(4), 584-598. https://doi.org/10.1002/ima.22338

Oh, Y., Schwalm, M., & Kalpin, N. (2022). Multisensory benefits for speech recognition in noisy environments. Frontiers in Neuroscience, 16, 1031424. https://doi.org/10.3389/fnins.2022.1031424

Choe, G., Lee, S., & Nang, J. (2019). CNN-based Visual Auditory Feature Fusion Method with Frame Selection for Classifying Video Events. ksii Transactions on Internet & Information Systems, 13(3), 254- 261. https://doi.org/10.3837/tiis.2019.03.033

Wang, L., Liu, G., Sun, L., Shi, L., & Ma, S. (2023). A novel deep-learning-based objective function for inverse identification of material properties. Journal of Nuclear Materials, 154579. https://doi.org/10.1016/j.jnucmat.2023.154579

Günther, F., Marelli, M., Tureski, S., & Petilli, M. A. (2023). ViSpa (Vision Spaces): a computer-visionbased representation system for individual images and concept prototypes, with large-scale evaluation. Psychological Review, 130(4), 896. https://doi.org/10.1037/rev0000392

Han, B., Lin, Y., Yang, Y., Mao, N., Li, W., Wang, H., & Palacios, T. (2020). Deep-Learning-Enabled Fast Optical Identification and Characterization of 2D Materials. Advanced Materials, 32(29), 2000953. https://doi.org/10.1002/adma.202000953

Lorenz Breinig, Rainer Leonhart, Olof Broman, Andreas Manuel, Franka Brüchert, & Günther Becker (2014). Classification of wood surfaces according to visual appearance by multivariate analysis of wood feature data. Journal of Wood Science, 61 (2), 89-112. https://doi.org/10.1007/s10086-014-1410-6

Chen, F. F., Yang, J. L., & Downes, G. (2008). A visual information assessment tool for resin canal identification and property measurement. Iawa Journal, 29(4), 397-408. https://doi.org/10.1163/22941932-90000194

Liu, H., Wang, F., Sun, F., & Fang, B. (2018). Surface material retrieval using weakly paired cross-modal learning. IEEE Transactions on Automation Science and Engineering, 16(2), 781-791. https://doi.org/10.1109/TASE.2018.2865000

Alex Belianinov, Anton V. Ievlev, Matthias Lorenz, Nikolay Borodinov, Benjamin Doughty, Sergei V. Kalinin, Facundo M. Fernández, & Olga S. Ovchinnikova (2018). Correlated Materials Characterization via Multimodal Chemical and Functional Imaging. ACS Nano, 12 (12), 11798-11818. https://doi.org/10.1021/acsnano.8b07292

Ahmad, M. S., Nuawi, M. Z., Othman, A., Ahmad, F., & Arif, M. (2016). Metallic material characterization using acoustics signal analysis. Jurnal Teknologi, 78(6-10), 31-37. https://doi.org/10.11113/jt.v78.9185

Emmett Kerr, T.M. McGinnity, & Sonya Coleman (2018). Material recognition using tactile sensing. Expert Systems With Applications, 94(0), 94-111. https://doi.org/10.1016/j.eswa.2017.10.045

Himani Chugh, Sheifali Gupta, Meenu Garg, Deepali Gupta, Heba G. Mohamed, Irene Delgado Noya, Aman Singh, & Nitin Goyal (2022). An Image Retrieval Framework Design Analysis Using Saliency Structure and Color Difference Histogram. Sustainability, 14 (16), 10357-10357. https://doi.org/10.3390/su141610357

Xiong, F., Zhou, J., Chanussot, J., & Qian, Y. (2019). Dynamic material-aware object tracking in hyperspectral videos. In 2019 10th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), IEEE, 2019 https://doi.org/10.1109/WHISPERS.2019.8921176

Suo, G. J., & Zheng, Z. K. (2011). Research on identification and classification of texture based on MATLAB. In 2012 International Workshop on Image Processing and Optical Engineering, SPIE, 2011. https://doi.org/10.1117/12.918079

Hsu, S. Y., & Huang, J. C. Y. (1997). Concealed fixed object detection with hyperspectral data in SRE's IMaG Environment. In Imaging Spectrometry III., SPIE, 1997. https://doi.org/10.1117/12.278928

Nagai, T., Matsushima, T., Koida, K., Tani, Y., Kitazaki, M., & Nakauchi, S. (2015). Temporal properties of material categorization and material rating: visual vs non-visual material features. Vision Research, 115, 259-270. https://doi.org/10.1016/j.visres.2014.12.011

Zhang, Y., Zhang, L., Bai, X., & Zhang, L. (2017). Infrared and visual image fusion through infrared feature extraction and visual information preservation. Infrared Physics & Technologyh, 83, 227 -237. https://doi.org/10.1016/j.infrared.2017.05.007

Sezen Yucel, Robert J. Moon, Linda J. Johnston, Douglas M. Fox, Byong Chon Park, E. Johan Foster, & Surya R. Kalidindi (2022). Transmission electron microscopy image analysis effects on cellulose nanocrystal particle size measurements. Cellulose, 29 (17), 9035-9053. https://doi.org/10.1007/s10570-022-04818-w

Ding, L., Hoover, A. N., Emerson, R. M., Lin, K. T., Gruber, J. N., Donohoe, B. S. & Ray, A. E. (2022). Image Analysis for Rapid Assessment and Quality-Based Sorting of Corn Stover. Frontiers in Energy Research, 10, 837698. https://doi.org/10.3389/fenrg.2022.837698

Li, F., Ng, M. K., Plemmons, R., Prasad, S., & Zhang, Q. (2010). Hyperspectral image segmentation, deblurring, and spectral analysis for material identification. InVisual Information Processing XIX, SPIE,2010. https://doi.org/10.1117/12.850121

Kong, S. Y., & Chin, R. K. Y. (2014). Feasibility Study of Using Acoustic Signal for Material Identification in Underwater Application Using a Single Transceiver. InInternational Journal of Simulation-Systems, Science & Technology, 15(2).

Shanbhag, H., Madani, S., Isanaka, A., Nair, D., Gupta, S., & Hassanieh, H. (2023). Contactless Material Identification with MillimeterWave Vibrometry. InProceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services, 2023. https://doi.org/10.1145/3581791.3596850

Wang, Y., Runting, Z., Wu, H., & Xue, G. (2021). Material Identification System with Sound Simulation Assisted Method in VR/AR Scenarios. InAdjunct Proceedings of the 2021 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2021 ACM International Symposium on Wearable Computers, 2021. https://doi.org/10.1145/3460418.3480162

Liu, H., Fang, J., Xu, X., & Sun, F. (2018). Surface material recognition using active multi-modal extreme learning machine. Cognitive Computation, 10, 937-950. https://doi.org/10.1007/s12559-018-9571-z

Eguíluz, A. G., Rañó, I., Coleman, S. A., & McGinnity, T. M. (2018). Multimodal material identification through recursive tactile sensing. Robotics and Autonomous Systems, 106, 130-139. https://doi.org/10.1016/j.robot.2018.05.003

Tsuji, S., Kimoto, A., & Takahashi, E. (2011). Material Identification by a Multimodal Tactile Sensor. IEEJ Transactions on Fundamentals and Materials, 131(4), 295-299. https://doi.org/10.1541/ieejfms.131.295

Deep Multimodal Fusion of Visual and Auditory Features for Robust Material Recognition

Authors

DOI:

Keywords:

Abstract

References

Additional Files

Published

Issue

Section

License

Most read articles by the same author(s)