Research on YOLOv5s-Based Multimodal Assistive Gesture and Micro-Expression Recognition with Speech Synthesis

Xiaohua Li; Chaiyan Jettanasen

doi:10.3390/computation13120277

Research on YOLOv5s-Based Multimodal Assistive Gesture and Micro-Expression Recognition with Speech Synthesis

dc.contributor.author	Xiaohua Li
dc.contributor.author	Chaiyan Jettanasen
dc.date.accessioned	2026-05-08T19:25:44Z
dc.date.issued	2025-12-1
dc.description.abstract	Effective communication between deaf–mute and visually impaired individuals remains a challenge in the fields of human–computer interaction and accessibility technology. Current solutions mostly rely on single-modal recognition, which often leads to issues such as semantic ambiguity and loss of emotional information. To address these challenges, this study proposes a lightweight multimodal fusion framework that combines gestures and micro-expressions, which are then processed through a recognition network and a speech synthesis module. The core innovations of this research are as follows: (1) a lightweight YOLOv5s improvement structure that integrates residual modules and efficient downsampling modules, which reduces the model complexity and computational overhead while maintaining high accuracy; (2) a multimodal fusion method based on an attention mechanism, which adaptively and efficiently integrates complementary information from gestures and micro-expressions, significantly improving the semantic richness and accuracy of joint recognition; (3) an end-to-end real-time system that outputs the visual recognition results through a high-quality text-to-speech module, completing the closed-loop from “visual signal” to “speech feedback”. We conducted evaluations on the publicly available hand gesture dataset HaGRID and a curated micro-expression image dataset. The results show that, for the joint gesture and micro-expression tasks, our proposed multimodal recognition system achieves a multimodal joint recognition accuracy of 95.3%, representing a 4.5% improvement over the baseline model. The system was evaluated in a locally deployed environment, achieving a real-time processing speed of 22 FPS, with a speech output latency below 0.8 s. The mean opinion score (MOS) reached 4.5, demonstrating the effectiveness of the proposed approach in breaking communication barriers between the hearing-impaired and visually impaired populations.
dc.identifier.doi	10.3390/computation13120277
dc.identifier.uri	https://dspace.kmitl.ac.th/handle/123456789/20250
dc.publisher	Computation
dc.subject	Hand Gesture Recognition Systems
dc.subject	Speech and Audio Processing
dc.subject	Human Pose and Action Recognition
dc.title	Research on YOLOv5s-Based Multimodal Assistive Gesture and Micro-Expression Recognition with Speech Synthesis
dc.type	Article

Collections

All

Research on YOLOv5s-Based Multimodal Assistive Gesture and Micro-Expression Recognition with Speech Synthesis

Files

Collections