Research on YOLOv5s-Based Multimodal Assistive Gesture and Micro-Expression Recognition with Speech Synthesis
| dc.contributor.author | Xiaohua Li | |
| dc.contributor.author | Chaiyan Jettanasen | |
| dc.date.accessioned | 2026-05-08T19:25:44Z | |
| dc.date.issued | 2025-12-1 | |
| dc.description.abstract | Effective communication between deaf–mute and visually impaired individuals remains a challenge in the fields of human–computer interaction and accessibility technology. Current solutions mostly rely on single-modal recognition, which often leads to issues such as semantic ambiguity and loss of emotional information. To address these challenges, this study proposes a lightweight multimodal fusion framework that combines gestures and micro-expressions, which are then processed through a recognition network and a speech synthesis module. The core innovations of this research are as follows: (1) a lightweight YOLOv5s improvement structure that integrates residual modules and efficient downsampling modules, which reduces the model complexity and computational overhead while maintaining high accuracy; (2) a multimodal fusion method based on an attention mechanism, which adaptively and efficiently integrates complementary information from gestures and micro-expressions, significantly improving the semantic richness and accuracy of joint recognition; (3) an end-to-end real-time system that outputs the visual recognition results through a high-quality text-to-speech module, completing the closed-loop from “visual signal” to “speech feedback”. We conducted evaluations on the publicly available hand gesture dataset HaGRID and a curated micro-expression image dataset. The results show that, for the joint gesture and micro-expression tasks, our proposed multimodal recognition system achieves a multimodal joint recognition accuracy of 95.3%, representing a 4.5% improvement over the baseline model. The system was evaluated in a locally deployed environment, achieving a real-time processing speed of 22 FPS, with a speech output latency below 0.8 s. The mean opinion score (MOS) reached 4.5, demonstrating the effectiveness of the proposed approach in breaking communication barriers between the hearing-impaired and visually impaired populations. | |
| dc.identifier.doi | 10.3390/computation13120277 | |
| dc.identifier.uri | https://dspace.kmitl.ac.th/handle/123456789/20250 | |
| dc.publisher | Computation | |
| dc.subject | Hand Gesture Recognition Systems | |
| dc.subject | Speech and Audio Processing | |
| dc.subject | Human Pose and Action Recognition | |
| dc.title | Research on YOLOv5s-Based Multimodal Assistive Gesture and Micro-Expression Recognition with Speech Synthesis | |
| dc.type | Article |