Vani-Adapt: A Zero-Shot Accent Trans-Adaptation Framework for Robust Indic Speech Recognition

Volume: 12 | Issue: 1 | Year 2026 | Subscription
International Journal of Image Processing and Pattern Recognition
Received Date: 12/20/2025
Acceptance Date: 01/15/2026
Published On: 2026-02-18
First Page: 17
Last Page: 22

Journal Menu


By: Jyotirmoyee Mandal, Kunal Halder, and Kakali Das.

1 Student, Department of Computer Science and Engineering, Greater Kolkata College of Engineering and Management, Sonarpur, Kolkata, West Bengal, India
2 Student, Department of Computer Science and Engineering, Greater Kolkata College of Engineering and Management, Sonarpur, Kolkata, West Bengal, India
3 Assistant Professor, Department of Computer Science and Engineering, Greater Kolkata College of Engineering and Management, Sonarpur, Kolkata, West Bengal, India

Abstract

In countries like India with multilingual and accent-rich dialects, speech-based human–computer interaction is important to expand digital services for better accessibility. Even with recent advancements in Automatic Speech Recognition (ASR), existing systems are still very reactive to regional accents and non-standard speech patterns which is not suitable for seamless experience. Traditional perspectives rely on accent-specific fine-tuning, which is unrealistic for real-world deployment and requires a lot of labeled data, which is almost impossible. In this paper, we demonstrate Vani-Adapt, a zero-shot accent trans-adaptation framework that figures out ASR robustness for accents that have never been seen before without the need for retraining or accent-labeled data with accuracy. A Disentangled Phonetic–Prosodic Encoder (DPPE), which tells apart linguistic content from prosodic features like intonation, rhythm, and stress, is the foundation of the proposed method. Vani-Adapt allows for structured accent normalization while keeping up speaker identity and semantic content by forecasting speech into an accent-invariant phonetic space and independently changing prosodic representations. A high-fidelity neural vocoder is utilized to reintegrate the modified speech, empowering smooth combination with existing ASR backends. Distinguishing OpenAI Whisper to outperforming baselines, trials show notable finetuning, such as a 28% comparative drop in Word Error Rate (WER) on hidden accents. Upgrades in naturalness and accessibility are further confirmed by subjective listening assessments. The outcomes reveal that Vani-Adapt provides an expandable and data-efficient accent-agnostic speech recognition solution, which makes it especially suitable for comprehensive conversational AI systems applied in linguistically diverse settings.

Loading

Citation:

How to cite this article: Jyotirmoyee Mandal, Kunal Halder, and Kakali Das Vani-Adapt: A Zero-Shot Accent Trans-Adaptation Framework for Robust Indic Speech Recognition. International Journal of Image Processing and Pattern Recognition. 2026; 12(1): 17-22p.

How to cite this URL: Jyotirmoyee Mandal, Kunal Halder, and Kakali Das, Vani-Adapt: A Zero-Shot Accent Trans-Adaptation Framework for Robust Indic Speech Recognition. International Journal of Image Processing and Pattern Recognition. 2026; 12(1): 17-22p. Available from:https://journalspub.com/publication/ijippr/article=26330

Refrences:

  1. Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust speech recognition via large-scale weak supervision. In: Proceedings of the International Conference on Machine Learning; 2023 Jul 3. p. 28492–28518.
  2. Prabhavalkar R, Hori T, Sainath TN, Schlüter R, Watanabe S. End-to-end speech recognition: A survey. IEEE/ACM Trans Audio Speech Lang Process. 2023 Oct 30;32:325–351.
  3. Ghoshal A, Swietojanski P, Renals S. Multilingual training of deep neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing; 2013 May 26. p. 7319–7323.
  4. Schultz T, Waibel A. Multilingual and crosslingual speech recognition. In: Proceedings of the DARPA Workshop on Broadcast News Transcription and Understanding; 1998 Feb. p. 259–262.
  5. Lee CH, Wang SM, Chang HC, Lee HY. ODSQA: Open-domain spoken question answering dataset. In: 2018 IEEE Spoken Language Technology Workshop (SLT); 2018 Dec 18. p. 949–956.
  6. Pascual S, Ravanelli M, Serra J, Bonafonte A, Bengio Y. Learning problem-agnostic speech representations from multiple self-supervised tasks. arXiv preprint arXiv:1904.03416. 2019 Apr 6.
  7. Zen H, Dang V, Clark R, Zhang Y, Weiss RJ, Jia Y, et al. LibriTTS: A corpus derived from LibriSpeech for text-to-speech. arXiv preprint arXiv:1904.02882. 2019 Apr 5.
  8. Tjandra A, Sisman B, Zhang M, Sakti S, Li H, Nakamura S. VQVAE unsupervised unit discovery and multi-scale code2spec inverter for ZeroSpeech challenge 2019. arXiv preprint arXiv:1905.11449. 2019 May 27.
  9. Donahue J, Dieleman S, Bińkowski M, Elsen E, Simonyan K. End-to-end adversarial text-to-speech. arXiv preprint arXiv:2006.03575. 2020 Jun 5.
  10. Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S. X-vectors: Robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2018 Apr 15. p. 5329–5333.
  11. Dutoit T. High-quality text-to-speech synthesis: An overview. J Electr Electron Eng Aust. 1997 Mar;17(1):25–36.