Adaptación de ASR al habla de personas con síndrome de Down

Fernández-García, David; Cardeñoso-Payo, Valentín; González-Ferreras, César; Escudero-Mancebo, David

Adaptación de ASR al habla de personas con síndrome de Down

Revue:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Année de publication: 2024

Número: 73

Pages: 209-220

Type: Article

DIALNET GOOGLE SCHOLAR Accès ouvert editor

D'autres publications dans: Procesamiento del lenguaje natural

Résumé

The speech of people with intellectual disabilities (ID) poses enormous challenges to automatic speech recognition (ASR) systems, making it difficult for a particularly sensitive population to access information services. This work studies the difficulties of ASR systems in recognizing the speech of ID people and shows how this limitation can be combated with model fine-tuning strategies. The performance of ASR based on whisper (v2 and v3) is measured with a reference corpus of typical speech and DI speech, verifying that there are important and significant differences. By applying fine-tuning techniques, performance for DI speakers improves by at least 30 percentage points. Our results show that the inclusion of the voice of ID people in the training corpora is essential to improve the effectiveness of ASRs.

Références bibliographiques

Almadhor, A., R. Irfan, J. Gao, N. Saleem, H. Tayyab Rauf, y S. Kadry. 2023. E2e-dasr: End-to-end deep learning-based dysarthric automatic speech recognition. Expert Systems with Applications, 222:119797.
American Psychiatric Association. 2013. Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5). American Psychiatric Publishing, Arlington, VA.
Bhat, C. y H. Strik. 2020. Automatic assessment of sentence-level dysarthria intelligibility using blstm. IEEE Journal of Selected Topics in Signal Processing, 14(2):322–330.
Caton, S. y M. Chapman. 2016. The use of social media and people with intellectual disability: A systematic review and thematic analysis. Journal of intellectual and developmental disability, 41(2):125–139.
Chapman, R. S. 1997. Language development in children and adolescents with Down syndrome. Mental Retardation and Developmental Disabilities Research Reviews, 3(4):307–312.
Cibrian, F. L., K. Anderson, C. M. Abrahamsson, V. G. Motti, y others. 2024. Limitations in speech recognition for young adults with Down syndrome. Research Square (Preprint Version 1).
Cleland, J., S. Wood, W. Hardcastle, J. Wishart, y C. Timmins. 2010. Relationship between speech, oromotor, language and cognitive abilities in children with Down’s syndrome. International journal of language & communication disorders, 45(1):83–95.
Conneau, A., M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, y A. Bapna. 2023. Fleurs: Few-shot learning evaluation of universal representations of speech. En 2022 IEEE Spoken Language Technology Workshop (SLT), páginas 798–805.
De Russis, L. y F. Corno. 2019. On the impact of dysarthric speech on contemporary asr cloud platforms. Journal of Reliable Intelligent Environments, 5:163–172.
Escudero-Mancebo, D., M. Corrales-Astorgano, V. Cardeñoso-Payo, L. Aguilar, C. González-Ferreras, P. Martínez-Castilla, y V. Flores-Lucas. 2022. PRAUTOCAL corpus: a corpus for the study of Down syndrome prosodic aspects. Language Resources and Evaluation, 56:191–224, Mayo.
Feng, J., J. Lazar, L. Kumin, y A. Ozok. 2010. Computer usage by children with Down syndrome: Challenges and future research. ACM Transactions on Accessible Computing (TACCESS), 2(3):1–44.
Green, J. R., R. L. MacDonald, P.-P. Jiang, J. Cattiau, R. Heywood, R. Cave, K. Seaver, M. A. Ladewig, J. Tobin, M. P. Brenner, P. C. Nelson, y K. Tomanek. 2021. Automatic Speech Recognition of Disordered Speech: Personalized Models Outperforming Human Listeners on Short Phrases. En Proc. Interspeech 2021, páginas 4778–4782.
Hermann, E. y M. Magimai.-Doss. 2023. Fewshot Dysarthric Speech Recognition with Text-to-Speech Data Augmentation. En Proc. INTERSPEECH 2023, páginas 156–160.
Hu, R., J. Feng, J. Lazar, y L. Kumin. 2013. Investigating input technologies for children and young adults with Down syndrome. Universal access in the information society, 12:89–104.
Janbakhshi, P., I. Kodrasi, y H. Bourlard. 2021. Automatic dysarthric speech detection exploiting pairwise distance-based convolutional neural networks. En 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), páginas 7328–7332. IEEE.
Jiao, Y., M. Tu, V. Berisha, y J. Liss. 2018. Simulating dysarthric speech for training data augmentation in clinical speech applications. En 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), páginas 6009–6013. IEEE.
Kitzing, P., A. Maier, y V. L. ˚Ahlander. 2009. Automatic speech recognition (asr) and its use as a tool for assessment or therapy of voice, speech, and language disorders. Logopedics Phoniatrics Vocology, 34(2):91–96.
Kumin, L. 2012. Early communication skills for children with Down syndrome: A guide for parents and professionals. Woodbine House, 3ª edición.
Laws, G. y D. V. Bishop. 2004. Verbal deficits in Down’s syndrome and specific language impairment: a comparison. International Journal of Language & Communication Disorders, 39(4):423–451.
Lea, C., Z. Huang, J. Narain, L. Tooley, D. Yee, D. T. Tran, P. Georgiou, J. P. Bigham, y L. Findlater. 2023. From user perceptions to technical improvement: Enabling people who stutter to better use speech recognition. En Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, páginas 1–16.
MacDonald, R. L., P.-P. Jiang, J. Cattiau, R. Heywood, R. Cave, K. Seaver, M. A. Ladewig, J. Tobin, M. P. Brenner, P. C. Nelson, J. R. Green, y K. Tomanek. 2021. Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia. En Interspeech 2021, páginas 4833–4837.
Martin, G. E., J. Klusek, B. Estigarribia, y J. E. Roberts. 2009. Language characteristics of individuals with Down syndrome. Topics in language disorders, 29(2):112–132.
Mitra, V., Z. Huang, C. Lea, L. Tooley, S. Wu, D. Botten, A. Palekar, S. Thelapurath, P. Georgiou, S. Kajarekar, y J. Bigham. 2021. Analysis and Tuning of a Voice Assistant System for Dysfluent Speech. En Proc. Interspeech 2021, páginas 4848–4852.
Prananta, L., B. Halpern, S. Feng, y O. Scharenborg. 2022. The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition. En Proc. Interspeech 2022, páginas 36–40.
Radford, A., J. W. Kim, T. Xu, G. Brockman, C. McLeavey, y I. Sutskever. 2023. Robust speech recognition via large-scale weak supervision. En Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
Rosen, K. y S. Yampolsky. 2000. Automatic speech recognition and a review of its functioning with dysarthric speech. Augmentative and Alternative Communication, 16(1):48–60.
Schultz, B. G., V. S. A. Tarigoppula, G. Noffs, S. Rojas, A. van der Walt, D. B. Grayden, y A. P. Vogel. 2021. Automatic speech recognition in neurodegenerative disease. International Journal of Speech Technology, 24(3):771–779.
Shahamiri, S. R. 2021. Speech vision: An end-to-end deep learning-based dysarthric automatic speech recognition system. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 29:852–861.
Shor, J., D. Emanuel, O. Lang, O. Tuval, M. Brenner, J. Cattiau, F. Vieira, M. McNally, T. Charbonneau, M. Nollstadt, A. Hassidim, y Y. Matias. 2019. Personalizing asr for dysarthric and accented speech with limited data. En Interspeech 2019, interspeech2019. ISCA, Septiembre.
Tanis, E. S., S. Palmer, M. Wehmeyer, D. K. Davies, S. E. Stock, K. Lobb, y B. Bishop. 2012. Self-report computer-based survey of technology use by people with intellectual and developmental disabilities. Intellectual and developmental disabilities, 50(1):53–68.
Timmer, J. y M. Koenig. 1995. On generating power law noise. Astronomy and Astrophysics, v. 300, p. 707, 300:707.
Tobin, J., Q. Li, S. Venugopalan, K. Seaver, R. Cave, y K. Tomanek. 2022. Assessing ASR Model Quality on Disordered Speech using BERTScore. En Proc. 1st Workshop on Speech for Social Good (S4SG), páginas 26–30.
Tobin, J. y K. Tomanek. 2022. Personalized automatic speech recognition trained on small disordered speech datasets. En ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), páginas 6637–6641.
Tomanek, K., F. Beaufays, J. Cattiau, A. Chandorkar, y K. C. Sim. 2021. On-device personalization of automatic speech recognition models for disordered speech. arXiv:2106.10259.
Venugopalan, S., J. Shor, M. Plakal, J. Tobin, K. Tomanek, J. R. Green, y M. P. Brenner. 2021. Comparing Supervised Models and Learned Speech Representations for Classifying Intelligibility of Disordered Speech on Selected Phrases. En Interspeech 2021, páginas 4843–4847.
Wang, C., M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, y E. Dupoux. 2021. VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. En C. Zong F. Xia W. Li, y R. Navigli, editores, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), páginas 993–1003, Online, Agosto. Association for Computational Linguistics.
Wong, B., C. Brebner, P. McCormack, y A. Butcher. 2015. Word production inconsistency of Singaporean-English-speaking adolescents with Down Syndrome. International journal of language & communication disorders, 50(5):629–645.
Zhang, T., V. Kishore, F. Wu, K. Q. Weinberger, y Y. Artzi. 2020. Bertscore: Evaluating text generation with bert. En International Conference on Learning Representations.

La source de données: Dialnet