Accents in Speech Recognition through the Lens of a World Englishes Evaluation Set

Del Río, Miguel; Miller, Corey; Profant, Ján; Drexler-Fox, Jennifer; Mcnamara, Quinn; Bhandari, Nishchal; Delworth, Natalie; Pirkin, Ilya; Jetté, Migüel; Chandra, Shipra; Ha, Peter; Westerman, Ryan

dc.contributor.author	Del Río, Miguel
dc.contributor.author	Miller, Corey
dc.contributor.author	Profant, Ján
dc.contributor.author	Drexler-Fox, Jennifer
dc.contributor.author	Mcnamara, Quinn
dc.contributor.author	Bhandari, Nishchal
dc.contributor.author	Delworth, Natalie
dc.contributor.author	Pirkin, Ilya
dc.contributor.author	Jetté, Migüel
dc.contributor.author	Chandra, Shipra
dc.contributor.author	Ha, Peter
dc.contributor.author	Westerman, Ryan
dc.date.accessioned	2024-01-03T09:40:18Z
dc.date.available	2024-01-03T09:40:18Z
dc.date.issued	2023-12-28
dc.identifier.issn	1731-7533
dc.identifier.uri	http://hdl.handle.net/11089/49005
dc.description.abstract	Automatic Speech Recognition (ASR) systems generalize poorly on accented speech, creating bias issues for users and providers. The phonetic and linguistic variability of accents present challenges for ASR systems in both data collection and modeling strategies. We present two promising approaches to accented speech recognition— custom vocabulary and multilingual modeling— and highlight key challenges in the space. Among these, lack of a standard benchmark makes research and comparison difficult. We address this with a novel corpus of accented speech: Earnings-22, A 125 file, 119 hour corpus of English-language earnings calls gathered from global companies. We compare commercial models showing variation in performance when taking country of origin into consideration and demonstrate targeted improvements using the methods we introduce.	en
dc.language.iso	en
dc.publisher	Wydawnictwo Uniwersytetu Łódzkiego	pl
dc.relation.ispartofseries	Research in Language;3	en
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0
dc.subject	accents	en
dc.subject	dialects	en
dc.subject	speech recognition	en
dc.subject	bias	en
dc.subject	multilingual	en
dc.title	Accents in Speech Recognition through the Lens of a World Englishes Evaluation Set	en
dc.type	Article
dc.page.number	225-244
dc.contributor.authorAffiliation	Del Río, Miguel - Rev.com	en
dc.contributor.authorAffiliation	Miller, Corey - Rev.com	en
dc.contributor.authorAffiliation	Profant, Ján - Rev.com	en
dc.contributor.authorAffiliation	Drexler-Fox, Jennifer - Rev.com	en
dc.contributor.authorAffiliation	Mcnamara, Quinn - Rev.com	en
dc.contributor.authorAffiliation	Bhandari, Nishchal - Rev.com	en
dc.contributor.authorAffiliation	Delworth, Natalie - Rev.com	en
dc.contributor.authorAffiliation	Pirkin, Ilya - Rev.com	en
dc.contributor.authorAffiliation	Jetté, Migüel - Rev.com	en
dc.contributor.authorAffiliation	Chandra, Shipra - Walgreens	en
dc.contributor.authorAffiliation	Ha, Peter - Northwestern University	en
dc.contributor.authorAffiliation	Westerman, Ryan - Zoom	en
dc.references	Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., & Weber, G.. (2020). Common Voice: A massively-multilingual speech corpus. Proceedings of the 12th Conference on Language Resources and Evaluation, pp. 4218-4222.	en
dc.references	Arons, B. (1992). A Review of the Cocktail Party Effect. AVIOS.	en
dc.references	Baese-Berk, M. M., McLaughlin, D. J. & McGowan, K. B. (2020). Perception of non-native speech. Language and Linguistics Compass, pp. 1-20. https://doi.org/10.1111/lnc3.12375	en
dc.references	Chang, X., Qian, Y., Yu, K. & Watanabe, S. (2019). End-To-End Monaural Multi-Speaker ASR System Without Pretraining. Proceedings of ICASSP. https://doi.org/10.1109/ICASSP.2019.8682822	en
dc.references	Chiswick, B. R. and Miller P. W. (2005). Linguistic distance: A quantitative measure of the distance between English and other languages. Journal of Multilingual and Multicultural Development, vol. 26, no. 1, pp. 1–11. https://doi.org/10.1080/14790710508668395	en
dc.references	Del Río, M., Delworth, N., Westerman, R., Huang, M., Bhandari, N., Palakapilly, J., McNamara, Q., Dong, J., Zelasko, Z., and Jetté, M. (2021). “Earnings-21: A Practical Benchmark for ASR in the Wild,” in Proc. Interspeech 2021, pp. 3465–3469. https://doi.org/10.21437/Interspeech.2021-1915	en
dc.references	Drexler-Fox, J. & Delworth, N. (2022). Improving contextual recognition of rare words with an alternate spelling prediction model. Proceedings of Interspeech.	en
dc.references	Gabler, P., Geiger, B. C., Schuppler, B. & Kern, R. (2023). Reconsidering Read and Spontaneous Speech: Causal Perspectives on the Generation of Training Data for Automatic Speech Recognition. Information, 14, 137. https://doi.org/10.3390/info14020137	en
dc.references	Gandhi, S., Von Platen, P., & Rush, A. M. (2022). ESB: A Benchmark for Multi-Domain End-to-End Speech Recognition. arXiv preprint arXiv:2210.13352.	en
dc.references	Goldwater, S., Jurafsky, D., and Manning, C. D. (2010). “Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates,” Speech Communication, vol. 52, no. 3, pp. 181–200. https://doi.org/10.1016/j.specom.2009.10.001	en
dc.references	Good, P. I. (2004). Permutation, Parametric, and Bootstrap Tests of Hypotheses. Springer Series in Statistics. Springer-Verlag.	en
dc.references	Hazirbas, C., Bitton, J., Dolhansky, B., Pan, J., Gordo, A. & Ferrer, C. C. (2021). Towards measuring fairness in AI: the Casual Conversations dataset. ArXiv.	en
dc.references	Hazirbas, C., Bang, Y., Yu, T., Assar, P., Porgali, B., Albiero, V., Hermanek, S., Pan, J., McReynolds, E., Bogen, M., Fung, P. & Ferrer, C. C. (2022). Casual Conversations v2: Designing a large consent-driven dataset to measure algorithmic bias and robustness. https://doi.org/10.1109/TBIOM.2021.3132237	en
dc.references	Hinsvark, A. J., Delworth, N., Del Río, M., McNamara, Q., Dong, J., Westerman, R., Huang, M., Palakapilly, J., Drexler, J., Pirkin, I., Bhandari, N. & Jetté, M. (2021). Accented Speech Recognition: A Survey. ArXiv.	en
dc.references	Holmes, J. (2013). An introduction to sociolinguistics. Routledge. https://doi.org/10.4324/9781315833057	en
dc.references	Incera, S., Shah, A. P., McLennan, C. T. & Wetzel, M. T. (2017). Sentence context influences the subjective perception of foreign accents. Acta Psychologica 172, pp. 71-76.	en
dc.references	Jones, T. (2015). Toward a description of African American Vernacular English dialect regions using “Black Twitter”. American Speech, Vol. 90, No. 4. https://doi.org/10.1215/00031283-3442117	en
dc.references	Kachru, B. (1992). The Other Tongue: English across cultures. University of Illinois Press. Kang, Y. M. & Zhou, Y. (2020). Fast and robust unsupervised contextual biasing for speech recognition. ArXiv.	en
dc.references	Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z , Toups, C., Rickford, J. R., Jurafsky, D. & Goel, S. (2020). Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences, vol. 117, no. 14, pp. 7684–7689. https://doi.org/10.1073/pnas.1915768117	en
dc.references	Kosmala, L., and Crible, L. (2021). The dual status of filled pauses: Evidence from genre, proficiency and co-occurrence. Language and Speech, May 2021. [Online]. Available: https://halshs.archives-ouvertes.fr/halshs-03225622 https://doi.org/10.1177/00238309211010862	en
dc.references	Levi, S. V., Winters, S. J. & Pisoni, D. B. (2007). Speaker-independent factors affecting the perception of foreign accent in a second language. Journal of the Acoustic Society of America, 121(4), pp. 2327-2338. https://doi.org/10.1121/1.2537345	en
dc.references	Lippi-Green, R. (2012). English with an Accent: Language, Ideology and Discrimination in the United States. Routledge. https://doi.org/10.4324/9780203348802	en
dc.references	Meyer, J., Rauchenstein, L., Eisenberg, J. D. & Howell, N. (2020). Artie bias corpus: An open dataset for detecting demographic bias in speech applications. Proceedings of the 12th Language Resources and Evaluation Conference, pp. 6462–6468.	en
dc.references	Miller, C., Tzoukermann E., Doyon J., and Mallard, E., (2021). Corpus creation and evaluation for speech-to-text and speech translation. Proceedings of Machine Translation Summit XVIII: Users and Providers Track, pp. 44–53.	en
dc.references	O’Neill, P. K., Lavrukhin, V., Majumdar, S., Noroozi, V., Zhang, Y., Kuchaiev, O., Balam, J., Dovzhenko, Y., Freyberg, K., Shulman, M. D., Ginsburg, B., Watanabe, S., and Kucsko, G. (2021). “SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recognition,” in Proc. Interspeech, pp. 1434–1438.	en
dc.references	Palanica, A., Thommandram, A., Lee, A., Li, M. & Fossat, Y. (2019). Do you understand the words that are comin outta my mouth? Voice assistant comprehension of medication names. NPJ Digital Medicine, vol. 55, pp. 1-6. https://doi.org/10.1038/s41746-019-0133-x	en
dc.references	Pharies, D. A. (2007). A Brief History of the Spanish Language. University Of Chicago Press. https://doi.org/10.1038/s41746-019-0133-x	en
dc.references	Porgali, B., Albiero, V., Ryda, J., Ferrer, C. C. & Hazirbas, C. (2023). The Casual Conversations v2 Dataset. ArXiv. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.	en
dc.references	Ralli, A. (2020). Greek in Contact with Romance. In M. Loporcaro & F. Gardani (eds.) The Oxford Encyclopedia of Romance Linguistics. Oxford. https://doi.org/10.1093/acrefore/9780199384655.013.422	en
dc.references	Reid, K. & Williams, E. T. (2023). Common Voice and accent choice: Data contributors self-describe their spoken accents in diverse ways. EasyChair. https://doi.org/10.1145/3617694.3623258	en
dc.references	Trinh, V. A., Gharemani, P., King, B., Droppo, J., Stolcke, A. & Maas, R. (2022). Reducing geographic disparities in automatic speech recognition via elastic weight consolidation. Proceedings of Interspeech. https://doi.org/10.21437/Interspeech.2022-11063	en
dc.references	van Rooy, B. (2020). English in Africa. In D. Schreier, M. Hundt & E. W. Schneider (eds.), The Cambridge Handbook of World Englishes, pp. 210-235. Cambridge University Press.	en
dc.references	Wagner, E., Liao, Y.-F. & Wagner, S. (2021). Authenticated Spoken Texts for L2 Listening Tests. Language Assessment Quarterly 18:3, pp. 205-227. https://doi.org/10.1080/15434303.2020.1860057	en
dc.references	Wells, J. C. (1982). Accents of English: Volume 3: Beyond the British Isles. Cambridge University Press. https://doi.org/10.1017/CBO9780511611766	en
dc.references	Wrembel, M., Gut, U., Kopečková, R. & Balas, A. Cross-linguistic interactions in third language acquisition: Evidence from multi-feature analysis of speech perception. (2020). Languages 5:52, pp. 1-21. https://doi.org/10.3390/languages5040052	en
dc.references	Yang, X., Audhkhasi, K., Rosenberg, A., Thomas, S., Ramabhadran, B., and Hasegawa-Johnson, M. (2018). “Joint modeling of accents and acoustics for multi-accent speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 1–5. https://doi.org/10.1109/ICASSP.2018.8462557	en
dc.references	Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D. & Zweig, G. (2016). Achieving human parity in conversational speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing. https://doi.org/10.1109/TASLP.2017.2756440	en
dc.references	Zhou, L., Li, J., Sun, E. & Liu, S. (2022). A Configurable Multilingual Model is all you need to recognize all languages. Proceedings of ICASSP. https://doi.org/10.1109/ICASSP43922.2022.9747905	en
dc.contributor.authorEmail	Del Río, Miguel - miguel.delrio@rev.com
dc.contributor.authorEmail	Miller, Corey - corey.miller@rev.com
dc.contributor.authorEmail	Profant, Ján - ril@uni.lodz.pl
dc.contributor.authorEmail	Drexler-Fox, Jennifer - ril@uni.lodz.pl
dc.contributor.authorEmail	Mcnamara, Quinn - ril@uni.lodz.pl
dc.contributor.authorEmail	Bhandari, Nishchal - ril@uni.lodz.pl
dc.contributor.authorEmail	Delworth, Natalie - ril@uni.lodz.pl
dc.contributor.authorEmail	Pirkin, Ilya - ril@uni.lodz.pl
dc.contributor.authorEmail	Jetté, Migüel - ril@uni.lodz.pl
dc.contributor.authorEmail	Chandra, Shipra - ril@uni.lodz.pl
dc.contributor.authorEmail	Ha, Peter - ril@uni.lodz.pl
dc.contributor.authorEmail	Westerman, Ryan - ril@uni.lodz.pl
dc.identifier.doi	10.18778/1731-7533.21.3.02
dc.relation.volume	21

Pliki tej pozycji

Nazwa:: 225-244_Del_Rio_et_al.pdf
Rozmiar:: 509.8KB
Format:: PDF

Oglądaj/Otwórz

Pozycja umieszczona jest w następujących kolekcjach

Research in Language (2023) vol. 21 nr 3 [6]

Pokaż uproszczony rekord

Poza zaznaczonymi wyjątkami, licencja tej pozycji opisana jest jako https://creativecommons.org/licenses/by-nc-nd/4.0