Pokaż uproszczony rekord

dc.contributor.authorDel Río, Miguel
dc.contributor.authorMiller, Corey
dc.contributor.authorProfant, Ján
dc.contributor.authorDrexler-Fox, Jennifer
dc.contributor.authorMcnamara, Quinn
dc.contributor.authorBhandari, Nishchal
dc.contributor.authorDelworth, Natalie
dc.contributor.authorPirkin, Ilya
dc.contributor.authorJetté, Migüel
dc.contributor.authorChandra, Shipra
dc.contributor.authorHa, Peter
dc.contributor.authorWesterman, Ryan
dc.date.accessioned2024-01-03T09:40:18Z
dc.date.available2024-01-03T09:40:18Z
dc.date.issued2023-12-28
dc.identifier.issn1731-7533
dc.identifier.urihttp://hdl.handle.net/11089/49005
dc.description.abstractAutomatic Speech Recognition (ASR) systems generalize poorly on accented speech, creating bias issues for users and providers. The phonetic and linguistic variability of accents present challenges for ASR systems in both data collection and modeling strategies. We present two promising approaches to accented speech recognition— custom vocabulary and multilingual modeling— and highlight key challenges in the space. Among these, lack of a standard benchmark makes research and comparison difficult. We address this with a novel corpus of accented speech: Earnings-22, A 125 file, 119 hour corpus of English-language earnings calls gathered from global companies. We compare commercial models showing variation in performance when taking country of origin into consideration and demonstrate targeted improvements using the methods we introduce.en
dc.language.isoen
dc.publisherWydawnictwo Uniwersytetu Łódzkiegopl
dc.relation.ispartofseriesResearch in Language;3en
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0
dc.subjectaccentsen
dc.subjectdialectsen
dc.subjectspeech recognitionen
dc.subjectbiasen
dc.subjectmultilingualen
dc.titleAccents in Speech Recognition through the Lens of a World Englishes Evaluation Seten
dc.typeArticle
dc.page.number225-244
dc.contributor.authorAffiliationDel Río, Miguel - Rev.comen
dc.contributor.authorAffiliationMiller, Corey - Rev.comen
dc.contributor.authorAffiliationProfant, Ján - Rev.comen
dc.contributor.authorAffiliationDrexler-Fox, Jennifer - Rev.comen
dc.contributor.authorAffiliationMcnamara, Quinn - Rev.comen
dc.contributor.authorAffiliationBhandari, Nishchal - Rev.comen
dc.contributor.authorAffiliationDelworth, Natalie - Rev.comen
dc.contributor.authorAffiliationPirkin, Ilya - Rev.comen
dc.contributor.authorAffiliationJetté, Migüel - Rev.comen
dc.contributor.authorAffiliationChandra, Shipra - Walgreensen
dc.contributor.authorAffiliationHa, Peter - Northwestern Universityen
dc.contributor.authorAffiliationWesterman, Ryan - Zoomen
dc.referencesArdila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., & Weber, G.. (2020). Common Voice: A massively-multilingual speech corpus. Proceedings of the 12th Conference on Language Resources and Evaluation, pp. 4218-4222.en
dc.referencesArons, B. (1992). A Review of the Cocktail Party Effect. AVIOS.en
dc.referencesBaese-Berk, M. M., McLaughlin, D. J. & McGowan, K. B. (2020). Perception of non-native speech. Language and Linguistics Compass, pp. 1-20. https://doi.org/10.1111/lnc3.12375en
dc.referencesChang, X., Qian, Y., Yu, K. & Watanabe, S. (2019). End-To-End Monaural Multi-Speaker ASR System Without Pretraining. Proceedings of ICASSP. https://doi.org/10.1109/ICASSP.2019.8682822en
dc.referencesChiswick, B. R. and Miller P. W. (2005). Linguistic distance: A quantitative measure of the distance between English and other languages. Journal of Multilingual and Multicultural Development, vol. 26, no. 1, pp. 1–11. https://doi.org/10.1080/14790710508668395en
dc.referencesDel Río, M., Delworth, N., Westerman, R., Huang, M., Bhandari, N., Palakapilly, J., McNamara, Q., Dong, J., Zelasko, Z., and Jetté, M. (2021). “Earnings-21: A Practical Benchmark for ASR in the Wild,” in Proc. Interspeech 2021, pp. 3465–3469. https://doi.org/10.21437/Interspeech.2021-1915en
dc.referencesDrexler-Fox, J. & Delworth, N. (2022). Improving contextual recognition of rare words with an alternate spelling prediction model. Proceedings of Interspeech.en
dc.referencesGabler, P., Geiger, B. C., Schuppler, B. & Kern, R. (2023). Reconsidering Read and Spontaneous Speech: Causal Perspectives on the Generation of Training Data for Automatic Speech Recognition. Information, 14, 137. https://doi.org/10.3390/info14020137en
dc.referencesGandhi, S., Von Platen, P., & Rush, A. M. (2022). ESB: A Benchmark for Multi-Domain End-to-End Speech Recognition. arXiv preprint arXiv:2210.13352.en
dc.referencesGoldwater, S., Jurafsky, D., and Manning, C. D. (2010). “Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates,” Speech Communication, vol. 52, no. 3, pp. 181–200. https://doi.org/10.1016/j.specom.2009.10.001en
dc.referencesGood, P. I. (2004). Permutation, Parametric, and Bootstrap Tests of Hypotheses. Springer Series in Statistics. Springer-Verlag.en
dc.referencesHazirbas, C., Bitton, J., Dolhansky, B., Pan, J., Gordo, A. & Ferrer, C. C. (2021). Towards measuring fairness in AI: the Casual Conversations dataset. ArXiv.en
dc.referencesHazirbas, C., Bang, Y., Yu, T., Assar, P., Porgali, B., Albiero, V., Hermanek, S., Pan, J., McReynolds, E., Bogen, M., Fung, P. & Ferrer, C. C. (2022). Casual Conversations v2: Designing a large consent-driven dataset to measure algorithmic bias and robustness. https://doi.org/10.1109/TBIOM.2021.3132237en
dc.referencesHinsvark, A. J., Delworth, N., Del Río, M., McNamara, Q., Dong, J., Westerman, R., Huang, M., Palakapilly, J., Drexler, J., Pirkin, I., Bhandari, N. & Jetté, M. (2021). Accented Speech Recognition: A Survey. ArXiv.en
dc.referencesHolmes, J. (2013). An introduction to sociolinguistics. Routledge. https://doi.org/10.4324/9781315833057en
dc.referencesIncera, S., Shah, A. P., McLennan, C. T. & Wetzel, M. T. (2017). Sentence context influences the subjective perception of foreign accents. Acta Psychologica 172, pp. 71-76.en
dc.referencesJones, T. (2015). Toward a description of African American Vernacular English dialect regions using “Black Twitter”. American Speech, Vol. 90, No. 4. https://doi.org/10.1215/00031283-3442117en
dc.referencesKachru, B. (1992). The Other Tongue: English across cultures. University of Illinois Press. Kang, Y. M. & Zhou, Y. (2020). Fast and robust unsupervised contextual biasing for speech recognition. ArXiv.en
dc.referencesKoenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z , Toups, C., Rickford, J. R., Jurafsky, D. & Goel, S. (2020). Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences, vol. 117, no. 14, pp. 7684–7689. https://doi.org/10.1073/pnas.1915768117en
dc.referencesKosmala, L., and Crible, L. (2021). The dual status of filled pauses: Evidence from genre, proficiency and co-occurrence. Language and Speech, May 2021. [Online]. Available: https://halshs.archives-ouvertes.fr/halshs-03225622 https://doi.org/10.1177/00238309211010862en
dc.referencesLevi, S. V., Winters, S. J. & Pisoni, D. B. (2007). Speaker-independent factors affecting the perception of foreign accent in a second language. Journal of the Acoustic Society of America, 121(4), pp. 2327-2338. https://doi.org/10.1121/1.2537345en
dc.referencesLippi-Green, R. (2012). English with an Accent: Language, Ideology and Discrimination in the United States. Routledge. https://doi.org/10.4324/9780203348802en
dc.referencesMeyer, J., Rauchenstein, L., Eisenberg, J. D. & Howell, N. (2020). Artie bias corpus: An open dataset for detecting demographic bias in speech applications. Proceedings of the 12th Language Resources and Evaluation Conference, pp. 6462–6468.en
dc.referencesMiller, C., Tzoukermann E., Doyon J., and Mallard, E., (2021). Corpus creation and evaluation for speech-to-text and speech translation. Proceedings of Machine Translation Summit XVIII: Users and Providers Track, pp. 44–53.en
dc.referencesO’Neill, P. K., Lavrukhin, V., Majumdar, S., Noroozi, V., Zhang, Y., Kuchaiev, O., Balam, J., Dovzhenko, Y., Freyberg, K., Shulman, M. D., Ginsburg, B., Watanabe, S., and Kucsko, G. (2021). “SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recognition,” in Proc. Interspeech, pp. 1434–1438.en
dc.referencesPalanica, A., Thommandram, A., Lee, A., Li, M. & Fossat, Y. (2019). Do you understand the words that are comin outta my mouth? Voice assistant comprehension of medication names. NPJ Digital Medicine, vol. 55, pp. 1-6. https://doi.org/10.1038/s41746-019-0133-xen
dc.referencesPharies, D. A. (2007). A Brief History of the Spanish Language. University Of Chicago Press. https://doi.org/10.1038/s41746-019-0133-xen
dc.referencesPorgali, B., Albiero, V., Ryda, J., Ferrer, C. C. & Hazirbas, C. (2023). The Casual Conversations v2 Dataset. ArXiv. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.en
dc.referencesRalli, A. (2020). Greek in Contact with Romance. In M. Loporcaro & F. Gardani (eds.) The Oxford Encyclopedia of Romance Linguistics. Oxford. https://doi.org/10.1093/acrefore/9780199384655.013.422en
dc.referencesReid, K. & Williams, E. T. (2023). Common Voice and accent choice: Data contributors self-describe their spoken accents in diverse ways. EasyChair. https://doi.org/10.1145/3617694.3623258en
dc.referencesTrinh, V. A., Gharemani, P., King, B., Droppo, J., Stolcke, A. & Maas, R. (2022). Reducing geographic disparities in automatic speech recognition via elastic weight consolidation. Proceedings of Interspeech. https://doi.org/10.21437/Interspeech.2022-11063en
dc.referencesvan Rooy, B. (2020). English in Africa. In D. Schreier, M. Hundt & E. W. Schneider (eds.), The Cambridge Handbook of World Englishes, pp. 210-235. Cambridge University Press.en
dc.referencesWagner, E., Liao, Y.-F. & Wagner, S. (2021). Authenticated Spoken Texts for L2 Listening Tests. Language Assessment Quarterly 18:3, pp. 205-227. https://doi.org/10.1080/15434303.2020.1860057en
dc.referencesWells, J. C. (1982). Accents of English: Volume 3: Beyond the British Isles. Cambridge University Press. https://doi.org/10.1017/CBO9780511611766en
dc.referencesWrembel, M., Gut, U., Kopečková, R. & Balas, A. Cross-linguistic interactions in third language acquisition: Evidence from multi-feature analysis of speech perception. (2020). Languages 5:52, pp. 1-21. https://doi.org/10.3390/languages5040052en
dc.referencesYang, X., Audhkhasi, K., Rosenberg, A., Thomas, S., Ramabhadran, B., and Hasegawa-Johnson, M. (2018). “Joint modeling of accents and acoustics for multi-accent speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 1–5. https://doi.org/10.1109/ICASSP.2018.8462557en
dc.referencesXiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D. & Zweig, G. (2016). Achieving human parity in conversational speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing. https://doi.org/10.1109/TASLP.2017.2756440en
dc.referencesZhou, L., Li, J., Sun, E. & Liu, S. (2022). A Configurable Multilingual Model is all you need to recognize all languages. Proceedings of ICASSP. https://doi.org/10.1109/ICASSP43922.2022.9747905en
dc.contributor.authorEmailDel Río, Miguel - miguel.delrio@rev.com
dc.contributor.authorEmailMiller, Corey - corey.miller@rev.com
dc.contributor.authorEmailProfant, Ján - ril@uni.lodz.pl
dc.contributor.authorEmailDrexler-Fox, Jennifer - ril@uni.lodz.pl
dc.contributor.authorEmailMcnamara, Quinn - ril@uni.lodz.pl
dc.contributor.authorEmailBhandari, Nishchal - ril@uni.lodz.pl
dc.contributor.authorEmailDelworth, Natalie - ril@uni.lodz.pl
dc.contributor.authorEmailPirkin, Ilya - ril@uni.lodz.pl
dc.contributor.authorEmailJetté, Migüel - ril@uni.lodz.pl
dc.contributor.authorEmailChandra, Shipra - ril@uni.lodz.pl
dc.contributor.authorEmailHa, Peter - ril@uni.lodz.pl
dc.contributor.authorEmailWesterman, Ryan - ril@uni.lodz.pl
dc.identifier.doi10.18778/1731-7533.21.3.02
dc.relation.volume21


Pliki tej pozycji

Thumbnail

Pozycja umieszczona jest w następujących kolekcjach

Pokaż uproszczony rekord

https://creativecommons.org/licenses/by-nc-nd/4.0
Poza zaznaczonymi wyjątkami, licencja tej pozycji opisana jest jako https://creativecommons.org/licenses/by-nc-nd/4.0