Mind the Gap: Language Data, Their Producers, and the Scientific Process (Crazy New Idea)

Author Tobias Weber



PDF
Thumbnail PDF

File

OASIcs.LDK.2021.6.pdf
  • Filesize: 461 kB
  • 9 pages

Document Identifiers

Author Details

Tobias Weber
  • Ludwig-Maximilians-Universität München, Germany

Cite AsGet BibTex

Tobias Weber. Mind the Gap: Language Data, Their Producers, and the Scientific Process (Crazy New Idea). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 6:1-6:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/OASIcs.LDK.2021.6

Abstract

This paper discusses the role of low-resource languages in NLP through the lens of different stakeholders. It argues that the current "consumerist approach" to language data reinforces a vicious circle which increases the technological exclusion of minority communities. Researchers' decisions directly affect these processes to the detriment of minorities and practitioners engaging in language work in these communities. In line with the conference topic, the paper concludes with strategies and prerequisites for creating a positive feedback loop in our research benefiting language work within the next decade.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Language resources
  • Computing methodologies → Natural language processing
  • Social and professional topics → Cultural characteristics
  • Software and its engineering → Software creation and management
  • Applied computing → Language translation
Keywords
  • minority languages
  • data integration
  • sociology of technology
  • documentary linguistics
  • exclusion

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Oliver Adams, Adam Makarucha, Graham Neubig, Steven Bird, and Trevor Cohn. Cross-lingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 937-947, Valencia, Spain, 2017. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/E17-1088.
  2. Željko Agić, Dirk Hovy, and Anders Søgaard. If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 268-272, Beijing, China, 2015. Association for Computational Linguistics. URL: https://doi.org/10.3115/v1/P15-2044.
  3. Željko Agić, Anders Johannsen, Barbara Plank, Héctor Martínez Alonso, Natalie Schluter, and Anders Søgaard. Multilingual Projection for Parsing Truly Low-Resource Languages. Transactions of the Association for Computational Linguistics, 4:301-312, 2016. URL: https://doi.org/10.1162/tacl_a_00100.
  4. Helene N. Andreassen, Andrea L. Berez-Kroeker, Lauren Collister, Philipp Conzett, Christopher Cox, Koenraad De Smedt, Bradley McDonnell, and the Research Data Alliance Linguistic Data Interest Group. Tromsø recommendations for citation of research data in linguistics, 2019. URL: https://doi.org/10.15497/rda00040.
  5. Mikel Artetxe and Holger Schwenk. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. Transactions of the Association for Computational Linguistics, 7:597-610, 2019. URL: https://doi.org/10.1162/tacl_a_00288.
  6. Peter K. Austin. Communities, ethics and rights in language documentation. In Peter K. Austin, editor, Language Documentation and Description, volume 7, pages 34-54. SOAS, London, 2010. Google Scholar
  7. Joseph Lo Bianco and Joy Kreeft Peyton. Vitality of heritage languages in the united states: The role of capacity, opportunity, and desire. Heritage Language Journal, 10(3):i-viii, 2013. URL: https://doi.org/10.46538/hlj.10.3.1.
  8. Steven Bird. Decolonising speech and language technology. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, pages 3504-3519, Barcelona, 2020. International Committee on Computational Linguistics. URL: https://doi.org/10.18653/v1/2020.coling-main.313.
  9. Pierre Bourdieu. Language & Symbolic Power. Polity Press, Malden, 1991. Google Scholar
  10. Jack Brehm. A theory of psychological reactance. Academic Press, New York, 1966. Google Scholar
  11. Michael Cysouw and Jeff Good. Languoid, doculect and glossonym: Formalizing the notion "language". Language Documentation & Conservation, 7:331-359, 2013. Google Scholar
  12. Michael Cysouw and Jan Wohlgemuth. The other end of universals: theory and typology of rara. In Jan Wohlgemuth and Michael Cysouw, editors, Rethinking Universals, pages 1-10. De Gruyter Mouton, 2010. URL: https://doi.org/doi:10.1515/9783110220933.1.
  13. Jenny L. Davis. Resisting rhetorics of language endangerment: Reclamation through indigenous language survivance. In Wesley Y. Leonard and Haley De Korne, editors, Language Documentation and Description, volume 14, pages 37-58. EL Publishing, London, 2017. Google Scholar
  14. Lise Dobrin, Peter K. Austin, and David Nathan. Dying to be counted: the commodification of endangered languages in documentary linguistics. In Peter K. Austin, editor, Language Documentation and Description, volume 6, pages 37-52. SOAS, London, 2009. Google Scholar
  15. Long Duong, Trevor Cohn, Karin Verspoor, Steven Bird, and Paul Cook. What can we get from 1000 tokens? a case study of multilingual POS tagging for resource-poor languages. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 886-897, Doha, 2014. Association for Computational Linguistics. URL: https://doi.org/10.3115/v1/D14-1096.
  16. Meng Fang and Trevor Cohn. Model transfer for tagging low-resource languages using a bilingual dictionary. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 587-593, Vancouver, 2017. Association for Computational Linguistics. URL: https://doi.org/10.18653/v1/P17-2093.
  17. Candace Kaleimamoowahinekapu Galla and Alanaise Goodwill. Talking story with vital voices: Making knowledge with indigenous language. Journal of Indigenous Wellbeing, 2(3):67-75, 2017. Google Scholar
  18. Duncan Gallie, Serge Paugam, and Sheila Jacobs. Unemployment, poverty and social isolation: Is there a vicious circle of social exclusion? European Societies, 5(1):1-32, 2003. URL: https://doi.org/10.1080/1461669032000057668.
  19. Anthony Giddens. The Constitution of Society. University of California Press, Berkeley, 1984. Google Scholar
  20. François Grin. Language Policy Evaluation and the European Charter for Regional or Minority Languages. Palgrave Macmillan, Basingstoke, 2003. Google Scholar
  21. Rodrigue Landry and Réal Allard. Beyond socially naive bilingual education : the effects of schooling and ethnolinguistic vitality on additive and subtractive bilingualism. NABE Annual Conference Journal, pages 1-30, 1993. Google Scholar
  22. Wesley Y. Leonard. Producing language reclamation by decolonising `language'. In Wesley Y. Leonard and Haley De Korne, editors, Language Documentation and Description, volume 14, pages 15-36. EL Publishing, London, 2017. Google Scholar
  23. Robert K. Merton. The Matthew Effect in science. Science, 159(3810):56-63, 1968. URL: https://doi.org/10.1126/science.159.3810.56.
  24. Tom Moring. Functional completeness in minority language media. In Mike Cormack and Niamh Hourigan, editors, Minority Language Media. Concepts, Critiques and Case Studies, pages 17-33. Multilingual Matters, Clevedon and Buffalo and Toronto, 2007. Google Scholar
  25. David Nathan and Peter K. Austin. Reconceiving metadata: language documentation through thick and thin. In Peter K. Austin, editor, Language Documentation and Description, volume 2, pages 179-188. SOAS, London, 2004. Google Scholar
  26. Revere Perkins. The covariation of culture and grammar. In Michael Hammond, Edith Moravcsik, and Jessica Wirth, editors, Studies in Syntactic Typology, pages 359-378. Benjamins, Amsterdam and Philadelphia, 1988. Google Scholar
  27. John E. Petrovic and Bedrettin Yazan. Language as instrument, resource, and maybe capital, but not commodity. In Bedrettin Yazan John E. Petrovic, editor, The Commodification of Language, pages 24-40. Routledge, London, 2021. Google Scholar
  28. Edoardo Maria Ponti, Helen O’Horan, Yevgeni Berzak, Ivan Vulić, Roi Reichart, Thierry Poibeau, Ekaterina Shutova, and Anna Korhonen. Modeling language variation and universals: A survey on typological linguistics for natural language processing. Computational Linguistics, 45(3):559-601, 2019. URL: https://doi.org/10.1162/coli_a_00357.
  29. Jan Rijkhoff. Rara and grammatical theory. In Jan Wohlgemuth and Michael Cysouw, editors, Rethinking Universals, pages 223-240. De Gruyter Mouton, 2010. URL: https://doi.org/doi:10.1515/9783110220933.223.
  30. Tobias Weber. Can Computational Meta-Documentary Linguistics Provide for Accountability and Offer an Alternative to "Reproducibility" in Linguistics? In Maria Eskevich, Gerard de Melo, Christian Fäth, John P. McCrae, Paul Buitelaar, Christian Chiarcos, Bettina Klimek, and Milan Dojchinovski, editors, 2nd Conference on Language, Data and Knowledge (LDK 2019), volume 70 of OpenAccess Series in Informatics (OASIcs), pages 26:1-26:8, Dagstuhl, Germany, 2019. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/OASIcs.LDK.2019.26.
  31. Tobias Weber. Metadata Inheritance: New Research Paper, New Data, New Metadata? In Andrea Mannocci, editor, Reframing Research Workshop Accepted Papers. Zenodo, 2020. URL: https://doi.org/10.5281/zenodo.4155362.
  32. Tobias Weber. A philological perspective on meta-scientific knowledge graphs. In Ladjel Bellatreche, Mária Bieliková, Omar Boussaïd, Barbara Catania, Jérôme Darmont, Elena Demidova, Fabien Duchateau, Mark Hall, Tanja Merčun, Boris Novikov, Christos Papatheodorou, Thomas Risse, Oscar Romero, Lucile Sautot, Guilaine Talens, Robert Wrembel, and Maja Žumer, editors, ADBIS, TPDL and EDA 2020 Common Workshops and Doctoral Consortium, pages 226-233, Cham, 2020. Springer International Publishing. URL: https://doi.org/10.1007/978-3-030-55814-7_19.
  33. Tobias Weber and Mia Klee. Agency in scientific discourse. Bulletin of the Transilvania University of Braşov Series IV: Philology and Cultural Studies, 13(1):71-86, 2020. URL: https://doi.org/10.31926/but.pcs.2020.62.13.1.5.
  34. Jan Wohlgemuth. Language endangerment, community size and typological rarity. In Jan Wohlgemuth and Michael Cysouw, editors, Rethinking Universals, pages 255-278. De Gruyter Mouton, 2010. URL: https://doi.org/doi:10.1515/9783110220933.255.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail