Comparison of Segmentable Units as Indicators of Two Texts Being Parallel (Short Paper)

Canosa, Afonso Xavier

doi:10.4230/OASIcs.SLATE.2018.16

File

OASIcs.SLATE.2018.16.pdf

Filesize: 0.58 MB
7 pages

Document Identifiers

DOI: 10.4230/OASIcs.SLATE.2018.16
URN: urn:nbn:de:0030-drops-92747

Author Details

Afonso Xavier Canosa

University of Santiago de Compostela, Galiza, Spain

Cite AsGet BibTex

Afonso Xavier Canosa. Comparison of Segmentable Units as Indicators of Two Texts Being Parallel (Short Paper). In 7th Symposium on Languages, Applications and Technologies (SLATE 2018). Open Access Series in Informatics (OASIcs), Volume 62, pp. 16:1-16:7, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/OASIcs.SLATE.2018.16

Abstract

A bitext produced from a Portuguese historical text and its English translation, Fernão Mendes Pinto's Pilgrimage, serves as a case study to describe the creation of a parallel corpus and investigate which linguistic and textual units are the best indicators of alignability. The process of building the corpus goes through preparation of transcriptions, annotation, segmentation and sentence alignment. Once the bitext is ready, the corpus is used to inquire which units appear as more relevant to predict that both texts are parallel. From the largest content units, those of chapters, to sentences, word types, tokens and characters, the latest, despite being the unit with less textual and linguistic significance, were found to be the best indicator of both texts being alignable.

Subject Classification

ACM Subject Classification

Computing methodologies → Machine translation

Keywords

parallel corpora
text alignment
bitexts

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Christian Buck and Philipp Koehn. Findings of the WMT 2016 bilingual document alignment shared task. In First Conference on Machine Translation - Shared Task Papers, volume 2, pages 554-563, 2016.
William A. Gale and Kenneth W. Church. A program for aligning sentences in bilingual corpora. Computational linguistics, 19(1):75-102, 1993.
Charlotte Galves, Aroldo Leal de Andrade, and Pablo Faria. Tycho brahe parsed corpus of historical Portuguese. http://www.tycho.iel.unicamp.br/~tycho/corpus/texts/psd.zip, 2017.
Philipp Koehn. EuroParl: A parallel corpus for statistical machine translation. In Machine Translation Summit, pages 79-86, 2005.
I. Dan Melamed. A geometric approach to mapping bitext correspondence. CoRR, 1996. URL: http://arxiv.org/abs/cmp-lg/9609009.
Robert C. Moore. Fast and accurate sentence alignment of bilingual corpora. In Conference of the Association for Machine Translation in the Americas, pages 135-144, 2002.
Xiaojun Quan, Chunyu Kit, and Yan Song. Non-monotonic sentence alignment via semisupervised learning. In 51st Annual Meeting of the Association for Computational Linguistics, volume 1, pages 622-630, 2013.
André Santos. A survey on parallel corpora alignment. In Master of Insformatics Internal Conferece, Universidade do Minho, pages 117-128, 2011.
Alberto Simões and Sara Fernandes. XML schemas for parallel corpora. In XATA 2010: 9ordfeminine Conferência Nacional em XML, Aplicações e Tecnologias, pages 59-69, 2011.
Hai-Long Trieu, Phuong-Thai Nguyen, and Kim-Anh Nguyen. Improving moore’s sentence alignment method using bilingual word clustering. In Knowledge and Systems Engineering, pages 149-160, 2014. URL: http://dx.doi.org/10.1007/978-3-319-02741-8_14.
Dániel Varga, Péter Halácsy, András Kornai, Viktor Nagy, László Németh, and Viktor Trón. Parallel corpora for medium density languages. In Recent advances in natural language processing IV : selected papers from RANLP 2005. John Benjamins, 2007.
Krzysztof Wołk and Krzysztof Marasek. Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents. CoRR, 2015. URL: http://arxiv.org/abs/1512.01641.
Marcos Zampieri and Martin Becker. Colonia: Corpus of historical Portuguese. In Non-standard Data Sources in Corpus-based Research. Shaker Verlag, 2013.
Federico Zanettin. Translation-driven corpora: Corpus resources for descriptive and applied translation studies. Routledge, 2014.