Comparison of Segmentable Units as Indicators of Two Texts Being Parallel (Short Paper)

Author Afonso Xavier Canosa



PDF
Thumbnail PDF

File

OASIcs.SLATE.2018.16.pdf
  • Filesize: 0.58 MB
  • 7 pages

Document Identifiers

Author Details

Afonso Xavier Canosa
  • University of Santiago de Compostela, Galiza, Spain

Cite As Get BibTex

Afonso Xavier Canosa. Comparison of Segmentable Units as Indicators of Two Texts Being Parallel (Short Paper). In 7th Symposium on Languages, Applications and Technologies (SLATE 2018). Open Access Series in Informatics (OASIcs), Volume 62, pp. 16:1-16:7, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/OASIcs.SLATE.2018.16

Abstract

A bitext produced from a Portuguese historical text and its English translation, Fernão Mendes Pinto's Pilgrimage, serves as a case study to describe the creation of a parallel corpus and investigate which linguistic and textual units are the best indicators of alignability. The process of building the corpus goes through preparation of transcriptions, annotation, segmentation and sentence alignment. Once the bitext is ready, the corpus is used to inquire which units appear as more relevant to predict that both texts are parallel. From the largest content units, those of chapters, to sentences, word types, tokens and characters, the latest, despite being the unit with less textual and linguistic significance, were found to be the best indicator of both texts being alignable.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Machine translation
Keywords
  • parallel corpora
  • text alignment
  • bitexts

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Christian Buck and Philipp Koehn. Findings of the WMT 2016 bilingual document alignment shared task. In First Conference on Machine Translation - Shared Task Papers, volume 2, pages 554-563, 2016. Google Scholar
  2. William A. Gale and Kenneth W. Church. A program for aligning sentences in bilingual corpora. Computational linguistics, 19(1):75-102, 1993. Google Scholar
  3. Charlotte Galves, Aroldo Leal de Andrade, and Pablo Faria. Tycho brahe parsed corpus of historical Portuguese. http://www.tycho.iel.unicamp.br/~tycho/corpus/texts/psd.zip, 2017.
  4. Philipp Koehn. EuroParl: A parallel corpus for statistical machine translation. In Machine Translation Summit, pages 79-86, 2005. Google Scholar
  5. I. Dan Melamed. A geometric approach to mapping bitext correspondence. CoRR, 1996. URL: http://arxiv.org/abs/cmp-lg/9609009.
  6. Robert C. Moore. Fast and accurate sentence alignment of bilingual corpora. In Conference of the Association for Machine Translation in the Americas, pages 135-144, 2002. Google Scholar
  7. Xiaojun Quan, Chunyu Kit, and Yan Song. Non-monotonic sentence alignment via semisupervised learning. In 51st Annual Meeting of the Association for Computational Linguistics, volume 1, pages 622-630, 2013. Google Scholar
  8. André Santos. A survey on parallel corpora alignment. In Master of Insformatics Internal Conferece, Universidade do Minho, pages 117-128, 2011. Google Scholar
  9. Alberto Simões and Sara Fernandes. XML schemas for parallel corpora. In XATA 2010: 9ordfeminine Conferência Nacional em XML, Aplicações e Tecnologias, pages 59-69, 2011. Google Scholar
  10. Hai-Long Trieu, Phuong-Thai Nguyen, and Kim-Anh Nguyen. Improving moore’s sentence alignment method using bilingual word clustering. In Knowledge and Systems Engineering, pages 149-160, 2014. URL: http://dx.doi.org/10.1007/978-3-319-02741-8_14.
  11. Dániel Varga, Péter Halácsy, András Kornai, Viktor Nagy, László Németh, and Viktor Trón. Parallel corpora for medium density languages. In Recent advances in natural language processing IV : selected papers from RANLP 2005. John Benjamins, 2007. Google Scholar
  12. Krzysztof Wołk and Krzysztof Marasek. Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents. CoRR, 2015. URL: http://arxiv.org/abs/1512.01641.
  13. Marcos Zampieri and Martin Becker. Colonia: Corpus of historical Portuguese. In Non-standard Data Sources in Corpus-based Research. Shaker Verlag, 2013. Google Scholar
  14. Federico Zanettin. Translation-driven corpora: Corpus resources for descriptive and applied translation studies. Routledge, 2014. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail