An Automatic Partitioning of Gutenberg.org Texts

Authors Davide Picca , Cyrille Gay-Crosier



PDF
Thumbnail PDF

File

OASIcs.LDK.2021.35.pdf
  • Filesize: 0.61 MB
  • 9 pages

Document Identifiers

Author Details

Davide Picca
  • University of Lausanne, Switzerland
Cyrille Gay-Crosier
  • University of Lausanne, Switzerland

Cite As Get BibTex

Davide Picca and Cyrille Gay-Crosier. An Automatic Partitioning of Gutenberg.org Texts. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 35:1-35:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/OASIcs.LDK.2021.35

Abstract

Over the last 10 years, the automatic partitioning of texts has raised the interest of the community. The automatic identification of parts of texts can provide a faster and easier access to textual analysis. We introduce here an exploratory work for multi-part book identification. In an early attempt, we focus on Gutenberg.org which is one of the projects that has received the largest public support in recent years. The purpose of this article is to present a preliminary system that automatically classifies parts of texts into 35 semantic categories. An accuracy of more than 93% on the test set was achieved. We are planning to extend this effort to other repositories in the future.

Subject Classification

ACM Subject Classification
  • Computing methodologies
  • Computing methodologies → Language resources
Keywords
  • Digital Humanities
  • Machine Learning
  • Corpora

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Mohamed Aly. Survey on multiclass classification methods. Neural Netw, 19:1-9, 2005. Google Scholar
  2. Ngo Xuan Bach, Nguyen Le Minh, Tran Thi Oanh, and Akira Shimazu. Learning logical structures of paragraphs in legal articles. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 20-28, Chiang Mai, Thailand, 2011. Asian Federation of Natural Language Processing. URL: https://www.aclweb.org/anthology/I11-1003.
  3. Julian Brooke, Adam Hammond, and Graeme Hirst. GutenTag: an NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus. In Proceedings of the Fourth Workshop on Computational Linguistics for Literature, pages 42-47, 2015. URL: https://doi.org/10.3115/v1/w15-0705.
  4. R Bucher. Classification of Fiction Genres: Text classification of fiction texts from Project Gutenberg. diva-portal.org, 2018. Google Scholar
  5. Hervé Déjean and Jean Luc Meunier. On tables of contents and how to recognize them. International Journal on Document Analysis and Recognition, 2009. URL: https://doi.org/10.1007/s10032-009-0078-8.
  6. Antoine Doucet, Gabriella Kazai, Bodin Dresevic, Aleksandar Uzelac, Bogdan Radakovic, and Nikola Todic. ICDAR 2009 book structure extraction competition. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 2009. URL: https://doi.org/10.1109/ICDAR.2009.280.
  7. Antoine Doucet, Gabriella Kazai, and Jean Luc Meunier. ICDAR 2011 book structure extraction competition. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 2011. URL: https://doi.org/10.1109/ICDAR.2011.298.
  8. Mattia Egloff and Davide Picca. The Project Gutenberg Ontology. In European Association for Digital Humanities (EADH), Galway, Ireland, 2018. Google Scholar
  9. Mattia Egloff, Davide Picca, and Alessandro Adamou. Extraction of character profiles from the gutenberg archive. In Emmanouel Garoufallou, Francesca Fallucchi, and Ernesto William De Luca, editors, Metadata and Semantic Research, pages 367-372, Cham, 2019. Springer International Publishing. Google Scholar
  10. Liangcai Gao, Zhi Tang, Xiaofan Lin, Xin Tao, and Yimin Chu. Analysis of book documents' table of content based on clustering. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 2009. URL: https://doi.org/10.1109/ICDAR.2009.143.
  11. M Gerlach and F Font-Clos. A standardized project gutenberg corpus for statistical analysis of natural language and quantitative linguistics. Entropy, 2020. Google Scholar
  12. OL Goodloe. Applications of Deep Neural Networks to Neurocognitive Poetics: A Quantitative Study of the Project Gutenberg English Poetry Corpus. repository.asu.edu, 2019. Google Scholar
  13. Shesen Guo, Ganzhou Zhang, Run Zhai, and Zehua Song. Distribution of English syllables in e-books of Project Gutenberg and the evolution of syllable number in two subcorpora. Digital Scholarship in the Humanities, 30(3):344-353, 2015. URL: https://doi.org/10.1093/llc/fqu013.
  14. Arthur M. Jacobs. The Gutenberg English Poetry Corpus: Exemplary Quantitative Narrative Analyses. Frontiers in Digital Humanities, 5, 2018. URL: https://doi.org/10.3389/fdigh.2018.00005.
  15. Gabriella Kazai, Antoine Doucet, Marijn Koolen, and Monica Landoni. Overview of the INEX 2009 book track. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2010. URL: https://doi.org/10.1007/978-3-642-14556-8_16.
  16. Evgeny Kim and Roman Klinger. Who feels what and why? annotation of a literature corpus with semantic roles of emotions. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1345-1359, Santa Fe, New Mexico, USA, 2018. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/C18-1114.
  17. Jiwei Li, Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphs and documents. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1106-1115, Beijing, China, July 2015. Association for Computational Linguistics. URL: https://doi.org/10.3115/v1/P15-1107.
  18. Lara McConnaughey, Jennifer Dai, and David Bamman. The labeled segmentation of printed books. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 737-747, Copenhagen, Denmark, 2017. Association for Computational Linguistics. URL: https://doi.org/10.18653/v1/D17-1077.
  19. Davide Picca and Mattia Egloff. DHTK: The Digital Humanities ToolKit. In Workshop on Humanities in the Semantic Web - WHiSe II, pages 1-6, 2017. Google Scholar
  20. Caroline Sporleder and Mirella Lapata. Automatic Paragraph Identification: A Study across Languages and Domains. In Dekang Lin and Dekai Wu, editors, Proceedings of EMNLP 2004, pages 72-79, 2004. Google Scholar
  21. Joseph Worsham and Jugal Kalita. Genre identification and the compositional effect of genre in literature. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1963-1973, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/C18-1167.
  22. Zhaohui Wu, Prasenjit Mitra, and C. Lee Giles. Table of contents recognition and extraction for heterogeneous book documents. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 2013. URL: https://doi.org/10.1109/ICDAR.2013.244.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail