An Automatic Partitioning of Gutenberg.org Texts

Picca, Davide; Gay-Crosier, Cyrille

doi:10.4230/OASIcs.LDK.2021.35

File

Subject Classification

ACM Subject Classification

Computing methodologies
Computing methodologies → Language resources

Keywords

Digital Humanities
Machine Learning
Corpora

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

Over the last 10 years, the automatic partitioning of texts has raised the interest of the community. The automatic identification of parts of texts can provide a faster and easier access to textual analysis. We introduce here an exploratory work for multi-part book identification. In an early attempt, we focus on Gutenberg.org which is one of the projects that has received the largest public support in recent years. The purpose of this article is to present a preliminary system that automatically classifies parts of texts into 35 semantic categories. An accuracy of more than 93% on the test set was achieved. We are planning to extend this effort to other repositories in the future.

Cite As Get BibTex

Davide Picca and Cyrille Gay-Crosier. An Automatic Partitioning of Gutenberg.org Texts. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 35:1-35:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/OASIcs.LDK.2021.35

Author Details

Davide Picca

University of Lausanne, Switzerland

Cyrille Gay-Crosier

University of Lausanne, Switzerland

References

Mohamed Aly. Survey on multiclass classification methods. Neural Netw, 19:1-9, 2005.
Ngo Xuan Bach, Nguyen Le Minh, Tran Thi Oanh, and Akira Shimazu. Learning logical structures of paragraphs in legal articles. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 20-28, Chiang Mai, Thailand, 2011. Asian Federation of Natural Language Processing. URL: https://www.aclweb.org/anthology/I11-1003.
Julian Brooke, Adam Hammond, and Graeme Hirst. GutenTag: an NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus. In Proceedings of the Fourth Workshop on Computational Linguistics for Literature, pages 42-47, 2015. URL: https://doi.org/10.3115/v1/w15-0705.
R Bucher. Classification of Fiction Genres: Text classification of fiction texts from Project Gutenberg. diva-portal.org, 2018.
Hervé Déjean and Jean Luc Meunier. On tables of contents and how to recognize them. International Journal on Document Analysis and Recognition, 2009. URL: https://doi.org/10.1007/s10032-009-0078-8.
Antoine Doucet, Gabriella Kazai, Bodin Dresevic, Aleksandar Uzelac, Bogdan Radakovic, and Nikola Todic. ICDAR 2009 book structure extraction competition. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 2009. URL: https://doi.org/10.1109/ICDAR.2009.280.
Antoine Doucet, Gabriella Kazai, and Jean Luc Meunier. ICDAR 2011 book structure extraction competition. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 2011. URL: https://doi.org/10.1109/ICDAR.2011.298.
Mattia Egloff and Davide Picca. The Project Gutenberg Ontology. In European Association for Digital Humanities (EADH), Galway, Ireland, 2018.
Mattia Egloff, Davide Picca, and Alessandro Adamou. Extraction of character profiles from the gutenberg archive. In Emmanouel Garoufallou, Francesca Fallucchi, and Ernesto William De Luca, editors, Metadata and Semantic Research, pages 367-372, Cham, 2019. Springer International Publishing.
Liangcai Gao, Zhi Tang, Xiaofan Lin, Xin Tao, and Yimin Chu. Analysis of book documents' table of content based on clustering. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 2009. URL: https://doi.org/10.1109/ICDAR.2009.143.
M Gerlach and F Font-Clos. A standardized project gutenberg corpus for statistical analysis of natural language and quantitative linguistics. Entropy, 2020.
OL Goodloe. Applications of Deep Neural Networks to Neurocognitive Poetics: A Quantitative Study of the Project Gutenberg English Poetry Corpus. repository.asu.edu, 2019.
Shesen Guo, Ganzhou Zhang, Run Zhai, and Zehua Song. Distribution of English syllables in e-books of Project Gutenberg and the evolution of syllable number in two subcorpora. Digital Scholarship in the Humanities, 30(3):344-353, 2015. URL: https://doi.org/10.1093/llc/fqu013.
Arthur M. Jacobs. The Gutenberg English Poetry Corpus: Exemplary Quantitative Narrative Analyses. Frontiers in Digital Humanities, 5, 2018. URL: https://doi.org/10.3389/fdigh.2018.00005.
Gabriella Kazai, Antoine Doucet, Marijn Koolen, and Monica Landoni. Overview of the INEX 2009 book track. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2010. URL: https://doi.org/10.1007/978-3-642-14556-8_16.
Evgeny Kim and Roman Klinger. Who feels what and why? annotation of a literature corpus with semantic roles of emotions. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1345-1359, Santa Fe, New Mexico, USA, 2018. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/C18-1114.
Jiwei Li, Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphs and documents. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1106-1115, Beijing, China, July 2015. Association for Computational Linguistics. URL: https://doi.org/10.3115/v1/P15-1107.
Lara McConnaughey, Jennifer Dai, and David Bamman. The labeled segmentation of printed books. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 737-747, Copenhagen, Denmark, 2017. Association for Computational Linguistics. URL: https://doi.org/10.18653/v1/D17-1077.
Davide Picca and Mattia Egloff. DHTK: The Digital Humanities ToolKit. In Workshop on Humanities in the Semantic Web - WHiSe II, pages 1-6, 2017.
Caroline Sporleder and Mirella Lapata. Automatic Paragraph Identification: A Study across Languages and Domains. In Dekang Lin and Dekai Wu, editors, Proceedings of EMNLP 2004, pages 72-79, 2004.
Joseph Worsham and Jugal Kalita. Genre identification and the compositional effect of genre in literature. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1963-1973, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/C18-1167.
Zhaohui Wu, Prasenjit Mitra, and C. Lee Giles. Table of contents recognition and extraction for heterogeneous book documents. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 2013. URL: https://doi.org/10.1109/ICDAR.2013.244.

An Automatic Partitioning of Gutenberg.org Texts

Authors Davide Picca , Cyrille Gay-Crosier

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message