Derzis: A Path Aware Linked Data Crawler

Authors André Fernandes dos Santos , José Paulo Leal



PDF
Thumbnail PDF

File

OASIcs.SLATE.2021.2.pdf
  • Filesize: 0.87 MB
  • 12 pages

Document Identifiers

Author Details

André Fernandes dos Santos
  • CRACS & INESC Tec LA, Faculty of Sciences, University of Porto, Portugal
José Paulo Leal
  • CRACS & INESC Tec LA, Faculty of Sciences, University of Porto, Portugal

Cite AsGet BibTex

André Fernandes dos Santos and José Paulo Leal. Derzis: A Path Aware Linked Data Crawler. In 10th Symposium on Languages, Applications and Technologies (SLATE 2021). Open Access Series in Informatics (OASIcs), Volume 94, pp. 2:1-2:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/OASIcs.SLATE.2021.2

Abstract

Consuming Semantic Web data presents several challenges, from the number of datasets it is composed of, to the (very) large size of some of those datasets and the uncertain availability of querying endpoints. According to its core principles, accessing linked data can be done simply by dereferencing the IRIs of RDF resources. This is a light alternative both for clients and servers when compared to dataset dumps or SPARQL endpoints. The linked data interface does not support complex querying, but using it recursively may suffice to gather information about RDF resources, or to extract the relevant sub-graph which can then be processed and queried using other methods. We present Derzis, an open source semantic web crawler capable of traversing the linked data cloud starting from a set of seed resources. Derzis maintains information about the paths followed while crawling, which allows to define property path-based restrictions to the crawling frontier.

Subject Classification

ACM Subject Classification
  • Information systems → Web crawling
  • Information systems → Structure and multilingual text search
Keywords
  • Semantic web
  • linked open data
  • RDF
  • crawler

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Punam Bedi, Anjali Thukral, Hema Banati, Abhishek Behl, and Varun Mendiratta. A multi-threaded semantic focused crawler. Journal of Computer Science and Technology, 27(6):1233-1242, 2012. Google Scholar
  2. Tim Berners-Lee. Linked Data - Design Issues, 2006. URL: http://www.w3.org/DesignIssues/LinkedData.html.
  3. Tim Berners-Lee, James Hendler, and Ora Lassila. The Semantic Web. Scientific american, 284(5):34-43, 2001. Google Scholar
  4. Dan Brickley, Ramanathan V Guha, and Andrew Layman. Resource Description Framework (RDF) Schemas, 1998. URL: https://www.w3.org/TR/1998/WD-rdf-schema-19980409/.
  5. Carlos Castillo. Effective web crawling. SIGIR Forum, 39(1):55-56, 2005. Google Scholar
  6. Giuseppe Cota, Fabrizio Riguzzi, Riccardo Zese, Evelina Lamma, et al. KRaider: a Crawler for Linked Data. In 34th Italian Conference on Computational Logic, volume 2396, pages 202-216. CEUR-WS. org, 2019. Google Scholar
  7. Fabien Gandon Dean Allemang, Jim Hendler. RDFS-Plus. In Semantic Web for the Working Ontologist: Effective Modeling for Linked Data, RDFS, and OWL. Chapter 7, 2020. URL: https://doi.org/10.1145/3382097.3382107.
  8. Marie Destandau, Caroline Appert, and Emmanuel Pietriga. S-Paths: Set-based visual exploration of linked data driven by semantic paths. Semantic Web, 12(1):99-116, 2020. URL: https://doi.org/10.3233/SW-200383.
  9. Michael Färber. The Microsoft Academic Knowledge Graph: A Linked Data Source with 8 Billion Triples of Scholarly Data. In Proceedings of the 18th International Semantic Web Conference, ISWC'19, pages 113-129, 2019. URL: https://doi.org/10.1007/978-3-030-30796-7_8.
  10. Nuno Freire and Mário J Silva. Domain-Focused Linked Data Crawling Driven by a Semantically Defined Frontier. In International Conference on Asian Digital Libraries, pages 340-348. Springer, 2020. Google Scholar
  11. Olaf Hartig and Giuseppe Pirrò. A context-based semantics for SPARQL property paths over the web. In European semantic web conference, pages 71-87. Springer, 2015. Google Scholar
  12. Pascal Hitzler. A review of the semantic web field. Communications of the ACM, 64(2):76-83, 2021. Google Scholar
  13. Aidan Hogan. The Semantic Web: Two decades on. Semantic Web, 11(1):169-185, 2020. URL: https://doi.org/10.3233/SW-190387.
  14. Gary Illyes, Henner Zeller, Lizzi Harvey, and Martijn Koster. Robots Exclusion Protocol. URL: https://tools.ietf.org/html/draft-koster-rep-04#section-2.5.
  15. Robert Isele, Jürgen Umbrich, Christian Bizer, and Andreas Harth. LDspider: An open-source crawling framework for the Web of Linked Data. In Proceedings of the 2010 International Conference on Posters & Demonstrations Track, volume 658, pages 29-32. Citeseer, 2010. Google Scholar
  16. Arun Krishnan. Making search easier, 2018. URL: https://www.aboutamazon.com/news/innovation-at-amazon/making-search-easier.
  17. Ora Lassila and Ralph R. Swick. Resource Description Framework (RDF) Model and Syntax specification, 1998. Google Scholar
  18. Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semantic web, 6(2):167-195, 2015. Google Scholar
  19. Marc Najork. Web crawler architecture, 2009. Google Scholar
  20. A Gomes Raphael do Vale, Marco A Casanova, Giseli Rabello Lopes, and Luiz André P Paes Leme. CRAWLER-LD: a multilevel metadata focused crawler framework for linked data. In International Conference on Enterprise Information Systems, pages 302-319. Springer, 2014. Google Scholar
  21. Michael Röder, Geraldo de Souza Jr, and Axel-Cyrille Ngonga Ngomo. Squirrel-Crawling RDF Knowledge Graphs on the Web. In International Semantic Web Conference, pages 34-47. Springer, 2020. Google Scholar
  22. Amit Singhal. Introducing the knowledge graph: things, not strings. Official google blog, 5:16, 2012. Google Scholar
  23. Ruben Verborgh, Miel Vander Sande, Olaf Hartig, Joachim Van Herwegen, Laurens De Vocht, Ben De Meester, Gerald Haesendonck, and Pieter Colpaert. Triple Pattern Fragments: a low-cost knowledge graph interface for the Web. Journal of Web Semantics, 37:184-206, 2016. Google Scholar
  24. Denny Vrandečić and Markus Krötzsch. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78-85, 2014. Google Scholar
  25. Liyang Yu. Follow your nose: a basic semantic web agent. In A Developer’s Guide to the Semantic Web, pages 533-557. Springer, 2011. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail