Relevance Feedback Search Based on Automatic Annotation and Classification of Texts

Authors Rafael Leal , Joonas Kesäniemi , Mikko Koho , Eero Hyvönen



PDF
Thumbnail PDF

File

OASIcs.LDK.2021.18.pdf
  • Filesize: 2.53 MB
  • 15 pages

Document Identifiers

Author Details

Rafael Leal
  • HELDIG Centre for Digital Humanities, University of Helsinki, Finland
Joonas Kesäniemi
  • Semantic Computing Research Group (SeCo), Aalto University, Finland
Mikko Koho
  • HELDIG Centre for Digital Humanities, University of Helsinki, Finland
  • Semantic Computing Research Group (SeCo), Aalto University, Finland
Eero Hyvönen
  • Semantic Computing Research Group (SeCo), Aalto University, Finland
  • HELDIG Centre for Digital Humanities, University of Helsinki, Finland

Cite As Get BibTex

Rafael Leal, Joonas Kesäniemi, Mikko Koho, and Eero Hyvönen. Relevance Feedback Search Based on Automatic Annotation and Classification of Texts. In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 18:1-18:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/OASIcs.LDK.2021.18

Abstract

The idea behind Relevance Feedback Search (RFBS) is to build search queries as an iterative and interactive process in which they are gradually refined based on the results of the previous search round. This can be helpful in situations where the end user cannot easily formulate their information needs at the outset as a well-focused query, or more generally as a way to filter and focus search results. This paper concerns (1) a framework that integrates keyword extraction and unsupervised classification into the RFBS paradigm and (2) the application of this framework to the legal domain as a use case. We focus on the Natural Language Processing (NLP) methods underlying the framework and application, where an automatic annotation tool is used for extracting document keywords as ontology concepts, which are then transformed into word embeddings to form vectorial representations of the texts. An unsupervised classification system that employs similar techniques is also used in order to classify the documents into broad thematic classes. This classification functionality is evaluated using two different datasets. As the use case, we describe an application perspective in the semantic portal LawSampo - Finnish Legislation and Case Law on the Semantic Web. This online demonstrator uses a dataset of 82145 sections in 3725 statutes of Finnish legislation and another dataset that comprises 13470 court decisions.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Information extraction
  • Applied computing → Document searching
  • Information systems → Clustering and classification
Keywords
  • relevance feedback
  • keyword extraction
  • zero-shot text classification
  • word embeddings
  • LawSampo

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval (2nd Ed.). Addison-Wesley Longman Publishing Co., Inc., 2011. Google Scholar
  2. David M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77-84, 2012. URL: https://doi.org/10.1145/2133806.2133826.
  3. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5:135-146, December 2017. URL: https://doi.org/10.1162/tacl_a_00051.
  4. Yann N. Dauphin, Gokhan Tur, Dilek Hakkani-Tur, and Larry Heck. Zero-Shot Learning for Semantic Utterance Classification. ICLR 2014, 2014. URL: http://arxiv.org/abs/1401.0509.
  5. Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018. Google Scholar
  6. Eero Hyvönen, Minna Tamper, Arttu Oksanen, Esko Ikkala, Sami Sarsa, Jouni Tuominen, and Aki Hietanen. LawSampo: A semantic portal on a linked open data service for finnish legislation and case law. In The Semantic Web: ESWC 2020 Satellite Events. Revised Selected Papers, pages 110-114. Springer-Verlag, 2019. Google Scholar
  7. Mikko Koho, Erkki Heino, Arttu Oksanen, and Eero Hyvönen. Toffee - semantic media search using topic modeling and relevance feedback. In Proceedings of the ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks. CEUR Workshop Proceedings, October 2018. Vol 2180. URL: http://ceur-ws.org/Vol-2180/.
  8. Rafael Leal. Unsupervised zero-shot classification of Finnish documents using pre-trained language models. Master’s thesis, University of Helsinki, Department of Digital Humanities, 2020. URL: http://urn.fi/URN:NBN:fi:hulib-202012155147.
  9. Olena Medelyan. Human-Competitive Automatic Topic Indexing. Thesis, The University of Waikato, 2009. URL: https://researchcommons.waikato.ac.nz/handle/10289/3513.
  10. Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. Deep learning-based text classification: A comprehensive review. ACM Computing Surveys (CSUR), 54(3):1-40, 2021. URL: https://doi.org/10.1145/3439726.
  11. Arttu Oksanen, Jouni Tuominen, Eetu Mäkelä, Minna Tamper, Aki Hietanen, and Eero Hyvönen. Semantic Finlex: Transforming, publishing, and using finnish legislation and case law as linked open data on the web. In G. Peruginelli and S. Faro, editors, Knowledge of the Law in the Big Data Age, volume 317 of Frontiers in Artificial Intelligence and Applications, pages 212-228. IOS Press, 2019. ISBN 978-1-61499-984-3 (print); ISBN 978-1-61499-985-0 (online). URL: http://doi.org/10.3233/FAIA190023.
  12. Jaakko Peltonen, Jonathan Strahl, and Patrik Floréen. Negative relevance feedback for exploratory search with visual interactive intent modeling. In Proceedings of the 22nd International Conference on Intelligent User Interfaces, pages 149-159. ACM, 2017. Google Scholar
  13. Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. Parabel: Partitioned Label Trees for Extreme Classification with Application to Dynamic Search Advertising. In Proceedings of the 2018 World Wide Web Conference on World Wide Web - WWW '18, pages 993-1002. ACM Press, 2018. URL: https://doi.org/10.1145/3178876.3185998.
  14. Anthony Rios and Ramakanth Kavuluru. Few-Shot and Zero-Shot Multi-Label Learning for Structured Label Spaces. EMNLP, 2018. URL: https://doi.org/10.18653/v1/D18-1352.
  15. Gerard Salton and Chris Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41(4):288, 1990. Google Scholar
  16. Prateek Veeranna Sappadla, Jinseok Nam, Eneldo Loza Mencía, and Johannes Fürnkranz. Using semantic similarity for multi-label zero-shot classification of text documents. In ESANN, 2016. Google Scholar
  17. Katri Seppälä and Eero Hyvönen. Asiasanaston muuttaminen ontologiaksi. Yleinen suomalainen ontologia esimerkkinä FinnONTO-hankkeen mallista (Changing a keyword thesaurus into an ontology. General Finnish Ontology as an example of the FinnONTO model). Technical report, National Library, Plans, Reports, Guides, March 2014. URL: https://www.doria.fi/handle/10024/96825.
  18. Teemu Sidoroff and Eero Hyvönen. Semantic e-goverment portals - a case study. In Proceedings of the ISWC-2005 Workshop Semantic Web Case Studies and Best Practices for eBusiness SWCASE05, 2005. URL: https://seco.cs.aalto.fi/publications/2005/sidoroff-hyvonen-semantic-e-government-2005.pdf.
  19. Laura Sinikallio, Senka Drobac, Minna Tamper, Rafael Leal, Mikko Koho, Jouni Tuominen, Matti La Mela, and Eero Hyvönen. Plenary debates of the Parliament of Finland as linked open data and in Parla-CLARIN markup, March 2021. Accepted, LDK 2021. Google Scholar
  20. Wei Song, Yu Zhang, Ting Liu, and Sheng Li. Bridging topic modeling and personalized search. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 1167-1175. Association for Computational Linguistics, 2010. Google Scholar
  21. Osma Suominen. Annif: DIY automated subject indexing using multiple algorithms. LIBER Quarterly, 29(1):1-25, July 2019. URL: https://doi.org/10.18352/lq.10285.
  22. Jie Tang, Ruoming Jin, and Jing Zhang. A topic modeling approach and its integration into the random walk framework for academic search. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on, pages 1055-1060. IEEE, 2008. Google Scholar
  23. Jaime Teevan, Susan T. Dumais, and Eric Horvitz. Personalizing search via automated analysis of interests and activities. In Proc. of the 28th Annual International ACM SIGIR Conference, SIGIR '05, pages 449-456. ACM, 2005. Google Scholar
  24. Zhiquan Ye, Yuxia Geng, Jiaoyan Chen, Jingmin Chen, Xiaoxiao Xu, SuHang Zheng, Feng Wang, Jun Zhang, and Huajun Chen. Zero-shot Text Classification via Reinforced Self-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3014-3024. Association for Computational Linguistics, 2020. URL: https://doi.org/10.18653/v1/2020.acl-main.272.
  25. Wenpeng Yin, Jamaal Hay, and Dan Roth. Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3914-3923. Association for Computational Linguistics, 2019. URL: https://doi.org/10.18653/v1/D19-1404.
  26. Jingqing Zhang, Piyawat Lertvittayakumjorn, and Yike Guo. Integrating Semantic Knowledge to Tackle Zero-shot Text Classification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1031-1040. Association for Computational Linguistics, 2019. URL: https://doi.org/10.18653/v1/N19-1108.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail