Encoder-Attention-Based Automatic Term Recognition (EA-ATR)

Authors Sampritha H. Manjunath, John P. McCrae



PDF
Thumbnail PDF

File

OASIcs.LDK.2021.23.pdf
  • Filesize: 0.55 MB
  • 13 pages

Document Identifiers

Author Details

Sampritha H. Manjunath
  • Insight Centre for Data Analytics, National University of Ireland, Galway, Ireland
John P. McCrae
  • Insight Centre for Data Analytics, National University of Ireland, Galway, Ireland

Acknowledgements

We would like to thank the reviewers for helpful comments and insightful feedback.

Cite As Get BibTex

Sampritha H. Manjunath and John P. McCrae. Encoder-Attention-Based Automatic Term Recognition (EA-ATR). In 3rd Conference on Language, Data and Knowledge (LDK 2021). Open Access Series in Informatics (OASIcs), Volume 93, pp. 23:1-23:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/OASIcs.LDK.2021.23

Abstract

Automated Term Recognition (ATR) is the task of finding terminology from raw text. It involves designing and developing techniques for the mining of possible terms from the text and filtering these identified terms based on their scores calculated using scoring methodologies like frequency of occurrence and then ranking the terms. Current approaches often rely on statistics and regular expressions over part-of-speech tags to identify terms, but this is error-prone. We propose a deep learning technique to improve the process of identifying a possible sequence of terms. We improve the term recognition by using Bidirectional Encoder Representations from Transformers (BERT) based embeddings to identify which sequence of words is a term. This model is trained on Wikipedia titles. We assume all Wikipedia titles to be the positive set, and random n-grams generated from the raw text as a weak negative set. The positive and negative set will be trained using the Embed, Encode, Attend and Predict (EEAP) formulation using BERT as embeddings. The model will then be evaluated against different domain-specific corpora like GENIA - annotated biological terms and Krapivin - scientific papers from the computer science domain.

Subject Classification

ACM Subject Classification
  • Information systems → Top-k retrieval in databases
  • Computing methodologies → Information extraction
  • Computing methodologies → Neural networks
Keywords
  • Automatic Term Recognition
  • Term Extraction
  • BERT
  • EEAP
  • Deep Learning for ATR

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Nikita Astrakhantsev. ATR4S: toolkit with state-of-the-art automatic terms recognition methods in Scala. Language Resources and Evaluation, 52(3):853-872, 2018. Google Scholar
  2. Nikita A Astrakhantsev, Denis G Fedorenko, and D Yu Turdakov. Methods for automatic term recognition in domain-specific text collections: A survey. Programming and Computer Software, 41(6):336-349, 2015. Google Scholar
  3. James R Curran and Marc Moens. Improvements in automatic thesaurus extraction. In Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition, pages 59-66, 2002. Google Scholar
  4. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint, 2018. URL: http://arxiv.org/abs/1810.04805.
  5. Éric Gaussier. Flow network models for word alignment and terminology extraction from bilingual corpora. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 1, ACL ’98/COLING ’98, page 444–450, USA, 1998. Association for Computational Linguistics. URL: https://doi.org/10.3115/980845.980921.
  6. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016. Google Scholar
  7. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997. Google Scholar
  8. Matthew Honnibal. Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models. Blog, Explosion, November, 10, 2016. Google Scholar
  9. Kyo Kageura and Bin Umino. Methods of automatic term recognition: A review. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication, 3(2):259-289, 1996. Google Scholar
  10. Kush Khosla, Robbie Jones, and Nicholas Bowman. Featureless deep learning methods for automated key-term extraction, 2019. Google Scholar
  11. J-D Kim, Tomoko Ohta, Yuka Tateisi, and Jun’ichi Tsujii. GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics, 19(suppl_1):i180-i182, 2003. Google Scholar
  12. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint, 2014. URL: http://arxiv.org/abs/1412.6980.
  13. Ioannis Korkontzelos, Ioannis P. Klapaftis, and Suresh Manandhar. Reviewing and evaluating automatic term recognition techniques. In Bengt Nordström and Aarne Ranta, editors, Advances in Natural Language Processing, pages 248-259, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. Google Scholar
  14. Mikalai Krapivin, Aliaksandr Autaeu, and Maurizio Marchese. Large dataset for keyphrases extraction. Technical report, University of Trento, 2009. Google Scholar
  15. Yang Lingpeng, Ji Donghong, Zhou Guodong, and Nie Yu. Improving retrieval effectiveness by using key terms in top retrieved documents. In David E. Losada and Juan M. Fernández-Luna, editors, Advances in Information Retrieval, pages 169-184, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. Google Scholar
  16. Yang Liu, Chengjie Sun, Lei Lin, and Xiaolong Wang. Learning natural language inference using bidirectional LSTM model and inner-attention. arXiv preprint, 2016. URL: http://arxiv.org/abs/1605.09090.
  17. Diana Maynard, Yaoyong Li, and Wim Peters. NLP techniques for term extraction and ontology population, 2008. Google Scholar
  18. Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. Deep learning based text classification: A comprehensive review. arXiv preprint, 2020. URL: http://arxiv.org/abs/2004.03705.
  19. Michael P Oakes and Chris D Paice. Term extraction for automatic abstracting. D. Bourigault, C. Jacquemin, and MC. L’Homme, editors, Recent Advances in Computational Terminology, 2:353-370, 2001. Google Scholar
  20. Ayla Rigouts Terryn, Veronique Hoste, Patrick Drouin, and Els Lefever. TermEval 2020: Shared task on automatic term extraction using the annotated corpora for term extraction research (ACTER) dataset. In Proceedings of the 6th International Workshop on Computational Terminology, pages 85-94, Marseille, France, May 2020. European Language Resources Association. URL: https://www.aclweb.org/anthology/2020.computerm-1.12.
  21. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998-6008, 2017. Google Scholar
  22. Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. Recent trends in deep learning based natural language processing. CoRR, abs/1708.02709, 2017. URL: http://arxiv.org/abs/1708.02709.
  23. Zijun Zhang. Improved Adam optimizer for deep neural networks. In 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), pages 1-2. IEEE, 2018. Google Scholar
  24. Ziqi Zhang, Jie Gao, and Fabio Ciravegna. JATE 2.0: Java automatic term extraction with apache Solr. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2262-2269, Portorož, Slovenia, 2016. European Language Resources Association (ELRA). URL: https://www.aclweb.org/anthology/L16-1359.
  25. Ziqi Zhang, Johann Petrak, and Diana Maynard. Adapted textrank for term extraction: A generic method of improving automatic term extraction algorithms. Procedia Computer Science, 137:102-108, January 2018. URL: https://doi.org/10.1016/j.procs.2018.09.010.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail