Mining Scientific Articles Powered by Machine Learning Techniques

Authors Carlos A. S. J. Gulo, Thiago R. P. M. Rúbio, Shazia Tabassum, Simone G. D. Prado

Thumbnail PDF


  • Filesize: 0.89 MB
  • 8 pages

Document Identifiers

Author Details

Carlos A. S. J. Gulo
Thiago R. P. M. Rúbio
Shazia Tabassum
Simone G. D. Prado

Cite AsGet BibTex

Carlos A. S. J. Gulo, Thiago R. P. M. Rúbio, Shazia Tabassum, and Simone G. D. Prado. Mining Scientific Articles Powered by Machine Learning Techniques. In 2015 Imperial College Computing Student Workshop (ICCSW 2015). Open Access Series in Informatics (OASIcs), Volume 49, pp. 21-28, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)


Literature review is one of the most important phases of research. Scientists must identify the gaps and challenges about certain area and the scientific literature, as a result of the accumulation of knowledge, should provide enough information. The problem is where to find the best and most important articles that guarantees to ascertain the state of the art on that specific domain. A feasible literature review consists on locating, appraising, and synthesising the best empirical evidences in the pool of available publications, guided by one or more research questions. Nevertheless, it is not assured that searching interesting articles in electronic databases will retrieve the most relevant content. Indeed, the existent search engines try to recommend articles by only looking for the occurrences of given keywords. In fact, the relevance of a paper should depend on many other factors as adequacy to the theme, specific tools used or even the test strategy, making automatic recommendation of articles a challenging problem. Our approach allows researchers to browse huge article collections and quickly find the appropriate publications of particular interest by using machine learning techniques. The proposed solution automatically classifies and prioritises the relevance of scientific papers. Using previous samples manually classified by domain experts, we apply a Naive Bayes Classifier to get predicted articles from real world journal repositories such as IEEE Xplore or ACM Digital. Results suggest that our model can substantially recommend, classify and rank the most relevant articles of a particular scientific field of interest. In our experiments, we achieved 98.22% of accuracy in recommending articles that are present in an expert classification list, indicating a good prediction of relevance. The recommended papers worth, at least, the reading. We envisage to expand our model in order to accept user’s filters and other inputs to improve predictions.
  • Machine Learning
  • Text Categorisation
  • Text Classification
  • Ranking
  • Systematic Literature Review


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Yindalon Aphinyanaphongs and Constantin F. Aliferis. Text categorization models for retrieval of high quality articles in internal medicine. In AMIA Annual Symposium Proceedings, pages 31-5, 2003. Google Scholar
  2. David M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77-84, 2012. Google Scholar
  3. David M. Blei and John D. Lafferty. Topic models. In Text Mining: Classification, Clustering, and Applications. Chapman &Hall/CRC, 2009. Google Scholar
  4. Harris M. Cooper. The structure of knowledge synthesis. Knowledge in Society. 1988. Google Scholar
  5. Ludovic Denoyer and Patrick Gallinari. Bayesian network model for semi-structured document classification. Information Processing &Management, 40(5):807-827, 2004. Google Scholar
  6. Tracy Edinger and Aaron M. Cohen. A large-scale analysis of the reasons given for excluding articles that are retrieved by literature search during systematic review. In AMIA Annual Symposium Proceedings, pages 379-387, 2013. Google Scholar
  7. Ingo Feinerer, Kurt Hornik, and David Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5):1-54, 2008. Google Scholar
  8. Louise Francis and Matthew Flynn. Text mining handbook. In Casualty Actuarial Society E-Forum. Casualty Actuarial Society E-Forum, 2010. Google Scholar
  9. Carlos A.S.J. Gulo and Thiago R.P.M. Rúbio. Text mining and scientific articles and using the R language. In Proceedings of the 10th Doctoral Symposium in Informatics Engineering - DSIE, pages 60-69, Porto, 2015. Google Scholar
  10. Andreas Hotho, Andreas Nürnberger, and Gerhard Paab. A brief survey of text mining. LDV Forum - GLDV Journal for Computational Linguistics and Language Technology, 2005. Google Scholar
  11. D. Isa, L.H. Lee, V. Kallimani, and R. RajKumar. Text document preprocessing with the bayes formula for classification using the support vector machine. IEEE Transactions on Knowledge and Data Engineering, 20(9):1264-1272, 2008. Google Scholar
  12. Mohammad S. Khorsheed and Abdulmohsen O. Al-Thubaity. Comparative evaluation of text classification techniques using a large diverse arabic dataset. Language Resources and Evaluation, 47(2):513-538, 2013. Google Scholar
  13. Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI - International Joint Conference on Artificial Intelligence, pages 1137-1145. Morgan Kaufmann, 1995. Google Scholar
  14. Guy Lebanon, Yi Mao, and Joshua Dillon. The locally weighted bag of words framework for document representation. J. Mach. Learn. Res., 8:2405-2441, 2007. Google Scholar
  15. S. Massung, ChengXiang Zhai, and J. Hockenmaier. Structural parse tree features for text representation. In ICSC - International Conference on Semantic Computing, pages 9-16, 2013. Google Scholar
  16. Andrew Kachites McCallum. Multi-label text classification with a mixture model trained by EM. In AAAI 99 Workshop on Text Learning, 1999. Google Scholar
  17. Chitu Okoli and Kira Schabram. A guide to conducting a systematic literature review of information systems research. Sprouts: Working Papers on Information Systems, 10(26), 2010. Google Scholar
  18. Timothy N. Rubin, America Chambers, Padhraic Smyth, and Mark Steyvers. Statistical topic models for multi-label document classification. Machine Learning, 88(1-2):157-208, 2012. Google Scholar
  19. M. Mahdi Shafiei and Evangelos E. Milios. A statistical model for topic segmentation and clustering. In Sabine Bergler, editor, Advances in Artificial Intelligence, volume 5032, pages 283-295, 2008. Google Scholar
  20. Chong Wang and David M. Blei. Collaborative topic modeling for recommending scientific articles. In Knowledge Discovery and Data Mining, 2011. Google Scholar
  21. Sholom M. Weiss, Nitin Indurkhya, and Tong Zhang. Fundamentals of Predictive Text Mining. Springer, 2010. Google Scholar
  22. Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, and Fred J. Damerau. Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, 2005. Google Scholar
  23. Hwanjo Yu, ChengXiang Zhai, and Jiawei Han. Text classification from positive and unlabeled documents. In Proceedings of the twelfth International Conference on Information and Knowledge Management, pages 232-239, 2003. Google Scholar
  24. Yangchang Zhao. R and data mining: Examples and case studies. In Yangchang Zhao, editor, R and Data Mining, pages 1-4. Academic Press, 2013. Google Scholar
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail