Comparison of Platforms for Recommender Algorithm on Large Datasets

Authors Christina Diedhiou, Bryan Carpenter, Ramazan Esmeli

Thumbnail PDF


  • Filesize: 476 kB
  • 10 pages

Document Identifiers

Author Details

Christina Diedhiou
  • School of Computing, University of Portsmouth, United Kingdom
Bryan Carpenter
  • School of Computing, University of Portsmouth, United Kingdom
Ramazan Esmeli
  • School of Computing, University of Portsmouth, United Kingdom

Cite AsGet BibTex

Christina Diedhiou, Bryan Carpenter, and Ramazan Esmeli. Comparison of Platforms for Recommender Algorithm on Large Datasets. In 2018 Imperial College Computing Student Workshop (ICCSW 2018). Open Access Series in Informatics (OASIcs), Volume 66, pp. 4:1-4:10, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


One of the challenges our society faces is the ever increasing amount of data. Among existing platforms that address the system requirements, Hadoop is a framework widely used to store and analyze "big data". On the human side, one of the aids to finding the things people really want is recommendation systems. This paper evaluates highly scalable parallel algorithms for recommendation systems with application to very large data sets. A particular goal is to evaluate an open source Java message passing library for parallel computing called MPJ Express, which has been integrated with Hadoop. As a demonstration we use MPJ Express to implement collaborative filtering on various data sets using the algorithm ALSWR (Alternating-Least-Squares with Weighted-lambda-Regularization). We benchmark the performance and demonstrate parallel speedup on Movielens and Yahoo Music data sets, comparing our results with two other frameworks: Mahout and Spark. Our results indicate that MPJ Express implementation of ALSWR has very competitive performance and scalability in comparison with the two other frameworks.

Subject Classification

ACM Subject Classification
  • Information systems → Database management system engines
  • HPC
  • MPJ Express
  • Hadoop
  • Spark
  • Mahout


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Wakeel Ahmad, Bryan Carpenter, and Aamir Shafi. Collective Asynchronous Remote Invocation (CARI): A High-Level and Efficient Communication API for Irregular Applications. Procedia Computer Science, 4:26-35, 2011. International Conference On Computational Science, ICCS 2011. Google Scholar
  2. Apache Giraph., 2014. [accessed 19-January-2018]. Google Scholar
  3. Apache Mahout., 2017. [accessed 30-January-2018]. Google Scholar
  4. Spark RDD Operations-Transformation &Action with Example. [accessed 11-June-2018]. Google Scholar
  5. Rui Maximo Esteves, Rui Pais, and Chunming Rong. K-means clustering in the cloud-a Mahout test. In Advanced Information Networking and Applications (WAINA), 2011 IEEE Workshops of International Conference on, pages 514-519. IEEE, 2011. Google Scholar
  6. Datasets | GroupLens., 2015. [accessed 14-December-2016]. Google Scholar
  7. Rong Gu, Xiaoliang Yang, Jinshuang Yan, Yuanhao Sun, Bing Wang, Chunfeng Yuan, and Yihua Huang. SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters. Journal of parallel and distributed computing, 74(3):2166-2179, 2014. Google Scholar
  8. Data Analytics Acceleration Library., 2017. [accessed 21-October-2017]. Google Scholar
  9. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys '07, pages 59-72, New York, NY, USA, 2007. ACM. Google Scholar
  10. Maja Kabiljo and Aleksandar Ilic. Recommending items to more than a billion people., 2015. [accessed 30-December-2017]. Google Scholar
  11. Introduction to ALS Recommendations with Hadoop. [accessed 22-June-2018]. Google Scholar
  12. MPJ Express., 2015. [accessed 18-January-2018]. Google Scholar
  13. Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and Carlo Curino. Apache tez: A unifying framework for modeling and building data processing applications. In Proceedings of the 2015 ACM SIGMOD international conference on Management of Data, pages 1357-1369. ACM, 2015. Google Scholar
  14. Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications, 19(1):49-66, 2005. Google Scholar
  15. Tom White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 4th edition, 2015. Google Scholar
  16. Webscope., 2006. [accessed 14-December-2016]. Google Scholar
  17. Hamza Zafar, Farrukh Aftab Khan, Bryan Carpenter, Aamir Shafi, and Asad Waqar Malik. MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems. Procedia Computer Science, 51:2678-2682, 2015. International Conference On Computational Science, ICCS 2015. URL:
  18. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2-2. USENIX Association, 2012. Google Scholar
  19. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud'10, pages 10-10, Berkeley, CA, USA, 2010. USENIX Association. URL:
  20. Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. Large-scale parallel collaborative filtering for the netflix prize. Lecture Notes in Computer Science, 5034:337-348, 2008. Google Scholar
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail