Comparison of Platforms for Recommender Algorithm on Large Datasets

Diedhiou, Christina; Carpenter, Bryan; Esmeli, Ramazan

doi:10.4230/OASIcs.ICCSW.2018.4

File

Subject Classification

ACM Subject Classification

Information systems → Database management system engines

Keywords

HPC
MPJ Express
Hadoop
Spark
Mahout

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

One of the challenges our society faces is the ever increasing amount of data. Among existing platforms that address the system requirements, Hadoop is a framework widely used to store and analyze "big data". On the human side, one of the aids to finding the things people really want is recommendation systems. This paper evaluates highly scalable parallel algorithms for recommendation systems with application to very large data sets. A particular goal is to evaluate an open source Java message passing library for parallel computing called MPJ Express, which has been integrated with Hadoop. As a demonstration we use MPJ Express to implement collaborative filtering on various data sets using the algorithm ALSWR (Alternating-Least-Squares with Weighted-lambda-Regularization). We benchmark the performance and demonstrate parallel speedup on Movielens and Yahoo Music data sets, comparing our results with two other frameworks: Mahout and Spark. Our results indicate that MPJ Express implementation of ALSWR has very competitive performance and scalability in comparison with the two other frameworks.

Cite As Get BibTex

Christina Diedhiou, Bryan Carpenter, and Ramazan Esmeli. Comparison of Platforms for Recommender Algorithm on Large Datasets. In 2018 Imperial College Computing Student Workshop (ICCSW 2018). Open Access Series in Informatics (OASIcs), Volume 66, pp. 4:1-4:10, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019) https://doi.org/10.4230/OASIcs.ICCSW.2018.4

Author Details

Christina Diedhiou

School of Computing, University of Portsmouth, United Kingdom

Bryan Carpenter

School of Computing, University of Portsmouth, United Kingdom

Ramazan Esmeli

School of Computing, University of Portsmouth, United Kingdom

References

Wakeel Ahmad, Bryan Carpenter, and Aamir Shafi. Collective Asynchronous Remote Invocation (CARI): A High-Level and Efficient Communication API for Irregular Applications. Procedia Computer Science, 4:26-35, 2011. International Conference On Computational Science, ICCS 2011.
Apache Giraph. http://giraph.apache.org/, 2014. [accessed 19-January-2018].
Apache Mahout. https://mahout.apache.org/, 2017. [accessed 30-January-2018].
Spark RDD Operations-Transformation &Action with Example. https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/. [accessed 11-June-2018].
Rui Maximo Esteves, Rui Pais, and Chunming Rong. K-means clustering in the cloud-a Mahout test. In Advanced Information Networking and Applications (WAINA), 2011 IEEE Workshops of International Conference on, pages 514-519. IEEE, 2011.
Datasets | GroupLens. http://grouplens.org/datasets/, 2015. [accessed 14-December-2016].
Rong Gu, Xiaoliang Yang, Jinshuang Yan, Yuanhao Sun, Bing Wang, Chunfeng Yuan, and Yihua Huang. SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters. Journal of parallel and distributed computing, 74(3):2166-2179, 2014.
Data Analytics Acceleration Library. https://software.intel.com/en-us/Intel-daal, 2017. [accessed 21-October-2017].
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys '07, pages 59-72, New York, NY, USA, 2007. ACM.
Maja Kabiljo and Aleksandar Ilic. Recommending items to more than a billion people. https://code.facebook.com, 2015. [accessed 30-December-2017].
Introduction to ALS Recommendations with Hadoop. https://mahout.apache.org/users/recom-mender/intro-als-hadoop.html. [accessed 22-June-2018].
MPJ Express. http://mpjexpress.org, 2015. [accessed 18-January-2018].
Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and Carlo Curino. Apache tez: A unifying framework for modeling and building data processing applications. In Proceedings of the 2015 ACM SIGMOD international conference on Management of Data, pages 1357-1369. ACM, 2015.
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications, 19(1):49-66, 2005.
Tom White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 4th edition, 2015.
Webscope. https://research.yahoo.com/, 2006. [accessed 14-December-2016].
Hamza Zafar, Farrukh Aftab Khan, Bryan Carpenter, Aamir Shafi, and Asad Waqar Malik. MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems. Procedia Computer Science, 51:2678-2682, 2015. International Conference On Computational Science, ICCS 2015. URL: http://dx.doi.org/10.1016/j.procs.2015.05.379.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2-2. USENIX Association, 2012.
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud'10, pages 10-10, Berkeley, CA, USA, 2010. USENIX Association. URL: http://dl.acm.org/citation.cfm?id=1863103.1863113.
Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. Large-scale parallel collaborative filtering for the netflix prize. Lecture Notes in Computer Science, 5034:337-348, 2008.

Comparison of Platforms for Recommender Algorithm on Large Datasets

Authors Christina Diedhiou, Bryan Carpenter, Ramazan Esmeli

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message