What Makes Spatial Data Big? A Discussion on How to Partition Spatial Data

Authors Alberto Belussi , Damiano Carra , Sara Migliorini , Mauro Negri, Giuseppe Pelagatti

Thumbnail PDF


  • Filesize: 0.54 MB
  • 15 pages

Document Identifiers

Author Details

Alberto Belussi
  • Department of Computer Science, University of Verona, Italy
Damiano Carra
  • Department of Computer Science, University of Verona, Italy
Sara Migliorini
  • Department of Computer Science, University of Verona, Italy
Mauro Negri
  • Department of Electronics, Information and Bioengineering, Politecnico of Milan, Italy
Giuseppe Pelagatti
  • Department of Electronics, Information and Bioengineering, Politecnico of Milan, Italy

Cite AsGet BibTex

Alberto Belussi, Damiano Carra, Sara Migliorini, Mauro Negri, and Giuseppe Pelagatti. What Makes Spatial Data Big? A Discussion on How to Partition Spatial Data. In 10th International Conference on Geographic Information Science (GIScience 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 114, pp. 2:1-2:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


The amount of available spatial data has significantly increased in the last years so that traditional analysis tools have become inappropriate to effectively manage them. Therefore, many attempts have been made in order to define extensions of existing MapReduce tools, such as Hadoop or Spark, with spatial capabilities in terms of data types and algorithms. Such extensions are mainly based on the partitioning techniques implemented for textual data where the dimension is given in terms of the number of occupied bytes. However, spatial data are characterized by other features which describe their dimension, such as the number of vertices or the MBR size of geometries, which greatly affect the performance of operations, like the spatial join, during data analysis. The result is that the use of traditional partitioning techniques prevents to completely exploit the benefit of the parallel execution provided by a MapReduce environment. This paper extensively analyses the problem considering the spatial join operation as use case, performing both a theoretical and an experimental analysis for it. Moreover, it provides a solution based on a different partitioning technique, which splits complex or extensive geometries. Finally, we validate the proposed solution by means of some experiments on synthetic and real datasets.

Subject Classification

ACM Subject Classification
  • Information systems → Geographic information systems
  • Spatial join
  • SpatialHadoop
  • MapReduce
  • partitioning
  • big data


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. A. Belussi, S. Migliorini, and A. Eldawy. A Cost Model for Spatial Join Operations in SpatialHadoop. Technical Report RR108/2018, Dept. of Computer Science, University of Verona, 2018. URL: https://iris.univr.it/handle/11562/981957.
  2. Alberto Belussi, Sara Migliorini, Mauro Negri, and Giuseppe Pelagatti. Validation of spatial integrity constraints in city models. In 4th ACM SIGSPATIAL Int. Workshop on Mobile Geographic Information Systems, pages 70-79, 2015. URL: http://dx.doi.org/10.1145/2834126.2834137.
  3. Matteo Dell'Amico, Damiano Carra, and Pietro Michiardi. PSBS: Practical size-based scheduling. IEEE Transactions on Computers, 65(7):2199-2212, 2016. Google Scholar
  4. D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In 18th Int. Conf. on Very Large Data Bases, pages 27-40, 1992. Google Scholar
  5. A. Eldawy, L. Alarabi, and M. F. Mokbel. Spatial partitioning techniques in SpatialHadoop. Proc. VLDB Endow., 8(12):1602-1605, 2015. Google Scholar
  6. A. Eldawy and M. F. Mokbel. Pigeon: A spatial MapReduce language. In IEEE 30th Int. Conf. on Data Engineering, pages 1242-1245, 2014. URL: http://dx.doi.org/10.1109/ICDE.2014.6816751.
  7. A. Eldawy and M. F. Mokbel. SpatialHadoop: A MapReduce framework for spatial data. In 2015 IEEE 31st International Conference on Data Engineering, pages 1352-1363, 2015. Google Scholar
  8. A. Eldawy and M. F. Mokbel. Spatial Join with Hadoop, pages 2032-2036. Springer International Publishing, Cham, 2017. URL: http://dx.doi.org/10.1007/978-3-319-17885-1_1570.
  9. K. A. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In 17th Int. Conf. on Very Large Data Bases, pages 525-535, 1991. Google Scholar
  10. Edwin H. Jacox and Hanan Samet. Spatial Join Techniques. ACM Trans. Database Syst., 32(1), 2007. URL: http://dx.doi.org/10.1145/1206049.1206056.
  11. Masaru Kitsuregawa and Yasushi Ogawa. Bucket spreading parallel hash: A new, robust, parallel hash join method for data skew in the super database computer (SDC). In 16th Int. Conf. on Very Large Data Bases, pages 210-221, 1990. Google Scholar
  12. S. Migliorini, A. Belussi, M. Negri, and G. Pelagatti. Towards massive spatial data validation with spatialhadoop. In 5th ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, pages 18-27, 2016. URL: http://dx.doi.org/10.1145/3006386.3006392.
  13. Giovanni Neglia, Damiano Carra, Mingdong Feng, Vaishnav Janardhan, Pietro Michiardi, and Dimitra Tsigkari. Access-time-aware cache algorithms. ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS), 2(4):21, 2017. Google Scholar
  14. Mario Pastorelli, Damiano Carra, Matteo Dell'Amico, and Pietro Michiardi. HFSP: bringing size-based scheduling to hadoop. IEEE Trans. on Cloud Computing, 5(1):43-56, 2017. Google Scholar
  15. Tom White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 4th edition, 2015. Google Scholar
  16. Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. Apache Spark: A unified engine for big data processing. Commun. ACM, 59(11):56-65, 2016. URL: http://dx.doi.org/10.1145/2934664.
  17. Xiaofang Zhou, David J. Abel, and David Truffet. Data partitioning for parallel spatial join processing. GeoInformatica, 2(2):175-204, 1998. Google Scholar