Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space

Author Yufei Tao



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2018.20.pdf
  • Filesize: 0.49 MB
  • 19 pages

Document Identifiers

Author Details

Yufei Tao

Cite As Get BibTex

Yufei Tao. Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space. In 21st International Conference on Database Theory (ICDT 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 98, pp. 20:1-20:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/LIPIcs.ICDT.2018.20

Abstract

In entity matching classification, we are given two sets R and S of objects where whether r and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D(R) and D(S) respectively, the goal is to discover a classifier function f: D(R) x D(S) -> {0, 1} from a certain class satisfying the property that, for every (r, s) in R x S, f(r, s) = 1 if and only if r and s are a match. 

Past research is accustomed to running a learning algorithm directly on all the labeled (i.e., match or not) pairs in R times S. This, however, suffers from the drawback that even reading through the input incurs a quadratic cost. We pursue a direction towards removing the quadratic barrier. Denote by T the set of matching pairs in R times S. We propose to accept R, S, and T as the input, and aim to solve the problem with cost proportional to |R|+|S|+|T|, thereby achieving a large performance gain in the (typical) scenario where |T|<<|R||S|. 
    
This paper provides evidence on the feasibility of the new direction, by showing how to accomplish the aforementioned purpose for entity matching with linear classification, where a classifier is a linear multi-dimensional plane separating the matching and non-matching pairs. We actually do so in the MPC model, echoing the trend of deploying massively parallel computing systems for large-scale learning. As a side product, we obtain new MPC algorithms for three geometric problems: linear programming, batched range counting, and dominance join.

Subject Classification

Keywords
  • Entity Matching
  • Linear Programming
  • Range Counting
  • Dominance Join
  • Massively Parallel Computation

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Foto N. Afrati, Manas R. Joglekar, Christopher Re, Semih Salihoglu, and Jeffrey D. Ullman. GYM: A multiround distributed join algorithm. In Proceedings of International Conference on Database Theory (ICDT), pages 4:1-4:18, 2017. Google Scholar
  2. Foto N. Afrati, Paraschos Koutris, Dan Suciu, and Jeffrey D. Ullman. Parallel skyline queries. Theory Comput. Syst., 57(4):1008-1037, 2015. Google Scholar
  3. Foto N. Afrati and Jeffrey D. Ullman. Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(9):1282-1298, 2011. Google Scholar
  4. Pankaj K. Agarwal, Kyle Fox, Kamesh Munagala, and Abhinandan Nath. Parallel algorithms for constructing range and nearest-neighbor searching data structures. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 429-440, 2016. Google Scholar
  5. Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 273-284, 2013. Google Scholar
  6. Timothy M. Chan and Eric Y. Chen. Multi-pass geometric algorithms. Discrete & Computational Geometry, 37(1):79-102, 2007. Google Scholar
  7. Xu Chu, Ihab F. Ilyas, and Paraschos Koutris. Distributed data deduplication. Proceedings of the VLDB Endowment (PVLDB), 9(11):864-875, 2016. Google Scholar
  8. Kenneth L. Clarkson. Las vegas algorithms for linear and integer programming when the dimension is small. Journal of the ACM (JACM), 42(2):488-499, 1995. Google Scholar
  9. Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of ACM Symposium on Theory of Computing (STOC), pages 81-90, 2013. Google Scholar
  10. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 137-150, 2004. Google Scholar
  11. Martin E. Dyer and Sandeep Sen. Fast and optimal parallel multidimensional search in PRAMs with applications to linear programming and related problems. SIAM Journal of Computing, 30(5):1443-1461, 2000. Google Scholar
  12. Herbert Edelsbrunner and Mark H. Overmars. On the equivalence of some rectangle problems. Information Processing Letters (IPL), 14(3):124-127, 1982. Google Scholar
  13. Vasilis Efthymiou, George Papadakis, George Papastefanatos, Kostas Stefanidis, and Themis Palpanas. Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Inf. Syst., 65:137-157, 2017. Google Scholar
  14. Wenfei Fan, Jingbo Xu, Yinghui Wu, Wenyuan Yu, Jiaxin Jiang, Zeyu Zheng, Bohan Zhang, Yang Cao, and Chao Tian. Parallelizing sequential graph computations. In Proceedings of ACM Management of Data (SIGMOD), pages 495-510, 2017. Google Scholar
  15. Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude W. Shavlik, and Xiaojin Zhu. Corleone: hands-off crowdsourcing for entity matching. In Proceedings of ACM Management of Data (SIGMOD), pages 601-612, 2014. Google Scholar
  16. Michael T. Goodrich. Communication-efficient parallel sorting. SIAM Journal of Computing, 29(2):416-432, 1999. Google Scholar
  17. Michael T. Goodrich and Edgar A. Ramos. Bounded-independence derandomization of geometric partitioning with applications to parallel fixed-dimensional linear programming. Discrete & Computational Geometry, 18(4):397-420, 1997. Google Scholar
  18. Xiao Hu, Paraschos Koutris, and Ke Yi. The relationships among coarse-grained parallel models. Technical report, HKUST, 2016. Google Scholar
  19. Xiao Hu, Yufei Tao, and Ke Yi. Output-optimal parallel algorithms for similarity joins. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 79-90, 2017. Google Scholar
  20. Bas Ketsman and Dan Suciu. A worst-case optimal multi-round algorithm for parallel computation of conjunctive queries. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 417-428, 2017. Google Scholar
  21. Lars Kolb, Andreas Thor, and Erhard Rahm. Load balancing for mapreduce-based entity resolution. In Proceedings of International Conference on Data Engineering (ICDE), pages 618-629, 2012. Google Scholar
  22. Hanna Köpcke and Erhard Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2):197-210, 2010. Google Scholar
  23. Hanna Köpcke, Andreas Thor, and Erhard Rahm. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment (PVLDB), 3(1):484-493, 2010. Google Scholar
  24. Paraschos Koutris, Paul Beame, and Dan Suciu. Worst-case optimal algorithms for parallel query processing. In Proceedings of International Conference on Database Theory (ICDT), pages 8:1-8:18, 2016. Google Scholar
  25. Paraschos Koutris and Dan Suciu. Parallel evaluation of conjunctive queries. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 223-234, 2011. Google Scholar
  26. Ketan Mulmuley. Computational Geometry: An Introduction Through Randomized Algorithms. Prentice-Hall, 1994. Google Scholar
  27. Mert Saglam and Gábor Tardos. On the communication complexity of sparse set disjointness and exists-equal problems. In Proceedings of Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 678-687, 2013. Google Scholar
  28. Yufei Tao, Wenqing Lin, and Xiaokui Xiao. Minimal MapReduce algorithms. In Proceedings of ACM Management of Data (SIGMOD), pages 529-540, 2013. Google Scholar
  29. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI, pages 15-28, 2012. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail