On Finding the Jaccard Center

Authors Marc Bury, Chris Schwiegelshohn



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2017.23.pdf
  • Filesize: 497 kB
  • 14 pages

Document Identifiers

Author Details

Marc Bury
Chris Schwiegelshohn

Cite As Get BibTex

Marc Bury and Chris Schwiegelshohn. On Finding the Jaccard Center. In 44th International Colloquium on Automata, Languages, and Programming (ICALP 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 80, pp. 23:1-23:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.ICALP.2017.23

Abstract

We initiate the study of finding the Jaccard center of a given collection N of sets. For two sets X,Y, the Jaccard index is defined as |X\cap Y|/|X\cup Y| and the corresponding distance is 1-|X\cap Y|/|X\cup Y|. The Jaccard center is a set C minimizing the maximum distance to any set of N.

We show that the problem is NP-hard to solve exactly, and that it admits a PTAS while no FPTAS can exist unless P = NP.
Furthermore, we show that the problem is fixed parameter tractable in the maximum Hamming norm between Jaccard center and any input set. Our algorithms are based on a compression technique similar in spirit to coresets for the Euclidean 1-center problem.

In addition, we also show that, contrary to the previously studied median problem by Chierichetti et al. (SODA 2010), the continuous version of the Jaccard center problem admits a simple polynomial time algorithm.

Subject Classification

Keywords
  • Clustering
  • 1-Center
  • Jaccard

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Geometric approximation via coresets. Combinatorial and computational geometry, 52:1-30, 2005. Google Scholar
  2. C. Bachmaier, F. J. Brandenburg, A. Gleißner, and A. Hofmeier. On the hardness of maximum rank aggregation problems. J. Discrete Algorithms, 31:2-13, 2015. Google Scholar
  3. M. Badoiu and K. L. Clarkson. Optimal core-sets for balls. Comput. Geom., 40(1):14-22, 2008. Google Scholar
  4. M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering via core-sets. In Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montréal, Québec, Canada, pages 250-257, 2002. Google Scholar
  5. T. C. Biedl, F.-J. Brandenburg, and X. Deng. On the complexity of crossings in permutations. Discrete Mathematics, 309(7):1813-1823, 2009. Google Scholar
  6. A. Z. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences 1997, SEQUENCES'97, pages 21-, Washington, DC, USA, 1997. IEEE Computer Society. Google Scholar
  7. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157-1166, 1997. Google Scholar
  8. L. Cai and D. W. Juedes. On the existence of subexponential parameterized algorithms. J. Comput. Syst. Sci., 67(4):789-807, 2003. Google Scholar
  9. F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 - July 1, 2009, pages 219-228, 2009. Google Scholar
  10. F. Chierichetti, R. Kumar, S. Pandey, and S. Vassilvitskii. Finding the Jaccard median. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 293-311, 2010. Google Scholar
  11. K. L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Trans. Algorithms, 6(4), 2010. Google Scholar
  12. E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng., 13(1):64-78, 2001. Google Scholar
  13. A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8-12, 2007, pages 271-280, 2007. Google Scholar
  14. C. de la Higuera and F. Casacuberta. Topology of strings: Median string is NP-complete. Theor. Comput. Sci., 230(1-2):39-48, 2000. Google Scholar
  15. M. R. Fellows, J. Gramm, and R. Niedermeier. On the parameterized intractability of CLOSEST substringsize and related problems. In STACS 2002, 19th Annual Symposium on Theoretical Aspects of Computer Science, Antibes - Juan les Pins, France, March 14-16, 2002, Proceedings, pages 262-273, 2002. Google Scholar
  16. M. Frances and A. Litman. On covering problems of codes. Theory Comput. Syst., 30(2):113-119, 1997. Google Scholar
  17. M. R. Garey and D. S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman &Co., New York, NY, USA, 1990. Google Scholar
  18. L. Gasieniec, J. Jansson, and A. Lingas. Efficient approximation algorithms for the Hamming center problem. In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, 17-19 January 1999, Baltimore, Maryland., pages 905-906, 1999. Google Scholar
  19. T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci., 38:293-306, 1985. Google Scholar
  20. J. C. Gower and P. Legendre. Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 3(1):5-48, 1986. Google Scholar
  21. J. Gramm, R. Niedermeier, and P. Rossmanith. Fixed-parameter algorithms for CLOSEST STRING and related problems. Algorithmica, 37(1):25-42, 2003. Google Scholar
  22. S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. Inf. Syst., 25(5):345-366, 2000. Google Scholar
  23. D. S. Hochbaum and D. B. Shmoys. A unified approach to approximation algorithms for bottleneck problems. J. ACM, 33(3):533-550, 1986. Google Scholar
  24. R. Impagliazzo, R. Paturi, and F. Zane. Which problems have strongly exponential complexity? J. Comput. Syst. Sci., 63(4):512-530, 2001. Google Scholar
  25. P. Jaccard. Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles, 37:241-272, 1901. Google Scholar
  26. P. Kumar, J. S. B. Mitchell, and E. A. Yildirim. Approximate minimum enclosing balls in high dimensions using core-sets. ACM Journal of Experimental Algorithmics, 8, 2003. Google Scholar
  27. J. K. Lanctôt, M. Li, B. Ma, S. Wang, and L. Zhang. Distinguishing string selection problems. Inf. Comput., 185(1):41-55, 2003. Google Scholar
  28. M. Li, B. Ma, and L. Wang. On the closest string and substring problems. J. ACM, 49(2):157-171, 2002. Google Scholar
  29. D. Marx. Closest substring problems with small distances. SIAM J. Comput., 38(4):1382-1410, 2008. Google Scholar
  30. N. Megiddo. Linear programming in linear time when the dimension is fixed. J. ACM, 31(1):114-127, 1984. Google Scholar
  31. M. Mitzenmacher and E. Upfal. Probability and computing - randomized algorithms and probabilistic analysis. Cambridge University Press, 2005. Google Scholar
  32. F. Nicolas and E. Rivals. Hardness results for the center and median string problems under the weighted and unweighted edit distances. J. Discrete Algorithms, 3(2-4):390-415, 2005. Google Scholar
  33. V. Y. Popov. Multiple genome rearrangement by swaps and by element duplications. Theor. Comput. Sci., 385(1-3):115-126, 2007. Google Scholar
  34. R. Real and J. M. Vargas. The probabilistic basis of jaccard’s index of similarity. Systematic biology, 45(3):380-385, 1996. Google Scholar
  35. H. Späth. The minisum location problem for the Jaccard metric. Operations-Research-Spektrum, 3(2):91-94, 1981. Google Scholar
  36. J. J. Sylvester. A Question in the Geometry of Situation. Quarterly Journal of Pure and Applied Mathematics, 1, 1857. Google Scholar
  37. G. A. Watson. An algorithm for the single facility location problem using the Jaccard metric. SIAM Journal on Scientific and Statistical Computing, 4(4):748-756, 1983. Google Scholar
  38. E. Welzl. Smallest enclosing disks (balls and ellipsoids), pages 359-370. Springer Berlin Heidelberg, Berlin, Heidelberg, 1991. Google Scholar
  39. P. Willett, J. M. Barnard, and G. M. Downs. Chemical similarity searching. Journal of Chemical Information and Computer Sciences, 38(6):983-996, 1998. Google Scholar
  40. E. A. Yildirim. Two algorithms for the minimum enclosing ball problem. SIAM Journal on Optimization, 19(3):1368-1391, 2008. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail