On the Hardness of Category Tree Construction

Authors Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, Slava Novgorodov

Thumbnail PDF


  • Filesize: 0.75 MB
  • 17 pages

Document Identifiers

Author Details

Shay Gershtein
  • Tel Aviv University, Israel
Uri Avron
  • Tel Aviv University, Israel
Ido Guy
  • Ben-Gurion University of the Negev, Beer Sheva, Israel
Tova Milo
  • Tel Aviv University, Israel
Slava Novgorodov
  • eBay Research, Netanya, Israel

Cite AsGet BibTex

Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, and Slava Novgorodov. On the Hardness of Category Tree Construction. In 25th International Conference on Database Theory (ICDT 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 220, pp. 4:1-4:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)


Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility. In our model, the input is a set of n weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets. For this model, we prove inapproximability bounds, of order Θ̃(√n) or Θ̃(n), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures and algorithms for data management
  • Theory of computation → Approximation algorithms analysis
  • Theory of computation → Problems, reductions and completeness
  • maximum independent set
  • approximation algorithms
  • approximation hardness bounds
  • taxonomy construction
  • category tree construction


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. URL: https://export.ebay.com/en/start-sell/selling-basics/seller-fees/fees-optional-listing-upgrades/.
  2. Rakesh Agrawal, Amit Somani, and Yirong Xu. Storage and querying of e-commerce data. In VLDB, pages 149-158, 2001. Google Scholar
  3. Uri Avron, Shay Gershtein, Ido Guy, Tova Milo, and Slava Novgorodov. Category Tree Construction from Search Queries in E-Commerce. URL: https://slavanov.com/research/concat_tr.pdf.
  4. Uri Avron, Shay Gershtein, Ido Guy, Tova Milo, and Slava Novgorodov. ConCaT: Construction of Category Trees from Search Queries in E-Commerce. In ICDE, 2021. Google Scholar
  5. Slobodan Beliga, Ana Meštrović, and Sanda Martinčić-Ipšić. An overview of graph-based keyword extraction methods and approaches. JIOS, 39(1):1-20, 2015. Google Scholar
  6. Amey Bhangale and Subhash Khot. UG-hardness to NP-hardness by Losing Half. In CCC, 2019. Google Scholar
  7. Yair Caro and Zsolt Tuza. Improved lower bounds on k-independence. Journal of Graph Theory, 15(1):99-107, 1991. Google Scholar
  8. Karthekeyan Chandrasekaran, Chao Xu, and Xilin Yu. Hypergraph k-cut in randomized polynomial time. Mathematical Programming, pages 1-29, 2019. Google Scholar
  9. Shui-Lung Chuang and Lee-Feng Chien. A practical web-based approach to generating topic hierarchy for text segments. In CIKM, page 127–136, 2004. Google Scholar
  10. Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, and Slava Novgorodov. On the Hardness of Category Tree Construction (full). URL: https://slavanov.com/research/icdt22_full.pdf.
  11. Thomas Hofmeister and Hanno Lefmann. Approximating maximum independent sets in uniform hypergraphs. In Proc. of MFCS, pages 562-570, 1998. Google Scholar
  12. Yi-Hsiang Hsieh, Shih-Hung Wu, Liang-Pu Chen, and Ping-Che Yang. Constructing hierarchical product categories for e-commerce by word embedding and clustering. In IRI, pages 397-402, 2017. Google Scholar
  13. Anna Huang. Similarity measures for text document clustering. In NZCSRSC, volume 4, pages 9-56, 2008. Google Scholar
  14. George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. Multilevel hypergraph partitioning: applications in VLSI domain. VLSI, 7(1):69-79, 1999. Google Scholar
  15. Tom Leighton, Fillia Makedon, and SG Tragoudas. Approximation algorithms for VLSI partition problems. In ISCAS, pages 2865-2868, 1990. Google Scholar
  16. Pan Li and Olgica Milenkovic. Inhomogeneous hypergraph clustering with applications. In NIPS, pages 2308-2318, 2017. Google Scholar
  17. Kunal Punera, Suju Rajan, and Joydeep Ghosh. Automatically learning document taxonomies for hierarchical classification. In Proc. of WWW, 2005. Google Scholar
  18. Cécile Robin, James O'Neill, and Paul Buitelaar. Automatic taxonomy generation - a use-case in the legal domain, 2017. URL: http://arxiv.org/abs/1710.01823.
  19. Lior Rokach and Oded Maimon. Clustering methods. In Data mining and knowledge discovery handbook, pages 321-352. Springer, 2005. Google Scholar
  20. Matthew Skala. Hypergeometric tail inequalities: ending the insanity, 2013. URL: http://arxiv.org/abs/1311.5939.
  21. Yuyin Sun, Adish Singla, Dieter Fox, and Andreas Krause. Building hierarchies of concepts via crowdsourcing. CoRR, abs/1504.07302, 2015. URL: http://arxiv.org/abs/1504.07302.
  22. Lei Tang, Jianping Zhang, and Huan Liu. Acclimatizing taxonomic semantics for hierarchical content classification. In Proc. of KDD, pages 384-393, 01 2006. Google Scholar
  23. Eli Upfal. Probability and computing: randomized algorithms and probabilistic analysis. Cambridge university press, 2005. Google Scholar
  24. Quan Yuan, Gao Cong, Aixin Sun, Chin-Yew Lin, and Nadia Magnenat Thalmann. Category hierarchy maintenance: a data-driven approach. In SIGIR, pages 791-800, 2012. Google Scholar
  25. Wenxing Zhu and Chuanyin Guo. Local search approximation algorithms for the complement of the min-k-cut problems. 2010. Google Scholar
  26. Hai Zhuge and Lei He. Automatic maintenance of category hierarchy. Future Generation Computer Systems, 67:1 - 12, 2017. Google Scholar
  27. David Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. In Proc. of STOC, pages 681-690, 2006. Google Scholar