On the Hardness of Category Tree Construction

Gershtein, Shay; Avron, Uri; Guy, Ido; Milo, Tova; Novgorodov, Slava

doi:10.4230/LIPIcs.ICDT.2022.4

Abstract

Category trees, or taxonomies, are rooted trees where each node, called a category, corresponds to a set of related items. The construction of taxonomies has been studied in various domains, including e-commerce, document management, and question answering. Multiple algorithms for automating construction have been proposed, employing a variety of clustering approaches and crowdsourcing. However, no formal model to capture such categorization problems has been devised, and their complexity has not been studied. To address this, we propose in this work a combinatorial model that captures many practical settings and show that the aforementioned empirical approach has been warranted, as we prove strong inapproximability bounds for various problem variants and special cases when the goal is to produce a categorization of the maximum utility.
In our model, the input is a set of n weighted item sets that the tree would ideally contain as categories. Each category, rather than perfectly match the corresponding input set, is allowed to exceed a given threshold for a given similarity function. The goal is to produce a tree that maximizes the total weight of the sets for which it contains a matching category. A key parameter is an upper bound on the number of categories an item may belong to, which produces the hardness of the problem, as initially each item may be contained in an arbitrary number of input sets.
For this model, we prove inapproximability bounds, of order Θ̃(√n) or Θ̃(n), for various problem variants and special cases, loosely justifying the aforementioned heuristic approach. Our work includes reductions based on parameterized randomized constructions that highlight how various problem parameters and properties of the input may affect the hardness. Moreover, for the special case where the category must be identical to the corresponding input set, we devise an algorithm whose approximation guarantee depends solely on a more granular parameter, allowing improved worst-case guarantees. Finally, we also generalize our results to DAG-based and non-hierarchical categorization.

URL: https://export.ebay.com/en/start-sell/selling-basics/seller-fees/fees-optional-listing-upgrades/.
Rakesh Agrawal, Amit Somani, and Yirong Xu. Storage and querying of e-commerce data. In VLDB, pages 149-158, 2001.
Uri Avron, Shay Gershtein, Ido Guy, Tova Milo, and Slava Novgorodov. Category Tree Construction from Search Queries in E-Commerce. URL: https://slavanov.com/research/concat_tr.pdf.
Uri Avron, Shay Gershtein, Ido Guy, Tova Milo, and Slava Novgorodov. ConCaT: Construction of Category Trees from Search Queries in E-Commerce. In ICDE, 2021.
Slobodan Beliga, Ana Meštrović, and Sanda Martinčić-Ipšić. An overview of graph-based keyword extraction methods and approaches. JIOS, 39(1):1-20, 2015.
Amey Bhangale and Subhash Khot. UG-hardness to NP-hardness by Losing Half. In CCC, 2019.
Yair Caro and Zsolt Tuza. Improved lower bounds on k-independence. Journal of Graph Theory, 15(1):99-107, 1991.
Karthekeyan Chandrasekaran, Chao Xu, and Xilin Yu. Hypergraph k-cut in randomized polynomial time. Mathematical Programming, pages 1-29, 2019.
Shui-Lung Chuang and Lee-Feng Chien. A practical web-based approach to generating topic hierarchy for text segments. In CIKM, page 127–136, 2004.
Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, and Slava Novgorodov. On the Hardness of Category Tree Construction (full). URL: https://slavanov.com/research/icdt22_full.pdf.
Thomas Hofmeister and Hanno Lefmann. Approximating maximum independent sets in uniform hypergraphs. In Proc. of MFCS, pages 562-570, 1998.
Yi-Hsiang Hsieh, Shih-Hung Wu, Liang-Pu Chen, and Ping-Che Yang. Constructing hierarchical product categories for e-commerce by word embedding and clustering. In IRI, pages 397-402, 2017.
Anna Huang. Similarity measures for text document clustering. In NZCSRSC, volume 4, pages 9-56, 2008.
George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. Multilevel hypergraph partitioning: applications in VLSI domain. VLSI, 7(1):69-79, 1999.
Tom Leighton, Fillia Makedon, and SG Tragoudas. Approximation algorithms for VLSI partition problems. In ISCAS, pages 2865-2868, 1990.
Pan Li and Olgica Milenkovic. Inhomogeneous hypergraph clustering with applications. In NIPS, pages 2308-2318, 2017.
Kunal Punera, Suju Rajan, and Joydeep Ghosh. Automatically learning document taxonomies for hierarchical classification. In Proc. of WWW, 2005.
Cécile Robin, James O'Neill, and Paul Buitelaar. Automatic taxonomy generation - a use-case in the legal domain, 2017. URL: http://arxiv.org/abs/1710.01823.
Lior Rokach and Oded Maimon. Clustering methods. In Data mining and knowledge discovery handbook, pages 321-352. Springer, 2005.
Matthew Skala. Hypergeometric tail inequalities: ending the insanity, 2013. URL: http://arxiv.org/abs/1311.5939.
Yuyin Sun, Adish Singla, Dieter Fox, and Andreas Krause. Building hierarchies of concepts via crowdsourcing. CoRR, abs/1504.07302, 2015. URL: http://arxiv.org/abs/1504.07302.
Lei Tang, Jianping Zhang, and Huan Liu. Acclimatizing taxonomic semantics for hierarchical content classification. In Proc. of KDD, pages 384-393, 01 2006.
Eli Upfal. Probability and computing: randomized algorithms and probabilistic analysis. Cambridge university press, 2005.
Quan Yuan, Gao Cong, Aixin Sun, Chin-Yew Lin, and Nadia Magnenat Thalmann. Category hierarchy maintenance: a data-driven approach. In SIGIR, pages 791-800, 2012.
Wenxing Zhu and Chuanyin Guo. Local search approximation algorithms for the complement of the min-k-cut problems. 2010.
Hai Zhuge and Lei He. Automatic maintenance of category hierarchy. Future Generation Computer Systems, 67:1 - 12, 2017.
David Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. In Proc. of STOC, pages 681-690, 2006.

On the Hardness of Category Tree Construction

Authors Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, Slava Novgorodov

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

On the Hardness of Category Tree Construction

Authors Shay Gershtein, Uri Avron, Ido Guy, Tova Milo, Slava Novgorodov

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message