Document

**Published in:** LIPIcs, Volume 290, 27th International Conference on Database Theory (ICDT 2024)

Data partitioning that maximizes or minimizes Shannon entropy is a crucial subroutine in data compression, columnar storage, and cardinality estimation algorithms. These partition algorithms can be accelerated if we have a data structure to find the entropy in different subsets of data when the algorithm needs to decide what block to construct. While it is generally known how to compute the entropy of a discrete distribution efficiently, we want to efficiently derive the entropy among the data items that lie in a specific area. We solve this problem in a typical setting when we deal with real data, where data items are geometric points and each requested area is a query (hyper)rectangle. More specifically, we consider a set P of n weighted and colored points in ℝ^d. The goal is to construct a low space data structure, such that given a query (hyper)rectangle R, it computes the entropy based on the colors of the points in P∩ R, in sublinear time. We show a conditional lower bound for this problem proving that we cannot hope for data structures with near-linear space and near-constant query time. Then, we propose exact data structures for d = 1 and d > 1 with o(n^{2d}) space and o(n) query time. We also provide a tune parameter t that the user can choose to bound the asymptotic space and query time of the new data structures. Next, we propose near linear space data structures for returning either an additive or a multiplicative approximation of the entropy. Finally, we show how we can use the new data structures to efficiently partition time series and histograms with respect to entropy.

Sanjay Krishnan and Stavros Sintos. Range Entropy Queries and Partitioning. In 27th International Conference on Database Theory (ICDT 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 290, pp. 6:1-6:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{krishnan_et_al:LIPIcs.ICDT.2024.6, author = {Krishnan, Sanjay and Sintos, Stavros}, title = {{Range Entropy Queries and Partitioning}}, booktitle = {27th International Conference on Database Theory (ICDT 2024)}, pages = {6:1--6:21}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-312-6}, ISSN = {1868-8969}, year = {2024}, volume = {290}, editor = {Cormode, Graham and Shekelyan, Michael}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2024.6}, URN = {urn:nbn:de:0030-drops-197883}, doi = {10.4230/LIPIcs.ICDT.2024.6}, annote = {Keywords: Shannon entropy, range query, data structure, data partitioning} }

Document

**Published in:** LIPIcs, Volume 290, 27th International Conference on Database Theory (ICDT 2024)

We are given a set 𝒵 = {(R_1,s_1), …, (R_n,s_n)}, where each R_i is a range in ℝ^d, such as rectangle or ball, and s_i ∈ [0,1] denotes its selectivity. The goal is to compute a small-size discrete data distribution 𝒟 = {(q₁,w₁),…, (q_m,w_m)}, where q_j ∈ ℝ^d and w_j ∈ [0,1] for each 1 ≤ j ≤ m, and ∑_{1≤j≤m} w_j = 1, such that 𝒟 is the most consistent with 𝒵, i.e., err_p(𝒟,𝒵) = 1/n ∑_{i = 1}ⁿ |s_i - ∑_{j=1}^m w_j⋅1(q_j ∈ R_i)|^p is minimized. In a database setting, 𝒵 corresponds to a workload of range queries over some table, together with their observed selectivities (i.e., fraction of tuples returned), and 𝒟 can be used as compact model for approximating the data distribution within the table without accessing the underlying contents.
In this paper, we obtain both upper and lower bounds for this problem. In particular, we show that the problem of finding the best data distribution from selectivity queries is NP-complete. On the positive side, we describe a Monte Carlo algorithm that constructs, in time O((n+δ^{-d}) δ^{-2} polylog n), a discrete distribution 𝒟̃ of size O(δ^{-2}), such that err_p(𝒟̃,𝒵) ≤ min_𝒟 err_p(𝒟,𝒵)+δ (for p = 1,2,∞) where the minimum is taken over all discrete distributions. We also establish conditional lower bounds, which strongly indicate the infeasibility of relative approximations as well as removal of the exponential dependency on the dimension for additive approximations. This suggests that significant improvements to our algorithm are unlikely.

Pankaj K. Agarwal, Rahul Raychaudhury, Stavros Sintos, and Jun Yang. Computing Data Distribution from Query Selectivities. In 27th International Conference on Database Theory (ICDT 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 290, pp. 18:1-18:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{agarwal_et_al:LIPIcs.ICDT.2024.18, author = {Agarwal, Pankaj K. and Raychaudhury, Rahul and Sintos, Stavros and Yang, Jun}, title = {{Computing Data Distribution from Query Selectivities}}, booktitle = {27th International Conference on Database Theory (ICDT 2024)}, pages = {18:1--18:20}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-312-6}, ISSN = {1868-8969}, year = {2024}, volume = {290}, editor = {Cormode, Graham and Shekelyan, Michael}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2024.18}, URN = {urn:nbn:de:0030-drops-198007}, doi = {10.4230/LIPIcs.ICDT.2024.18}, annote = {Keywords: selectivity queries, discrete distributions, Multiplicative Weights Update, eps-approximation, learnable functions, depth problem, arrangement} }

Document

**Published in:** LIPIcs, Volume 290, 27th International Conference on Database Theory (ICDT 2024)

A witness is a sub-database that preserves the query results of the original database but of much smaller size. It has wide applications in query rewriting and debugging, query explanation, IoT analytics, multi-layer network routing, etc. In this paper, we study the smallest witness problem (SWP) for the class of conjunctive queries (CQs) without self-joins.
We first establish the dichotomy that SWP for a CQ can be computed in polynomial time if and only if it has head-cluster property, unless P = NP. We next turn to the approximated version by relaxing the size of a witness from being minimum. We surprisingly find that the head-domination property - that has been identified for the deletion propagation problem [Kimelfeld et al., 2012] - can also precisely capture the hardness of the approximated smallest witness problem. In polynomial time, SWP for any CQ with head-domination property can be approximated within a constant factor, while SWP for any CQ without such a property cannot be approximated within a logarithmic factor, unless P = NP.
We further explore efficient approximation algorithms for CQs without head-domination property: (1) we show a trivial algorithm which achieves a polynomially large approximation ratio for general CQs; (2) for any CQ with only one non-output attribute, such as star CQs, we show a greedy algorithm with a logarithmic approximation ratio; (3) for line CQs, which contain at least two non-output attributes, we relate SWP problem to the directed steiner forest problem, whose algorithms can be applied to line CQs directly. Meanwhile, we establish a much higher lower bound, exponentially larger than the logarithmic lower bound obtained above. It remains open to close the gap between the lower and upper bound of the approximated SWP for CQs without head-domination property.

Xiao Hu and Stavros Sintos. Finding Smallest Witnesses for Conjunctive Queries. In 27th International Conference on Database Theory (ICDT 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 290, pp. 24:1-24:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{hu_et_al:LIPIcs.ICDT.2024.24, author = {Hu, Xiao and Sintos, Stavros}, title = {{Finding Smallest Witnesses for Conjunctive Queries}}, booktitle = {27th International Conference on Database Theory (ICDT 2024)}, pages = {24:1--24:20}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-312-6}, ISSN = {1868-8969}, year = {2024}, volume = {290}, editor = {Cormode, Graham and Shekelyan, Michael}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2024.24}, URN = {urn:nbn:de:0030-drops-198066}, doi = {10.4230/LIPIcs.ICDT.2024.24}, annote = {Keywords: conjunctive query, smallest witness, head-cluster, head-domination} }

Document

Track A: Algorithms, Complexity and Games

**Published in:** LIPIcs, Volume 198, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)

This paper considers enumerating answers to similarity-join queries under dynamic updates: Given two sets of n points A,B in ℝ^d, a metric ϕ(⋅), and a distance threshold r > 0, report all pairs of points (a, b) ∈ A × B with ϕ(a,b) ≤ r. Our goal is to store A,B into a dynamic data structure that, whenever asked, can enumerate all result pairs with worst-case delay guarantee, i.e., the time between enumerating two consecutive pairs is bounded. Furthermore, the data structure can be efficiently updated when a point is inserted into or deleted from A or B.
We propose several efficient data structures for answering similarity-join queries in low dimension. For exact enumeration of similarity join, we present near-linear-size data structures for 𝓁₁, 𝓁_∞ metrics with log^{O(1)} n update time and delay. We show that such a data structure is not feasible for the 𝓁₂ metric for d ≥ 4. For approximate enumeration of similarity join, where the distance threshold is a soft constraint, we obtain a unified linear-size data structure for 𝓁_p metric, with log^{O(1)} n delay and update time. In high dimensions, we present an efficient data structure with worst-case delay-guarantee using locality sensitive hashing (LSH).

Pankaj K. Agarwal, Xiao Hu, Stavros Sintos, and Jun Yang. Dynamic Enumeration of Similarity Joins. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 11:1-11:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{agarwal_et_al:LIPIcs.ICALP.2021.11, author = {Agarwal, Pankaj K. and Hu, Xiao and Sintos, Stavros and Yang, Jun}, title = {{Dynamic Enumeration of Similarity Joins}}, booktitle = {48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)}, pages = {11:1--11:19}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-195-5}, ISSN = {1868-8969}, year = {2021}, volume = {198}, editor = {Bansal, Nikhil and Merelli, Emanuela and Worrell, James}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2021.11}, URN = {urn:nbn:de:0030-drops-140803}, doi = {10.4230/LIPIcs.ICALP.2021.11}, annote = {Keywords: dynamic enumeration, similarity joins, worst-case delay guarantee} }

Document

APPROX

**Published in:** LIPIcs, Volume 145, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)

Given a set of points P and axis-aligned rectangles R in the plane, a point p in P is called exposed if it lies outside all rectangles in R. In the max-exposure problem, given an integer parameter k, we want to delete k rectangles from R so as to maximize the number of exposed points. We show that the problem is NP-hard and assuming plausible complexity conjectures is also hard to approximate even when rectangles in R are translates of two fixed rectangles. However, if R only consists of translates of a single rectangle, we present a polynomial-time approximation scheme. For general rectangle range space, we present a simple O(k) bicriteria approximation algorithm; that is by deleting O(k^2) rectangles, we can expose at least Omega(1/k) of the optimal number of points.

Neeraj Kumar, Stavros Sintos, and Subhash Suri. The Maximum Exposure Problem. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 145, pp. 19:1-19:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{kumar_et_al:LIPIcs.APPROX-RANDOM.2019.19, author = {Kumar, Neeraj and Sintos, Stavros and Suri, Subhash}, title = {{The Maximum Exposure Problem}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)}, pages = {19:1--19:20}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-125-2}, ISSN = {1868-8969}, year = {2019}, volume = {145}, editor = {Achlioptas, Dimitris and V\'{e}gh, L\'{a}szl\'{o} A.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX-RANDOM.2019.19}, URN = {urn:nbn:de:0030-drops-112344}, doi = {10.4230/LIPIcs.APPROX-RANDOM.2019.19}, annote = {Keywords: max-exposure, PTAS, densest k-subgraphs, geometric constraint removal, Network resilience} }

Document

**Published in:** LIPIcs, Volume 127, 22nd International Conference on Database Theory (ICDT 2019)

In multi-parameter decision making, data is usually modeled as a set of points whose dimension is the number of parameters, and the skyline or Pareto points represent the possible optimal solutions for various optimization problems. The structure and computation of such points have been well studied, particularly in the database community. As the skyline can be quite large in high dimensions, one often seeks a compact summary. In particular, for a given integer parameter k, a subset of k points is desired which best approximates the skyline under some measure. Various measures have been proposed, but they mostly treat the skyline as a discrete object. By viewing the skyline as a continuous geometric hull, we propose a new measure that evaluates the quality of a subset by the Hausdorff distance of its hull to the full hull. We argue that in many ways our measure more naturally captures what it means to approximate the skyline.
For our new geometric skyline approximation measure, we provide a plethora of results. Specifically, we provide (1) a near linear time exact algorithm in two dimensions, (2) APX-hardness results for dimensions three and higher, (3) approximation algorithms for related variants of our problem, and (4) a practical and efficient heuristic which uses our geometric insights into the problem, as well as various experimental results to show the efficacy of our approach.

Nirman Kumar, Benjamin Raichel, Stavros Sintos, and Gregory Van Buskirk. Approximating Distance Measures for the Skyline. In 22nd International Conference on Database Theory (ICDT 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 127, pp. 10:1-10:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{kumar_et_al:LIPIcs.ICDT.2019.10, author = {Kumar, Nirman and Raichel, Benjamin and Sintos, Stavros and Van Buskirk, Gregory}, title = {{Approximating Distance Measures for the Skyline}}, booktitle = {22nd International Conference on Database Theory (ICDT 2019)}, pages = {10:1--10:20}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-101-6}, ISSN = {1868-8969}, year = {2019}, volume = {127}, editor = {Barcelo, Pablo and Calautti, Marco}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2019.10}, URN = {urn:nbn:de:0030-drops-103125}, doi = {10.4230/LIPIcs.ICDT.2019.10}, annote = {Keywords: Skyline, Pareto optimal, Approximation, Hardness, Multi-criteria decision making} }

Document

**Published in:** LIPIcs, Volume 101, 16th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2018)

We consider the problem of computing a Euclidean shortest path in the presence of removable obstacles in the plane. In particular, we have a collection of pairwise-disjoint polygonal obstacles, each of which may be removed at some cost c_i > 0. Given a cost budget C > 0, and a pair of points s, t, which obstacles should be removed to minimize the path length from s to t in the remaining workspace? We show that this problem is NP-hard even if the obstacles are vertical line segments. Our main result is a fully-polynomial time approximation scheme (FPTAS) for the case of convex polygons. Specifically, we compute an (1 + epsilon)-approximate shortest path in time O({nh}/{epsilon^2} log n log n/epsilon) with removal cost at most (1+epsilon)C, where h is the number of obstacles, n is the total number of obstacle vertices, and epsilon in (0, 1) is a user-specified parameter. Our approximation scheme also solves a shortest path problem for a stochastic model of obstacles, where each obstacle's presence is an independent event with a known probability. Finally, we also present a data structure that can answer s-t path queries in polylogarithmic time, for any pair of points s, t in the plane.

Pankaj K. Agarwal, Neeraj Kumar, Stavros Sintos, and Subhash Suri. Computing Shortest Paths in the Plane with Removable Obstacles. In 16th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 101, pp. 5:1-5:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{agarwal_et_al:LIPIcs.SWAT.2018.5, author = {Agarwal, Pankaj K. and Kumar, Neeraj and Sintos, Stavros and Suri, Subhash}, title = {{Computing Shortest Paths in the Plane with Removable Obstacles}}, booktitle = {16th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2018)}, pages = {5:1--5:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-068-2}, ISSN = {1868-8969}, year = {2018}, volume = {101}, editor = {Eppstein, David}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SWAT.2018.5}, URN = {urn:nbn:de:0030-drops-88312}, doi = {10.4230/LIPIcs.SWAT.2018.5}, annote = {Keywords: Euclidean shortest paths, Removable polygonal obstacles, Stochastic shortest paths, L\underline1 shortest paths} }

Document

**Published in:** LIPIcs, Volume 75, 16th International Symposium on Experimental Algorithms (SEA 2017)

A regret minimizing set Q is a small size representation of a much larger database P so that user queries executed on Q return answers whose scores are not much worse than those on the full dataset. In particular, a k-regret minimizing set has the property that the regret ratio between the score of the top-1 item in Q and the score of the top-k item in P is minimized, where the score of an item is the inner product of the item's attributes with a user's weight (preference) vector. The problem is challenging because we want to find a single representative set Q whose regret ratio is small with respect to all possible user weight vectors.
We show that k-regret minimization is NP-Complete for all dimensions d>=3, settling an open problem from Chester et al. [VLDB 2014]. Our main algorithmic contributions are two approximation algorithms, both with provable guarantees, one based on coresets and another based on hitting sets. We perform extensive experimental evaluation of our algorithms, using both real-world and synthetic data, and compare their performance against the solution proposed in [VLDB 14]. The results show that our algorithms are significantly faster and scalable to much larger sets than the greedy algorithm of Chester et al. for comparable quality answers.

Pankaj K. Agarwal, Nirman Kumar, Stavros Sintos, and Subhash Suri. Efficient Algorithms for k-Regret Minimizing Sets. In 16th International Symposium on Experimental Algorithms (SEA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 75, pp. 7:1-7:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{agarwal_et_al:LIPIcs.SEA.2017.7, author = {Agarwal, Pankaj K. and Kumar, Nirman and Sintos, Stavros and Suri, Subhash}, title = {{Efficient Algorithms for k-Regret Minimizing Sets}}, booktitle = {16th International Symposium on Experimental Algorithms (SEA 2017)}, pages = {7:1--7:23}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-036-1}, ISSN = {1868-8969}, year = {2017}, volume = {75}, editor = {Iliopoulos, Costas S. and Pissis, Solon P. and Puglisi, Simon J. and Raman, Rajeev}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2017.7}, URN = {urn:nbn:de:0030-drops-76321}, doi = {10.4230/LIPIcs.SEA.2017.7}, annote = {Keywords: regret minimizing sets, skyline, top-k query, coreset, hitting set} }

X

Feedback for Dagstuhl Publishing

Feedback submitted

Please try again later or send an E-mail