Search Results

Documents authored by Tao, Yufei


Document
Subgraph Enumeration in Optimal I/O Complexity

Authors: Shiyuan Deng and Yufei Tao

Published in: LIPIcs, Volume 290, 27th International Conference on Database Theory (ICDT 2024)


Abstract
Given a massive data graph G = (V, E) and a small pattern graph Q, the goal of subgraph enumeration is to list all the subgraphs of G isomorphic to Q. In the external memory (EM) model, it is well-known that every indivisible algorithm must perform Ω({|E|^ρ}/{M^{ρ-1} B}) I/Os in the worst case, where M represents the number of words in (internal) memory, B denotes the number of words in a disk block, and ρ is the fractional edge covering number of Q. It has been a longstanding open problem to design an algorithm to match this lower bound. The state of the art is an algorithm in ICDT'23 that achieves an I/O complexity of O({|E|^ρ}/{M^{ρ-1} B} log_{M/B} |E|/B) with high probability. In this paper, we remove the log_{M/B} |E|/B factor, thereby settling the open problem when randomization is permitted.

Cite as

Shiyuan Deng and Yufei Tao. Subgraph Enumeration in Optimal I/O Complexity. In 27th International Conference on Database Theory (ICDT 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 290, pp. 21:1-21:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{deng_et_al:LIPIcs.ICDT.2024.21,
  author =	{Deng, Shiyuan and Tao, Yufei},
  title =	{{Subgraph Enumeration in Optimal I/O Complexity}},
  booktitle =	{27th International Conference on Database Theory (ICDT 2024)},
  pages =	{21:1--21:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-312-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{290},
  editor =	{Cormode, Graham and Shekelyan, Michael},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2024.21},
  URN =		{urn:nbn:de:0030-drops-198033},
  doi =		{10.4230/LIPIcs.ICDT.2024.21},
  annote =	{Keywords: Subgraph Enumeration, Conjunctive Queries, External Memory, Algorithms}
}
Document
Join Sampling Under Acyclic Degree Constraints and (Cyclic) Subgraph Sampling

Authors: Ru Wang and Yufei Tao

Published in: LIPIcs, Volume 290, 27th International Conference on Database Theory (ICDT 2024)


Abstract
Given a (natural) join with an acyclic set of degree constraints (the join itself does not need to be acyclic), we show how to draw a uniformly random sample from the join result in O(polymat/max{1, OUT}) expected time (assuming data complexity) after a preprocessing phase of O(IN) expected time, where IN, OUT, and polymat are the join’s input size, output size, and polymatroid bound, respectively. This compares favorably with the state of the art (Deng et al. and Kim et al., both in PODS'23), which states that, in the absence of degree constraints, a uniformly random sample can be drawn in Õ(AGM/max{1, OUT}) expected time after a preprocessing phase of Õ(IN) expected time, where AGM is the join’s AGM bound and Õ(.) hides a polylog(IN) factor. Our algorithm applies to every join supported by the solutions of Deng et al. and Kim et al. Furthermore, since the polymatroid bound is at most the AGM bound, our performance guarantees are never worse, but can be considerably better, than those of Deng et al. and Kim et al. We then utilize our techniques to tackle directed subgraph sampling, a problem that has extensive database applications and bears close relevance to joins. Let G = (V, E) be a directed data graph where each vertex has an out-degree at most λ, and let P be a directed pattern graph with a constant number of vertices. The objective is to uniformly sample an occurrence of P in G. The problem can be modeled as join sampling with input size IN = Θ(|E|) but, whenever P contains cycles, the converted join has cyclic degree constraints. We show that it is always possible to throw away certain degree constraints such that (i) the remaining constraints are acyclic and (ii) the new join has asymptotically the same polymatroid bound polymat as the old one. Combining this finding with our new join sampling solution yields an algorithm to sample from the original (cyclic) join (thereby yielding a uniformly random occurrence of P) in O(polymat/max{1, OUT}) expected time after O(|E|) expected-time preprocessing, where OUT is the number of occurrences.

Cite as

Ru Wang and Yufei Tao. Join Sampling Under Acyclic Degree Constraints and (Cyclic) Subgraph Sampling. In 27th International Conference on Database Theory (ICDT 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 290, pp. 23:1-23:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{wang_et_al:LIPIcs.ICDT.2024.23,
  author =	{Wang, Ru and Tao, Yufei},
  title =	{{Join Sampling Under Acyclic Degree Constraints and (Cyclic) Subgraph Sampling}},
  booktitle =	{27th International Conference on Database Theory (ICDT 2024)},
  pages =	{23:1--23:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-312-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{290},
  editor =	{Cormode, Graham and Shekelyan, Michael},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2024.23},
  URN =		{urn:nbn:de:0030-drops-198054},
  doi =		{10.4230/LIPIcs.ICDT.2024.23},
  annote =	{Keywords: Join Sampling, Subgraph Sampling, Degree Constraints, Polymatroid Bounds}
}
Document
Enumerating Subgraphs of Constant Sizes in External Memory

Authors: Shiyuan Deng, Francesco Silvestri, and Yufei Tao

Published in: LIPIcs, Volume 255, 26th International Conference on Database Theory (ICDT 2023)


Abstract
We present an indivisible I/O-efficient algorithm for subgraph enumeration, where the objective is to list all the subgraphs of a massive graph G : = (V, E) that are isomorphic to a pattern graph Q having k = O(1) vertices. Our algorithm performs O((|E|^{k/2})/(M^{{k/2}-1} B) log_{M/B}(|E|/B) + (|E|^ρ)/(M^{ρ-1} B) I/Os with high probability, where ρ is the fractional edge covering number of Q (it always holds ρ ≥ k/2, regardless of Q), M is the number of words in (internal) memory, and B is the number of words in a disk block. Our solution is optimal in the class of indivisible algorithms for all pattern graphs with ρ > k/2. When ρ = k/2, our algorithm is still optimal as long as M/B ≥ (|E|/B)^ε for any constant ε > 0.

Cite as

Shiyuan Deng, Francesco Silvestri, and Yufei Tao. Enumerating Subgraphs of Constant Sizes in External Memory. In 26th International Conference on Database Theory (ICDT 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 255, pp. 4:1-4:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


Copy BibTex To Clipboard

@InProceedings{deng_et_al:LIPIcs.ICDT.2023.4,
  author =	{Deng, Shiyuan and Silvestri, Francesco and Tao, Yufei},
  title =	{{Enumerating Subgraphs of Constant Sizes in External Memory}},
  booktitle =	{26th International Conference on Database Theory (ICDT 2023)},
  pages =	{4:1--4:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-270-9},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{255},
  editor =	{Geerts, Floris and Vandevoort, Brecht},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2023.4},
  URN =		{urn:nbn:de:0030-drops-177460},
  doi =		{10.4230/LIPIcs.ICDT.2023.4},
  annote =	{Keywords: Subgraph Enumeration, Conjunctive Queries, External Memory, Algorithms}
}
Document
Space-Query Tradeoffs in Range Subgraph Counting and Listing

Authors: Shiyuan Deng, Shangqi Lu, and Yufei Tao

Published in: LIPIcs, Volume 255, 26th International Conference on Database Theory (ICDT 2023)


Abstract
This paper initializes the study of range subgraph counting and range subgraph listing, both of which are motivated by the significant demands in practice to perform graph analytics on subgraphs pertinent to only selected, as opposed to all, vertices. In the first problem, there is an undirected graph G where each vertex carries a real-valued attribute. Given an interval q and a pattern Q, a query counts the number of occurrences of Q in the subgraph of G induced by the vertices whose attributes fall in q. The second problem has the same setup except that a query needs to enumerate (rather than count) those occurrences with a small delay. In both problems, our goal is to understand the tradeoff between space usage and query cost, or more specifically: (i) given a target on query efficiency, how much pre-computed information about G must we store? (ii) Or conversely, given a budget on space usage, what is the best query time we can hope for? We establish a suite of upper- and lower-bound results on such tradeoffs for various query patterns.

Cite as

Shiyuan Deng, Shangqi Lu, and Yufei Tao. Space-Query Tradeoffs in Range Subgraph Counting and Listing. In 26th International Conference on Database Theory (ICDT 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 255, pp. 6:1-6:25, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


Copy BibTex To Clipboard

@InProceedings{deng_et_al:LIPIcs.ICDT.2023.6,
  author =	{Deng, Shiyuan and Lu, Shangqi and Tao, Yufei},
  title =	{{Space-Query Tradeoffs in Range Subgraph Counting and Listing}},
  booktitle =	{26th International Conference on Database Theory (ICDT 2023)},
  pages =	{6:1--6:25},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-270-9},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{255},
  editor =	{Geerts, Floris and Vandevoort, Brecht},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2023.6},
  URN =		{urn:nbn:de:0030-drops-177484},
  doi =		{10.4230/LIPIcs.ICDT.2023.6},
  annote =	{Keywords: Subgraph Pattern Counting, Subgraph Pattern Listing, Conjunctive Queries}
}
Document
Range Updates and Range Sum Queries on Multidimensional Points with Monoid Weights

Authors: Shangqi Lu and Yufei Tao

Published in: LIPIcs, Volume 248, 33rd International Symposium on Algorithms and Computation (ISAAC 2022)


Abstract
Let P be a set of n points in ℝ^d where each point p ∈ P carries a weight drawn from a commutative monoid (ℳ, +, 0). Given a d-rectangle r_upd (i.e., an orthogonal rectangle in ℝ^d) and a value Δ ∈ ℳ, a range update adds Δ to the weight of every point p ∈ P∩ r_upd; given a d-rectangle r_qry, a range sum query returns the total weight of the points in P ∩ r_qry. The goal is to store P in a structure to support updates and queries with attractive performance guarantees. We describe a structure of Õ(n) space that handles an update in Õ(T_upd) time and a query in Õ(T_qry) time for arbitrary functions T_upd(n) and T_qry(n) satisfying T_upd ⋅ T_qry = n. The result holds for any fixed dimensionality d ≥ 2. Our query-update tradeoff is tight up to a polylog factor subject to the OMv-conjecture.

Cite as

Shangqi Lu and Yufei Tao. Range Updates and Range Sum Queries on Multidimensional Points with Monoid Weights. In 33rd International Symposium on Algorithms and Computation (ISAAC 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 248, pp. 57:1-57:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)


Copy BibTex To Clipboard

@InProceedings{lu_et_al:LIPIcs.ISAAC.2022.57,
  author =	{Lu, Shangqi and Tao, Yufei},
  title =	{{Range Updates and Range Sum Queries on Multidimensional Points with Monoid Weights}},
  booktitle =	{33rd International Symposium on Algorithms and Computation (ISAAC 2022)},
  pages =	{57:1--57:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-258-7},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{248},
  editor =	{Bae, Sang Won and Park, Heejin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2022.57},
  URN =		{urn:nbn:de:0030-drops-173427},
  doi =		{10.4230/LIPIcs.ISAAC.2022.57},
  annote =	{Keywords: Range Updates, Range Sum Queries, Data Structures, Lower Bounds}
}
Document
Parallel Acyclic Joins with Canonical Edge Covers

Authors: Yufei Tao

Published in: LIPIcs, Volume 220, 25th International Conference on Database Theory (ICDT 2022)


Abstract
In PODS'21, Hu presented an algorithm in the massively parallel computation (MPC) model that processes any acyclic join with an asymptotically optimal load. In this paper, we present an alternative analysis of her algorithm. The novelty of our analysis is in the revelation of a new mathematical structure - which we name canonical edge cover - for acyclic hypergraphs. We prove non-trivial properties for canonical edge covers that offer us a graph-theoretic perspective about why Hu’s algorithm works.

Cite as

Yufei Tao. Parallel Acyclic Joins with Canonical Edge Covers. In 25th International Conference on Database Theory (ICDT 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 220, pp. 9:1-9:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)


Copy BibTex To Clipboard

@InProceedings{tao:LIPIcs.ICDT.2022.9,
  author =	{Tao, Yufei},
  title =	{{Parallel Acyclic Joins with Canonical Edge Covers}},
  booktitle =	{25th International Conference on Database Theory (ICDT 2022)},
  pages =	{9:1--9:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-223-5},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{220},
  editor =	{Olteanu, Dan and Vortmeier, Nils},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2022.9},
  URN =		{urn:nbn:de:0030-drops-158838},
  doi =		{10.4230/LIPIcs.ICDT.2022.9},
  annote =	{Keywords: Joins, Conjunctive Queries, MPC Algorithms, Parallel Computing}
}
Document
Towards Optimal Dynamic Indexes for Approximate (and Exact) Triangle Counting

Authors: Shangqi Lu and Yufei Tao

Published in: LIPIcs, Volume 186, 24th International Conference on Database Theory (ICDT 2021)


Abstract
In ICDT'19, Kara, Ngo, Nikolic, Olteanu, and Zhang gave a structure which maintains the number T of triangles in an undirected graph G = (V, E) along with the edge insertions/deletions in G. Using O(m) space (m = |E|), their structure supports an update in O(√m log m) amortized time which is optimal (up to polylog factors) subject to the OMv-conjecture (Henzinger, Krinninger, Nanongkai, and Saranurak, STOC'15). Aiming to improve the update efficiency, we study: - the optimal tradeoff between update time and approximation quality. We require a structure to provide the (ε, Γ)-guarantee: when queried, it should return an estimate t of T that has relative error at most ε if T ≥ Γ, or an absolute error at most ε ⋅ Γ, otherwise. We prove that, under any ε ≤ 0.49 and subject to the OMv-conjecture, no structure can guarantee O(m^{0.5-δ}/Γ) expected amortized update time and O(m^{2/3-δ}) query time simultaneously for any constant δ > 0; this is true for Γ = m^c of any constant c in [0, 1/2). We match the lower bound with a structure that ensures Õ((1/ε)³ ⋅ √m/Γ) amortized update time with high probability, and O(1) query time. - (for exact counting) how to achieve arboricity-sensitive update time. For any 1 ≤ Γ ≤ √m, we describe a structure of O(min{α m + m log m, (m/Γ)²}) space that maintains T precisely, and supports an update in Õ(min{α + Γ, √m}) amortized time, where α is the largest arboricity of G in history (and does not need to be known). Our structure reconstructs the aforementioned ICDT'19 result up to polylog factors by setting Γ = √m, but achieves Õ(m^{0.5-δ}) update time as long as α = O(m^{0.5-δ}).

Cite as

Shangqi Lu and Yufei Tao. Towards Optimal Dynamic Indexes for Approximate (and Exact) Triangle Counting. In 24th International Conference on Database Theory (ICDT 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 186, pp. 6:1-6:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{lu_et_al:LIPIcs.ICDT.2021.6,
  author =	{Lu, Shangqi and Tao, Yufei},
  title =	{{Towards Optimal Dynamic Indexes for Approximate (and Exact) Triangle Counting}},
  booktitle =	{24th International Conference on Database Theory (ICDT 2021)},
  pages =	{6:1--6:23},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-179-5},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{186},
  editor =	{Yi, Ke and Wei, Zhewei},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2021.6},
  URN =		{urn:nbn:de:0030-drops-137146},
  doi =		{10.4230/LIPIcs.ICDT.2021.6},
  annote =	{Keywords: Triangle Counting, Data Structures, Lower Bounds, Graph Algorithms}
}
Document
A Simple Parallel Algorithm for Natural Joins on Binary Relations

Authors: Yufei Tao

Published in: LIPIcs, Volume 155, 23rd International Conference on Database Theory (ICDT 2020)


Abstract
In PODS'17, Ketsman and Suciu gave an algorithm in the MPC model for computing the result of any natural join where every input relation has two attributes. Achieving an optimal load O(m/p^{1/ρ}) - where m is the total size of the input relations, p the number of machines, and ρ the fractional edge covering number of the join - their algorithm requires 7 rounds to finish. This paper presents a simpler algorithm that ensures the same load with 3 rounds (in fact, the second round incurs only a load of O(p²) to transmit certain statistics to assist machine allocation in the last round). Our algorithm is made possible by a new theorem that provides fresh insight on the structure of the problem, and brings us closer to understanding the intrinsic reason why joins on binary relations can be settled with load O(m/p^{1/ρ}).

Cite as

Yufei Tao. A Simple Parallel Algorithm for Natural Joins on Binary Relations. In 23rd International Conference on Database Theory (ICDT 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 155, pp. 25:1-25:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)


Copy BibTex To Clipboard

@InProceedings{tao:LIPIcs.ICDT.2020.25,
  author =	{Tao, Yufei},
  title =	{{A Simple Parallel Algorithm for Natural Joins on Binary Relations}},
  booktitle =	{23rd International Conference on Database Theory (ICDT 2020)},
  pages =	{25:1--25:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-139-9},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{155},
  editor =	{Lutz, Carsten and Jung, Jean Christoph},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2020.25},
  URN =		{urn:nbn:de:0030-drops-119495},
  doi =		{10.4230/LIPIcs.ICDT.2020.25},
  annote =	{Keywords: Natural Joins, Conjunctive Queries, MPC Algorithms, Parallel Computing}
}
Document
Distribution-Sensitive Bounds on Relative Approximations of Geometric Ranges

Authors: Yufei Tao and Yu Wang

Published in: LIPIcs, Volume 129, 35th International Symposium on Computational Geometry (SoCG 2019)


Abstract
A family R of ranges and a set X of points, all in R^d, together define a range space (X, R|_X), where R|_X = {X cap h | h in R}. We want to find a structure to estimate the quantity |X cap h|/|X| for any range h in R with the (rho, epsilon)-guarantee: (i) if |X cap h|/|X| > rho, the estimate must have a relative error epsilon; (ii) otherwise, the estimate must have an absolute error rho epsilon. The objective is to minimize the size of the structure. Currently, the dominant solution is to compute a relative (rho, epsilon)-approximation, which is a subset of X with O~(lambda/(rho epsilon^2)) points, where lambda is the VC-dimension of (X, R|_X), and O~ hides polylog factors. This paper shows a more general bound sensitive to the content of X. We give a structure that stores O(log (1/rho)) integers plus O~(theta * (lambda/epsilon^2)) points of X, where theta - called the disagreement coefficient - measures how much the ranges differ from each other in their intersections with X. The value of theta is between 1 and 1/rho, such that our space bound is never worse than that of relative (rho, epsilon)-approximations, but we improve the latter’s 1/rho term whenever theta = o(1/(rho log (1/rho))). We also prove that, in the worst case, summaries with the (rho, 1/2)-guarantee must consume Omega(theta) words even for d = 2 and lambda <=3. We then constrain R to be the set of halfspaces in R^d for a constant d, and prove the existence of structures with o(1/(rho epsilon^2)) size offering (rho,epsilon)-guarantees, when X is generated from various stochastic distributions. This is the first formal justification on why the term 1/rho is not compulsory for "realistic" inputs.

Cite as

Yufei Tao and Yu Wang. Distribution-Sensitive Bounds on Relative Approximations of Geometric Ranges. In 35th International Symposium on Computational Geometry (SoCG 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 129, pp. 57:1-57:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{tao_et_al:LIPIcs.SoCG.2019.57,
  author =	{Tao, Yufei and Wang, Yu},
  title =	{{Distribution-Sensitive Bounds on Relative Approximations of Geometric Ranges}},
  booktitle =	{35th International Symposium on Computational Geometry (SoCG 2019)},
  pages =	{57:1--57:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-104-7},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{129},
  editor =	{Barequet, Gill and Wang, Yusu},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SoCG.2019.57},
  URN =		{urn:nbn:de:0030-drops-104617},
  doi =		{10.4230/LIPIcs.SoCG.2019.57},
  annote =	{Keywords: Relative Approximation, Disagreement Coefficient, Data Summary}
}
Document
Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space

Authors: Yufei Tao

Published in: LIPIcs, Volume 98, 21st International Conference on Database Theory (ICDT 2018)


Abstract
In entity matching classification, we are given two sets R and S of objects where whether r and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D(R) and D(S) respectively, the goal is to discover a classifier function f: D(R) x D(S) -> {0, 1} from a certain class satisfying the property that, for every (r, s) in R x S, f(r, s) = 1 if and only if r and s are a match. Past research is accustomed to running a learning algorithm directly on all the labeled (i.e., match or not) pairs in R times S. This, however, suffers from the drawback that even reading through the input incurs a quadratic cost. We pursue a direction towards removing the quadratic barrier. Denote by T the set of matching pairs in R times S. We propose to accept R, S, and T as the input, and aim to solve the problem with cost proportional to |R|+|S|+|T|, thereby achieving a large performance gain in the (typical) scenario where |T|<<|R||S|. This paper provides evidence on the feasibility of the new direction, by showing how to accomplish the aforementioned purpose for entity matching with linear classification, where a classifier is a linear multi-dimensional plane separating the matching and non-matching pairs. We actually do so in the MPC model, echoing the trend of deploying massively parallel computing systems for large-scale learning. As a side product, we obtain new MPC algorithms for three geometric problems: linear programming, batched range counting, and dominance join.

Cite as

Yufei Tao. Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space. In 21st International Conference on Database Theory (ICDT 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 98, pp. 20:1-20:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{tao:LIPIcs.ICDT.2018.20,
  author =	{Tao, Yufei},
  title =	{{Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space}},
  booktitle =	{21st International Conference on Database Theory (ICDT 2018)},
  pages =	{20:1--20:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-063-7},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{98},
  editor =	{Kimelfeld, Benny and Amsterdamer, Yael},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2018.20},
  URN =		{urn:nbn:de:0030-drops-86057},
  doi =		{10.4230/LIPIcs.ICDT.2018.20},
  annote =	{Keywords: Entity Matching, Linear Programming, Range Counting, Dominance Join, Massively Parallel Computation}
}
Document
Invited Talk
Top-k Indexes Made Small and Sweet (Invited Talk)

Authors: Yufei Tao

Published in: LIPIcs, Volume 48, 19th International Conference on Database Theory (ICDT 2016)


Abstract
Top-k queries have become extremely popular in the database community. Such a query, which is issued on a set of elements each carrying a real-valued weight, returns the k elements with the highest weights among all the elements that satisfy a predicate. As usual, an index structure is necessary to answer a query substantially faster than accessing the whole input set. The existing research on top-k queries can be classified in two categories. The first one, which is system-oriented, aims to devise indexes that are simple to understand and easy to implement. These indexes, typically designed with heuristics, are reasonably fast in practical applications, but do not necessarily offer strong performance guarantees - in other words, they are small but not sweet. The other category, which is theory-oriented, aims to develop indexes that promise attractive bounds on the space consumption and query overhead (sometimes also update cost). These indexes, unfortunately, are often excessively sophisticated in the adopted techniques, and are rarely applied in practice - they are sweet but not small. This talk will discuss the progress of an on-going project that strives to take down the barrier between the two categories, by crafting a framework for acquiring simple top-k indexes with excellent performance guarantees - namely, small and sweet. This is achieved with reductions that produce top-k indexes automatically from the existing data structures for conventional reporting queries on unweighted elements (i.e., finding all elements satisfying a predicate), and/or the existing data structures on top-1 queries. Our reductions promise nearly no performance deterioration with respect to those existing structures, are general enough to be applicable to a huge variety of top-k problems, and work in both the external memory model and the RAM model.

Cite as

Yufei Tao. Top-k Indexes Made Small and Sweet (Invited Talk). In 19th International Conference on Database Theory (ICDT 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 48, p. 3:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)


Copy BibTex To Clipboard

@InProceedings{tao:LIPIcs.ICDT.2016.3,
  author =	{Tao, Yufei},
  title =	{{Top-k Indexes Made Small and Sweet}},
  booktitle =	{19th International Conference on Database Theory (ICDT 2016)},
  pages =	{3:1--3:1},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-002-6},
  ISSN =	{1868-8969},
  year =	{2016},
  volume =	{48},
  editor =	{Martens, Wim and Zeume, Thomas},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2016.3},
  URN =		{urn:nbn:de:0030-drops-57725},
  doi =		{10.4230/LIPIcs.ICDT.2016.3},
  annote =	{Keywords: Data Structures, Top-k, External Memory, RAM, Reductions}
}
Document
On The I/O Complexity of Dynamic Distinct Counting

Authors: Xiaocheng Hu, Yufei Tao, Yi Yang, Shengyu Zhang, and Shuigeng Zhou

Published in: LIPIcs, Volume 31, 18th International Conference on Database Theory (ICDT 2015)


Abstract
In dynamic distinct counting, we want to maintain a multi-set S of integers under insertions to answer efficiently the query: how many distinct elements are there in S? In external memory, the problem admits two standard solutions. The first one maintains $S$ in a hash structure, so that the distinct count can be incrementally updated after each insertion using O(1) expected I/Os. A query is answered for free. The second one stores S in a linked list, and thus supports an insertion in O(1/B) amortized I/Os. A query can be answered in O(N/B log_{M/B} (N/B)) I/Os by sorting, where N=|S|, B is the block size, and M is the memory size. In this paper, we show that the above two naive solutions are already optimal within a polylog factor. Specifically, for any Las Vegas structure using N^{O(1)} blocks, if its expected amortized insertion cost is o(1/log B}), then it must incur Omega(N/(B log B)) expected I/Os answering a query in the worst case, under the (realistic) condition that N is a polynomial of B. This means that the problem is repugnant to update buffering: the query cost jumps from 0 dramatically to almost linearity as soon as the insertion cost drops slightly below Omega(1).

Cite as

Xiaocheng Hu, Yufei Tao, Yi Yang, Shengyu Zhang, and Shuigeng Zhou. On The I/O Complexity of Dynamic Distinct Counting. In 18th International Conference on Database Theory (ICDT 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 31, pp. 265-276, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)


Copy BibTex To Clipboard

@InProceedings{hu_et_al:LIPIcs.ICDT.2015.265,
  author =	{Hu, Xiaocheng and Tao, Yufei and Yang, Yi and Zhang, Shengyu and Zhou, Shuigeng},
  title =	{{On The I/O Complexity of Dynamic Distinct Counting}},
  booktitle =	{18th International Conference on Database Theory (ICDT 2015)},
  pages =	{265--276},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-939897-79-8},
  ISSN =	{1868-8969},
  year =	{2015},
  volume =	{31},
  editor =	{Arenas, Marcelo and Ugarte, Mart{\'\i}n},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2015.265},
  URN =		{urn:nbn:de:0030-drops-49895},
  doi =		{10.4230/LIPIcs.ICDT.2015.265},
  annote =	{Keywords: distinct counting, lower bound, external memory}
}