Hu, Xiaocheng ;
Tao, Yufei ;
Yang, Yi ;
Zhang, Shengyu ;
Zhou, Shuigeng
On The I/O Complexity of Dynamic Distinct Counting
Abstract
In dynamic distinct counting, we want to maintain a multiset S of integers under insertions to answer efficiently the query: how many distinct elements are there in S? In external memory, the problem admits two standard solutions. The first one maintains $S$ in a hash structure, so that the distinct count can be incrementally updated after each insertion using O(1) expected I/Os. A query is answered for free. The second one stores S in a linked list, and thus supports an insertion in O(1/B) amortized I/Os. A query can be answered in O(N/B log_{M/B} (N/B)) I/Os by sorting, where N=S, B is the block size, and M is the memory size.
In this paper, we show that the above two naive solutions are already optimal within a polylog factor. Specifically, for any Las Vegas structure using N^{O(1)} blocks, if its expected amortized insertion cost is o(1/log B}), then it must incur Omega(N/(B log B)) expected I/Os answering a query in the worst case, under the (realistic) condition that N is a polynomial of B. This means that the problem is repugnant to update buffering: the query cost jumps from 0 dramatically to almost linearity as soon as the insertion cost drops slightly below Omega(1).
BibTeX  Entry
2015
distinct counting, lower bound, external memory 
18th International Conference on Database Theory (ICDT 2015)

2015 
2015 