Abstract
In this paper we introduce a general framework that exponentially improves the space, the degree of independence, and the time needed by minwise based algorithms. The authors, in SODA 2011, we introduced an exponential time improvement for minwise based algorithms by defining and constructing an almost kminwise independent family of hash functions. Here we develop an alternative approach that achieves both exponential time and exponential space improvement. The new approach relaxes the need for approximately minwise hash functions, hence gets around the Omega(log(1/epsilon)) independence lower bound in [Patrascu 2010]. This is done by defining and constructing a dkminwise independent family of hash functions. Surprisingly, for most cases only 8wise independence is needed for the additional improvement. Moreover, as the degree of independence is a small constant, our function can be implemented efficiently.
Informally, under this definition, all subsets of size d of any fixed set X have an equal probability to have hash values among the minimal k values in X, where the probability is over the random choice of hash function from the family. This property measures the randomness of the family, as choosing a truly random function, obviously, satisfies the definition for d=k=X. We define and give an efficient time and space construction of approximately dkminwise independent family of hash functions for the case where d=2, as this is sufficient for the additional exponential improvement.
We discuss how this construction can be used to improve many minwise based algorithms. To our knowledge such definitions, for hash functions, were never studied and no construction was given before.
As an example we show how to apply it for similarity and rarity estimation over data streams. Other minwise based algorithms, can be adjusted in the same way.
BibTeX  Entry
@InProceedings{feigenblat_et_al:LIPIcs:2012:3849,
author = {Guy Feigenblat and Ely Porat and Ariel Shiftan},
title = {{Exponential Space Improvement for minwise Based Algorithms}},
booktitle = {IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2012) },
pages = {7085},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783939897477},
ISSN = {18688969},
year = {2012},
volume = {18},
editor = {Deepak D'Souza and Telikepalli Kavitha and Jaikumar Radhakrishnan},
publisher = {Schloss DagstuhlLeibnizZentrum fuer Informatik},
address = {Dagstuhl, Germany},
URL = {http://drops.dagstuhl.de/opus/volltexte/2012/3849},
URN = {urn:nbn:de:0030drops38495},
doi = {10.4230/LIPIcs.FSTTCS.2012.70},
annote = {Keywords: Streaming, MinWise, Hash Functions, Similarity, On line algorithms, Sublinear algorithms}
}
Keywords: 

Streaming, MinWise, Hash Functions, Similarity, On line algorithms, Sublinear algorithms 
Seminar: 

IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2012) 
Issue Date: 

2012 
Date of publication: 

10.12.2012 