Statistical and Probabilistic Methods in Algorithmic Data Analysis
Abstract
This report documents the program and the outcomes of Dagstuhl Seminar “Statistical and Probabilistic Methods in Algorithmic Data Analysis” (24391). Modern algorithms for data analysis require the use of advanced probabilistic methods to achieve the necessary scalability and accuracy guarantees. At the same time, modern tasks of knowledge discovery from data require the use of advanced statistics to handle challenges such as the test of multiple hypotheses or dependency structure of the data points, such as in time series or graphs. Probabilistic methods are also at the core of areas of theoretical computer science such as sub-linear algorithms and average-case analysis. The application of these methods requires careful balancing of theoretical and practical considerations, to obtain efficient algorithms for data analysis. The Dagstuhl Seminar focused on statistical and probabilistic methods to develop and analyze useful, scalable algorithms for knowledge discovery from large, rich datasets. Participants from different countries, at different stages of their careers, and from both industry and academia gave talks on the topics of the seminar, usually presenting their own research, either recently published or soon-to-be. There was ample time for socializing, networking, and starting or continuing collaborations, and new results are expected to be published thanks to these collaborations.
Keywords and phrases:
approximation algorithms, data science, online algorithms, randomized algorithms, statistical data analysisSeminar:
September 22–27, 2024 – https://www.dagstuhl.de/243912012 ACM Subject Classification:
Theory of computation Approximation algorithms analysis ; Information systems Data mining ; Theory of computation Graph algorithms analysis ; Theory of computation Streaming, sublinear and near linear time algorithmsCopyright and License:
1 Executive Summary
Aris Gionis (KTH Royal Institute of Technology, Stockholm, SE, argioni@kth.se)
Matteo Riondato (Amherst College, USA,
mriondato@amherst.edu)
Eli Upfal (Brown University, Providence, USA, eliezer_upfal@brown.edu)
License:
Creative Commons BY 4.0 International license © Aris Gionis, Matteo Riondato, and Eli Upfal
The development of efficient methods for the analysis of data, including graphs, time series, and transactional datasets, is a key challenge in computer science. The availability of larger, richer datasets gathered about all aspects of human life, society, and nature, and which take a variety of forms (e.g., graphs, time series, tabular data) has created the need for such methods, which would find applications in areas such as biology, finance, and computational social sciences. It has also created challenges that must be tackled in order to develop efficient methods, such as the following.
-
Trade-offs between data, computation, and information: As datasets with trillions of points become available, how much data do we really need to (approximately) answer thousands or more exploratory analysis questions on them? We need new algorithms that carefully balance the minimum amount of data needed to obtain the desired information at the required accuracy, with the computation needed to extract it.
-
Balancing theory and practice in data science: How can we satisfy the needs for theoretical and statistical guarantees on one side and practical efficiency on the other? How do we connect concepts from different disciplines to exploit their power in useful algorithms? Developments from all areas of computer science and statistics should be used to obtain data analysis methods that are efficient from both a computational and a statistical point of view.
-
Enabling efficient and statistically-sound scientific discoveries: The practice of science requires testing multiple hypotheses on the same, often large, data. For example, the Human Protein Reference Database graph has proteins and interactions between them. Scientists want to understand the significance of small connected subgraphs in this network, representing pathways in cancer cells. There are subgraphs of size 8, each corresponding to an hypothesis. We need computationally-efficient methods that scale to trillions of hypotheses, while controlling for different measures of statistical guarantee on false discoveries, e.g., the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR).
-
Enhancing responsible and unbiased data analysis: There are numerous techniques for analyzing and processing datasets. While these techniques have overall been effective in extracting patterns in datasets, the choice of an appropriate technique and parameters for analyzing a dataset is often more art than science. Different techniques, each implementing a variety of heuristics, may give different results when applied to the same data. Moreover, many of these techniques do not provide an interpretable mechanistic explanation for their performance, and most do not provide a rigorous measure of statistical significance or robustness of the analysis. A principled approach to data analysis, based on well founded mathematical and statistical concepts, enforces objectivity and unbiased results. Such an approach enhances the effectiveness of evidence-based medicine, policy, and social applications of data analysis.
During the Dagstuhl Seminar, the participants engaged in lively presentations of recent results on these topics, ranging from the theory of machine learning, to bias reduction, to pattern mining, to approximation and online algorithms for problems arising in very different settings and with many different applications. An Open Problem session gave the participants a chance to share interesting problems for the community. While the research presentations took the majority of the schedule, there were plenty of chances for the participants to network, socialize, and engage in new and ongoing collaborations, which is, in our opinion, the real advantage of attending a Dagstuhl Seminar in person. Especially positive where the interactions between participants at different stages of their career, from institutions in different countries, and from both academia and industry.
2 Table of Contents
3 Overview of Talks
3.1 Statistical and Probabilistic Methods in Algorithmic Data Analysis
Aris Anagnostopoulos (Sapienza University of Rome, IT)
License:
Creative Commons BY 4.0 International license © Aris Anagnostopoulos
Graph neural networks (GNNs) excel in learning from network-like data but often lack interpretability, making their application challenging in domains requiring transparent decision-making. We propose the Graph Kolmogorov–Arnold Network (GKAN), a novel GNN model leveraging spline-based activation functions on edges to enhance both accuracy and interpretability.
3.2 Local and Distributed Dynamics for Community Detection
Luca Becchetti (Sapienza University of Rome, IT)
License:
Creative Commons BY 4.0 International license © Luca Becchetti
Given an underlying graph, we consider distributed dynamics that globally perform detection of planted communities. In the local communication model, we consider the following dynamics: Initially, each node locally chooses a value in , uniformly at random and independently of other nodes. Then, in each consecutive round, every node updates its local value to the average of the values held by its neighbors, at the same time applying an elementary, local clustering rule that only depends on the current and the previous values held by the node.
We prove that the process resulting from this dynamics produces a clustering that exactly or approximately (depending on the graph) reflects the underlying cut in logarithmic time, under various graph models that exhibit a sparse balanced cut, including the stochastic block model. We also prove that a natural extension of this dynamics performs community detection on a regularized version of the stochastic block model with multiple communities.
We further discuss more recent extensions of this work to different models of communication (e.g., population protocols) and more to the case of more communities.
The talk is mainly based on the following papers:
-
Becchetti, Luca, et al. Find your place: Simple distributed algorithms for community detection. SIAM Journal on Computing 49.4 (2020): 821–864.
-
Becchetti, L., Clementi, A., Manurangsi, P., Natale, E., Pasquale, F., Raghavendra, P., and Trevisan, L. (2018). Average whenever you meet: opportunistic protocols for community detection. In 26th Annual European Symposium on Algorithms: ESA 2018.
-
Becchetti, Luca, Emilio Cruciani, Francesco Pasquale, and Sara Rizzo. Step-by-step community detection in volume-regular graphs. Theoretical Computer Science 847 (2020): 49–67.
3.3 A Fully-dynamic Approximation Algorithm for Maximum Weight b-Matchings in Graphs
Fabian Brandt-Tumescheit (HU Berlin, DE)
License:
Creative Commons BY 4.0 International license © Fabian Brandt-Tumescheit
Matching nodes in a graph is a well-studied algorithmic problem with many applications. The -matching problem is a generalization that allows to match a node with up to neighbors.
This allows more flexible connectivity patterns whenever vertices may have multiple associations. The algorithm b-suitor [Khan et al., SISC 2016] is able to compute a (1/2)-approximation of a maximum weight b-matching in time. Since real-world graphs often change over time, fast dynamic methods for b-matching optimization are desirable. In this work, we propose Dyn-b-suitor, a dynamic algorithm for the weighted b-matching problem. As a non-trivial extension to the dynamic Suitor algorithm for 1-matchings [Angriman et al., JEA 2022], our approach computes (1/2)-approximate b-matchings by identifying and updating affected vertices without static recomputation. Our proposed algorithm is fully-dynamic, i.e., it supports both edge insertions and deletions, and we prove that it computes the same solution as its static counterpart. In extensive experiments on real-world benchmark graphs and generated instances, our dynamic algorithm yields significant savings compared to the sequential b-suitor, e.g., for batch updates with 103 edges with an average acceleration factor of 103x. When comparing our sequential dynamic algorithm with the parallel (static) b-suitor on a 128-core machine, our dynamic algorithm is still 59x to 104x faster.
3.4 Learning Agent-Based Models from Data
Gianmarco De Francisci Morales (CENTAI – Torino, IT)
License:
Creative Commons BY 4.0 International license © Gianmarco De Francisci Morales
Agent-Based Models (ABMs) are used in several fields to study the evolution of complex systems according to micro-level assumptions. These models encode the causal mechanisms that drive the dynamic processes and have the advantage of being easy to interpret. However, they do not take advantage of the widespread availability of data, so their predictive power is often limited. In addition, there is no principled way to do parameter calibration and model selection. Finally, ABMs typically can not estimate agent-specific (or “micro”) state variables. In this talk, I showcase a line of research aimed at learning the latent parameters and micro-level variables of an ABM from data. The key idea is to cast the ABM into a probabilistic generative model. This transformation follows two general design principles: balance of stochasticity and data availability, and replacement of unobservable discrete choices with differentiable approximations. Given this new form of the model, we can derive a proper likelihood function and proceed to maximize the likelihood of the latent variables thanks to auto-differentiation and gradient-based optimization. I will show: (i) that a maximum likelihood approach for parameter estimation of opinion dynamics models outperforms the typical simulation-based approach, (ii) better forecasting on simulations of an ABM of the housing market, in which agents with different incomes bid higher prices to live in high-income neighborhoods, (iii) a real application to social media data of an opinion dynamics ABM, and (iv) a way to bypass the need to explicitly derive the likelihood, based on variational inference. I will conclude by reflecting on the importance of scientific models in a world of black-box, deep-learning models.
3.5 Efficient Private Algorithms (for Graphs, using Graphs).
Alessandro Epasto (Google – New York, US)
License:
Creative Commons BY 4.0 International license © Alessandro Epasto
Graphs are a fundamental data structure in machine learning applications lying at the core of countless industrial applications. While traditional graph algorithms have not considered the privacy of the users providing the data, recently private graph analysis has received significant attention. In this talk, I will cover recent research in differentially private (DP) algorithms in graphs. DP is a strong notion of privacy guarantees promising plausible deniability for user data. I will cover our work on designing efficient graph algorithms with a variety of applications, including on solving the seemingly non-graph problem of sanitizing a private dataset of arbitrary items to extract as many items as possible, while respeciting user privacy.
3.6 Probing the mammalian fossil record for patterns of competitive exclusion using computational randomization experiments
Esther Galbrun (University of Eastern Finland – Kuopio, FI)
License:
Creative Commons BY 4.0 International license © Esther Galbrun
Joint work of: Esther Galbrun, Jo S. Hermansen, Indrė Žliobaitė
Due to recent common ancestry, species belonging to the same genus are expected to be more similar with respect to their phenotype, and hence exhibit less niche divergence than species belonging to different genera. Consequently, congeneric species are expected to compete intensely for resources, and therefore to be segregated in space. Yet, despite the longstanding history of this hypothesis of congeneric competitive exclusion, empirical evidence in support of it is at best limited. Here, we analyze co-occurrence patterns of species that belong to the same genera in the mammalian fossil record kept in the NOW database, considering separately Europe during the Neogene, and North America during the Oligocene–Neogene. We assess co-occurrence patterns in comparison to baselines where competitive exclusion is obfuscated through randomization. We find that congeneric species occur together notably less than would be expected at random, with large herbivores being more segregated than large carnivorans and small mammals.
3.7 Minimizing hitting time between disparate groups with shortcut edges
Aristides Gionis (KTH Royal Institute of Technology – Stockholm, SE)
License:
Creative Commons BY 4.0 International license © Aristides Gionis
Structural bias or segregation of networks refers to situations where two or more disparate groups are present in the network, so that the groups are highly connected internally, but loosely connected to each other. Examples include polarized communities in social networks, antagonistic content in video-sharing or news-feed plat- forms, etc. In many cases it is of interest to increase the connectivity of disparate groups so as to, e.g., minimize social friction, or expose individuals to diverse viewpoints. A commonly-used mechanism for increasing the network connectivity is to add edge shortcuts between pairs of nodes. In many applications of interest, edge shortcuts typically translate to recommendations, e.g., what video to watch, or what news article to read next. The problem of reducing structural bias or segregation via edge shortcuts has recently been studied in the literature, and random walks have been an essential tool for modeling navigation and connectivity in the underlying networks. Existing methods, however, either do not offer approximation guarantees, or engineer the objective so that it satisfies certain desirable properties that simplify the optimization task. We address the problem of adding a given number of shortcut edges in the network so as to directly minimize the average hitting time and the maximum hitting time between two disparate groups. The objectives we study are more natural than objectives considered earlier in the literature (e.g., maximizing hitting-time reduction) and the optimization task is significantly more challenging. Our algorithm for minimizing average hitting time is a greedy bicriteria that relies on supermodularity. In contrast, maximum hitting time is not supermodular. Despite, we develop an approximation algorithm for that objective as well, by leveraging connections with average hitting time and the asymmetric k-center problem.
3.8 Optimally Improving Cooperative Learning in a Social Setting
Shahrzad Haddadan (Rutgers Business School – Piscataway, US)
License:
Creative Commons BY 4.0 International license © Shahrzad Haddadan
We consider a cooperative learning scenario where a collection of networked agents with individually owned classifiers dynamically update their predictions, for the same classification task, through communication or observations of each other’s predictions. Clearly if highly influential vertices use erroneous classifiers, there will be a negative effect on the accuracy of all the agents in the network. We ask the following question: how can we optimally fix the prediction of a few classifiers so as maximize the overall accuracy in the entire network. To this end we consider an aggregate and an egalitarian objective function. We show a polynomial time algorithm for optimizing the aggregate objective function, and show that optimizing the egalitarian objective function is NP-hard. Furthermore, we develop approximation algorithms for the egalitarian improvement. The performance of all of our algorithms are guaranteed by mathematical analysis and backed by experiments on synthetic and real data.
3.9 Learning RUMs from Small Slates
Ravi Kumar (Google – Mountain View, US)
License:
Creative Commons BY 4.0 International license © Ravi Kumar
Joint work of: Ravi Kumar, Flacio Chierichetti, Mirko Giacchini, Alessandro Panconesi, Andrew Tomkins
A Random Utility Model (RUM) is a classical model of user behavior defined by a utility distribution. A user, presented with a subset of items from , will select the item of the subset with highest utility, according to a utility vector drawn from the distribution. In practical settings, the subset is often of small size, as in the “ten blue links” of web search.
In this talk, we will consider a learning setting with complete information on user choices from subsets of size at most k. We show that is necessary and sufficient to predict the distribution of all user choices within an arbitrarily small constant error. Based on this upper bound, we obtain new algorithms for approximate RUM learning and variations thereof.
3.10 Consistency, Robustness and Randomness
Silvio Lattanzi (Google – Barcelona, ES)
License:
Creative Commons BY 4.0 International license © Silvio Lattanzi
Joint work of: Silvio Lattanzi, Sayan Bhattacharya, Martin Costa, Hendrik Fichtenberger, Naveen Garg, Ashkan Norouzi Fard, Niko Parotsidis, Ola Svensson, Sergei Vassilvitskii
In the talks we will discuss algorithms for the dynamic k-median clustering problem. In this problem, the input data points are dynamically changing through insertions and (possibly) deletions, and the goal is to maintain a good clustering with a small number of changes to the cluster centers. In particular, we will discuss algorithms in the insertion only and fully dynamic setting and we will highlight the main ideas behind them.
3.11 The Role of Transparency in Repeated First-Price Auctions with Unknown Valuations
Stefano Leonardi (Sapienza University of Rome, IT)
License:
Creative Commons BY 4.0 International license © Stefano Leonardi
We study the problem of regret minimization for a single bidder in a sequence of first-price auctions where the bidder discovers the item’s value only if the auction is won. Our main contribution is a complete characterization, up to logarithmic factors, of the minimax regret in terms of the auction’s transparency, which controls the amount of information on competing bids disclosed by the auctioneer at the end of each auction. Our results hold under different assumptions (stochastic, adversarial, and their smoothed variants) on the environment generating the bidder’s valuations and competing bids. These minimax rates reveal how the interplay between transparency and the nature of the environment affects how fast one can learn to bid optimally in first-price auctions.
3.12 Hyperbolic Communities: Modelling Communities Beyond Cliques
Pauli Miettinen (University of Eastern Finland – Kuopio, FI)
License:
Creative Commons BY 4.0 International license © Pauli Miettinen
Joint work of: Pauli Miettinen, Rainer Gemulla, Stephen Günnemann, Sibylee Hess, Sanjar Karaev, Saskia Metzler, Stefan Neumann
Communities are most commonly modelled as (quasi-) cliques, assuming that everyone in the community knows everyone else with equal probability. In this talk I argue that this model is too simple. A better model for real-world communities can be obtained by considering the community to have a clique as a core and a tail of members who know increasingly smaller portion of the core. This model generalises many earlier models, including cliques and the core/periphery model (Borgatti & Everett, 1999).
Fitting the model to real-world Q/A-communities shows that the relative sizes of the cores of the communities stays constant (at around 16% of the community size) over time and over fluctuations in the community size.
To find the communities, the model can be expressed as a particular kind of matrix factorization: each community is an outer product of two vectors of nonnegative numbers such that (a permutation of) the values in the vectors form an arithmetic progression and this rank-1 matrix is thresholded to 0/1 matrix. To express the total graph as a union of such thresholded rank-1 matrices, the matrix multiplication must be taken over the max-times (or subtropical) algebra.
3.13 Sublinear-Time Opinion Estimation
Stefan Neumann (Technische Universität Wien, AT)
License:
Creative Commons BY 4.0 International license © Stefan Neumann
Joint work of: Stefan Neumann, Yinhao Dong, Pan Peng
Online social networks are ubiquitous parts of modern societies and the discussions that take place in these networks impact people’s opinions on diverse topics, such as politics or vaccination. One of the most popular models to formally describe this opinion formation process is the Friedkin-Johnsen (FJ) model, which allows to define measures, such as the polarization and the disagreement of a network.
Recently, Xu, Bao and Zhang (WebConf’21) showed that all opinions and relevant measures in the FJ model can be approximated in near-linear time. However, their algorithm requires the entire network and the opinions of all nodes as input. Given the sheer size of online social networks and increasing data-access limitations, obtaining the entirety of this data might, however, be unrealistic in practice.
In this paper, we show that node opinions and all relevant measures, like polarization and disagreement, can be efficiently approximated in time that is sublinear in the size of the network. Particularly, our algorithms only require query-access to the network and do not have to preprocess the graph. Furthermore, we use a connection between FJ opinion dynamics and personalized PageRank, and show that in d-regular graphs, we can deterministically approximate each node’s opinion by only looking at a constant-size neighborhood, independently of the network size. We also experimentally validate that our estimation algorithms perform well in practice.
This paper appeared at the WebConf 2024 and is joint work with Yinhao Dong and Pan Peng.
3.14 Mining Temporal Networks – A Higher-Order Temporal H-Index for Evolving Networks
Lutz Oettershagen (KTH Royal Institute of Technology – Stockholm, SE)
License:
Creative Commons BY 4.0 International license © Lutz Oettershagen
Joint work of: Lutz Oettershagen, Nils M. Kriege, Petra Mutzel
The H-index of a node in a static network is the maximum value h such that at least h of its neighbors have a degree of at least h. Recently, a generalized version, the n-th order H-index, was introduced, allowing to relate degree centrality, H-index, and the k-core of a node. We extend the n-th order H-index to temporal networks and define corresponding temporal centrality measures and temporal core decompositions. Our n-th order temporal H-index respects the reachability in temporal networks leading to node rankings, which reflect the importance of nodes in spreading processes. We derive natural decompositions of temporal networks into subgraphs with strong temporal coherence. We analyze a recursive computation scheme and develop a highly scalable streaming algorithm. Our experimental evaluation demonstrates the efficiency of our algorithms and the conceptional validity of our approach. Specifically, we show that the n-th order temporal H-index is a strong heuristic for identifying super-spreaders in evolving social networks and detects temporally well-connected components.
3.15 Differentially Private Continual Counting
Rasmus Pagh (University of Copenhagen, DK)
License:
Creative Commons BY 4.0 International license © Rasmus Pagh
The problem of continually releasing the value of a counter while ensuring privacy of individual inputs has received a lot of attention recently, in part because of applications in machine learning. This talk gives an overview of recent results on this problem, focusing on efficiency in the streaming setting. Several open problems are presented.
3.16 Scalable and Accurate Fair k-Center Clustering in Doubling Metrics
Andrea Pietracaprina (University of Padova, IT) and Geppino Pucci (University of Padova, IT)
License:
Creative Commons BY 4.0 International license © Andrea Pietracaprina and Geppino Pucci
Joint work of: Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci, Francesco Visonà
The -center problem requires the selection of points (centers) out of a given metric pointset so to minimize the maximum distance any point of from the closest center. The talk examines a fair variant of the problem, known as fair center, where points are assigned to some category and each category may contribute a limited number of points to the center set. It presents the first sliding-windows streaming algorithm for fair center in general metric spaces. At any time , the algorithm is able to provide a solution for the current window whose quality is almost as good as the one guaranteed by the best, polynomial-time sequential algorithms run on the entire window, and exhibits space and time requirements independent of the window size. Experimental evidence of the practical viability of the algorithm is also given.
3.17 Beyond the Configuration Model: Descriptive Null Models for (Hyper)graphs
Giulia Preti (CENTAI – Torino, IT)
License:
Creative Commons BY 4.0 International license © Giulia Preti
Joint work of: Giulia Preti, Adriano Fazzone, Giovanni Petri, Gianmarco De Francisci Morales, Aris Gionis, Matteo Riondato
Graph null models provide a framework for generating random graph samples that preserve specific structural properties, enabling comparison with observed networks to assess the statistical significance of observed patterns. Many null models in the literature progressively enforce stronger constraints on the graphs in the ensemble, making sampling from these models increasingly challenging. Among the various techniques proposed for this task are Markov Chain Monte Carlo (MCMC) algorithms. These algorithms generate random samples by constructing a Markov chain over the space defined by the graphs in the ensemble and ensuring convergence to the target distribution.
In this talk, I will explore the null models we developed for three distinct types of graphs: bipartite graphs modeling transactional datasets, colored multi-graphs, and directed hypergraphs. Using practical examples such as purchase networks, endorsement networks, and co-sponsorship data, we will see how selecting an appropriate null model can yield different insights into structural patterns. Additionally, I will discuss specialized double-edge swap operations such as restricted swaps and parity swaps, which are designed to ensure uniform and efficient sampling from our graph ensembles.
3.18 Drift-adaptive Clustering of Streamed Data
Geppino Pucci (University of Padova, IT)
License:
Creative Commons BY 4.0 International license © Geppino Pucci
Joint work of: Matteo Ceccarello, Alessio Mazzetto, Andrea Pietracaprina, Geppino Pucci, Eli Upfal
We present a novel technique to let streaming clustering algorithms adapt to arbitrary drifts in the distribution of the data stream. The goal is to compute, at each step, a clustering that best represents the current distribution of the data. In the streaming setting, this is traditionally attempted through the sliding-window mechanism, where the analysis is performed on the most recent fixed-size segment of the data. The problems with this approach are twofold: (1) setting the correct window size is challenging; and (2) a fixed window size cannot effectively track changes in the distribution happening at variable speed. In this talk, we focus on a family of center-based clustering methods that includes the popular -median and -means problems, and we propose a new methodology that adjusts the window size based on the drift of the data. The challenge is that it is not possible to explicitly estimate the drift, as we may have only a single data point from each distribution. Nonetheless, we prove that our methodology can adjust the window size as effectively as an algorithm that has knowledge of the magnitude of the drift. Furthermore, our solution exhibits the same multiplicative error as the best-known approximation algorithm for the same clustering problem in the no-drift case. We also demonstrate the practical effectiveness of our methodology through experiments on real data.
3.19 Bias Reduction for Sum Estimation
Will Rosenbaum (University of Liverpool, GB)
License:
Creative Commons BY 4.0 International license © Will Rosenbaum
Joint work of: Will Rosenbaum, Talya Eden, Jakob Bæk Tejs Houen, Shyam Narayanan, Jakub Tětek
In classical statistics and distribution testing, it is often assumed that elements can be sampled exactly from some distribution , and that when an element is sampled, the probability of sampling is also known. In this setting, recent work in distribution testing has shown that many algorithms are robust in the sense that they still produce correct output if the elements are drawn from any distribution that is sufficiently close to . This phenomenon raises interesting questions: under what conditions is a “noisy” distribution sufficient, and what is the algorithmic cost of coping with this noise? In this paper, we investigate these questions for the problem of estimating the sum of a multiset of real values . This problem is well-studied in the statistical literature in the case , where the Hansen-Hurwitz estimator [Annals of Mathematical Statistics, 1943] is frequently used. We assume that for some (known) distribution , values are sampled from a distribution that is pointwise close to P. That is, there is a parameter such that for all ,
For every positive integer we define an estimator for whose bias is proportional to (where our reduces to the classical Hansen-Hurwitz estimator). As a special case, we show that if is pointwise -close to uniform and all , for any , we can estimate to within additive error using samples, where . We then show that this sample complexity is essentially optimal. Interestingly, our upper and lower bounds show that the sample complexity need not vary uniformly with the desired error parameter : for some values of , perturbations in its value have no asymptotic effect on the sample complexity, while for other values, any decrease in its value results in an asymptotically larger sample complexity.
3.20 Sampling-based algorithms for approximate subgraph counting
Ilie Sarpe (KTH Royal Institute of Technology – Stockholm, SE)
License:
Creative Commons BY 4.0 International license © Ilie Sarpe
Analyzing graphs by the means of the local properties of its nodes is a fundamental task in network analysis. In this work I will discuss recent results for estimating local, triangle-based, coefficients on large graphs. Our algorithm is based on sampling and it adapts to the complexity of the graph.
3.21 Fully-Dynamic Decision Trees
Mauro Sozio (Telecom Paris, FR)
License:
Creative Commons BY 4.0 International license © Mauro Sozio
We present fully-dynamic algorithms for efficiently maintaining epsilon-feasible decision trees, which are decision trees where the splitting features and values are near-optimal (up to an additive error of epsilon). We present deterministic algorithms with polylog amortized cost as well as polylog cost in the worst case. Our algorithms are near-optimal both in terms of memory requirement and running time (up to polylog factors), while maintaining optimal splits is not possible under the SETH conjecture. We then discuss how to extend our results so as to efficiently maintain random forests in a fully dynamic environment. Our extensive experimental evaluation against the state-of-the-art shows the effectiveness of our approach.
3.22 Information Geometry for Statistical Knowledge Discovery: Patterns, Tensors, and Mode Interactions
Mahito Sugiyama (National Institute of Informatics – Tokyo, JP)
License:
Creative Commons BY 4.0 International license © Mahito Sugiyama
Joint work of: Mahito Sugiyama, Hiroyuki Nakahara, Koji Tsuda, Simon Luo, Lamiae Azizi, Kazu Ghalamkari, Yoshinobu Kawahara
I will present a unified information geometric perspective on various approaches to statistical knowledge discovery in structured spaces. These approaches include pattern (itemset) mining, matrix and tensor decomposition, and mode interaction selection.
3.23 Finding favourite tuples on data streams with provably few comparisons
Nikolaj Tatti (University of Helsinki, FI)
License:
Creative Commons BY 4.0 International license © Nikolaj Tatti
We study the problem of identifying one or more high-utility tuples by adaptively receiving user input on a minimum number of pairwise comparisons. We devise a single-pass streaming algorithm, which processes each tuple in the stream at most once, while ensuring that the memory size and the number of requested comparisons are in the worst case logarithmic in , where is the number of all tuples.
3.24 A Theory of Weak-Supervision and Zero-Shot Learning
Eli Upfal (Brown University – Providence, US)
License:
Creative Commons BY 4.0 International license © Eli Upfal
Labeled training data is often scarce, unavailable, or can be very costly to obtain. To circumvent this problem, there is a growing interest in developing methods that can exploit sources of information other than labeled data, such as weak-supervision and zero-shot learning. While these techniques obtained impressive accuracy in practice, both for vision and language domains, they come with no theoretical characterization of their accuracy. In a sequence of recent works, we develop a rigorous mathematical framework for constructing and analyzing algorithms that combine multiple sources of related data to solve a new learning task. Our learning algorithms provably converge to models that have minimum empirical risk with respect to an adversarial choice over feasible labelings for a set of unlabeled data, where the feasibility of a labeling is computed through constraints defined by rigorously estimated statistics of the sources. Notably, these methods do not require the related sources to have the same labeling space as the multiclass classification task. We demonstrate the effectiveness of our approach with experimentations on various image classification tasks.
3.25 Efficient Discovery of Significant Patterns with Few-Shot Resampling
Fabio Vandin (University of Padova, IT)
License:
Creative Commons BY 4.0 International license © Fabio Vandin
We present FSR, an efficient algorithm to identify statistically significant patterns with rigorous guarantees on the probability of false discoveries. FSR builds on a novel general framework for mining significant patterns that captures some of the most commonly considered patterns, including itemsets, sequential patterns, and subgroups. FSR uses a small number of resampled datasets, obtained by assigning i.i.d. labels to each transaction, to rigorously bound the supremum deviation of a quality statistic measuring the significance of patterns. FSR builds on novel tight bounds on the supremum deviation that require to mine a small number of resampled datasets, while providing a high effectiveness in discovering significant patterns. This is joint work with Leonardo Pellegrina.
3.26 Self-Sufficient Itemsets: A set of statistical constraints for discarding patterns that are unlikely to be interesting
Geoffrey Webb (Monash University – Clayton, AU)
License:
Creative Commons BY 4.0 International license © Geoffrey Webb
Association Discovery is a powerful but misunderstood tool for data science. When done correctly, it can expose systematic interactions within data. The results are straightforward to interpret and often actionable. However, the standard framework for association discovery, the support/confidence framework is not fit for purpose. It often fails to uncover important interactions and usually hides those it does find with a flood of false and misleading purported associations. This talk explains the value of association discovery, exposes the limitations of the support/confidence framework, and presents self-sufficient itemsets, an alternative framework that avoids those limitations.
4 Participants
-
Florian Adriaens – University of Helsinki, FI
-
Aris Anagnostopoulos – Sapienza University of Rome, IT
-
Luca Becchetti – Sapienza University of Rome, IT
-
Fabian Brandt-Tumescheit – HU Berlin, DE
-
Gianmarco De Francisci Morales – CENTAI – Torino, IT
-
Alessandro Epasto – Google – New York, US
-
Esther Galbrun – University of Eastern Finland – Kuopio, FI
-
Aristides Gionis – KTH Royal Institute of Technology – Stockholm, SE
-
Shahrzad Haddadan – Rutgers Business School – Piscataway, US
-
Ravi Kumar – Google – Mountain View, US
-
Silvio Lattanzi – Google – Barcelona, ES
-
Stefano Leonardi – Sapienza University of Rome, IT
-
Sebastian Lüderssen – TU Wien, AT
-
Alessio Mazzetto – Brown University – Providence, US
-
Pauli Miettinen – University of Eastern Finland – Kuopio, FI
-
Stefan Neumann – Technische Universität Wien, AT
-
Lutz Oettershagen – KTH Royal Institute of Technology – Stockholm, SE
-
Rasmus Pagh – University of Copenhagen, DK
-
Andrea Pietracaprina – University of Padova, IT
-
Giulia Preti – CENTAI – Torino, IT
-
Geppino Pucci – University of Padova, IT
-
Matteo Riondato – Amherst College, US
-
Will Rosenbaum – University of Liverpool, GB
-
Larry Rudolph – Two Sigma Investments LP – New York, US
-
Ilie Sarpe – KTH Royal Institute of Technology – Stockholm, SE
-
Mauro Sozio – Telecom Paris, FR
-
Mahito Sugiyama – National Institute of Informatics – Tokyo, JP
-
Nikolaj Tatti – University of Helsinki, FI
-
Eli Upfal – Brown University – Providence, US
-
Fabio Vandin – University of Padova, IT
-
Sergei Vassilvitskii – Google – New York, US
-
Yllka Velaj – Universität Wien, AT
-
Geoffrey Webb – Monash University – Clayton, AU
-
Anthony Wirth – The University of Sydney, AU