Abstract 1 Introduction 2 Preliminaries 3 Sparse erosion distance between GPDs relative to sampled intervals 4 Lipschitz stability and differentiability of the loss function 5 Numerical Experiments on Sparsification 6 Conclusion References

Sparsification of the Generalized Persistence Diagrams for Scalability Through Gradient Descent

Mathieu Carrière ORCID Centre Inria d’Université Côte d’Azur, Sophia Antipolis, France Seunghyun Kim Department of Mathematical Sciences, KAIST, Daejeon, South Korea Woojin Kim ORCID Department of Mathematical Sciences, KAIST, Daejeon, South Korea
Abstract

The generalized persistence diagram (GPD) is a natural extension of the classical persistence barcode to the setting of multi-parameter persistence and beyond. The GPD is defined as an integer-valued function whose domain is the set of intervals in the indexing poset of a persistence module, and is known to be able to capture richer topological information than its single-parameter counterpart. However, computing the GPD is computationally prohibitive due to the sheer size of the interval set. Restricting the GPD to a subset of intervals provides a way to manage this complexity, compromising discriminating power to some extent. However, identifying and computing an effective restriction of the domain that minimizes the loss of discriminating power remains an open challenge.

In this work, we introduce a novel method for optimizing the domain of the GPD through gradient descent optimization. To achieve this, we introduce a loss function tailored to optimize the selection of intervals, balancing computational efficiency and discriminative accuracy. The design of the loss function is based on the known erosion stability property of the GPD. We showcase the efficiency of our sparsification method for dataset classification in supervised machine learning. Experimental results demonstrate that our sparsification method significantly reduces the time required for computing the GPDs associated to several datasets, while maintaining classification accuracies comparable to those achieved using full GPDs. Our method thus opens the way for the use of GPD-based methods to applications at an unprecedented scale.

Keywords and phrases:
Multi-parameter persistent homology, Generalized persistence diagram, Generalized rank invariant, Non-convex optimization, Gradient descent
Funding:
Mathieu Carrière: Partially supported by ANR grant “TopModel”, ANR-23-CE23-0014, and supported by the French government, through the 3IA Cote d’Azur Investments in the project managed by the National Research Agency (ANR) with the reference number ANR-23-IACL-0001.
Woojin Kim: Partially supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (RS-2025-00515946).
Copyright and License:
[Uncaptioned image] © Mathieu Carrière, Seunghyun Kim, and Woojin Kim; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Mathematics of computing Algebraic topology
; Theory of computation Nonconvex optimization
Related Version:
Full Version: https://arxiv.org/abs/2412.05900 [11]
Supplementary Material:
Software  (Source Code): https://github.com/L-ebesgue/sparse_GPDs
  archived at Software Heritage Logo swh:1:dir:2147b0ae5a14e4c082811cb721798cb8974fb0b9
Acknowledgements:
This work stemmed from a conversation between M.C. and W.K. during the Dagstuhl Seminar on “Applied and Combinatorial Topology” (24092), held from February 26 to March 1, 2024. M.C. and W.K. thank the organizers of the Dagstuhl Seminar and appreciate the hospitality of Schloss Dagstuhl – Leibniz Center for Informatics. The authors are also grateful to the OPAL infrastructure from Université Côte d’Azur for providing resources and support.
Editors:
Oswin Aichholzer and Haitao Wang

1 Introduction

Persistent homology, a central tool in topological data analysis (TDA), enables the study of topological features in datasets through algebraic invariants. In the classical one-parameter setting, the persistence barcode (or equivalently, the persistence diagram) serves as a complete, discrete, and computationally tractable invariant of a persistence module. Persistent homology can be extended to multi-parameter persistent homology, which provides tools for capturing the topological features of datasets using multiple filtrations instead of just one. However, the transition to multi-parameter persistent homology introduces significant complexity into the algebraic structure of the associated persistence modules [3, 6, 9].

Nevertheless, the generalized persistence diagram (GPD) extends the notion of persistence diagram from the one-parameter to the multi-parameter setting in a natural way [20, 33]. Although the GPD has been extensively studied in terms of stability, discriminating power, computation, and generalizations (see, e.g., [1, 13, 16, 17, 20, 22, 23, 24]), the computational complexity of GPDs remains a major obstacle [19]. The primary challenge arises from the size of their domain: the domain of the complete GPD is either Int(d), the set of all intervals in d, or any appropriately chosen, finite subset of Int(d) – however, in this case, domains still tend to be enormous to avoid sacrificing the GPD discriminating power [13]. Nevertheless, GPDs are flexible, in the sense that they are still well-defined on any finite subdomain Int(d). Even if the subdomain has small size, as the GPD over is simply defined as the Möbius inversion of the generalized rank invariant (GRI) over (see Definition 1). This allows to control the aforementioned complexity of GPD computation by picking a small, or sparse, subdomain . Moreover, the topological information loss due to using such sparse subdomains can be mitigated by looking for relevant intervals, i.e., intervals that are rich in topological content; indeed, even with a substantially small subdomain Int(d), the GPD over can still be finer than other traditional invariants of multi-parameter persistence modules, such as the rank invariant (RI) [9]; see [13] for details. However, how to design these subdomains in the “best” way so as to reduce computational cost while maintaining the discriminating power of the GPD is a question that has not been much explored so far.

Sparsification of the GPD via gradient descent.

Motivated by the computational challenges of the GPD and the flexibility of this invariant upon selecting subsets of intervals as its domain, we propose a method for automatically sparsifying GPDs computed from 2-persistence modules based on gradient descent. Namely, we consider the following scenario.

Suppose we aim at classifying instances of a given dataset based on their topological features, and we have already computed a set of corresponding persistence modules {Mi:2vectk}1it from the dataset, where each persistence module corresponds to an individual data point. We consider the set {dgmMi}1it of GPDs of Mi over a large set of intervals in 2 (cf. Definition 1). We refer to these as the full GPDs.

Let n1 be the cardinality of , and let m be a sparsification parameter m={1,2,}, which is typically significantly smaller than n. Let (Int(2)m) denote the set of m-subsets of Int(2), i.e., {𝒥Int(2):|𝒥|=m}. In order to identify an m-subset of intervals in 2 over which the new GPDs are computed (and subsequently used to classify the persistence modules {Mi}1it), we proceed as follows. Firstly, we identify a loss function defined on (Int(2)m):

d,m,{Mi}1it:(Int(2)m)𝒥i=1td(dgmMi𝒥,dgmMi), (1)

where d is an appropriate dissimilarity function. Secondly, we search for a minimizer of the loss function. The goal of this search is to identify a subset 𝒥 of m intervals such that the GPDs of the {Mi}i over this sparse subset {dgmMi𝒥}1it, the sparse GPDs, best approximate their corresponding full counterparts overall. One natural way of searching for a (local) minimizer is through gradient descent, starting from a randomly chosen m-subset 𝒥init of . To achieve this, the following requirements either must be met or highly desirable:

  1. (I)

    (Distance) A suitable distance or dissimilarity function d, utilized in constructing the loss function above, ideally satisfying certain stability guarantees w.r.t. the interleaving distance between persistence modules [12, 25],

  2. (II)

    (Vectorization) A certain representation of the loss function d,m,{Mi}1it as a map defined on some subset 𝒟 of Euclidean space,

  3. (III)

    (Convexity) Convexity of the subset 𝒟,

  4. (IV)

    (Loss regularity) Lipschitz stability and differentiability of d,m,{Mi}1it, and

  5. (V)

    (Feasibility) Computational feasibility of d,m,{Mi}1it for practical implementation.

Our contributions can be listed according to Items (I)-(V) mentioned above.

Summary of contributions.

  • A natural choice for d in Item (I) would be to use the erosion distance dE [13, 33], a standard metric between GPDs,111 Note that dE is also referred to as a metric between generalized rank invariants (GRIs), e.g., in [13, 21]. Since the GPD and the GRI determine one another (cf. Remark 2 (i) and (iii)), dE can thus also be viewed as a metric between GPDs, as in [33, 36]. that is known to be stable under perturbations of input persistence modules w.r.t. the interleaving distance. However, the use of the erosion distance requires GPDs to be defined on the same set of intervals (Definition 3) that is closed under thickening, which implies the domain must be infinite. All these make it difficult to directly utilize dE. Hence, we introduce the sparse erosion distance d𝐄^ between GPDs relative to (possibly) different interval sets, as an adaptation of d𝐄 (Definition 4, Proposition 6 (ii), and Corollary 7).222Another possibility would be to use bottleneck and Wasserstein distances, however we show in Remark 17 in the full version [11] that they fail to ensure continuity of the loss function.

  • Regarding Item (V), we restrict our focus to intervals in 2 with a small number of minimal and maximal points (Remark 10). Then, if the sparse erosion distance dE^ is computed between GPDs of the same persistence module M, we prove that d𝐄^ depends solely on the domains ,𝒥, and not on the input persistence module, i.e. d𝐄^((𝐝𝐠𝐦M,),(𝐝𝐠𝐦M,𝒥))=d^(,𝒥) (Proposition 6 (iii)). This fact significantly enhances the tractability of gradient descent as it allows to avoid recomputing the GPDs at every iteration. Moreover, we derive a closed-form formula for the computation of d^(,𝒥) (Theorem 9 (i)).

    Figure 1: Any (2,1)-interval I of 2, as depicted above, is represented by 𝐯I=(x,y,a,b,c,d) 6. Any (1,1)-interval can also be represented by (x,y,a,b,c,d) 6 with b=c=0.
    Figure 2: Our pipeline for sparsifying GPDs in the context of time series classification. The sizes of the GPD points in 6 are proportional to their multiplicities.
  • For achieving Items (II), (III), and (IV), we only consider intervals with at most two minimal points and one maximal point. This ensures the existence of natural embeddings of these intervals into Euclidean space 6, which can be obtained by stacking up the coordinates of the interval middle points (x,y) together with the lengths a,b,c,d0 needed to define the interval lower boundaries (Figure 1). The main advantages of this vectorization method (w.r.t. the other natural ones) are that (1) it allows for a simple formulation of the loss function (Theorem 9 (ii)), and (2) its variables are all independent from each other, which makes the implementation of gradient descent easier (Remarks 10 and 11).333Alternatively, we can consider intervals with one minimal points and at most two maximal points. Either choice ensures that our restricted GPDs are not a weaker invariant than the rank invariant or the signed barcode [7]; see [13, Section 4]. Using this vectorization method, we also prove that 𝒟 not only forms a convex subset of Euclidean space 𝟔m (Proposition 13), but also ensures the Lipschitz stability and almost everywhere differentiability of the loss function d,m,{Mi}𝟏it (Theorem 14 and Proposition 16). In fact, we prove that the loss function can actually be realized as

    dE^,m:𝒟(6m)𝕧𝒥td^(,𝒥), (2)

    where 𝕧𝒥 is our 6m-dimensional embedding of 𝒥. Since t (the size of the dataset) is a constant, the (local) minimizers of dE^,m coincide with those of the map t1dE^,m:𝕧𝒥d^(,𝒥). Hence, searching for a (local) minimizer of dE^,m is essentially searching for the (locally) best m intervals that represent the domain of the full GPDs.

  • Finally, regarding Item (V) again, and in order to showcase the efficiency of our proposed sparsification method, we provide numerical experiments on topology-based time series classification with supervised machine learning (Section 5 and Figure 2), in which we show that the sparse GPDs {dgmMi𝒥}1it can be computed in much faster time than the full GPDs, while maintaining similar or better classification performances from random forests models. Our code is fully available at sparse GPDs.

Comparison with other works.

While there exist many existing works that utilize multi-parameter persistent homology to enhance the performance of machine learning models [10, 14, 27, 29, 32, 34, 35, 37], our work takes a different approach as it focuses on methods to mitigate the computational overhead associated with multi-parameter persistence descriptors.

The works most closely related to ours are [29] and [37], which use the RI or GRI for multi-parameter persistence modules in a machine learning context. Firstly, [29] is based on an equivalent representation of the GPDs (computed from rectangle intervals444This types of GPDs are often called signed barcodes.) as signed measures [7], which allows to compare GPDs with optimal transport distances, as well as to deploy known vectorization techniques intended for general measures. Secondly, the approach proposed in [37] involves vectorizing the GRIs of 2-parameter persistence modules by evaluating them on intervals with specific shapes called worms. Our goal is different: we rather aim at vectorizing and sparsifying the domains of the GPDs in order to achieve good performance scores. In fact, the sparsification process that we propose can actually be used complementarily to both [29] and [37] by first sparsifying the set of rectangles or worms before applying their vectorization methods. Note that differentiability properties of both of these approaches w.r.t. the multi-parameter filtrations were recently established [32, 34]; in contrast, our work deals with the differentiability of a GPD-based loss function w.r.t. the interval domains (while keeping the multi-parameter filtrations fixed).

Organization.

Section 2 reviews basic properties of the GPD and GRI. Section 3 introduces the sparse erosion distance, clarify its relation to the erosion distance given in [13], and presents its closed-form formula, which is specialized and useful in our setting. Section 4 establishes the Lipschitz stability and differentiability of our loss function. Section 5 presents our numerical experiments. Finally, Section 6 discusses future research directions.

2 Preliminaries

In this article, P=(P,) stands for a poset, regarded as the category whose objects are the elements of P, and for any pair p,qP, there exists a unique morphism pq if and only if pq. All vector spaces in this article are over a fixed field k. Let vectk denote the category of finite-dimensional vector spaces and linear maps over k. A functor Pvectk will be referred to as a (P-)persistence module. The direct sum of any two P-persistence modules is defined pointwise. Any M:Pvectk is trivial if M(x)=0 for all xP. If a nontrivial M is not isomorphic to a direct sum of any two nontrivial persistence modules, M is indecomposable. Every persistence module is decomposed into a direct sum of indecomposable modules, uniquely determined up to isomorphism [2, 4].

An interval I of P is a subset IP such that:

  1. (i)

    I is nonempty.

  2. (ii)

    If p,qI and prq, then rI.

  3. (iii)

    I is connected, i.e. for any p,qI, there is a sequence p=p0,p1,,p=q of elements of I with pi and pi+1 comparable for 0i1.

By Int(P) we denote the set of all intervals of P.

Given any IInt(P), the interval module kI is the P-persistence module, with

(kI)(p):={kifpI0otherwise.,kI(pq):={idkifpqI0otherwise.

Every interval module is indecomposable [5, Proposition 2.2]. A P-persistence module M is interval-decomposable if it is isomorphic to a direct sum jJkIj of interval modules. In this case, the barcode of M is defined as the multiset barc(M):={kIj:jJ}.

For pP, let p denote the set of points qP such that pq. Clearly, p belongs to Int(P). A P-persistence module is finitely presentable if it is isomorphic to the cokernel of a morphism aAkabBkb, where A and B are finite multisets of elements of P.

When P is a connected poset, the (generalized) rank of M, denoted by rank(M), is defined as the rank of the canonical linear map from the limit of M to the colimit of M, which is a nonnegative integer [20]. This isomorphism invariant of P-persistence modules, which takes a single integer value, can be refined into an integer-valued function as follows. Let be any nonempty subset of Int(P). The generalized rank invariant (GRI) of M over is the map rkM:0 given by Irank(M|I), where M|I is the restriction of M to I [20]. When =Int(P), we denote rkM simply as rkM.

The generalized persistence diagram (GPD) of M over captures the changes of the GRI values when I varies. Its formal definition follows.

Definition 1 ([13]).

The generalized persistence diagram (GPD) of 𝐌 over is defined as the function dgmM: that satisfies:555The condition in Equation (3) is a generalization of the fundamental lemma of persistent homology [18].

for all I,rkM(I)=JJIdgmM(J). (3)
 Remark 2 ([13, Sections 2 and 3] and [7]).
  1. (i)

    If is finite, then dgmM exists.

  2. (ii)

    If dgmM exists, then it is unique.

  3. (iii)

    If dgmM exists, then dgmM and rkM determine one another.

  4. (iv)

    (Monotonicity) rkM(I)rkM(J) for any pair IJ in .

  5. (v)

    (The GPD generalizes the barcode) Let =Int(P). If M is interval decomposable, then dgmM exists. In this case, for any IInt(P), dgmM(I) coincides the multiplicity of I in barc(M). Also, dgmM often exists even when M is not interval decomposable.

  6. (vi)

    If M is a finitely presentable d-persistence module, then the GPD over Int(d) exists [13, Theorem C(iii)].

In the rest of the article, every d-persistence module M is assumed to be finitely presentable, thus its GPD over Int(d) exists, and is denoted by dgmM.

3 Sparse erosion distance between GPDs relative to sampled intervals

We adapt the notion of erosion distance to define a distance between GPDs relative to (possibly different) sampled intervals. When comparing the same GPD with two different sampled intervals, this distance simplifies to a distance between sampled intervals. This is relevant in our case, as we compare the full and sparse GPDs of the same persistence module.

3.1 Sparse erosion distance

In this section, we review the definition of the erosion distance and adapt it to define the sparse erosion distance between GPDs relative to sampled intervals.

For ϵ, the vector ϵ(1,,1)d will be simply denoted by ϵ whenever there is no risk of confusion. For IInt(d) and ϵ0, we consider the ϵ-thickening Iϵ:=pIBϵ(p) of I, where Bϵ(p) stands for the closed ϵ-ball around p w.r.t. the supremum distance, i.e. Bϵ(p)={qdpϵqp+ϵ}. See Figure 3. A subset Int(d) is said to be closed under thickening if for all I and for all ϵ0, the interval Iϵ belongs to .

Figure 3: ϵ-thickening Iϵ.
Definition 3 ([13, Definition 5.2]).

Let M and N be d-persistence modules. Let be any subset of Int(d) that is closed under thickening. The erosion distance between dgmM and dgmN (and equivalently between rkM and rkN by Remark 2 (ii)) is

dE(dgmM,dgmN):=inf(ϵ>0:for all I,rkN(Iϵ)rkM(I) and rkM(Iϵ)rkN(I)).

A correspondence between nonempty sets A and B is a subset A×B satisfying the following: (1) for each aA, there exists bB such that (a,b), and (2) for each bB, there exists aA such that (a,b). For ϵ0, an ϵ-correspondence between nonempty ,𝒥Int(d) is a correspondence ×𝒥 such that for all (I,J), JIϵ and IJϵ. Blending the ideas of the Hausdorff and erosion distances, we obtain our new distance between GPDs relative to sampled intervals:

Definition 4 (Sparse erosion distance between GPDs relative to sampled intervals).

For any M,N:dvectk and any nonempty ,𝒥Int(d), the sparse erosion distance between pairs (dgmM,) and (dgmN,𝒥) is

dE^((dgmM,),(dgmN,𝒥)):=inf(ϵ>0: there exists an ϵ-correspondence×𝒥 s.t.(I,J),δ0,rkN(Jϵ+δ)rkM(Iδ) and rkM(Iϵ+δ)rkN(Jδ)).

We remark that dE^((dgmM,),(dgmN,𝒥)) captures not only (1) the algebraic difference between [M with respect to ] and [N with respect to 𝒥], but also (2) the geometric difference between the domains and 𝒥. To see (1), consider, for example, the -persistence modules M=k[0,1] and N=k[0,2], and let both and 𝒥 be the singleton set {[3,4]}. Then, one can see that dE^((dgmM,),(dgmN,𝒥))=0 from the fact that rkM=rkN𝒥=0 and the monotonicity of rkM and rkN. To see (2), let M and N be any isomorphic -persistence modules, and set :={[0,12]} and 𝒥:={[0,1]}. Then, we obtain dE^((dgmM,),(dgmN,𝒥))1/2, solely due to the difference between and 𝒥.

Proposition 5.

dE^ is an extended pseudometric. (See the full version [11] for the proof.)

For any nonempty ,𝒥Int(d), let

d^(,𝒥):=inf(ϵ>0:there exists anϵ-correspondence between  and 𝒥). (4)

We clarify the relationship among dE, dE^, and d^:

Proposition 6.

Let M,N:dvectk and let ,𝒥Int(d) be nonempty. We have:

  1. (i)

    d^(,𝒥)dE^((dgmM,),(dgmN,𝒥)).

  2. (ii)

    If =𝒥 are closed under thickening, then

    dE^((dgmM,),(dgmN,𝒥))dE(dgmM,dgmN). (5)
  3. (iii)

    If MN or dgmM=dgmN, then

    dE^((dgmM,),(dgmN,𝒥))=d^(,𝒥). (6)

(See the full version [11] for the proof.)

We remark that the inequality given in Item (i) can be strict. For instance, let d=1, =𝒥=Int(d), M=0, and N=k[0,1). Then, 0=d^(,𝒥)<dE^((dgmM,),(dgmN,𝒥)).

By Proposition 6 (ii) and the prior stability result of dE [13, Theorem H], we have:

Corollary 7.

For any M,N:dvectk and for any Int(d), we have

dE^((dgmM,),(dgmN,))dI(M,N),

where the right-hand side is the interleaving distance between M and N.

3.2 Closed-form formula for the sparse erosion distance

In this section, we find a closed-form formula for the right-hand side of Equation (6), when each interval in and 𝒥 has only finitely many minimal and maximal points (Theorem 9). This result is essential when studying our loss function in Section 4 and optimizing the domains of the GPDs in Section 5.

For p,q, an interval IInt(2) is a (p,q)-interval if I has exactly p minimal points and exactly q maximal points. For any interval IInt(2), let min(I) (resp. max(I)) be the set of all minimal (resp. maximal) elements of I. For (x,y)2, define

δ(x,y) :={1ifxy0ifx>y. (9)
Lemma 8.

For pr,qr,ps,qs, let Ir,JsInt(2) be (pr,qr)- and (ps,qs)-intervals, respectively. Then, for any ϵ0, JsIrϵ and IrJsϵ if and only if

ϵmax(maxkak,maxlbl,maxiai,maxjbj),where
  • i,j,k,l range from 1 to pr,qr,ps,qs respectively; ak and bl are defined as (10) and (11) respectively; ai and bj are obtained by interchanging the roles of i and k in (10), and those of j and l in (11), respectively;

  • min(Ir)=:{(xir,yir):1ipr},min(Js)=:{(xks,yks):1kps},max(Ir)=:{(Xjr,Yjr):1jqr},max(Js)=:{(Xls,Yls):1lqs}.

mini(max((1δ(xir,xks))|xksxir|,(1δ(yir,yks))|yksyir|)) (10)
minj(max(δ(Xjr,Xls)|XlsXjr|,δ(Yjr,Yls)|YlsYjr|)) (11)

(See the full version [11] for the proof.) Let :={Ir}r=1n and 𝒥:={Js}s=1m be sets of intervals of 2 with only finitely many minimal and maximal points. Let the (n×m)-matrix (ϵrs) where ϵrs is the RHS of the inequality given in Lemma 8, i.e.

ϵrs=max(maxkak,maxlbl,maxiai,maxjbj) (12)

and i,j,k,l,ak,bl,ai,bj as defined in Lemma 8. Next, we show that d^(,𝒥), as defined in Equation 4, equals the largest of the smallest elements across all rows and columns of . In particular, the following variables and functions are useful for describing the closed form formula for d^(,𝒥), when each of and 𝒥 consists solely of (1,1)- or (2,1)-intervals.

When Ir and Js are (2,1)-intervals, as depicted in Figure 4, let

x1 :=x2r, y1 :=y1r, a :=Y1ry1r, b :=x2rx1r,
c :=y1ry2r, d :=X1rx2r, x2 :=x2s, y2 :=y1s,
e :=Y1sy1s, f :=x2sx1s, g :=y1sy2s, h :=X1sx2s.
Figure 4: The parametrization of Ir and Js.

When Ir (resp. Js) is a (1,1)-interval, set x1r=x2r, y1r=y2r and thus b=c=0 (resp. x1s=x2s, y1s=y2s and thus f=g=0). Also, let

F(w1,w2,w3,w4) :=max(δ(w1,w2)|w2w1|,δ(w3,w4)|w4w3|),
G(m1,m2,m3,m4,m5) :=min(F(m1,m2,m3,m4),m5),
H(o1,o2,o3,o4) :=max(min(o1,o2),min(o3,o4)),

where all input variables of F,G and H are real numbers.

Theorem 9.

Let :={Ir}r=1n and 𝒥:={Js}s=1m be sets of intervals of 2 with only finitely many minimal and maximal points.

  1. (i)

    We have d^(,𝒥)=max(maxr(minsϵrs),maxs(minrϵrs)) where ϵrs is as defined in Equation 12 and Ir (resp. Js) is a (pr,qr)-interval (resp. (ps,qs)-interval).

  2. (ii)

    If and 𝒥 consist solely of (1,1)- or (2,1)-intervals, then ϵrs is the maximum of (13)-(16).

H(F(x2f,x1b,y2,y1),F(x2f,x1,y2,y1c),
F(x2,x1b,y2g,y1),F(x2,x1,y2g,y1c)) (13)
F(x1+d,x2+h,y1+a,y2+e) (14)
H(F(x1b,x2f,y1,y2),F(x1b,x2,y1,y2g),
F(x1,x2f,y1c,y2),F(x1,x2,y1c,y2g)) (15)
F(x2+h,x1+d,y2+e,y1+a) (16)

(See the full version [11] for the proof.)

 Remark 10.

In Theorem 9 (i), as pr, qr, ps, and qs increase, the complexity of computing ϵrs increases. This is because, the ranges over which the maxima on the RHS of Equation 12 increase, thereby increasing the computational complexity of d^(,𝒥) as well.

4 Lipschitz stability and differentiability of the loss function

The goal of this section is to establish Lipchitz continuity/stability and (almost everywhere) differentiability of our loss function (cf. Equation 2). From Figure 1, recall that we denote the embedding of a (1,1)-or (2,1)-interval I of 2 into 6 by 𝐯I.

 Remark 11 (Independence of variables).

Although there are multiple ways to embed (p,q)-intervals of 2 into Euclidean space, for any p,q, some vectorization methods are more efficient than others in our context. For instance, suppose that we represent the (2,1)-interval I in Figure 1 by simply concatenating the coordinates of its minimal and maximal points. Namely, we represent I with the vector (x1,y1,x2,y2,X,Y)6, where x1=xb, y1=y, x2=x, y2=yc, X=x+d, Y=y+a. A difficulty with this embedding in the context of gradient descent is that the variables are not independent: one must have (x1,y1)(X,Y), (x2,y2)(X,Y), and (x2,y2)(x1,y1). Ensuring such relations between variables is not trivial in non-convex optimization (without using refined techniques such as projected gradient descent). On the other hand, using our proposed embedding is more practical, as all variables are independent, and our only requirements are that a,b,c,d must be nonnegative, which can easily be imposed using, e.g., exponential or ReLU functions.

Let :={Ir}r=1n and 𝒥:={Js}s=1m consist solely of (1,1)- or (2,1)-intervals of 2.

Lemma 12.

For Ir and Js𝒥, consider 𝐯Ir=(x1,y1,a,b,c,d) and 𝐯Js=(x2,y2,e,f,g,h) defined as in Figure 1. If 𝐯Ir𝐯Jsϵ, then ϵrs2ϵ where ϵrs is described as in Theorem 9 (ii). (See the full version [11] for the proof.)

We define 𝕧 as the concatenation of 𝐯Ir for all Ir, i.e., 𝕧:=(𝐯I1||𝐯In)6n. Let be the collection of all ordered n-sets of (1,1)-or (2,1)-intervals of 2. Then, define the function V:6n by 𝕧.

Proposition 13.

The image of via the function V is a convex subset of 6n.

Proof.

Let S be the collection of all (1,1)- and all (2,1)-intervals of 2. Consider the map from S to ××>0×0×0×>0 depicted in Figure 1. Clearly, this map is a surjection. It follows that the image of S via the map I(x,y,a,b,c,d) depicted in Figure 1 is a convex subset of 6. Now, observe that the image of via the map V is equal to the n-fold Cartesian product of the image of S, which is a subset of 6n. The fact that any Cartesian product of convex sets is convex implies our claim.

Theorem 14 (Lipschitz stability of loss function).

Let 𝒦 be any finite set of (1,1)- or (2,1)-intervals, and let 𝒥1,𝒥2 be any two n-sets of (1,1)- or (2,1)-intervals for some n. Then, we have |d^(𝒦,𝒥1)d^(𝒦,𝒥2)|2minπ𝕧π(𝒥1)𝕧𝒥2, where the minimum is taken over all permutations on 𝒥1.

Proof.

By the triangle inequality, we have |d^(𝒦,𝒥1)d^(𝒦,𝒥2)|d^(𝒥1,𝒥2) and thus it suffices to show that d^(𝒥1,𝒥2)2minπ𝕧π(𝒥1)𝕧𝒥2. Let π0 be a permutation on 𝒥1 that attains the minimum. Let :=π0(𝒥1) and 𝒥:=𝒥2. By Lemma 12, ϵrr (resp. ϵss) 2ϵ for all r,s{1,,n} and hence minsϵrs (resp. minrϵrs) 2ϵ for each r (resp. s). Thus, we have maxr(minsϵrs)2ϵ and maxs(minrϵrs)2ϵ. By Theorem 9 (i), we are done.

 Remark 15.

The opposite direction of Theorem 14 does not hold, i.e. there does not exist c>0 such that |d^(𝒦,𝒥1)d^(𝒦,𝒥2)|cminπ𝕧π(𝒥1)𝕧𝒥2. However, the non-existence of such c>0 is not an issue in our work as what we require is the Lipschitz stability and almost everywhere differentiability of the loss function (which we establish in the next proposition), in order to prevent erratic and oscillating gradient descent iterations.

Proposition 16.

The loss function given in Equation (2) is differentiable almost everywhere.

Proof.

The three maps 2 given by (x,y)δ(x,y)|yx| (cf. Equation (9)), (x,y)max{x,y}, and (x,y)min{x,y} are finitely segmented piecewise linear. Therefore, the functions given in Equations (13)–(16) are all finitely segmented piecewise linear on Euclidean spaces, and thus so are ϵrs given in Theorem 9 (ii). Hence, trivially, these functions are differentiable almost everywhere. Now, Proposition 6 (iii) implies our claim.

5 Numerical Experiments on Sparsification

In this section, we make use of the results proved above to provide a method for sparsifying GPDs. Indeed, computing a single GPD dgmM on a persistence module M coming from a simplicial complex S requires computing zigzag persistence modules on all intervals of , as described in [16], and then computing the corresponding Möbius inversion, yielding a time complexity of O(n3), where n is the number of intervals in , due to the need to compute the Möbius function value for all possible pairs of intervals, each requiring O(n) operations to iterate over each interval. Thus, the complexity of computing a single GPD is O(n×Ns2.376+n3), where Ns is the number of simplices [30]. Hence, computing all GPDs from a dataset of persistence modules becomes intractable when n is large.

Therefore, as described in the introduction, our goal in this section is to design a sparse subset of (2,1)-intervals 𝒥 of size mn that minimizes the loss function in Equation 2 with gradient descent by treating every interval in 𝒥 as a parameter to optimize.

Time-series datasets.

The datasets we consider are taken from the UCR repository [15], and correspond to classification tasks that have already been studied with persistent homology before [29, Section 4.2]. More precisely, instances in these datasets take the form of labelled time series, that we pre-process with time-delay embedding in order to turn them into point clouds. Specifically, each labelled time series T={f(t1),,f(tn)} of length n is transformed into a point cloud XT3 of cardinality n2 with

XT:={(f(t1),f(t2),f(t3))T,,(f(tn2),f(tn1),f(tn))T}.

Then, we compute both the Vietoris-Rips filtration and the sublevel set filtration induced by a Gaussian kernel density estimator (with bandwidth σ=0.1diam(XT), where diam(XT):=maxx,yXTxy2 is the diameter of the point cloud) using the PointCloud2FilteredComplex function of the multipers library [28] (a tutorial is available in [26]). Both filtrations are then normalized so that their ranges become equal to the unit interval [0,1].

Loss function.

In order to compute GPDs out of these 2-parameter filtrations and modules, one needs a subset of intervals. As explained in the introduction, we then minimize

dE^,m:𝕧𝒥d^(,𝒥),

where the full domain (resp. the sparse domain 𝒥) is comprised of n=1,600 (resp. m=400) (2,1)-intervals obtained from a grid in 6 computed by evenly sampling 10 (resp. 5) values for x and y and 2 values for a,b,c,d within their corresponding filtration ranges.666We also tried random uniform sampling for the initial distributions of x,y,a,b,c,d, but this resulted in slower convergence and lower performances. Note that while the (2,1)-intervals in 𝒥 are treated as parameters to optimize, the (2,1)-intervals in are fixed throughout the optimization process. Moreover, the formula that we provided in Theorem 9 (ii) can be readily implemented in any library that uses auto-differentiation, such as pytorch. In particular, we run stochastic gradient descent with momentum 0.9 on for 750 epochs using learning rate η=0.001 with exponential decay of factor 0.99 to achieve convergence, and obtain our sparse subset 𝒥. See Figure 5 for a visualization of the loss decrease. Note how the Lipschitz stability proved in Theorem 14 translates into a smooth decrease with small oscillations.

Refer to caption
Figure 5: Loss decrease across gradient descent iterations. Note that the loss value stays on a plateau for the first 300 iterations; this is due to the fact that during these first iterations, the parameters in 𝒥 that are updated with gradient descent are not yet the ones achieving the maxima and minima in the closed-form formula provided in Theorem 9 (ii). Moreover, in this example, gradient descent converges to a local minimum of the loss that is only 0.01 lower than the initial value, partially because we are using the small learning rate 0.001.

Accuracy scores.

In order to quantify the running time improvements as well as the information loss (if any) when switching from the full domain to the optimized sparse one 𝒥, we computed the full and sparse GPDs in homology dimensions 0 and 1 with the zigzag persistence diagram implementation in dionysus [31], and then we trained random forest classifiers on these GPDs to predict the time series labels.777Recall that optimizing the sparse domain can be performed without computing GPDs, so the optimization process described in the previous paragraph is done once and for all, before computing the GPDs. In order to achieve this, we first turned both the full and optimized GPDs into Euclidean vectors by first binning every GPD (seen as a point cloud in 6) with a fixed 6-dimensional histogram, and then convolving this histogram with a 6-dimensional Gaussian kernel in order to smooth its values, with a procedure similar to the one described in [29, Section 3.2.1]. Both the histogram bins and the kernel bandwidths were found with 3-fold cross-validation on the training set (see the provided code for hyperparameter values). Then, random forest classifiers were trained on these smoothed histograms to predict labels; in Table 1 we report the accuracy scores of these classifiers for the initial (before optimization) sparse domain 𝒥init, the optimized sparse domain 𝒥, and the full domain . Moreover, in Table 2, we report the running time needed to compute all GPDs using either 𝒥init, 𝒥, or , as well as the improvement when switching from 𝒥 to . Note that our goal is not to improve on the state-of-the-art for time series classification, but rather to assess whether optimizing the loss dE^,m based on our upper bound is indeed beneficial for improving topology-based models. See Figure 2 for a schematic overview of our full pipeline.

Discussion on results.

As one can see from Table 1, there is either a clear improvement or a comparable performance in accuracy scores after optimizing 𝒥. Indeed, by minimizing dE^,m, one forces the sparse domain 𝒥 to be as close as possible to the full domain , and thus to retain as much topological information as possible. Hence, either the full GPDs are more efficient than the initial sparse GPDs, in which case the optimized GPDs perform much better than the initial sparse ones, or the full GPDs are less efficient than the initial sparse GPDs,888Recall that the accuracy score is only an indirect measure: While the full GPDs are always richer in topological information, we hypothesize that their scores might still be lower due to a large number of intervals on which the full GPDs vanish – a question that we leave for further study. in which case the optimized GPDs have comparable or slightly worse performances than the initial sparse ones thanks to the small sizes of their domains (except for the PC and IPD datasets, for which the optimized GPDs still perform interestingly better). In all cases, we emphasize that optimized GPDs provide the best solution: they maintain scores at levels that are either comparable or better than the best solution between the initial sparse and full GPDs, while avoiding to force users to choose interval domains (as their domains are obtained automatically with gradient descent). Note in particular that whenever optimized GPDs do not achieve the best scores, their performances remain of the same order as those of their competitors (see, e.g., DPC or SC), whereas when optimized GPDs are the best, the performances of the other two competitors are sometimes far behind (see, e.g., GPA).

As for running times, we observe a slight increase from the running times associated to the initial sparse GPDs (except for the IPD dataset), which we hypothesize is due to the use of optimized intervals that intersect more with each other, and thus induce more poset relations when computing Möbius inversions, and a strong improvement over the running times associated to the full GPDs, with ratios ranging between 5 and 15 times faster. Our optimized GPDs thus achieve the best of both worlds: they are strinkingly fast to compute while keeping high accuracy scores. Our code was run on a 2x Xeon SP Gold 5115 @ 2.40GHz, and is fully available at sparse GPDs.

Table 1: Accuracy scores (%) of random forest classifiers trained on several UCR datasets. Underline indicates best score between the initial and optimized sparse domains, while bold font indicates best score overall. See the full version [11] for the dataset full names.
C DPA DPC DPT PPA PPC PPT ECG IPD
Init. 0.750 0.705 0.746 0.561 0.790 0.718 0.693 0.790 0.677
Optim. 0.786 0.691 0.721 0.561 0.785 0.742 0.737 0.790 0.690
Full 0.857 0.669 0.743 0.554 0.780 0.729 0.707 0.740 0.651
MI P SL GP GPA GPM GPO PC SC
Init. 0.534 0.924 0.565 0.900 0.835 0.946 0.987 0.789 0.447
Optim. 0.588 0.886 0.546 0.893 0.959 0.959 0.956 0.800 0.487
Full 0.545 0.838 0.531 0.847 0.864 0.946 0.990 0.778 0.503
Table 2: Running times (seconds) needed for computing all GPDs on several UCR datasets. Underline indicates best running time between the initial and optimized sparse domains, while bold font indicates best running time overall. See the full version [11] for the dataset full names.
C DPA DPC DPT PPA PPC PPT ECG IPD
Init. 350 1076 1902 1178 1182 1660 1147 556 1262
Optim. 421 1172 2054 1265 1257 1749 1232 621 1328
Full 2304 11672 20163 13007 12745 18427 12844 5098 19860
Improv. 5.47x 9.95x 9.82x 10.28x 10.13x 10.53x 10.42x 8.20x 15.74x
MI P SL GP GPA GPM GPO PC SC
Init. 2989 792 3478 703 1576 1658 1573 1339 1240
Optim. 3399 909 3919 883 1955 2061 1925 1550 1324
Full 29074 6280 31285 5773 12913 13523 12849 10730 13089
Improv. 8.55x 6.91x 7.98x 6.54x 6.60x 6.56x 6.67x 6.92x 9.88x

6 Conclusion

Our sparsification method demonstrates the practicality of approximating full GPDs with sparse ones: while significantly reducing computational costs, our appropriate loss function also ensures that their discriminative power is not too compromised. We thus believe that our work paves the way for efficiently deploying multi-parameter topological data analysis to large-scale applications, that are currently out of reach for most multi-parameter topological invariants from the literature. In what follows, we outline several potential future directions.

Optimization for finer GPDs.

In our experiments, we optimized the GPDs over intervals with at most two minimal points and exactly one maximal point. Allowing more complex intervals, as well as dataset-dependent terms in the loss function, could improve performance, but finding suitable embeddings of such intervals into Euclidean space remains a challenge (cf. Remark 11) and would increase the computational cost (cf. Remark 10). It would also be interesting to quantify the discriminating power of GPDs with measures that are more direct than the score of a machine learning classifier (or at least to further study the dependencies between GPD sparsification and classifier scores), and to investigate on heuristics (based on drops in the loss values) for deciding whether sparse GPDs are sufficiently good so that optimization can be stopped.

Experimental validation for other GRI-based descriptors.

While we only focused on sparsifying GPDs in this work, it would be interesting to measure the extent to which our interval domain optimization adapts to other descriptors based on the GRI from the TDA literature; of particular interest are the GPDs/signed barcodes coming from rank exact decompositions, which have recently proved to be stable [8] (see also Section 8.2 in [7] which discusses the influence of the choice of the interval domains on the resulting invariants), as well as GRIL, which focuses on specific intervals called worms [37].

References

  • [1] Hideto Asashiba, Etienne Gauthier, and Enhao Liu. Interval replacements of persistence modules. arXiv preprint, 2024. arXiv:2403.08308.
  • [2] Gorô Azumaya. Corrections and supplementaries to my paper concerning Krull-Remak-Schmidt’s theorem. Nagoya Mathematical Journal, 1:117–124, 1950.
  • [3] Ulrich Bauer, Magnus Botnan, Steffen Oppermann, and Johan Steen. Cotorsion torsion triples and the representation theory of filtered hierarchical clustering. Advances in Mathematics, 369:107171, 2020.
  • [4] Magnus Botnan and William Crawley-Boevey. Decomposition of persistence modules. Proceedings of the American Mathematical Society, 148(11):4581–4596, 2020.
  • [5] Magnus Botnan and Michael Lesnick. Algebraic stability of zigzag persistence modules. Algebraic & Geometric topology, 18(6):3133–3204, 2018.
  • [6] Magnus Botnan and Michael Lesnick. An introduction to multiparameter persistence. In Representations of Algebras and Related Structures, pages 77–150. CoRR, 2023.
  • [7] Magnus Botnan, Steffen Oppermann, and Steve Oudot. Signed barcodes for multi-parameter persistence via rank decompositions and rank-exact resolutions. Foundations of Computational Mathematics, pages 1–60, 2024.
  • [8] Magnus Botnan, Steffen Oppermann, Steve Oudot, and Luis Scoccola. On the bottleneck stability of rank decompositions of multi-parameter persistence modules. Advances in Mathematics, 451:109780, 2024.
  • [9] Gunnar Carlsson and Afra Zomorodian. The theory of multidimensional persistence. Discrete & Computational Geometry, 42(1):71–93, 2009. doi:10.1007/S00454-009-9176-0.
  • [10] Mathieu Carrière and Andrew Blumberg. Multiparameter persistence image for topological machine learning. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pages 22432–22444. Curran Associates, Inc., 2020.
  • [11] Mathieu Carriere, Seunghyun Kim, and Woojin Kim. Sparsification of the generalized persistence diagrams for scalability through gradient descent. arXiv preprint arXiv:2412.05900, 2024. doi:10.48550/arXiv.2412.05900.
  • [12] Frédéric Chazal, David Cohen-Steiner, Marc Glisse, Leonidas J Guibas, and Steve Y Oudot. Proximity of persistence modules and their diagrams. In Proceedings of the twenty-fifth annual symposium on Computational geometry, pages 237–246, 2009. doi:10.1145/1542362.1542407.
  • [13] Nate Clause, Woojin Kim, and Facundo Memoli. The generalized rank invariant: Möbius invertibility, discriminating power, and connection to other invariants. arXiv preprint, 2024. arXiv:2207.11591v5.
  • [14] René Corbet, Ulderico Fugacci, Michael Kerber, Claudia Landi, and Bei Wang. A kernel for multi-parameter persistent homology. Computers & Graphics: X, 2:100005, 2019. doi:10.1016/J.CAGX.2019.100005.
  • [15] Hoang-Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ratanamahatana, and Eamonn Keogh. The UCR time series archive. arXiv, 2018. arXiv:1810.07758.
  • [16] Tamal Dey, Woojin Kim, and Facundo Mémoli. Computing generalized rank invariant for 2-parameter persistence modules via zigzag persistence and its applications. In 38th International Symposium on Computational Geometry (SoCG 2022). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPIcs.SoCG.2022.34.
  • [17] Tamal Dey, Aman Timalsina, and Cheng Xin. Computing generalized ranks of persistence modules via unfolding to zigzag modules. arXiv preprint, 2024. doi:10.48550/arXiv.2403.08110.
  • [18] Herbert Edelsbrunner and John L Harer. Computational topology: an introduction. American Mathematical Society, 2008.
  • [19] Donghan Kim, Woojin Kim, and Wonjun Lee. Super-polynomial growth of the generalized persistence diagram, 2024. doi:10.48550/arXiv.2412.04889.
  • [20] Woojin Kim and Facundo Mémoli. Generalized persistence diagrams for persistence modules over posets. Journal of Applied and Computational Topology, 5(4):533–581, 2021. doi:10.1007/S41468-021-00075-1.
  • [21] Woojin Kim and Facundo Mémoli. Spatiotemporal persistent homology for dynamic metric spaces. Discrete & Computational Geometry, 66:831–875, 2021. doi:10.1007/S00454-019-00168-W.
  • [22] Woojin Kim and Facundo Mémoli. Persistence over posets. Notices of the American Mathematical Society, 70(08), 2023.
  • [23] Woojin Kim and Facundo Mémoli. Extracting persistent clusters in dynamic data via möbius inversion. Discrete & Computational Geometry, 71(4):1276–1342, 2024. doi:10.1007/S00454-023-00590-1.
  • [24] Woojin Kim and Samantha Moore. Bigraded Betti numbers and generalized persistence diagrams. Journal of Applied and Computatioal Topology, 2024. doi:10.1007/s41468-024-00180-x.
  • [25] Michael Lesnick. The theory of the interleaving distance on multidimensional persistence modules. Foundations of Computational Mathematics, 15(3):613–650, 2015. doi:10.1007/S10208-015-9255-Y.
  • [26] David Loiseaux. Time series classification. Accessed: 2025-03-24. URL: https://davidlapous.github.io/multipers/notebooks/time_series_classification.html.
  • [27] David Loiseaux, Mathieu Carrière, and Andrew Blumberg. A framework for fast and stable representations of multiparameter persistent homology decompositions. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023). Curran Associates, Inc., 2023.
  • [28] David Loiseaux and Hannah Schreiber. multipers: Multiparameter persistence for machine learning. Journal of Open Source Software, 9(103):6773, 2024. doi:10.21105/JOSS.06773.
  • [29] David Loiseaux, Luis Scoccola, Mathieu Carrière, Magnus Botnan, and Steve Oudot. Stable vectorization of multiparameter persistent homology using signed barcodes as measures. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023). Curran Associates, Inc., 2023.
  • [30] Nikola Milosavljević, Dmitriy Morozov, and Primoz Skraba. Zigzag persistent homology in matrix multiplication time. In Proceedings of the twenty-seventh Annual Symposium on Computational Geometry, pages 216–225, 2011. doi:10.1145/1998196.1998229.
  • [31] Dmitriy Morozov. Dionysus 2. Accessed: 2025-03-24. URL: https://mrzv.org/software/dionysus2/.
  • [32] Soham Mukherjee, Shreyas Samaga, Cheng Xin, Steve Oudot, and Tamal Dey. D-gril: End-to-end topological learning with 2-parameter persistence. arXiv, 2024. arXiv:2406.07100.
  • [33] Amit Patel. Generalized persistence diagrams. Journal of Applied and Computational Topology, 1(3):397–419, 2018. doi:10.1007/S41468-018-0012-6.
  • [34] Luis Scoccola, Siddharth Setlur, David Loiseaux, Mathieu Carrière, and Steve Oudot. Differentiability and optimization of multiparameter persistent homology. In 41st International Conference on Machine Learning (ICML 2024), volume 235, pages 43986–44011. PMLR, 2024.
  • [35] Oliver Vipond. Multiparameter persistence landscapes. Journal of Machine Learning Research, 21(61):1–38, 2020. URL: https://jmlr.org/papers/v21/19-054.html.
  • [36] Lu Xian, Henry Adams, Chad M Topaz, and Lori Ziegelmeier. Capturing dynamics of time-varying data via topology. Foundations of Data Science, 4(1):1–36, 2022.
  • [37] Cheng Xin, Soham Mukherjee, Shreyas Samaga, and Tamal Dey. GRIL: A 2-parameter persistence based vectorization for machine learning. In 2nd Annual Workshop on Topology, Algebra, and Geometry in Machine Learning. OpenReviews.net, 2023.