Verifying an Efficient Algorithm for Computing Bernoulli Numbers
Abstract
The Bernoulli numbers are a sequence of rational numbers that is ubiquitous in mathematics, but difficult to compute efficiently (compared to e.g. approximating ).
In 2008, Harvey gave the currently fastest known practical way for computing them: his algorithm computes in time . By doing this for many small primes in parallel and then combining the results with the Chinese Remainder Theorem, one recovers the value of as a rational number in time. One advantage of this approach is that the expensive part of the algorithm is highly parallelisable and has very low memory requirements. This algorithm still holds the world record with its computation of .
We give a verified efficient LLVM implementation of this algorithm. This was achieved by formalising the necessary mathematical background theory in Isabelle/HOL, proving an abstract version of the algorithm correct, and refining this abstract version down to LLVM using Lammich’s Isabelle-LLVM framework, including many low-level optimisations. The performance of the resulting LLVM code is comparable with Harvey’s original unverified and hand-optimised C++ implementation.
Keywords and phrases:
Bernoulli numbers, LLVM, verification, Isabelle, Chinese remainder theorem, modular arithmetic, Montgomery arithmeticCopyright and License:
2012 ACM Subject Classification:
Theory of computation Logic and verificationSupplementary Material:
Other (Formal Proof Development): https://dx.doi.org/10.5281/zenodo.15749805
Acknowledgements:
The computational results presented here have been achieved (in part) using the LEO HPC infrastructure of the University of Innsbruck.Editors:
Yannick Forster and Chantal KellerSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
The Bernoulli numbers are a sequence of rational numbers that is ubiquitous in mathematics and has connections to, among other things: the closed-form expression for (Faulhaber’s formula), regular primes and cyclotomic fields, the combinatorics of alternating permutations, the Riemann zeta function, the Euler–Maclaurin summation formula, and the Maclaurin series of the tangent and cosecant functions.
Algorithmically, they are quite difficult to compute efficiently: Bernoulli himself only computed them up to in the 17th century. Euler extended this up to in 1748, Adams up to in 1877 [1], and Lehmer up to in 1936 [22] – all by hand.
Interestingly, it is often claimed that the first non-trivial computer program ever written was Lovelace’s Note G, published in 1843. This was an algorithm designed for Babbage’s Analytical Engine (which was sadly never built) to compute Bernoulli numbers.111Accounts differ on whether it was Lovelace or Babbage himself who authored the program, but the prevailing view seems to be that it was Lovelace. [9]
The first actual use of a digital computer (that we are aware of) in computing Bernoulli numbers (and the first big step forward after Lehmer in 1936) was when Knuth and Buckholtz gave a table of the values up to in 1967 [16] (the numerator of already has 1421 decimal digits). The first large jump to higher indices happened in 1996, when Fee and Plouffe [10] computed the single value and a few others up to in the following years, using an entirely different method based on approximating the Riemann zeta function. Unlike previous methods, this method allowed computing a value without knowing the preceding values .
In 2010, Harvey [14] computed the value of using an entirely different method: He gave a fast algorithm to compute modulo a prime . This can then be run in parallel for many different primes, followed by an invocation of the Chinese Remainder Theorem to reconstruct the value of as a rational number. While this approach is not asymptotically faster than the one based on the zeta function, it is much faster in practice, as Harvey demonstrated. To this day, Harvey’s algorithm is still the state of the art for practical computation of a single Bernoulli number, and his record of still stands. His algorithm takes time to compute .222For comparison, has been approximated to trillions of digits and is much easier to compute ( time for digits with Chudnovsky’s formula and binary splitting). [26]
In this paper, we verify Harvey’s algorithm in Isabelle/HOL. Using the Isabelle Refinement Framework [21] and its parallel LLVM back end [20], our verification spans from a definition of Bernoulli numbers down to efficient LLVM code with performance comparable to Harvey’s original algorithm. As a side product, we also verified several other interesting numerical algorithms and number-theoretic results, such as prime sieves and fast Chinese remaindering.
Our trusted code base consists of Isabelle, the Isabelle LLVM code generator, and the LLVM compiler. Only in the reconstruction phase, we additionally rely on correct integer operations in the GNU Multiple Precision Arithmetic Library [12].
2 Mathematical Background
The Bernoulli numbers are defined as the coefficients of the exponential generating function or, following an alternative convention, . The only difference between the two conventions is that the first one yields , and the second one yields . We shall adopt the first convention. We use the Isabelle/HOL formalisation of Bernoulli numbers in the Archive of Formal Proofs (AFP) by Bulwahn and Eberl [4], which already contains many useful results on Bernoulli numbers.
Remark 1 (Notation).
We shall write and for the numerator and denominator of , respectively.
Both in this presentation and in our Isabelle proofs, we use the notation . If are integers, the meaning is that or, equivalently, . The corresponding notation in Isabelle/HOL is .
We extend this definition to rational numbers in the natural way: We define , where denotes a modular inverse of modulo . Note that this is only well-defined if the denominators are coprime to . In Isabelle, we use the notation for the infix operator ‘’ and for the congruence relation.
The notation denotes the multiplicative order of the element in the ring , i.e. the smallest positive integer such that .
Lastly, as is common in number theory, any sum or product with an index variable is meant to be interpreted with being restricted to the prime numbers.
The sequence of Bernoulli numbers shows an obvious pattern: if then if is odd, if , and if . The vanishing of for odd is easily proved from the formal power series given above. The result about the sign of for even is obtained from a well-known connection between and the Riemann zeta function:
Theorem 2 (The connection between and ).
If , we have:
Proof.
We apply the Residue Theorem to the integral . Here, is a rectangle of width and height centred around the origin. Letting establishes our claim.
The next important ingredient gives us an easy way to compute the denominator of :
Theorem 3 (Von Staudt–Clausen theorem).
If is even, then .
We omit the proofs of this fact and of the following ones from this presentation for reasons of space. All results up to this point are part of the AFP entry by Bulwahn and Eberl [4].
The core of Harvey’s algorithm are two closely related congruences involving Bernoulli numbers modulo a prime or a prime power:
Theorem 4 (Voronoi’s congruence).
Let be even and let and be coprime integers. Then:
Theorem 5 (Kummer’s congruence).
Let be a prime number and and be integers with even and and . Then .
Proof.
The Isabelle proofs closely follow Cohen [5, Prop. 9.5.20, Cor. 9.5.25] and are relatively straightforward. These congruences were not previously available in the AFP and were formalised in the course of this project [7].
By setting and in Kummer’s congruence, we obtain the range-reduction congruence . Thus we can w.l.o.g. assume that .
Voronoi’s congruence gives us a closed-form expression for computing . Harvey tweaks this congruence to obtain the following:
Theorem 6 (Voronoi’s congruence, Harvey’s first version).
Let be prime and and with . Then:
Here, denotes a modular inverse of modulo . Note that is either an integer or a half-integer, depending on whether is even or odd.
We will later pick a convenient value for the parameter in this theorem.
3 Computing Modulo a Prime
The main part of Harvey’s algorithm is to gather ‘modular information’, i.e. to compute for many different primes . Note that since if , the term is actually undefined if , which is why Harvey does not use such primes.
We on the other hand decided to instead compute , which is well-defined for all primes. Since is easy to compute via von Staudt–Clausen, it is easy to convert between and if the latter is well-defined.
Depending on and , we use one of three different methods to compute :
- Case 1:
-
If , we use a version of Theorem 6 to compute and then multiply the result with to obtain . To avoid big-integer arithmetic, we do the latter part by simply multiplying with for every prime with .
There are two sub-cases for how we compute :
- Case 1.1:
-
If , we use a highly optimised version of Theorem 6 which we refer to as harvey_fastsum in our Isabelle formalisation.
- Case 1.2:
-
If , we fall back to Harvey’s Algorithm 1, which is a much less optimised version of Theorem 6. In our Isabelle formalisation, we refer to this as harvey_slowsum.
- Case 2:
-
If , we recall that is not well-defined. However, it is easy to show that in this case we have , which is easy to compute.
It would in principle be possible to only use the primes that fall into Case 1.1 in order to keep the algorithm simpler and make it slightly faster. However, little is known about their distribution in practice, which makes it impossible to find an a-priori bound for how many primes we will need. Omitting Case 2 primes on the other hand, as Harvey does, is easy to do but brings no real benefit.
Since Case 1.1 is by far the most frequent and most optimised one, we dedicate the remainder of this section to it. For Case 1.2, we refer the reader to Harvey’s paper.
Harvey derives the congruence behind Case 1.1 from Theorem 6 by choosing . He then picks a generator of (also referred to as a primitive root modulo ) and, with some elementary algebra and clever reindexing, derives the following:
Theorem 7 (The harvey_fastsum congruence).
Let be prime with and and a generator of . Let if is even and otherwise. Let . Then:
Empirically, is small for most primes.333The mean value of for the first odd primes is approximately 9.80. Less than of these primes have . Thus, just like Harvey, we focus on optimising the inner sum. We will process the inner sum in chunks of blocks of size . Currently, we pick for hardware architecture reasons that will become clear later.
Concretely, generalise to and to in the inner sum in Theorem 7 and let and . Then:
That is, we split the sum into a ‘chunk-aligned’ part, and a remainder.
3.1 The Inner Sum in General
For presentation purposes, we start by describing the algorithm for the remainder, which works for arbitrary . The algorithm for chunk-aligned uses the same implementation techniques with the addition of a tabulation optimisation, which we describe in Section 3.2.
3.1.1 Abstract Algorithm
We compute the sum for arbitrary with a simple loop with two local variables and that hold the current values of and , respectively.
We have and also if and otherwise. This leads us to the following simple algorithm:
This algorithm is phrased in the nondeterminism-error monad of the Isabelle Refinement Framework (IRF) [21]. It iterates over the values , maintaining the state with the invariants and and .
Given the ideas above, we can prove the algorithm correct:
In words: assuming that and that is odd, our algorithm refines the program that just returns the desired sum.
If and are programs, the refinement relation means that every possible result of program is also a possible result of program , or fails. Moreover, can only fail if fails. In the above lemma, is a deterministic program that does not fail, thus our algorithm does not fail and has either no results or the desired result. The case of having no results will only be excluded later, when we subsequently refine our algorithm to Isabelle-LLVM, which provably cannot have an empty set of results.
An assertion fails if the asserted predicate does not hold. In the context of stepwise refinement, assertions are a valuable tool to organise proof obligations: during the proof of the algorithm, we know that (it follows from the invariant ). Making this knowledge explicit as an assertion allows us to use it later on, when we further refine the algorithm: since a failed algorithm is refined by any program, we can assume that the algorithm to be refined does not fail, and thus that the assertions hold. Also, as in unverified programming, assertions are an engineering tool, making flawed proof attempts fail early. For the sake of readability, we will often elide assertions from presented listings.
For this paper, we will elide most actual Isabelle proofs and rather describe their ideas. The Isabelle listings shown are slightly edited for presentation purposes. Lemma and function names coincide with the actual formalisation.
3.1.2 Introducing Montgomery Form
In a next step, we use the Montgomery arithmetic (also known as REDC arithmetic) [23] for the accumulator and . We also insert annotations for type casting. The algorithm is shown in Fig. 1.
Here, is the bit width for numeric operations (in our case 32). The accumulator is maintained as a double-width word that will not overflow during the loop. Thus, we can defer the reduction modulo until after the loop.
Moreover, the code contains annotations for type casts, bit widths, and the types of operations. For example, the operation mul_uuu.op (2*W) a b multiplies its unsigned double-width () arguments and , asserting that there is no overflow. While these annotations are not strictly necessary, they make the next refinement step simpler (cf. Sec. 3.1.3).
Some operations for the Montgomery form also require extra parameters apart from the bit width . For example, the multiplication requires both the modulus and , its inverse modulo . The inverse is pre-computed once and passed as an extra parameter to the function. We show that this function is a refinement of the above harvey_restsum1:
The first assumption states that ctxt is a Montgomery refinement context444The formalisation of Montgomery form uses these contexts, while our algorithms use explicit parameters. These are independent design choices, which slightly clash here. for bit width , modulus and its inverse . The lemma then states that if is in bounds, and is the Montgomery form of , then harvey_restsum2 W p’ p xi s n refines harvey_restsum1 p x s n wrt. the relation mont_rel_relaxed ctxt (n+1). This relation indicates that the left value modulo is the Montgomery form of the right value, and that the left value is less than . The counter is used to keep track of an upper bound for the accumulator, while we keep adding values to it.
Note that we can easily prove the non-overflow assertions on this level. The assertion from the previous refinement level helps here, as it allows us to assume .
3.1.3 Refinement to LLVM
The last step uses the sepref [18] tool to refine to (an Isabelle model of) LLVM code.
The already proved bounds annotations make this proof easy: after unfolding the definition and unfolding the for loop into a while loop (which sepref can process), we annotate the number literals with their respective bit-width, and then invoke sepref. This generates LLVM code and a refinement lemma, stating that the parameters , , , are implemented by unsigned 32-bit words (u32A), the parameter is implemented by an unsigned 64-bit word (u64A), and the result is returned as an unsigned 64-bit word.
3.2 Further Optimisation of the Inner Sum
For the aligned part of the summation, we apply another crucial optimisation that is also described by Harvey. Instead of iterating through the sum in single steps, we iterate in chunks of size . Note that being or is equivalent to the th digit of the binary expansion of being or , which allows us to determine the values of for consecutive values of in a single step by computing the corresponding bits of the binary expansion of . For this we define
and rewrite the inner sum for multiples of as
| We split the sum into chunks and blocks, and introduce the tabulation | ||||
where selects the th bit of its argument, such that . The table is defined as
Since is typically much larger than and , most of the computational work goes into computing the table, which our optimisations make cheap: in an outer loop, we obtain chunks of bits of the binary expansion and incrementally maintain . The inner loop only bit-shifts and masks these bits, indexes into the table, and adds to table entries. Additionally, the implementation makes similar optimisations to those we described in Sec. 3. In particular, we use the relaxed Montgomery form for the table entries to allow for a mostly branch free implementation of the inner loop.555The computation of the binary expansion of we describe below does use a branch to correct a possible ‘off-by-one’ error. However, this happens so rarely that due to branch prediction, the cost of this branch is negligible. Moreover, the size of the table ( double-width machine words, i.e. in our current setting) is chosen such that it fits comfortably inside the L1d cache of a typical modern CPU.
To incrementally obtain the bits of the binary expansion, we maintain the value of modf, and observe that . However, instead of performing an expensive division by , we apply a further optimisation: we define . Then, for and , we have . Using this, we can reduce the division and modulo operation to a multiplication operation. To be able to detect the case, we limit , and define:
Here, >> denotes the left shift operation and AND the bitwise ‘and’. With this, we get:
This shows how to obtain the next digit and maintain the value of modf. Note that is precomputed outside the loop. Figure 2 displays the algorithm for computing the table.
After some initialisation, the outer loop iterates over the chunks. In each iteration of the outer loop, the binary expansion of is maintained and another 64 bits of digits are obtained in cbits. The inner loop iterates over the blocks, obtains the current block bits, and updates the table. After the inner loop, the outer loop maintains the value curx of .
3.3 Putting Things Together
Finally, we assemble the aligned inner sum, the computation of the remainder, and the fallback to the slower algorithm when (i.e. when ) to obtain the algorithm harvey_select2 with the following correctness theorem:
That is, under the listed preconditions, harvey_select2 will return the Montgomery form of an integer with .
Next, we address some of the preconditions. First, we compute via Hensel lifting. Next, as mentioned before, when we have , which allows us to easily determine directly. Moreover, when , we apply the previously-mentioned range reduction based on Kummer’s congruence. Finally, we multiply the result with to obtain , and convert the result from Montgomery form to normal numbers.
The resulting algorithm is displayed in Fig. 3.
In addition to the parameters , , , and , it also takes the prime factorisation of . The function prod_list_exc2 computes by multiplying all prime factors except . The correctness theorem is:
In words: the algorithm returns for any even natural number and any odd prime less than .
4 Computing as a Rational Number
The full algorithm consists of a preprocessing phase to create a large array of independent ‘tasks’, each being represented with a triple , then the main part of computing for each of these, and finally a modular reconstruction phase to determine as a rational number. Figure 4 shows a high-level view of the algorithm without the modular reconstruction phase.
First, we compute a rough a-priori upper bound for the largest prime that will be needed to gather enough modular information to reconstruct . We also check that is small enough to not cause overflows later, and otherwise return an error. Next, we use a simple sieving algorithm to compute all primes up to and a factor map that maps every composite number to a prime factor. Next, we compute the prime factors of using the von Staudt–Clausen theorem. We then use this information to determine a good upper bound for and drop the unnecessary primes from the list. For each remaining prime , we then compute and find a generator of . We refer to the list of triples as the ‘task list’ (or, in the Isabelle formalisation, the ‘pgo list’). Finally, we compute the modular information for each of the primes in parallel and return the result as well as the prime factorisation of .
In the following sections, we will explain each of these steps as well as the modular reconstruction step in some detail.
4.1 A-priori bound
The function estimate_check_a_priori_bound k (cf. Fig. 4) returns a pair . If , then is an upper bound for the largest prime number that will be needed to reconstruct . If , we cannot guarantee that our 32-bit arithmetic will not overflow, and we abort the algorithm. The first ingredient to compute is a bound on :
Theorem 8 (Upper bound for ).
Let be an even integer. Then
where and .
Proof.
We use Theorem 2 to express in terms of . The monotonicity of for real implies . Combining this with Stirling’s inequality and noting that yields the result. We can convert this into a bound on by combining it with the von Staudt–Clausen theorem: it implies and therefore . Lastly, a weak (but non-asymptotic) version of the Prime Number Theorem states that for all [6]. This was also formalised specifically for this project. Combining all of these bounds, we obtain the following:
Theorem 9 (A-priori prime bound).
Let be an integer. Then where
for constants , , .
In our algorithm, we use the following slightly weaker upper bound:
If , we return . We prove that will be in bounds for . We leave it to future work to delay the check until after we have pruned the list of primes with the more precise bound.
Note that our bound is lower than Harvey’s since, first, we can make use of all primes because we compute instead of , and secondly because we use a somewhat sharper inequality for .
4.2 Sieving
The prime sieving algorithm mk_factor_prime_cache K returns a tuple , where (“prime cache”) is the list of primes with , and (“factor cache”) encodes a mapping that maps any integer between and to its smallest prime factor (if it is composite) or to 0 (if it is prime or ). The sieving algorithm starts by mapping all numbers to and then proceeds in the usual fashion. When sieving is finished, the list of primes is extracted from the factor cache.
The advantage of the factor cache is that it allows us to not only quickly list prime numbers and determine whether a number is prime, but also to factor numbers efficiently. This speeds up the compute_harvey_pgos algorithm (cf. Sec 4.4).
4.3 Computing the Denominator and Precise Bounding
We can now determine the prime factorisation of by simply taking all primes with . The algorithm adjust_primes2 dfs k pc sums up for all of these to get an approximation of that easily fits in a machine integer. Together with our estimate for , we can now compute a relatively precise estimate of .
We then choose a prefix of the sieved primes such that . For this, we use a fixed-point approximation of . This avoids using floating-point numbers or arbitrary precision integers and is still precise enough in practice. Note that we do not actually store the prime 2 in the sieve or pruned list of primes, but still consider it for the bound. Before reconstructing (cf. Sec. 4.6), we do add the congruence (which holds for any even ) to our modular information.
4.4 Preparing the Task List
The algorithm compute_harvey_pgos pc fc takes the pruned prime list and the factor map and computes a list of task entries of the form . Here, is the prime itself, is a generator of , is the multiplicative order of 2 in , and is a placeholder that will be filled in by Harvey’s algorithm.
A number is a generator of iff it has maximal order, i.e. if . We can determine whether is a generator by checking that for all prime factors of . We find the smallest generator by simply checking and taking the first one that works.666Empirically, is almost always quite small. For the first primes, the mean and maximum of are 4.91 and 94, with in 60 % of the cases. It is known that and the generalised Riemann hypothesis implies .
As for computing the order: is defined as the smallest positive integer such that . Lagrange’s theorem tells us that . Therefore, every factor in the prime factorisation of must also be a factor of , and we can determine its multiplicity by letting , (where ) and then repeating , until , at which point as desired.
We are not aware of better algorithms for computing the order and finding a generator; however, even the worst-case running time of these is negligible compared to the main part of our algorithm.
4.5 Parallel Map
To compute the modular information in parallel, we instantiate a generic parallel map combinator with bernoulli_num_harvey_wrapper2 (cf. Sec. 3.3). We elide the boilerplate code for the instantiation, and only display the resulting correctness lemma in Fig. 5.
where the pre- and postconditions are defined as:
4.6 Postprocessing
Figure 6 displays our Isabelle formalisation of Harvey’s full algorithm. After handling some special cases for , , and odd , it computes the modular information and a prime factorisation of the denominator. If the flag is false, we abort the algorithm. Otherwise, the tuples in are converted to pairs in . We also add the pair since is always odd. At this point, we know that for each pair , we have . Moreover, the primes in these pairs are disjoint, and their product is greater than . Chinese remaindering gives us a pair with and . Thus, if and if , and we know that iff . Finally, we compute the denominator from its prime factorisation.
4.6.1 Fast Chinese Remaindering
The Chinese Remainder Theorem (CRT) allows us to determine from the values of for every .
A naïve implementation of the CRT is too slow for numbers as big as ours. We therefore follow the approach called remainder trees outlined in the classic textbook by von zur Gathen and Gerhard [24]. The basic data structure is a binary tree where every leaf has a pair of modular data and modulus of the form attached to it, and every internal node has a modulus attached to it that is the product of the moduli of its children. Given this tree , we can compute our desired result as , where is the following simple recursive function:
This can also be parallelised easily since different branches of the tree are independent.
While the abstract version of our algorithm uses a tree data structure, which is initially created from the list of input pairs, we later refine this to use the list of input pairs directly for the leafs, and another array for the inner nodes. This implementation is slightly more memory-efficient than using a data structure with explicit pointers. However, in practice, we expect memory usage to be dominated by the space for the integers.
4.6.2 Arbitrary Precision Integers
While all other parts of our algorithm use machine words, CRT and the computation of require large integers. We take a pragmatic approach and import the GNU Multiprecision Library (GMP) [12] into our formalisation. We declare an assertion mpzA for big integers, and the GMP library functions and use Isabelle’s specification mechanism to specify the Hoare-triples for them. We then instruct Isabelle LLVM to translate calls to the specified functions to the corresponding symbols from the GMP library.
The specification mechanism ensures that we do not know more about the constants than what was specified. Any such knowledge is dangerous, as it could be used to prove behaviour that is not exhibited by the GMP library. At the same time, the specification mechanism requires us to prove that a model for the specification exists. This is an important sanity check, as it prevents us from accidentally specifying contradictory statements.
Note that a verified big integer library is orthogonal to this work: it could easily replace our use of GMP, with little effect on the rest of the formalization.
5 Code Export and Final Correctness Theorem
Finally, we use sepref to refine bern_crt(cf. 4.6) to Isabelle LLVM and prove:
Note that Isabelle LLVM uses non-negative signed integers in various places, which are related to natural numbers by the sn64A refinement assertion. To make the algorithm usable from C++, we define the wrapper function bern_crt_impl_wrapper that returns void and uses pointer parameters to pass the results instead. Combining the refinement lemmas yields the following correctness theorem:
The preconditions of this Hoare-triple are that okp, nump, and denomp point to valid memory, that ti and di are 64-bit non-negative signed integers, and that ki is a 32-bit unsigned with value k. Then, after calling bern_crt_impl_wrapper okp nump denomp ti di ki, the pointers okp, nump, and denomp will point to the values ok, numi, and denomi; numi and denomi will be valid GMP numbers representing the integers num, and denom; and if , ok will be non-zero, and if ok is non-zero, num and denom represent the Bernoulli number . Note that we explicitly specified and here, as this also guarantees that num and denom have no common divisors.
This correctness theorem only depends on the Isabelle LLVM model, the formalisation of separation logic, our specification of GMP, and the definition of Bernoulli numbers. All the intermediate refinement steps need not to be trusted.
Finally, we use Isabelle LLVM to export the function to actual LLVM text and to generate a header file to interface with the generated code from C/C++.
We then implement a command line interface in C++, with which we run our experiments to generate actual Bernoulli numbers. For the more detailed timing measurements, we instrumented the LLVM code to invoke callbacks into our C++ program when certain phases of the algorithm are completed.
6 Benchmarks
Table 1 shows the performance data of the exported LLVM code, compiled with Clang 18.1.8 and running on a single “standard” node of the LEO5 cluster operated by the University of Innsbruck with two 32-core 2.60 GHz Intel Xeon Platinum 8358 CPUs with hyperthreading.
| Input | |||||||
|---|---|---|---|---|---|---|---|
| time elapsed (s) | 1.12 | 1.41 | 5.75 | 44.6 | 432.3 | 4,215 | 45,458 |
| CPU time (s) | 12.56 | 61.45 | 430.11 | 4,318.58 | 46,960 | 475,160 | 5,330,806 |
| CPU factor | 12.1 | 44.0 | 74.9 | 96.9 | 108.6 | 112.7 | 117.3 |
| memory (MB) | 48.9 | 124.4 | 378.8 | 1220 | 3,948 | 11,775 | 36,769 |
| sieving | 0.4 | 1.3 | 1.1 | 0.7 | 0.3 | 0.1 | 0.0 |
| generator/ | 6.5 | 18.9 | 15.6 | 7.3 | 2.7 | 0.9 | 0.3 |
| main phase | 25.8 | 61.5 | 64.6 | 81.2 | 92.4 | 96.9 | 98.9 |
| reconstruction | 30.5 | 8.4 | 8.2 | 4.5 | 1.9 | 0.7 | 0.3 |
| writing result | 2.1 | 8.7 | 10.0 | 6.2 | 2.7 | 1.3 | 0.5 |
Among other things, the table shows the percentage of elapsed time spent in the various parts of the algorithm. It can be seen that especially for larger inputs, the main phase (i.e. computing the modular information for each prime ) dominates the running time so that optimisation of the other parts will not result in a noticeable speed-up.
Figure 7 shows the relative performance of our verified LLVM implementation and Harvey’s unverified C++ implementation on different machines: a 12-core desktop computer, a 16-core server, and a cluster node with two 32-core server CPUs. For the desktop computer, we also show the single-core performance. Figure 8 gives a closer look at the single-core performance on the desktop computer, which is less noisy and thus allows a better comparison.
The data suggests that our implementation is significantly slower than Harvey’s for small inputs but competitive on the more interesting larger inputs (roughly ). Interestingly, our implementation slightly outperforms Harvey’s on the cluster in the range . This may be due to the different implementation of parallelisation.
Analyzing where and why our implementation is slower than Harvey’s will require thorough microbenchmarking. Also note that the results shown in this section are not a statistically rigorous performance analysis, but merely some experiments to establish that the performance of our implementation is roughly comparable to that of Harvey’s.
7 Related Work
We are not aware of any other work on the verification of algorithms involving Bernoulli numbers. As far as proof assistants are concerned, we are only aware of two others that even have a definition of Bernoulli numbers, namely HOL Light and Lean. However, neither system’s library seems to have an algorithm for computing them other than using the recurrence that follows directly from their definition. We are not aware of any verification efforts for Bernoulli number algorithms outside proof assistants either.
As for unverified algorithms, there are various approaches apart from Harvey’s 2010 algorithm (which we formalised in this paper) that are more efficient than the naïve approach:
-
by computing and approximating to sufficient precision
-
by performing the division as formal power series, either directly or via Kronecker substitution (cf. e.g. Brent and Harvey [3])
Harvey discusses some of these in some more detail in his paper. The first of them is already available in Isabelle [4], but it is quite inefficient for even moderate values of . The last two have the same asymptotic running time as the algorithm we verify, namely , but for various reasons discussed in Harvey’s paper, they seem to perform much worse in practice. Note however that the third algorithm computes not only but the entire sequence .
In 2012, Harvey also published another multi-modular algorithm for computing which achieves subquadratic running time by computing modulo prime powers instead of simply primes. Asymptotically, this is a large improvement over all previously known algorithms; however, to date, there is still no implementation, and it is not clear how well it would perform for feasible input sizes in practice. In fact, it seems that the only fast algorithms that have actual implementations are the ones based on approximating and Harvey’s 2010 algorithm.
8 Conclusion
| Component | LOC |
| Voronoi/Kummer congruences [7] | 2,397 |
| Prime bounds [6] | 1,814 |
| Fixed-point approximation of | 892 |
| Prime sieving, , generators | 2,765 |
| Fast Chinese remaindering | 3,801 |
| Montgomery arithmetic | 2,317 |
| Binary fraction expansion of | 968 |
| Miscellaneous | 1,373 |
| Component (cont’d) | LOC |
|---|---|
| GMP bindings for Isabelle-LLVM | 1,749 |
| Additions to sepref/Isabelle-LLVM | 4,365 |
| Main algorithm (abstract) | 1,569 |
| Main algorithm (refinement) | 8,568 |
| Total | 32,578 |
We fully verified a challenging state-of-the-art mathematical algorithm to compute Bernoulli numbers, achieving performance comparable to highly optimised hand-written C++ code. A large amount of machinery had to be developed to achieve this, including:
- Mathematical background:
- Algorithms, abstractly and verified down to LLVM:
-
prime sieving, Chinese remaindering via remainder trees, computing , fixed-point approximations of , efficient computation of the binary fraction expansion of a rational number, Hensel lifting, the extended Euclidean algorithm, Montgomery (‘REDC’) arithmetic, computing , finding generators in
- Extending the Isabelle/LLVM framework:
-
axiomatisation of GMP integers, destructive parallel maps over an array, low-level bit arithmetic
We expect many of these developments to be useful for other applications.
The total development comprises about 32 kLOC in Isabelle (see Table 2 for more details). The mathematical core of the development is relatively small (about 6 kLOC or 20 %), which matches our experience that proofs about algorithms tend to be bulkier than proofs about mathematics.
References
- [1] J. C. Adams. On the calculation of Bernoulli’s numbers up to by means of Staudt’s theorem. In Report of the meeting of the British Association for the Advancement of Science, Rep. Brit. Assn., pages 8–15, 1877. URL: https://gallica.bnf.fr/ark:/12148/bpt6k78160g.
- [2] Shigeki Akiyama and Yoshio Tanigawa. Multiple zeta values at non-positive integers. The Ramanujan Journal, 5(4):327–351, 2001. doi:10.1023/A:1013981102941.
- [3] Richard P. Brent and David Harvey. Fast Computation of Bernoulli, Tangent and Secant Numbers, pages 127–142. Springer New York, 2013. doi:10.1007/978-1-4614-7621-4_8.
- [4] Lukas Bulwahn and Manuel Eberl. Bernoulli numbers. Archive of Formal Proofs, January 2017. Formal proof development. URL: https://isa-afp.org/entries/Bernoulli.html.
- [5] Henri Cohen. Number Theory: Volume II: Analytic and Modern Tools. Graduate Texts in Mathematics. Springer New York, 2007.
- [6] Manuel Eberl. Concrete bounds for Chebyshev’s prime counting functions. Archive of Formal Proofs, September 2024. Formal proof development. URL: https://isa-afp.org/entries/Chebyshev_Prime_Bounds.html.
- [7] Manuel Eberl. Kummer’s congruence. Archive of Formal Proofs, March 2024. Formal proof development. URL: https://isa-afp.org/entries/Kummer_Congruence.html.
- [8] Andres Erbsen, Jade Philipoom, Jason Gross, Robert Sloan, and Adam Chlipala. Simple high-level code for cryptographic arithmetic – with proofs, without compromises. In 2019 IEEE Symposium on Security and Privacy (SP), pages 1202–1219, 2019. doi:10.1109/SP.2019.00005.
- [9] Eugene Eric Kim and Betty Alexandra Toole. Ada and the first computer. Scientific American, 280(5):76–81, May 1999. doi:10.1038/scientificamerican0599-76.
- [10] Greg Fee and Simon Plouffe. An efficient algorithm for the computation of Bernoulli numbers, 2007. arXiv:math/0702300.
- [11] Mathias Fleury and Peter Lammich. A more pragmatic CDCL for IsaSAT and targetting LLVM (short paper). In Brigitte Pientka and Cesare Tinelli, editors, Automated Deduction - CADE 29 - 29th International Conference on Automated Deduction, Rome, Italy, July 1-4, 2023, Proceedings, volume 14132 of Lecture Notes in Computer Science, pages 207–219. Springer, 2023. doi:10.1007/978-3-031-38499-8_12.
- [12] The GNU multiple precision arithmetic library. URL: https://gmplib.org/.
- [13] Arnd Hartmanns, Bram Kohlen, and Peter Lammich. Efficient formally verified maximal end component decomposition for MDPs. In André Platzer, Kristin Yvonne Rozier, Matteo Pradella, and Matteo Rossi, editors, Formal Methods - 26th International Symposium, FM 2024, Milan, Italy, September 9-13, 2024, Proceedings, Part I, volume 14933 of Lecture Notes in Computer Science, pages 206–225. Springer, 2024. doi:10.1007/978-3-031-71162-6_11.
- [14] David Harvey. A multimodular algorithm for computing Bernoulli numbers. Math. Comput., 79(272):2361–2370, 2010. doi:10.1090/S0025-5718-2010-02367-1.
- [15] M. Kaneko. The Akiyama–Tanigawa algorithm for Bernoulli numbers. Journal of Integer Sequences, 3, 2000.
- [16] Donald E. Knuth and Thomas J. Buckholtz. Computation of tangent, Euler, and Bernoulli numbers. Mathematics of Computation, 21(100):663–688, 1967. doi:10.1090/s0025-5718-1967-0221735-9.
- [17] Bram Kohlen, Maximilian Schäffeler, Mohammad Abdulaziz, Arnd Hartmanns, and Peter Lammich. A formally verified IEEE 754 floating-point implementation of interval iteration for MDPs. CoRR, abs/2501.10127, 2025. doi:10.48550/arXiv.2501.10127.
- [18] Peter Lammich. Generating verified LLVM from Isabelle/HOL. In John Harrison, John O’Leary, and Andrew Tolmach, editors, 10th International Conference on Interactive Theorem Proving, ITP 2019, September 9-12, 2019, Portland, OR, USA, volume 141 of LIPIcs, pages 22:1–22:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPIcs.ITP.2019.22.
- [19] Peter Lammich. Fast and verified UNSAT certificate checking. In Christoph Benzmüller, Marijn J. H. Heule, and Renate A. Schmidt, editors, Automated Reasoning - 12th International Joint Conference, IJCAR 2024, Nancy, France, July 3-6, 2024, Proceedings, Part I, volume 14739 of Lecture Notes in Computer Science, pages 439–457. Springer, 2024. doi:10.1007/978-3-031-63498-7_26.
- [20] Peter Lammich. Refinement of parallel algorithms down to LLVM: applied to practically efficient parallel sorting. J. Autom. Reason., 68(3):14, 2024. doi:10.1007/S10817-024-09701-W.
- [21] Peter Lammich and Thomas Tuerk. Applying data refinement for monadic programs to Hopcroft’s algorithm. In Proc. of ITP, volume 7406 of LNCS, pages 166–182. Springer, 2012. doi:10.1007/978-3-642-32347-8_12.
- [22] D. H. Lehmer. An extension of the table of Bernoulli numbers. Duke Mathematical Journal, 2(3), September 1936. doi:10.1215/s0012-7094-36-00238-7.
- [23] Peter L. Montgomery. Modular multiplication without trial division. Mathematics of Computation, 44(170):519–521, 1985. doi:10.1090/s0025-5718-1985-0777282-x.
- [24] Joachim von zur Gathen and Jürgen Gerhard. Modern Computer Algebra. Cambridge University Press, 3rd edition, 2013.
- [25] Christoph Walther. Formally verified Montgomery multiplication. In Hana Chockler and Georg Weissenbacher, editors, Computer Aided Verification - 30th International Conference, CAV 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, July 14-17, 2018, Proceedings, Part II, volume 10982 of Lecture Notes in Computer Science, pages 505–522. Springer, 2018. doi:10.1007/978-3-319-96142-2_30.
- [26] Alexander Yee and Shigeru Kondo. 10 trillion digits of pi: A case study of summing hypergeometric series to high precision on multicore systems. Technical report, Siebel School of Computing and Data Science, 2011. URL: https://hdl.handle.net/2142/28348.
