Exact computation of sums and means

2015-05-21 at 9:22 pm 9 comments

A while ago, I came across a mention of the Python math.fsum function, which sums a set of floating-point values exactly, then rounds to the closest floating point value. This seemed useful. In particular, I thought that if it’s fast enough it could be used instead of R’s rather primitive two-pass approach to trying to compute the sample mean more accurately (but still not exactly). My initial thought was to just implement the algorithm Python uses in pqR. But I soon discovered that there were newer (and faster) algorithms. And then I thought that I might be able to do even better…

The result is a new paper of mine on Fast exact summation using small and large superaccumulators (also available from arxiv.org).

A superaccumulator is a giant fixed-point number that can exactly represent any floating-point value (and then some, to allow for bigger numbers arising from doing sums). This concept has been used before to implement exact summation methods. But if done in software in the most obvious way, it would be pretty slow. In my paper, I introduce two new variations on this method. The “small” superaccumulator method uses a superaccumulator composed of 64-bit “chunks” that overlap by 32 bits, allowing carry propagation to be done infrequently. The “large” superaccumulator method has a separate chunk for every possible combination of the sign bit and exponent bits in a floating-point number (4096 chunks in all). It has higher overhead for initialization than the small superaccumulator method, but takes less time per term added, so it turns out to be faster when summing more than about 1000 terms.

Here is a graph of performance on a Dell Precision T7500 workstation, with a 64-bit Intel Xeon X5680 processor:


The horizontal axis is the number of terms summed, the vertical axis the time per term in nanoseconds, both on logarithmic scales. The time is obtained by repeating the same summation many times, so the terms summed will be in cache memory if it is large enough (vertical lines give sizes of the three cache levels).

The red lines are for the new small (solid) and large (dashed) superaccumulator methods. The blue lines are for the iFastSum (solid) and OnlineExact (dashed) methods of Zhu and Hayes (2010), which appear to be the fastest previous exact summation methods. The black lines are for the obvious (inexact) simple summation method (solid) and a simple out-of-order summation method, that adds terms with even and odd indexes separately, then adds together these two partial sums. Out-of-order summation provides more potential for instruction-level parallelism, but may not produce the same result as simple ordered summation, illustrating the reproducibility problems with trying to speed up non-exact summation.

One can see that my new superaccumulator methods are about twice as fast as the best previous methods, except for sums of less than 100 terms. For large sums (10000 or more terms), the large superaccumulator method is about 1.5 times slower than the obvious simple summation method, and about three times slower than out-of-order summation.

These results are all for serial implementations. One advantage of exact summation is that it can easily be parallelized without affecting the result, since the exact sum is the same for any summation order. I haven’t tried a parallel implementation yet, but it should be straightforward. For large summations, it should be possible to perform exact summation at the maximum rate possible given limited memory bandwidth, using only a few processor cores.

For small sums (eg, 10 terms), the exact methods are about ten times slower than simple summation. I think it should be possible to reduce this inefficiency, using a method specialized to such small sums.

However, even without such an improvement, the new superaccumulator methods should be good enough for replacing R’s “mean” function with one that computes the exact sample mean, since for small vectors the overhead of calling “mean” will be greater than the overhead of exactly summing the vector. Summing all the data points exactly, then rounding to 64-bit floating point, and then dividing by the number of data points wouldn’t actually produce the exactly-rounded mean (due to the effect of two rounding operations). However, it should be straightforward to combine the final division with the rounding, to produce the exactly-correct rounding of the sample mean. This should also be faster than the current inexact two-pass method.

Modifying “sum” to have an “exact=TRUE” option also seems like a good idea. I plan to implement these modifications to both “sum” and “mean” in a future version of pqR, though perhaps not the next version, which may be devoted to other improvements.


Entry filed under: Computing, R Programming, Statistics - Computing.

How large vectors in R might be stored compactly What can global temperature data tell us?

9 Comments Add your own

  • 1. Anonymous  |  2015-05-22 at 4:05 pm

    The link to your paper is confused.

    • 2. Radford Neal  |  2015-05-22 at 4:12 pm

      Thanks! Should be fixed now. Fortunately, the arxiv link was OK.

  • 3. David Swofford  |  2019-08-16 at 2:37 pm

    I realize this post is more than 4 years old, but I was wondering if you ever wrote a routine for adding partial sums as suggested in the discussion with regard to parallelizing the summation.

    No problem either way, but it would save me some trouble figuring out how to code it if you’ve already done it.

  • 5. David Swofford  |  2019-08-17 at 3:06 pm

    Thanks for your reply! I was using an older version, so good to know about the more recent one.

    Just to be clear, I’m not asking about a full parallel implementation, just one function to add one xsum_small_accumulator to another. I’ve tried to write this myself, but so far haven’t been successful.

    • 6. Radford Neal  |  2019-08-17 at 3:39 pm

      I’ve been meaning to get back to this sometime. I’ll let you know if I’ve written such a function.

  • 7. David Swofford  |  2019-08-19 at 8:42 pm

    I was overcomplicating it. The following function serves my purposes (although it should probably deal with subnormal/Inf/NaN to be general):

    sacc adds_until_propagate == 0)
    (void) xsum_carry_propagate(sacc);

    for (int i = 0; i chunk[i] += sacc_to_add->chunk[i];


  • 8. David Swofford  |  2019-08-19 at 8:52 pm

    Here’s a link to the code that was not formatted properly in the previous comment…


    • 9. Radford Neal  |  2019-08-19 at 9:06 pm

      Here’s the code properly formatted:

         sacc <-- sacc + sacc_to_add */ 
      void xsum_small_add_acc(xsum_small_accumulator *restrict sacc, 
                              xsum_small_accumulator *restrict sacc_to_add) 
          if (sacc->adds_until_propagate == 0) 
              (void) xsum_carry_propagate(sacc); 
          for (int i = 0; i < XSUM_SCHUNKS; i++) 
              sacc->chunk[i] += sacc_to_add->chunk[i]; 

      The problem is that angle brackets get interpreted as HTML, so they have to be written with the HTML escapes..


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


May 2015

Most Recent Posts

%d bloggers like this: