# Doubling the Performance of SPARKNaCl (again...)

## by Roderick Chapman (Protean Code Limited) –

# Introduction #

In my last blog entry here, we looked at improving the performance of the SPARKNaCl library, while preserving all the proof results.

At the end of that blog, I speculated a few ideas for further improvement that might bear a bit more fruit. This blog documents how three of those ideas turned out and the results that I got.

Don’t worry… this one is much shorter than the last…

## Starting point #

At the end of the last blog, we had the following performance results for a single NaCl “Sign” operation, running on a 32-bit RISC-V board. This time, I will also include result for worst-case stack usage (WCSU, generated by the GNATstack tool) and code size using the RM32IM (not compressed) ISA (generated by the riscv32-elf-size tool) for the entire SPARKNaCl library (so not including the effect of the linker removing unused sections). As before, execution times are in Millions of CPU cycles, while WCSU and code size are in bytes:

```
Level Time WCSU Size
-O0 181.74 6688 49162
-O1 69.96 4824 30492
-O2 57.82 5504 31180
```

## The big idea #

TweetNaCl stores “big” (256 bit) integers as an array of sixteen 64-bit signed integers. This type is called “GF” in the sources. A “Normal” GF has all the digits or “limbs” in the range 0 .. 65535. Signed 64-bit “limbs” are needed to store the intermediate results of “+”, “-” and “*” operations, which can overflow and/or become negative before being “normalized” modulo 2**255-19 ready for the next operation.

Two things were nagging at me about this scheme:

- Most of the time, a “Normal_GF” value occupies 128 bytes of space, but only really contains 32 bytes of useful information, since the most significant 6 bytes of each limb are always zero. This takes up lots of space on the stack, and copying/assignment of them is inefficient. See our earlier comments on functional programming style and “Return Slot Optimization” for how this kind of copying overhead can hit performance.
- I also realized that the “Carry and Reduce” (CAR) functions were all using 64-bit signed arithmetic all the time. This is really slow on a 32-bit machine like our RV32 target.

So…here’s the idea: what if we store a Normal_GF as an array of 32-bit values, and only “expand out” into 64-bit arithmetic for the “*” operation?

To be sure this will work, we’d have to know exact bounds on the types produced by the “*” operation at each phase of the normalization process. Good news: we already have that information embodied in the subtype declarations for those functions. For example, the first “CAR” function that follows “*” has the declaration:

```
function Product_To_Seminormal (X : in Product_GF) return Seminormal_GF
with Pure_Function,
Global => null;
```

so let’s look at those type declarations too. These fragments are taken from sparknacl.ads at the “Faster_Tweet” tag so they correspond to what we had at the end of the last blog.

```
-- "LM" = "Limb Modulus"
-- "LMM1" = "Limb Modulus Minus 1"
LM : constant := 65536;
LMM1 : constant := 65535;
-- "R2256" = "Remainder of 2**256 (modulo 2**255-19)"
R2256 : constant := 38;
-- "Maximum GF Limb Coefficient"
MGFLC : constant := (R2256 * 15) + 1;
-- "Maximum GF Limb Product"
MGFLP : constant := LMM1 * LMM1;
subtype Product_GF is GF
with Dynamic_Predicate =>
(for all I in Index_16 =>
Product_GF (I) >= 0 and
Product_GF (I) <=
(MGFLC - 37 * GF_Any_Limb (I)) * MGFLP);
----------------------------------------------------------------------
-- A "Seminormal GF" is the result of applying a single
-- normalization step to a Product_GF
--
-- Least Significant Limb ("LSL") of a Seminormal GF.
-- LSL is initially normalized to 0 .. 65535, but gets
-- R2256 * Carry added to it, where Carry is (Limb 15 / 65536)
-- The upper-bound on Limb 15 is given by substituting I = 14
-- into the Dynamic_Predicate above, so
-- (MGFLC - 37 * 14) * MGFLP = 53 * MGFLP
-- See the body of Product_To_Seminormal for the full
-- proof of this upper-bound
subtype Seminormal_GF_LSL is I64
range 0 .. (LMM1 + R2256 * ((53 * MGFLP) / LM));
-- Limbs 1 though 15 are in 0 .. 65535, but the
-- Least Significant Limb 0 is in GF_Seminormal_Product_LSL
subtype Seminormal_GF is GF
with Dynamic_Predicate =>
(Seminormal_GF (0) in Seminormal_GF_LSL and
(for all I in Index_16 range 1 .. 15 =>
Seminormal_GF (I) in GF_Normal_Limb));
```

The first important number to check is the worst-case upper bound on all the limbs of Product_GF. This occurs for limb 0, and has upper bound (MGFLC * MGFLP) which has a value a bit below 2**42, so we definitely can’t store one of those in 32 bits.

The second important number is Seminormal_GF_LSL’Last, which evaluates to a bit less than 2**27, so we *can* store a Seminormal_GF using 32-bit limbs. That also means that a Nearlynormal_GF (which is the type returned by the second “Seminormal_To_Nearlynormal” normalization step) can be represented with 32-bit limbs.

The upshot is that the “*” operation needs to “expand out” its parameters into 64-bit before it does all the sums, the *first* normalization (Product_To_Seminormal) has to be done in 64-bit arithmeric, but can return an array of 32-bit values, then the second and third normalization functions can *all* be done in 32-bit maths.

Does this make much of a difference? I did a quick-and-dirty 32-bit implementation of “CAR” for comparison. On RV32 the difference is startling - 133 Instructions for a CAR function in 32-bit arithmetic, versus 321 Instructions for the 64-bit variant, so that’s a respectable improvement and doesn’t take into account the benefit of saving memory, faster assignments and so on.

## Results (Git Tag: GF32) #

I made the changes suggested above, and used GNATprove to make sure I hadn’t made a mess of it. I didn’t test the code at all until the proof was complete and automated as before. Here are the new numbers:

```
Level Time WCSU Size
-O0 172.59 4416 47092
-O1 65.14 3595 27448
-O2 34.94 3536 27668
```

So we’ve saved about 2k of stack space and 2k of code size across the board. Performance shows a reasonably improvement at -O0 and -O1, and a massive jump at -O2 with an improvement of nearly 23 Million cycles.

Getting this far took about 5 hours at the keyboard, including code modifications, re-generating the proofs, a single (successful) run of the regression tests, and getting the performance numbers above.

# Going further… (Git Tag: GF16) #

So far so good. Those results gave me the confidence to take the next logical step.

The most compact representation possible for a “Normal_GF” is to store each limb (with range 0 .. 65535) as an *unsigned* 16-bit value, so the whole thing fits in 32 bytes with no wasted space at all.

Using that representation doesn’t change the maths in the operations at all - we still need 64-bit math for “*” and the first CAR, and 32-bit math for the second and third CAR and for “+” and “-“, but it will halve the cost of an assignment (again), reduce memory usage and (possibly) improve data cache hit rate, so probably worth a go…

Having done it once, this change was relatively simple - about 2 hours time - although some care was needed to make sure all the proofs were still OK.

Results as follows:

```
Level Time WCSU Size
-O0 167.48 3296 49684
-O1 32.60 2560 27460
-O2 27.43 2592 27868
```

So…code size is about the same, and we save another 1k of stack space across the board. Performance improves a bit at -O0, gets a massive boost at -O1, and another noticeable improvement at -O2.

Pleasantly enough, these numbers are also well ahead of the best results that we’ve seen for TweetNaCl.

I could transfer these changes to the TweetNaCl C sources I suppose, but I really don’t have the confidence to do such a subtle bunch of changes without GNATprove backing me up. Perhaps the maintainers of WireGuard will decide to pick up these ideas?

# What about -O3? (Git Tag: O3_No_Unroll) #

At the end of the last blog, I promised I’d have a look at at results with optimization level 3.

The main differences between -O2 and -O3 are that -O3 automatically inlines more subprogram calls, and automatically unrolls loops (whether you want it to or not…) These can be a double-edged sword - possibly faster through fewer branches, but possibly slower through increased code size and reduced hit rate for the instruction cache.

Using exactly the same code as we left off with, we can add a fourth row to the table:

```
Level Time WCSU Size
-O0 167.48 3296 49684
-O1 32.60 2560 27460
-O2 27.43 2592 27868
-O3 49.87 2624 48656
```

which looks rather bad.

Having manually unrolled the most time-critical loops already, one thing we could do is to use GNAT’s pragma Loop_Optimize (No_Unroll) to explicitly tell the compiler *not* to unroll all the other loops in the library. With those pragmas in place, we get:

```
Level Time WCSU Size
-O0 167.48 3296 49684
-O1 32.60 2560 27460
-O2 27.43 2592 27868
-O3 25.80 2736 34404
```

so a slight improvement in performance for a small increase in stack usage, and a few more kilobytes of code size over -O2 (owing to inlining no doubt).

# Feeling the Squeeze (Part 2…) (Git Tag: RV32IMC_A4) #

And finally… I promised I’d return to looking at the performance of SPARKNaCl using the RISV-V Compressed Instruction Set. Previously, we’d seen a significant performance improvement by switching from the compressed instructions (known as “RV32IMC”) to uncompressed (known as “RV32IM”) owing to a performance penalty in the E31 core for branching to a mis-aligned instruction.

Eric Botcazou at AdaCore suggested having a look at the compiler’s options to force instruction alignment to see if we could put things right.

Section 3.11 of the GCC manual says there are four options to control the alignment of function entry points, branch labels, jump labels and loop entry points. With the RV32IMC instruction set, these all default to “2” (so all instructions appear at an address which is an exact multiple of 2), giving maximum code density, but (with our E31 core) incurring a performance penalty when we branch to a 32-bit instruction that is not aligned on a 4-byte boundary.

To see how bad the penalty is, let’s look at performance, worst-case stack usage, and code size using RV32IMC and the default alignment, based on the code as above. For fun, I have also added a build using the option “-Os” which means “optimize for smallest code size” just to see what happens.

```
Level Time WCSU Size
-O0 170.53 3296 40430
-O1 32.51 2592 20230
-O2 62.72 2608 20972
-O3 25.73 2736 26000
-Os 30.54 2512 18280
```

So… we’re a bit slower at -O0, a bit faster at -O1 and -O3, but -O2 suffers some sort of terrible penalty. Worst-case stack usage is mostly unchanged, while code size comes down by about 7k across the board, with -Os winning with the whole library fitting in slightly over 18k.

Now we’ll see what happens when I set all four alignment options to “4”. The Makefile in perf/tsign.mk selects this by setting the SPARKNACL_TARGET_ARCH project variable to “rv32imc_a4”. With that, we hope to keep the code size of compressed instructions, but avoid the performance penalty of mis-aligned branches. The results are:

```
Level Time WCSU Size
-O0 174.47 3296 41360
-O1 32.54 2592 20908
-O2 41.10 2608 21672
-O3 25.69 2736 26828
-Os 30.52 2512 18616
```

In terms of performance, -O0 and -O1 are a bit slower. -O2 regains some credibility, while -O3 and -Os are a bit faster than before. Worst-case stack usage is unchanged, while code size goes up by about 800 bytes at all levels owing to the need to “pad” basic blocks with “no-op” instructions to regain instruction alignment.

# Conclusions #

In summary:

Fastest performance: 25.69 Million cycles for a “Sign” at -O3 and RV32IMC_A4

Smallest worst-case stack usage: 2512 bytes at -Os (and regardless of instruction set)

Smallest code size: 18280 bytes at -Os and RV32IMC

From the initial starting point we’ve seen the performance of a “Sign” operation come down from 93.86 to 27.43 million cycles at -O2 - an improvement of about 3.4 times, *without breaking the proof*. I think this last point cannot be overstated… years of programming (and PSP training) have taught me what a brilliant programmer I’m *not*. I just don’t know if I could make subtle changes without a tool like GNATprove backing me up (and saving me from acute embarrassment).

Of course, there’s more to come. Firstly, we could attempt to reduce the number of “CAR” calls from 3 to 2 following “*”. Benoit Viguier’s work (see his recent paper on the subject) shows that this is possible (and provable!) but relies on some very deep work with the Coq theorem prover that I can’t really get my head around at the moment…

Secondly, we’ll report more performance figures as and when GNAT adopts more advanced Return Slot Optimization for SPARK, as we mentioned last time.

Then there’s GNAT-LLVM, and … I guess it all depends on when lockdown ends and if I get any “proper” work to do… we’ll see…

Thanks for reading this far.

## Acknowledgements #

Thanks to Eric Botcazou at AdaCore for helping with the RV32IMC alignment issues. Claire Dross of SPARK Team helped me refactor the proofs following the switch to 16 bit limbs in a Normal GF.