This blog reports more recent efforts with SPARKNaCl - my formally verified SPARK version of the NaCl cryptographic library.
There are seven sections to this update:
- Performance improvement of “+” and “-” for the GF type using a form of loop fusion.
- Another round of performance improvement, via “operator narrowing” in the “*” operator for the GF type.
- A correction to remove a potential for a side-channel information leak.
- Another small performance improvement using loop unrolling to avoid a double-initialization.
- Porting the library to compile and run with the GNAT Community 2021 Edition.
- Proof of the library with the accompany 2021 release of the SPARK Community Edition.
- Performance analysis of the code on a bare-metal RISC-V target to see if the new compiler offers any noticeable gain in code generation or optimization.
So…let’s dive in with the first of those.
1. Loop fusion in “+” and “-” #
In the earlier blog entries (here and here), I’ve explained how SPARKNaCl does modular arithmetic on “large” (256 bit) integers. In particular, there is a type called GF representing one of those, plus various predicated subtypes of it. There are also implementations of “+”, “-” and “*” for GF, all in the body of the root SPARKNaCl package. These operators calculate intermediate results, then do a few passes of “carry and reduce” to normalize the digits to be in the range 0 .. 65535.
In the original version of my code, both the “+” and “-” operators require two passes of the “carry and reduce” function to fully normalize the result. This is simple, and the SPARK proof tools are able to verify that the result really is fully normalized after two calls.
On reflection, though, I realized that the first carry/reduce operation could actually be done at the same time as the “+” operation, in the same loop. To paraphrase a bit, instead of having a structure like this:
for I in Index_16 loop Do_X; end loop; for I in Index_16 loop Do_Y; end loop;
You can get the same effect with:
for I in Index_16 loop Do_X_And_Y_At_The_Same_Time; end loop;
Well…you get the idea. Compiler writers call this kind of optimization “Loop Fusion”, but there’s no way I would expect GNAT to do that automatically in this case, since the two loops in question appear in completely different functions.
Nevertheless, it can be done - I merged the main loop of “+” with the main loop of the first “CAR” function, and then did the same to “-” as well.
The resulting code can be seen at the Git tag “Loop_Fusion”. As before, I applied GNATProve to check that all proofs were OK before testing.
What difference does this make to performance?
When we left off at the end of the last blog entry, the numbers looked like this, with performance for a “Sign” operation at each major optimization level measured in millions of CPU cycles, plus Worst-Case Stack Usage (WCSU) and Code Size both measures in bytes, using the GNAT Community Edition 2020 compiler:
Level Time WCSU Size -O0 174.47 3296 41360 -O1 32.54 2592 20908 -O2 41.10 2608 21672 -O3 25.69 2736 26828 -Os 30.52 2512 18616
With loop fusion in place, we get:
Level Time WCSU Size -O0 161.08 3296 40784 -O1 32.00 2592 20732 -O2 53.53 2608 21476 -O3 25.13 2736 26636 -Os 29.66 2512 18416
Performance is a bit better across the board, except at -O2, which suffers some unexplained loss. WCSU is the same, while code size comes down by about 200 bytes owing to elimination of the SPARKNaCl.Car.Sum_To_Nearlynormal and Difference_To_Nearlynormal functions, which become redundant as a result of this change.
2. Operator narrowing in SPARKNaCl.”*” #
In the last blog entry, we went into some detail of how we could compress the representation of the “GF” type in SPARKNaCl, which stores a 256-bit unsigned integer. Doing so had a major impact on performance, owing to faster assignments of those objects and (possibly) better hit rate for the data cache.
In doing that, I realized that some of the internal mathematics on a GF object could be made faster by “narrowing” the multiplication operator - for example, doing a 32-bit multiplication instead of a 64-bit one. This made a huge difference on my 32-bit RISC-V target, where 32-bit mutiplies are one instruction, but a full-blown 64-bit multiply might be 5 or 6. I applied this trick to the “Carry and Reduce” (aka “CAR”) functions in the SPARKNaCl.Car package to great effect.
Upon further reflection, I realized I could apply the same trick to the main multiplication function for the GF type itself. This stores its intermediate result with 64-bit signed “digits”, allowing the individual digits to overflow their normal range before being carry-and-reduced modulo 2**255-19 as before.
When we left off, the inner loop opened with the following lines of code (This is at the Git tag “Loop_Fusion” in the GitHub repo):
declare T : GF64_PA; LT : GF64_Normal_Limb; begin T := (others => 0); for I in Index_16 loop LT := I64 (Left (I)); T (I) := T (I) + (LT * I64 (Right (0))); -- and so on for T (I + 1), T (I + 2) ...
The object T has to be an array of 64-bit signed integers to accomodate the largest possible values of the intermediate digits of the result, so the code is converting Right(0) to I64 and multiplying by LT (which is also a subtype of I64) to get an I64 result which is then accumulated into T(I).
But… hang on… we’ve already proven that all elements of Left and Right are in the range 0 .. 65535 (16 bits), so that multiply operator could actually be done with a single 32-bit unsigned multiply. The “+” operator still needs to be 64-bit, so we just need to introduce a new subtype and move around a few type conversions, thus:
declare subtype U32_Normal_Limb is U32 range 0 .. LMM1; T : GF64_PA; LT : U32_Normal_Limb; begin T := (others => 0); for I in Index_16 loop LT := U32_Normal_Limb (Left (I)); T (I) := T (I) + I64 (LT * U32_Normal_Limb (Right (0))); -- and so on...
This looks good, and the SPARK proof tools tell us that there is still no potential for arithmetic overflow here - another example of a “proof driven refactoring” in SPARK. With that change in place (at the “Narrow_Mul” tag in the GitHub repo), performance looks like this:
Level Time WCSU Size -O0 113.68 3168 39152 -O1 31.58 2592 20384 -O2 26.40 2608 21144 -O3 24.88 2736 26248 -Os 29.41 2528 18120
We get a big win at -O0, which is to be expected, about the same stack usage and a small reduction in code size across the board. The gains at -O1 and above are not as great as I would have expected, so let’s have a closer look at what’s going on. I had a look at the generated code for the “*” function both before and after this change with the compiler set to -O1 optimization level. Aside from the base address of the function changing, the generated code is identical in both cases.
So… it seems that at -O1 and higher, the compiler can automatically find that operator narrowing optimization for us, which explains why we see such a big improvement at -O0 but not at the higher optimization levels. It turns out that, in turn, is only possible because we previously “narrowed” the representation of GF to pack each digit into exactly 16 bits. That change had the knock-on effect of giving the compiler enough information to automatically narrow the “*” operator.
Such automatic operator narrowing uses information from types in Ada, but does not (yet) exploit the bounds of subtypes, since the compiler has to make worst-case assumptions about the possibility of invalid values with subtypes, so that explains why we didn’t see the automatic narrowing before when each digit of a GF was a subtype of a 64-bit integer base type.
3. A small side-channel leak gets fixed #
You might recall that the code in SPARKNaCl is supposed to inherit the “constant time” nature of the algorithms from TweetNaCl. This means the code has to exhibit two important properties:
- No branches or loop bounds depend on “sensitive” data (such as cryptographic keys).
- No expression used inside an array element lookup depends on sensitive data.
The second of these might seem odd, but imagine if you could “snoop” on a CPU from the outside (say, with a logic analyser or an oscilloscope), and record the pattern of bits that were appearing on the CPU’s address bus. If you had an array lookup that depended on a sensitive variable V, then the address in memory accessed by that expression would depend on V. So… if you can observe the address bus, then you could theoretically deduce the value of V while an algorithm is running. If V happened to be a sequence of bits representing a cryptographic key, then we’re in trouble, so that’s the motivation for this restriction.
In coding SPARKNaCl, I realized I had made a mistake in this regard. In the package SPARKNaCl.Utils, there are two functions CSwap16 and CSwap32 that do conditional, but constant-time swapping of GF values. In the C code, the “swap or don’t swap” parameter is a bool type which then gets treated as an int to form a constant mask value.
In SPARK, I couldn’t do that directly, since Boolean in Ada is a proper enumeration type, and not an Integer, so I turned that transformation into a couple of library-level constant lookup tables to convert from Boolean to the required Integer mask value. This is elegant in terms of typing and good style in Ada, but introduces an array lookup that depends on a potentially sensitive value, so breaks rule 2 above. Whoops!
So… I needed to fix it. The change is easy - just convert to Boolean value to an Integer using ‘Pos (to get 0 for False and 1 for True), then multiply by the required factor to get the right mask value. This is basically what the C code does all along…
You can see this fix at the “No_Lookups” tag in the GitHub repo. This might affect performance too, since we’ve removed a memory access. The numbers look like this:
Level Time WCSU Size -O0 113.68 3168 39368 -O1 31.58 2592 20448 -O2 26.40 2608 21204 -O3 24.88 2736 26364 -Os 29.41 2528 18144
so basically no observable change in runtime or WCSU, and a tiny increase in code size at all levels.
4. A bit more unrolling #
So far so good, but then I realized that another small improvement was possible at the expense of a bit of expanded code size.
The first assigment to T in the “*” function is an aggregate that initializes all elements of T to 0. This is required since the loop accumulates values into each digit as it goes along. But…the first iteration of the loop re-initializes the first 16 values of T, so we have a wasteful “double initialization” of those elements going on. Could that be improved?
The answer is “Yes” if we’re willing to increase the code size a bit with a manual unrolling of the first iteration of the loop. Imagine if you copied out the loop body once before the loop statement, then changed the loop to iterate over 1 .. 15 instead of 0 .. 15. What does that gain? Well… in that first iteration, we know that all references to T (I) on the right hand side of the assignment could be replaced by 0, and then eliminated completely because 0 + X is just X.
That would initialize the first 16 elements of T exactly once. The remaining elements still must be initialized to 0, but that can all be done with a single aggregate assignment. With that in place, the opening of the code the “*” looks like this at the “Unroll_First_Mul” Git Tag:
begin LT := U32_Normal_Limb (Left (0)); T := GF64_PA'(0 => I64 (LT * U32_Normal_Limb (Right (0))), 1 => I64 (LT * U32_Normal_Limb (Right (1))), 2 => I64 (LT * U32_Normal_Limb (Right (2))), 3 => I64 (LT * U32_Normal_Limb (Right (3))), 4 => I64 (LT * U32_Normal_Limb (Right (4))), 5 => I64 (LT * U32_Normal_Limb (Right (5))), 6 => I64 (LT * U32_Normal_Limb (Right (6))), 7 => I64 (LT * U32_Normal_Limb (Right (7))), 8 => I64 (LT * U32_Normal_Limb (Right (8))), 9 => I64 (LT * U32_Normal_Limb (Right (9))), 10 => I64 (LT * U32_Normal_Limb (Right (10))), 11 => I64 (LT * U32_Normal_Limb (Right (11))), 12 => I64 (LT * U32_Normal_Limb (Right (12))), 13 => I64 (LT * U32_Normal_Limb (Right (13))), 14 => I64 (LT * U32_Normal_Limb (Right (14))), 15 => I64 (LT * U32_Normal_Limb (Right (15))), others => 0); -- Iteration "0" is done, so only loop over 1 .. 15 now... for I in Index_16 range 1 .. 15 loop -- and so on as before...
With those changes in place (and still using GNAT Community 2020), perfomance looks like this:
Level Time WCSU Size -O0 109.20 3264 39908 -O1 30.21 2592 20688 -O2 24.52 2608 21428 -O3 23.58 2736 26540 -Os 27.99 2528 18428
So a reasonable gain at all optimization levels, with small changes in WCSU and code size.
5. Porting to GNAT Community 2021 #
Unsurprisingly, this wasn’t too hard. A few small changes were required, as can be seen at the “GNAT_CE_2021” Git tag. In short:
- The compiler now spots that a range check in an Assert was always “True” so issued a warning about it. Since we’re compiling in “all warning are errors” mode, this had to be removed.
- In debugging and test code, the compiler objects to the use of the ‘Img or ‘Image attributes on type Standard.Boolean if the No_Elaboration_Maps Restriction is enabled. These attributes were replaced by a simple hand-written expression function.
- The wording of the warning for an “unused assignment” issued by GNATProve has improved a bit, so the corresponsing pragma Warnings had to be updated to “match” the new wording.
- I had to also “pull” the latest release of the Ada_Drivers_Library from GitHub (here) which has a few modifications to make it compatible with the 2021 edition of the RISC-V cross compiler.
That was it. With those changes in place using GNAT Community 2021 on 64-bit Linux, the library compiled and passed all regression tests with flying colours.
6. Proof with SPARK Community 2021 #
I’m pleased to say that the proof with the Community 2021 edition of the SPARK tools was trouble-free. As before, the tools report all VCs proved automatically using a combination of CVC4, Z3 and Alt-Ergo. Also as before, no single prover is capable of proving the entire library by itself.
7. Performance with GNAT Community 2021 for 32-bit RISC-V #
The new compiler is based on GCC 10.3.1, so we might expect to see some difference in code generation and performance from the 2020 Edition, which was based on GCC 9.3.1.
With the new compiler, the numbers looks like this:
Level Time WCSU Size -O0 110.20 3136 39864 -O1 30.06 2496 20208 -O2 24.37 2464 20736 -O3 23.22 2576 24132 -Os 27.62 2368 18236
So, a bit slower at -O0, but faster at all other optimization levels. WCSU and Code Size improve across the board.
If you read my first blog about performance analysis of the code, you might remember it touched on the topic of “Return Slot Optimization” (RSO) in GCC - a trick the compiler can do to eliminate the need to return a function call result by copy. At the time, I noticed that GNAT was missing a few opportunities to apply RSO in the SPARKNaCl.Sign package. Eric Botcazou from AdaCore also had a look at this and proposed some changes to GNAT to improve the situation, especially being able to enable RSO for function calls which are always Pure in SPARK.
Well… it looks like these changes have made it into GNAT Community 2021. I compiled with SPARKNaCl.Sign package with both the 2020 and 2021 compilers and enabled the “-fdump-tree-all” in both cases. Comparison of the results shows an impressive improvement. With the 2020 compiler, 39 function calls are marked eligible for RSO. With the 2021 compiler, that number leaps to 49. In particular, all the function calls in SPARKNaCl.Sign.”+” have RSO enabled with the 2021 compiler, which probably accounts for the improvement that we’re seeing here.
For the really interested reader, the actual change to the GNAT sources to enable more RSO for SPARK can be seen here.
Further work #
There’s always more to do…a few things on the ToDo list:
- Prepare a new release of SPARKNaCl for the Alire package manager.
- Repeat all the regression and performance testing on RISC-V, but using GNAT-LLVM.
- Do more analysis of which proofs defy each of the main provers, and submit these as improvement requests to those teams. For instance, we could work out which proofs are proven by Z3 or Alt-Ergo but not by CVC4 and submit those to the CVC4 team for analysis and improvement.
- Write a C-compatible API for the library that exactly mimics the TweetNaCl C API, so the whole library could be used more easily as a drop-in replacement for TweetNaCl or LibSodium.
- Verification of the generated LLVM bytecode against the Cryptol specifications using Galois’ SAW tool.
Anyway… that’s all I can think of for now… please drop me a line on firstname.lastname@example.org if you have any other ideas or requests. Alternatively, just open an Issue on the GitHub repo…