AdaCore Blog

Performance analysis and tuning of SPARKNaCl

by Roderick Chapman (Protean Code Limited) Guest Author

Intro­duc­tion #

This is the sec­ond in a series of blogs about my SPARK­Na­Cl cryp­to­graph­ic library. The project’s GitHub repo pro­vides some back­ground infor­ma­tion about the library, while the first blog entry went into the details of trans­form­ing a tricky con­stant-time algo­rithm from C into SPARK and its proof.

If you’re not famil­iar with SPARK­Na­Cl, then it might be worth review­ing one or both of those before press­ing on here.

This blog goes into the details of both mea­sur­ing and improv­ing the run­time per­for­mance of SPARK­Na­Cl on a real bare met­al” embed­ded tar­get, and com­par­ing results with those for the orig­i­nal Tweet­Na­Cl” C implementation.

Warn­ing: this blog is long and rather tech­ni­cal, but please press on.

TL;DR ver­sion: for­mal is fast.

Moti­va­tion #

Hav­ing worked with SPARK for many years, I have often encoun­tered a belief that for­mal is slow”. While type safe” lan­guages do have a rep­u­ta­tion for not run­ning as fast as their type unsafe” cousins, is that real­ly true for SPARK which offers sta­t­ic type safety?

In some appli­ca­tions, both cor­rect­ness and per­for­mance are cer­tain­ly crit­i­cal, and a cryp­to­graph­ic library like NaCl is a fine exam­ple. Some sub­tle cor­rect­ness prop­er­ties are required, but per­for­mance is also fun­da­men­tal for usabil­i­ty and to min­i­mize pow­er con­sump­tion in small bat­tery-pow­ered devices.

So…can we eat our prover­bial cake and (still) have it? Hav­ing estab­lished an auto­mat­ed proof of type-safe­ty for SPARK­Na­Cl, how does it actu­al­ly stack up against the Tweet­Na­Cl C imple­men­ta­tion in terms of run­ning time?

SPARK actu­al­ly has some prop­er­ties in its favour. Notably:

  • SPARK offers sta­t­ic type safe­ty, so pro­grams can be com­piled and run with no run­time type check­ing at all.
  • There are no alias­ing and no func­tion side-effects in SPARK.
  • There are no unde­fined (aka erro­neous”) behav­iours in SPARK, and no depen­dence on unspec­i­fied behaviour.

These points sug­gest that SPARK should be high­ly amenable to opti­miza­tion with­in a compiler.

On the oth­er hand, SPARK­Na­Cl uses some lan­guage fea­tures that might come with some per­for­mance penal­ty. For exam­ple, SPARK­Na­Cl uses a func­tion­al pro­gram­ming style where pos­si­ble, and declares oper­a­tors (like “+” and “*”) for com­pos­ite types. This aids read­abil­i­ty of the code, but such func­tions are required to return their result by copy in Ada, so come with some per­for­mance penalty.

Exper­i­men­tal set­up #

To mea­sure run­time prop­er­ly, a host sys­tem like MacOS or Lin­ux is not real­ly suit­able. For these exper­i­ments, I decid­ed to go for a small 32-bit RISC‑V board — a HiFive1 Revi­sion B. This offers a bare met­al” pro­gram­ming envi­ron­ment, so should offer rea­son­ably pre­dictable and repro­ducible results, and is also sup­port­ed by the RISC‑V cross ver­sion of the GNAT Com­mu­ni­ty 2020 Edi­tion and the Ada Dri­vers Library.

I wrote some basic per­for­mance test pro­grams. Rather than try to test all the func­tions of the library, I set­tled on a sim­pler test of the NaCl Sign” API which forms a cryp­to­graph­ic sig­na­ture of an arbi­trary mes­sage using the signer’s pri­vate key, using the Ed25519” alrog­ithm. The Sign” oper­a­tion is com­pu­ta­tion­al­ly expen­sive, using some com­plex math­e­mat­i­cal trick­ery involv­ing the Curve25519 ellip­tic curve, and the SHA512 hash algo­rithm, so it seemed a good work out” to test per­for­mance of the library.

The main test pro­gram is in SPARKNaCl/perf/tsign.adb in the GitHub Repo. See the asso­ci­at­ed Make­file (tsign​.mk) and project file (tsign.gpr) for how it all fits togeth­er. The test pro­gram first runs the SPARK­Na­Cl Sign oper­a­tion on a con­stant 256 byte mes­sage using a fixed secret key and reports the run­ning time in CPU Clock Cycles. It then runs exact­ly the same test using the Tweet­Na­Cl ver­sion of the Sign oper­a­tion, which is also com­piled and linked into the same bina­ry. Both the SPARK and C ver­sions are (obvi­ous­ly) com­piled with the same com­pil­er (in this case GNAT Com­mu­ni­ty 2020) and at the same opti­miza­tion lev­el, so the opti­miza­tion pass­es and back end” are the same for each test. Each test was re-run with the com­pil­er set to opti­mize at lev­els ‑O0, ‑O1 and ‑O2.

In the fol­low­ing sec­tions, I’ll also give the name of the Tag in the GitHub repo that cor­re­sponds to the par­tic­u­lar snap­shot of the code that was used to get each set of results so you can see how the code changes and develops.

The Gold­en Rule #

In try­ing to improve the per­for­mance of SPARK­Na­Cl, I set myself one gold­en rule: the proof of the entire library would remain 100% com­plete and auto­mat­ed. Any change to the code that broke the proof” would be rejected.

Per­for­mance Base­line Test (Git Tag: Base­line) #

This test uses the code as it was in April 2020 when I wrote the last blog entry, but with no effort to improve per­for­mance. Sim­i­lar­ly, the Tweet­Na­Cl C code used for com­par­i­son is essen­tial­ly unmod­i­fied form the designer’s orig­i­nal release.

The num­bers report­ed here are in Mil­lions of CPU cycles for a Sign oper­a­tion, so low­er is bet­ter.

—Lev­el—SPARK— — — C

These results actu­al­ly came as some­thing of a sur­prise: SPARK is very close to C at ‑O1, a lit­tle behind at ‑O2, and notice­ably bet­ter at ‑O0, with no effort spent on analy­sis or opti­miza­tion at all. How is SPARK win­ning so con­vinc­ing­ly at ‑O0? A lit­tle analy­sis turned up the fol­low­ing reasons:

Alter­na­tive Car­ry-and-Reduce code #

One of the most time-crit­i­cal func­tions in NaCl is the Car­ry-and-Reduce” (aka CAR”) func­tion for large (256-bit) inte­gers. These inte­gers are stored as an array of 16, 64-bit inte­gers, and fol­low­ing an oper­a­tion like “*”, they need to be reduced mod­u­lo 2**255 – 19, and nor­mal­ized” so that all dig­its are in the range 0 .. 65535. With a few intro­duc­to­ry type def­i­n­i­tions and pre­proces­sor macros, the CAR func­tion in Tweet­Na­Cl look like this:

#define FOR(i,n) for (i = 0;i < n;++i)
#define sv static void
typedef long long i64;
typedef i64 gf[16];

sv car25519(gf o)
  int i;
  i64 c;

  FOR(i,16) {

As we’ve seen before, the cod­ing in Tweet­Na­Cl empha­sizes brevi­ty over read­abil­i­ty, so it’s pret­ty hard to see what’s going on. At ‑O0, lines 12 — 15 (the loop body) gen­er­ate 161 32-bit RISC‑V instruc­tions, includ­ing 10 32-bit mul­ti­plies, since 64-bit signed mul­ti­ply requires sev­er­al 32-bit mul­ti­ply instruc­tions on RV32IM.

Line 14 looks par­tic­u­lar­ly trou­ble­some — it has four mul­ti­pli­ca­tions, and is using the val­ue of the loop counter vari­able i in Boolean tests which are then used as operands for mul­ti­pli­ca­tion. Remem­ber that NaCl requires con­stant time” pro­gram­ming, so we’re not allowed to have a con­di­tion­al state­ment inside the loop body. If we were, then we could iso­late the i == 15” case and replace line 14 by:

if (i == 15)
   o[0] += c-1+37*(c-1);
   o[i+1] += c-1;

which vio­lates the con­stant time” require­ment, but is eas­i­er to read, and leads to anoth­er sig­nif­i­cant insight: if we unroll” the final loop iter­a­tion (when i == 15), then the if state­ment inside the loop body can be removed entire­ly, and the inner loop (of the first 14 iter­a­tions) no longer requires a mul­ti­pli­ca­tion at all.

There is one fur­ther unfor­tu­nate fea­ture of this code: the loop body uses the shift oper­a­tors “»” and “«” on signed inte­gers in C, which can­not be repro­duced direct­ly in SPARK. Sec­ond­ly, for neg­a­tive val­ues, the result of a signed shift right is imple­men­ta­tion-defined in C, while the result of a signed shift left is unde­fined.

In look­ing at all these issues, and how this code might be best trans­lat­ed and ver­i­fied in SPARK, I came across the work of Benoit Vigu­ier, who had ver­i­fied this code (and much more besides) using the Coq the­o­rem prover. Benoit point­ed out that his effort had actu­al­ly ver­i­fied an alter­na­tive imple­men­ta­tion of CAR that appears in the Wire­Guard sources, writ­ten by Jason Donen­feld.

Jason’s ver­sion of the algo­rithm removes the need for the left-shift inside the loop. The right-shift, though, is still need­ed. In SPARK, pack­age Inter­faces declares a Shift_​Right_​Arithmetic func­tion for type Unsigned_​64. Mak­ing that avail­able for a signed 64-bit Inte­ger requires two instan­ti­a­tions of Ada.Unchecked_Conversion and a suit­able post-condition:

--  returns equivalent of X >> 16 in C, doing an arithmetic
--  shift right when X is negative, assuming 2's complement
--  representation
function ASR_16 (X : in I64) return I64
is (To_I64 (Shift_Right_Arithmetic (To_U64 (X), 16)))
  with Post => (if X >= 0 then ASR_16'Result = X / LM else
                               ASR_16'Result = ((X + 1) / LM) - 1);

Note that ASR_16 is declared as an expres­sion func­tion so its body can be avail­able for inlin­ing for both proof and code generation.

The ini­tial CAR algo­rithm in SPARK­Na­Cl there­fore comes out as fol­lows. In this code sam­ple, I have omit­ted the sub­type pred­i­cates, loop invari­ants, and asser­tions for brevity:

function Car (X : in GF) return GF
   Carry : I64;
   R     : GF;
   R := X;
   for I in Index_16 range 0 .. 14 loop
      Carry := ASR_16 (R (I));
      R (I + 1) := R (I + 1) + Carry;
      R (I) := R (I) mod LM;
   end loop;

   Carry := ASR_16 (R (15));
   R (0) := R (0) + R2256 * Carry;
   R (15) := R (15) mod LM;

   return R;
end Car;

This code also makes it a bit clear­er how this works. Essen­tial­ly, the Car­ry” (any­thing above the first 16 bits) for dig­it I is added to dig­it I+1. Dig­it I is then reduced mod­u­lo LM (which is 65536). Final­ly, the car­ry out from dig­it 15 is com­put­ed. We can’t add that to dig­it 16 (because there isn’t one!), so instead it is added to dig­it 0 once after being mul­ti­plied by the mag­ic” con­stant R2256. This means we end up with the right result, with the val­ue reduced mod­u­lo 2**255 – 19.

How and why this real­ly works (and the proof of it) is enough mate­r­i­al for an entire blog entry, so we won’t go into it here.

From the point of view of per­for­mance, this inner loop body gen­er­ates just 59 instruc­tions with no mul­ti­plies, so a big win over the orig­i­nal C version.

Does this account for the big per­for­mance dif­fer­ence? In SPARK­Na­Cl, I have been able to prove that three appli­ca­tions of CAR are need­ed to ful­ly nor­mal­ize the results of a “*” oper­a­tion on a Normal_​GF. In Tweet­Na­Cl, CAR is only applied twice fol­low­ing “*”. In SPARK­Na­Cl, we also apply CAR (twice each) after “+” and “-“.

The inner loop of CAR is exe­cut­ed 16 times, and in SPARK­Na­Cl, a sim­ple exper­i­ment (and/​or a bit of maths) shows that CAR is called 24564 times for a sin­gle Sign”. That means that the inner loop loop of CAR accounts for (16 * 59 * 24564) = 23.2 Mil­lion instructions.

In Tweet­Na­Cl, CAR is called 10240 cycles for the same test, so the inner loop accounts for (16 * 161 * 10240) = 26.4 Mil­lion instructions.

So… even at ‑O0, SPARK­Na­Cl is sav­ing about 3.2 Mil­lion instruc­tions for just that one loop body.

Func­tion returns #

I had expect­ed that the SPARK code would be sig­nif­i­cant­ly slow­er than the C owing to the use of a func­tion­al pro­gram­ming style in SPARK, where we have func­tions like “+”, “-“, “*”, and CAR that return a GF (which is 128 bytes) by copy. In writ­ing high-per­for­mance code, a gen­er­al­ly good rule-of-thumb is to avoid copy­ing data at all costs…

The Tweet­Na­Cl C code uses an explic­it by ref­er­ence pass­ing style for these func­tions — they are declared as void” func­tions with three para­me­ters, all of which are point­ers to GF”, so you’d expect this to be much faster.

So how come SPARK is still com­pete­tive? It turns out that there’s more to this issue than meets the eye… we’ll return to it lat­er when we get to talk about Return Slot Optimization”…

Inlin­ing and expres­sion func­tions #

Anoth­er per­for­mance win for SPARK comes from Ada 2012’s expres­sion func­tions. These are always eli­gi­ble for inlin­ing, even at ‑O0 with GNAT. Func­tions are nev­er inlined at ‑O0 in C, so this explains more per­for­mance gains for SPARK here.

Intrin­sic bit­wise func­tions #

In Tweet­Na­Cl, there is a func­tion declared that per­forms a 64-bit Rotate” oper­a­tion. This is a full-blown func­tion in C, and is nev­er inlined at ‑O0, so comes with some per­for­mance penalty.

In SPARK­Na­Cl, we’re using the Rotate_​Right” func­tion declared in pack­age Inter­faces, which has Con­ven­tion Intrin­sic applied to it. This means that the GNAT code gen­er­a­tor just knows” how to do this func­tion direct­ly — a kind of super inlin­ing” that is built into the code gen­er­a­tor. On RV32IM, a call to Rotate_​Right for a 64-bit unsigned val­ued expands to just 6 instruc­tions, so a sig­nif­i­cant per­for­mance gain for SPARK, espe­cial­ly in the SHA512 algo­rithm which uses Rotate_​Right in its inner loop.

So.. that explains why SPARK is faster than C at ‑O0. At high­er lev­els C catch­es back up through inlin­ing and opti­miza­tion of the crit­i­cal func­tion bod­ies, and the effect of call­ing CAR far few­er times.

Hav­ing estab­lished a base­line for per­for­mance for both SPARK and C, we can now move on to see how we can improve the SPARK code. At the end of this blog, we’ll come back to com­par­ing SPARK­Na­Cl with Tweet­Na­Cl to see how they stack up.

Get­ting Relaxed about Ini­tial­iza­tion (Git Tag: Relax_​Init) #

If you’ve used SPARK, you’ll know it is par­tic­u­lar­ly pedan­tic about mak­ing sure that all objects are prop­er­ly (and whol­ly) ini­tial­ized before they are ref­er­enced. In ear­li­er releas­es of SPARK, the flow ana­lyz­er” treats each vari­able (of any type) as a whole object, so things like arrays are either com­plete­ly ini­tial­ized, or not at all. This approach is safe, but can lead to slight­ly inef­fi­cient cod­ing where arrays end up being ini­tial­ized twice to avoid a false alarm from the flow analyzer.

SPARK Com­mu­ni­ty 2020 added sup­port for Relaxed Ini­tial­iza­tion of vari­ables which changes that sit­u­a­tion. This facil­i­ty moves the effort of ver­i­fy­ing cor­rect ini­tial­iza­tion from the flow ana­lyz­er to the proof engine. This allows us to prove (for exam­ple) that an array is prop­er­ly ini­tial­ized, even if it is done so ele­ment-by-ele­ment in a loop.

The lan­guage now has a new aspect and a new attribute to make this happen:

  • The Relaxed_​Initialization” aspect is applied to a vari­able. This says Ver­i­fy ini­tial­iza­tion using the proof engine, not the flow analyzer.”
  • The “‘Ini­tial­ized” attribute returns Boolean giv­en the name of any object or com­po­nent of an object. It returns true if that object or com­po­nent has been initialized.

Once again, we return the the CAR algo­rithm, which serves as a good exam­ple of how this works. The assign­ment R := X;” is a whole-array assign­ment. This def­i­nite­ly ini­tial­izes R, but these objects are 128 bytes, so comes with some­thing of a run­time penal­ty. The inner loop actu­al­ly ini­tial­izes all the ele­ment of R again, so per­haps we can refac­tor this code using Relaxed Ini­tial­iza­tion to avoid the first array assign­ment. The trick is to unroll” the first loop iter­a­tion this time in order to ini­tial­ize R (0) and R (1) before we go into the loop. Again, I have removed the unin­ter­est­ing loop invari­ants, but it looks basi­cal­ly like this:

   Carry : I64;
   R     : GF with Relaxed_Initialization;
   Carry := ASR_16 (X (0));
   R (0) := X (0) mod LM;
   R (1) := X (1) + Carry;

   pragma Assert
     (R (0)'Initialized and R (1)'Initialized);

   for I in Index_16 range 1 .. 14 loop
      Carry := ASR_16 (R (I));
      R (I) := R (I) mod LM;
      R (I + 1) := X (I + 1) + Carry;
      pragma Loop_Invariant
        (for all K in Index_16 range 0 .. I => R (K)'Initialized);
      pragma Loop_Invariant (R (I + 1)'Initialized);
   end loop;

   pragma Assert (R'Initialized);

   -- as before...

The loop invari­ants say that, after I loop iter­a­tions, ele­ments 0 through I+1 of R are ini­tial­ized. Once we exit the loop (when I = 14), we know that ele­ments 0 through 15 are ini­tial­ized, but that’s the whole array, so we can Assert that R’Initialized just after the loop, and all is well.

We can use the same trick in all the Car” func­tions in SPARKNaCl.Car and also in the “+”, “*” and “-” oper­a­tors for GF that are in the body of the SPARK­Na­Cl package.

Let’s com­pare the per­for­mance of the SPARK code at the Base­line” tag with the new Relax_​Init” tag:


so sav­ing a few mil­lion cycles across the board.

Par­tial Redun­can­cy Elim­i­na­tion (Git Tag: PRE_​Scalarmult) #

We’ll now turn to a poten­tial improve­ment in the SPARKNaCl.Sign.Scalarmult code. This func­tions forms the core of the Sign operation.

The spec­i­fi­ca­tion of this func­tion is:

function Scalarmult (Q : in GF_Vector_4;
                     S : in Bytes_32) return GF_Vector_4
  with Global => null;

where GF_​Vector_​4 is an array of 4 GF val­ues. The body (again, sim­pli­fy­ing a bit for brevi­ty) looks like this:

function Scalarmult (Q : in GF_Vector_4;
                     S : in Bytes_32) return GF_Vector_4
   CB     : Byte;
   Swap   : Boolean;
   LP, LQ : GF_Vector_4;
   LP := (0 => GF_0,
          1 => GF_1,
          2 => GF_1,
          3 => GF_0);
   LQ := Q;

   for I in reverse U32 range 0 .. 255 loop
      CB   := S (I32 (Shift_Right (I, 3)));
      Swap := Boolean'Val (Shift_Right (CB, Natural (I and 7)) mod 2);

      CSwap (LP, LQ, Swap);
      --  Note user-defined "+" for GF_Vector_4 called here
      LQ := LQ + LP;
      LP := LP + LP;
      CSwap (LP, LQ, Swap);
   end loop;

   return LP;
end Scalarmult;

The inner loops run 256 times, and the “+” oper­a­tor for GF_​Vector_​4 is expen­sive in terms of exe­cu­tion time, so this looks like a good tar­get for fur­ther efforts.

The inner loop is essen­tial­ly iter­at­ing over all 256 bits of the 32-byte para­me­ter S. The Cur­rent Byte” of inter­est (CB) is picked from S by divid­ing the loop counter I by 8 (using a Shift). The exact bit is then extract­ed from CB by shift­ing I and 7” places and doing a mod 2” to get the val­ue of that one bit. The CSwap pro­ce­dure is a con­stant-time con­di­tion­al swap of GF_​Vector_​4 values.

In Tweet­Na­Cl, the code is stuc­tured in this way to save space (so the code can fit in few­er tweets), but it’s actu­al­ly far eas­i­er to under­stand if it’s writ­ten as two nest­ed loops — an out­er loop that iter­ates over the 32 bytes of S and an inner loop that iter­ates over the 8 bits of CB. It then looks like this:

--  For each byte of S, starting at the MSB
for I in reverse Index_32 loop
   --  For each bit, starting with bit 7 (the MSB)
   for J in reverse Natural range 0 .. 7 loop
      CB := S (I);
      Swap := Boolean'Val (Shift_Right (CB, J) mod 2);
      CSwap (LP, LQ, Swap);
      LQ := LQ + LP;
      LP := LP + LP;
      CSwap (LP, LQ, Swap);
   end loop;
end loop;

This struc­ture is arguably eas­i­er to read, elim­i­nates a Shift_​Right oper­a­tion, but also reveals an small oppor­tu­ni­ty for opti­miza­tion. The assign­ment to CB depends on the out­er loop counter I, but not on the inner loop counter J, so this assign­ment can be moved out” of the inner loop, thus:

--  For each byte of S, starting at the MSB
for I in reverse Index_32 loop
   CB := S (I);
   --  For each bit of CB, starting with bit 7 (the MSB)
   for J in reverse Natural range 0 .. 7 loop
      Swap := Boolean'Val (Shift_Right (CB, J) mod 2);
      CSwap (LP, LQ, Swap);
      LQ := LQ + LP;
      LP := LP + LP;
      CSwap (LP, LQ, Swap);
   end loop;
end loop;

Com­pil­ers already have a pass that does this kind of thing — it’s called Par­tial Redun­dan­cy Elim­i­na­tion”, but that’s only enabled at ‑O1 and above, so we can gain a lit­tle here at ‑O0 doing it by hand.

With that in place, the per­for­mance fig­ures looks like this:


so a small improvement.

[Un]Rolling with it… (Git Tag: Unroll_​Multiply) #

You may won­der why I’ve not report­ed fig­ures for the ‑O3 switch? Good question…

The main effect of ‑O3 is that it enables auto­mat­ic loop unrolling. The con­di­tions under which a loop does or doesn’t get unrolled aren’t entire­ly clear, and the results (I did try it) are also unpre­dictable. In par­tic­u­lar, unrolling a loop makes the code faster (owing to few­er branch­es, and not hav­ing to main­tain loop counter vari­ables), but it also makes the code far big­ger and there­fore slow­er by reduc­ing instruc­tion cache hit rate and hav­ing to read instruc­tions from mem­o­ry (or from FLASH) more often.

But… there is one loop in SPARK­Na­Cl that I knew might ben­e­fit from a man­u­al” (i.e. actu­al­ly mod­i­fy­ing the source code) loop unroll — the inner loop of the “*” oper­a­tor for the GF type. The loop body is short and it’s only exe­cut­ed 16 times, so unrolling it shouldn’t result in a large increase in code size.

The algo­rithm is a clas­sic lad­der” mul­ti­pli­ca­tion. Remem­ber that each GF rep­re­sents a large inte­ger, where each dig­it” N rep­re­sents a coef­fe­cient mul­ti­plied by 2**(16N), so the least sig­nif­i­cant dig­it rep­re­sents units”, the next dig­it is mul­ti­plied by 2**16, the next by 2**32 and so on.

To mul­ti­ply Left by Right, we loop over all the pairs of dig­its, mul­ti­ply them togeth­er, and accu­mu­late the coef­fe­cients into an inter­me­di­ate result T which has 31 dig­its. The dig­its of this inter­me­di­ate result can be real­ly big (way beyond 2**16), so that’s why Tweet­Na­Cl stores them in 64 bits.

The orig­i­nal code looks like this, again omit­ting the loop invariant:

function "*" (Left, Right : in Normal_GF) return Normal_GF
   T  : GF_PA; -- 31 digits
   T := (others => 0);
   --  "Textbook" ladder multiplication
   for I in Index_16 loop
      for J in Index_16 loop
         T (I + J) := T (I + J) + (Left (I) * Right (J));
      end loop;
   end loop;

The inner J loop can be unrolled by hand so it would start with:

T (I)      := T (I)      + (Left (I) * Right (0));
T (I + 1)  := T (I + 1)  + (Left (I) * Right (1));
T (I + 2)  := T (I + 2)  + (Left (I) * Right (2));
--  and so on...

but we notice that the expres­sion Left (I)” is com­mon to each state­ment, so we can do a lit­tle man­u­al com­mon subex­pres­sion elim­i­na­tion” on there by intro­duc­ing a new local vari­able LT:

LT         := Left (I);
T (I)      := T (I)      + (LT * Right (0));
T (I + 1)  := T (I + 1)  + (LT * Right (1));
T (I + 2)  := T (I + 2)  + (LT * Right (2));
--  and so on...

Unrolling the inner loop like this also has one side-ben­e­fit: it com­plete­ly removes the need for a com­plex loop invari­ant in the inner loop. It took a while to get it right, so see­ing it dis­ap­pear so eas­i­ly was a pyrrhic vic­to­ry of sorts.

With that done, the per­for­mance fig­ures look like this:


so a big improve­ment at all opti­miza­tion levels.

To Slice or not to Slice? (Git Tag: No_​Slices) #

The NaCl Sign” oper­a­tions uses the SHA512 hash algo­rithm three times to form a sin­gle sig­na­ture, so I thought it would be anoth­er good place to look for per­for­mance improvement.

The inner loop of SHA512 repeat­ed­ly stores a group of 8 bytes into a Unsigned_​64 val­ue, but does so in big-endi­an for­mat, so the first byte becomes the most-sig­nif­i­cant 8 bits of the result. To do that, SPARK­Na­Cl declares a sim­ple con­ver­sion func­tion, called DL64. The orig­i­nal ver­sion, trans­lat­ed from the Tweet­Na­Cl orig­i­nal, looks like this:

function DL64 (X : in Bytes_8) return U64
   U : U64 := 0;
   for I in X'Range loop
      U := Shift_Left (U, 8) or U64 (X (I));
   end loop;
   return U;
end DL64;

which looks sim­ple enough. Note that Bytes_​8 is a con­strained array of Byte (with index val­ues 0 .. 7) and that the Shift_​Left and or” oper­a­tors here are both oper­at­ing on 64-bit unsigned values.

In the main loop of SHA512, DL64 gets called many times, con­vert­ing each group of 8 bytes from a 128 byte block. The nat­ur­al” way to write this in SPARK is to use array slices” to pick out the right group of bytes, and then con­vert each slice to Bytes_​8 to fix the bounds to those expect­ed by DL64. This is all done in one big aggre­gate assign­ment that looks like this:

W := (0  => DL64 (Bytes_8 (M (CB + 0 .. CB + 7))),
      1  => DL64 (Bytes_8 (M (CB + 8 .. CB + 15))),
      2  => DL64 (Bytes_8 (M (CB + 16 .. CB + 23))),
      3  => DL64 (Bytes_8 (M (CB + 24 .. CB + 31))),
      -- and so on...   
      15 => DL64 (Bytes_8 (M (CB + 120 .. CB + 127))));

This looks ele­gant, but all those slices are prob­a­bly a bit slow com­pared to the C where it’s all done with a triv­ial bit of point­er arithmetic.

I decid­ed to re-write this to see if I could avoid the slices alto­geth­er. The first step is to re-declare DL64 so it takes two para­me­ters — a gen­er­al uncon­strained array of bytes X, and an off­set val­ue I where it should start pick­ing bytes out of that array. It also needs a suit­able pre­con­di­tion that states that there must be at least 8 bytes in X start­ing at posi­tion I, so the new spec­i­fi­ca­tion looks like this:

function DL64 (X : in Byte_Seq;
               I : in N32) return U64
  with Global => null,
       Pre => X'Length >= 8 and then
              I >= X'First and then
              I <= X'Last - 7;

I also real­ized that I could re-write the body to avoid loop­ing, and to use 32-bit Shifts and log­i­cal oper­a­tors most of the time, since that will be more effi­cient on our RV32 CPU. It now looks like this:

function DL64 (X : in Byte_Seq;
               I : in N32) return U64
   LSW, MSW : U32;
   --  Doing this in two 32-bit groups avoids the need
   --  for 64-bit shifts on 32-bit machines.
   MSW := Shift_Left (U32 (X (I)),     24) or
     Shift_Left (U32 (X (I + 1)), 16) or
     Shift_Left (U32 (X (I + 2)), 8) or
     U32 (X (I + 3));
   LSW := Shift_Left (U32 (X (I + 4)), 24) or
     Shift_Left (U32 (X (I + 5)), 16) or
     Shift_Left (U32 (X (I + 6)), 8) or
     U32 (X (I + 7));

   --  Just one 64-bit shift and an "or" is now required
   return Shift_Left (U64 (MSW), 32) or U64 (LSW);
end DL64;

With that in place, the call­ing code is much sim­pler and avoids the need for array slices and conversions:

W := (0  => DL64 (M, CB),
      1  => DL64 (M, CB + 8),
      2  => DL64 (M, CB + 16),
      --  and so on...
      15 => DL64 (M, CB + 120));

With that in place, the per­for­mance fig­ures look like this:


so a bit slow­er at ‑O0 and ‑O1, but notice­ably faster at ‑O2.

Com­pressed Instruc­tions on RV32 (Git Tag: RV32IM_​Not_​C) #

The RISC‑V ISA defines sev­er­al option­al exten­sions that a par­tic­u­lar CPU may choose to imple­ment. The options all have a one-let­ter iden­ti­fi­er, so M” for Mul­ti­pli­ca­tion and Divi­sion instruc­tions, F” for float­ing-point, and C” for Compression.

The C option defines 16-bit alter­na­tive encod­ings for the most com­mon­ly used 32-bit instruc­tions. The map­ping is one-to-one in both direc­tions, so com­pressed instruc­tions are decod­ed with a sim­ple table look-up in the CPU, or some sim­ple logic.

Using the com­pressed instruc­tions is easy, and we’ve been using them so far in all these tests, since the default for the com­pil­er is to set the tar­get archi­tec­ture to RV32IMC”. There is no change to the compiler’s out­put, but the assem­bler gen­er­ates the com­pressed opcodes as appro­pri­ate. There’s also a bit of work done by the link­er to make sure that jump and branch instruc­tions all end up going to the right place.

Per­for­mance-wise, the com­pressed instruc­tions can be a mixed bless­ing. They cer­tain­ly reduce code size, so you’d expect instruc­tion cache hit rate to improve. On the oth­er hand, there must be some over­head is the extra instruc­tion decod­ing log­ic, and then there’s the issue of instruc­tion alignment.

In RV32IM (no C!) all instruc­tions are 32 bits so always appear aligned” on a 4‑byte bound­ary. On RV32IMC, this has to change to allow instruc­tions to be aligned on 2‑byte bound­aries. The per­for­mance impact of this change is very dif­fi­cult to pre­dict, and obvi­ous­ly depends on the very par­tic­u­lar details of the CPU. The E31 Core (the RISC‑V core at the heart of the HiFive1_​Rev_​B board) man­u­al states:

Fetch­es are always word-aligned and there is a one-cycle penal­ty for branch­ing to a 32-bit instruc­tion that is not word-aligned.” (sec­tion 3.1.5)

so there’s a per­for­ma­ce hit if we branch to a 32-bit instruc­tion that isn’t 4‑byte aligned.

So… that got me think­ing… what if we tell the com­pil­er (well.. assem­bler) not to gen­er­ate the com­pressed instruc­tions. What effect would that have on code size and per­for­mance? This can be done by set­ting “-march=rv32im” for the com­pil­er — this switch gets passed on to the assem­bler and link­er and they take care of all the rest.

With no com­pressed instruc­tions, our per­for­mance fig­ures look like this:


so a bit worse at ‑O0 and ‑O1 and a bit bet­ter at ‑O2. As for code size, the quick check with the riscv32-elf-size” tool tells us that the bina­ry at ‑O2 is 3648 bytes big­ger with­out the com­pressed instruc­tions, so not too bad unless, like me, you start­ed pro­gram­ming on a ZX81 with 1k of RAM.

Return Slot Opti­miza­tion (Git Tag: RSO) #

We now return a a vex­ing ques­tion: how come SPARK was so fast to begin with, espe­cial­ly at ‑O0? I had expect­ed that the SPARK code would be far slow­er than Tweet­Na­Cl to begin with, owing to its func­tion­al pro­gram­ming style, and the use of user-defined func­tion­al oper­a­tors like “+”, “-” and “*” for the types GF and GF_​Vector_​4. These func­tions return large blocks of data (128 bytes for a GF) by copy, so I had expect­ed that this would be a big per­for­mance hit.

To find out what was going on, I reached out to Eric Bot­ca­zou at Ada­Core … it turns out it’s all down to some­thing called Return Slot Opti­miza­tion” (RSO) in GNAT. This sec­tion explains RSO, why it’s so impor­tant, and how we’ve man­aged to exploit it in SPARKNaCl.

First, con­sid­er the canon­i­cal” seman­tics of an assign­ment state­ment that calls a func­tion F with 2 argu­ments — some­thing like:

A := F (B, C);

Let’s assume that A, B, and C are all of some type T, which is big” — like a con­strained array of data. In the worst-case sce­nario, two tem­po­rary vari­ables and two assign­ments are required.

  1. In the body of F, the return val­ue will be con­struct­ed in a tem­po­rary vari­able, then copied back to the caller.
  2. In the call­ing code, the returned val­ue will be assigned into a com­pil­er-gen­er­at­ed tem­po­rary vari­able, before being whol­ly assigned onto A.

Now.. there must be oppor­tu­ni­ties to opti­mize that to get rid of the extra assign­ments — this is RSO. The trou­ble is that some con­di­tions mean that RSO can­not always be safe­ly applied. In par­tic­u­lar, any of the fol­low­ing can defeat RSO:

  1. Alias­ing between any of the para­me­ters (B and C) and/​or the assign­ment tar­get (A in this case).
  2. Side-effects caused by call­ing F.
  3. An unhand­ed excep­tion raised by F.

A fur­ther com­pli­ca­tion is the fact that GCC com­piles many lan­guages, so it doesn’t know if the code has come from Ada, C, C++, For­tran or what­ev­er else, so it has to make worst-case assump­tions about these issues.

But… hang on a minute… we’re com­pil­ing SPARK here, where’s there’s no alias­ing, no func­tion side-effects, and (for SPARK­Na­Cl) we’ve proven that there are no excep­tions. The trou­ble is, the com­pil­er doesn’t know that… yet…

There are three places that RSO can be spot­ted and enabled:

  1. In the gen­er­a­tion of the Gim­ple” inter­me­di­ate code. This is always done, even at ‑O0, but can only spot very sim­ple cases.
  2. An opti­miza­tion pass called ret­slot” that is enabled at ‑O1 and above.
  3. A lat­er opti­miza­tion pass called nrv” tries again. Also enabled at ‑O1 and above.

To under­stand and see the effect of RSO, let’s look at the func­tion SPARKNaCl.Sign.”+” which oper­ates on the GF_​Vector_​4 type. The SPARK code looks like this:

function "+" (Left, Right : in GF_Vector_4) return GF_Vector_4
   A, B, C, D, E, F, G, H : Normal_GF;
   function Double (X : in Normal_GF) return Normal_GF
     is (X + X)
     with Global => null;
   A := (Left (1) - Left (0)) * (Right (1) - Right (0));
   B := (Left (0) + Left (1)) * (Right (0) + Right (1));
   C := (Left (3) * Right (3)) * GF_D2;
   D := Double (Left (2) * Right (2));

   E := D - C;
   F := D + C;
   G := B - A;
   H := B + A;

   return GF_Vector_4'(0 => G * E,
                       1 => H * F,
                       2 => F * E,
                       3 => G * H);
end "+";

This body makes 18 calls to the oper­a­tors for GF defined in the body of SPARK­Na­Cl, so this should be ripe for RSO.

To see what’s going on in the Gim­ple code before and after each opti­miza­tion pass, we can run the com­pil­er with the “-fdump-tree-all” flag.

Using GNAT Com­mu­ni­ty 2020 Edi­tion, in the unop­ti­mized Gim­ple code for “+”, only 10 of those 18 calls have RSO enabled. For exam­ple, the Gim­ple code for the final 4 assign­ments and the return state­ment look like this:

e = sparknacl."-" (&d, &c);
f = sparknacl."+" (&d, &c);
g = sparknacl."-" (&b, &a);
h = sparknacl."+" (&b, &a);

R.6 = sparknacl."*" (&g, &e); [return slot optimization]
<retval>[0] = R.6;

R.7 = sparknacl."*" (&h, &f); [return slot optimization]
<retval>[1] = R.7;

R.8 = sparknacl."*" (&f, &e); [return slot optimization]
<retval>[2] = R.8;

R.9 = sparknacl."*" (&g, &h); [return slot optimization]
<retval>[3] = R.9;

return <retval>;

So… the first four calls are miss­ing out on RSO. The final four calls get RSO in the called oper­a­tor, but the results is still assigned to a tem­po­ries R.6 though R.9 before being copied to ret­val” and returned to the caller. Look­ing at the Gim­ple code that results from the ret­slot” and nrv” pass­es, there’s no fur­ther change at all.

Look­ing even deep­er at the gen­er­at­ed code for “+” at ‑O2, we find it needs 1197 instruc­tions to exe­cute, but 1072 of those are involved in copy­ing the results of calls to those 18 oper­a­tors, which does seem like an unfor­tu­nate overhead.

These calls could be sub­ject to RSO becuase of the intrin­sic prop­er­ties of SPARK, but the com­pil­er doesn’t know that. We need some way of giv­ing a hint” to the com­pil­er that all these calls should be OK for RSO.

In order to see how this can work, I added the Pure_​Function” aspect to all the func­tions in the call-tree of SPARKNaCl.Sign, and Eric mod­i­fied GNAT to spot this so that RSO could be enabled more wide­ly. This has been imple­ment­ed in a wave­front of GNAT Pro 22.0w. With that in place, the unop­ti­mized Gim­ple for the same frag­ment of “+” looks like this:

e = sparknacl."-" (&d, &c); [return slot optimization]
f = sparknacl."+" (&d, &c); [return slot optimization]
g = sparknacl."-" (&b, &a); [return slot optimization]
h = sparknacl."+" (&b, &a); [return slot optimization]
<retval>[0] = sparknacl."*" (&g, &e); [return slot optimization]
<retval>[1] = sparknacl."*" (&h, &f); [return slot optimization]
<retval>[2] = sparknacl."*" (&f, &e); [return slot optimization]
<retval>[3] = sparknacl."*" (&g, &h); [return slot optimization]
return <retval>;

which is a bit more like it! We see all 8 calls here get­ting RSO, and the results of the final 4 calls are assigned direct­ly onto ret­val” and the extra tem­po­rary vari­ables have dis­ap­peared. In fact, all 18 calls in “+” now have RSO enabled.

Look­ing at the gen­er­at­ed code for “+” at ‑O2 with the new com­pil­er is reveal­ing — it’s a sin­gle straight line sequence of just 94 instruc­tions, with no branch­es, so some­thing of an improv­ment from 1197 instruc­tions before.

Anoth­er pleas­ant side-effect is that RSO removes the need for many com­pil­er-gen­er­at­ed tem­po­rary vari­ables. With Com­mu­ni­ty 2020, “+” requires 1712 bytes of space on the stack, while with the new com­pil­er that comes down to 1056 bytes, so reduc­ing worst-case stack usage a fair amount.

In terms of per­for­ma­ce, the gain here is not so great for two reasons:

  1. The “+” oper­a­tor on GF_​Vector_​4 is called 512 times for a sin­gle Sign” oper­a­tion, so we’d expect some­thing like (1197 — 94)*512 = 564736 instruc­tions improve­ment at ‑O2, or about 0.5 Mil­lion cycles (assum­ing the proces­sor is aver­ag­ing rough­ly 1 instruc­tion per cycle or thereabouts…)
  2. The (far more time-crit­i­cal) “+”, “*”, “-” and CAR” func­tions on GF are get­ting RSO applied with the GNAT Com­mu­ni­ty 2020 Edi­tion, so we’re not expect­ing to see much improve­ment there.

With the new com­pil­er, the per­for­mance fig­ures look like this:

Lev­el—Base­line—Relax_​Init—PRE_​Scalarmult—Unroll_​Multiply—No_​Slices—RV32IM_​Not_​CRSO(GNAT 22.0w)

which is in line with our expectations.

Back to Tweet­Na­Cl and lev­el­ling the play­ing field (Git Tag: Faster_​Tweet) #

We’ll now return to com­par­ing the per­for­mance of SPARK­Na­Cl with TweetNaCl.

It would be tempt­ing to just com­pare orig­i­nal results with Tweet­Na­Cl to our final and best results with SPARK for our Sign” test, return­ing to using the GNAT Com­mu­ni­ty 2020 edition:

Lev­el—Tweet­Na­Cl—SPARK­Na­Cl @ RSO

Based on that, we could declare vic­to­ry and ride off into the sun­set, but it’s not real­ly a fair com­par­i­son any more, is it?

We’ve spent con­sid­er­able effort in apply­ing some trans­for­ma­tions to SPARK­Na­Cl which (if we’re will­ing to give up the fits in 100 tweets” require­ment) could also be applied to the Tweet­Na­Cl sources. In par­tic­u­lar, the fol­low­ing changes could be applied to TweetNaCl:

  1. Adop­tion of Donenfeld’s alter­na­tive CAR algorithm.
  2. Loop redun­dan­cy elim­i­na­tion in Sign.Scalarmult.
  3. Unrolling the inner loop in “*” for GF.

Our changes to elim­i­nate array slices, relaxed ini­tial­iza­tion, and return slot opti­miza­tion have no real equiv­a­lent in C, so can’t be applied. We’ll also con­tin­ue to use the same com­pil­er options (and the same com­pil­er), so switch­ing to the uncom­pressed ISA will be picked up in test­ing TweetNaCl.

With those changes in place (See the GIT Tag Faster_​Tweet” for the revised perf/tweetnacl.c sources), the results look like this:


so C is win­ning again at all lev­els. Most of this improve­ment in the C comes down to Donenfeld’s revised CAR code — it is not only faster than the orig­i­nal, but is only called 10240 times for a sin­gle Sign” oper­a­tion, as opposed to 24564 times in SPARK­Na­Cl. That alone could account for Tweet­Na­Cl exe­cut­ing about 13 Mil­lion few­er instructions.

The good news is that, for both lan­guages, we’ve man­aged a dra­mat­ic improve­ment from the orig­i­nal Base­line tag.

Hang on Har­ry, I’ve got a good idea… #

In doing and writ­ing about this work, I’ve already come up with a few more ideas for how I could improve the SPARK. In no par­tic­u­lar order:

  • Re-intro­duce the Com­pressed RISC‑V instruc­tion set, but use the ‑falign-func­tions and ‑falign-labels switch­es to force 4‑byte align­ment of all jump tar­gets. That might get us close to the orig­i­nal code den­si­ty with the com­pressed instruc­tions, but with­out pay­ing the penal­ty for mis-aligned branch­es. Thanks to Eric once again for point­ing out this possibility.
  • SPARK is main­ly los­ing per­for­mance owing to all those extra calls to CAR. A long-term goal would be to fall into line with Tweet­Na­Cl, by not nor­mal­iz­ing the results of “+” and “-“, and only need­ing to apply CAR twice after “*”. The proof that this works will be, um, challenging…
  • I real­ize that most-of-the-time, stor­ing a Normal_​GF with 64-bit dig­its is a com­plete waste of space, but also a huge per­for­mance penal­ty on RV32, where all the 64-bit arith­metic requires mul­ti­ple instruc­tions. We could store a Normal_​GF using 32-bit dig­its, which are fine to store the inter­me­di­ate results of “+” and “-“. We’d only have to expand out” to 64-bit dig­its for the inter­me­di­ate results of “*”. This would also save lots of mem­o­ry, and make every assign­ment of a Normal_​GF half the size! My ini­tial analy­sis indi­cates that only the first CAR fol­low­ing a “*” needs to be done in 64-bit arith­metic. Of the 24564 calls to CAR that we cur­rent­ly see in a Sign oper­a­tion, that means 19448 of them could be replaced by a 32-bit ver­sion. I wrote a quick-and-dirty 32-bit ver­sion of CAR that comes out at 133 instruc­tions on RV32IM at ‑O2, ver­sus 321 instruc­tions for the 64-bit vari­ant. This would save some­thing like 3.5 Mil­lion instruc­tions in a Sign oper­a­tion, so worth a go…

If I get round to imple­ment­ing any of these, then I’ll write anoth­er (and hope­ful­ly short­er) blog entry.

Con­clu­sions and Fur­ther Work #

This has all been some­thing of a marathon. My excuse is that I need­ed some­thing fun to do dur­ing lock­down. I should prob­a­bly get out more… except I’m not allowed to…

Well done if you’ve got this far… in conclusion:

  • There real­ly doesn’t seem to be a con­flict between for­mal proof and get­ting high-per­for­mance out of time-crit­i­cal code like this. At all stages of this process, the proof of type-safe­ty and oth­er prop­er­ties for SPARK­Na­Cl has been main­tained. Indeed, the con­ti­nu­ity of the proof is real­ly impor­tant in giv­ing me con­fi­dence that my per­for­mance-improv­ing trans­for­ma­tions haven’t done some­thing bad to the actu­al cor­rect­ness of the code.
  • I was large­ly igno­rant of RSO and its impor­tance before this, but it’s real­ly impor­tant. It means we can still adopt an ele­gant and func­tion­al pro­gram­ming style in SPARK, but with­out sac­ri­fic­ing per­for­mance owing to return­ing com­pos­ite type from functions.

In terms of fur­ther work:

  • The ideas above pro­vide a few more areas where SPARK could gain some performance.
  • This work has con­cen­trat­ed on the Sign” top-lev­el oper­a­tion, but there are oth­ers that have been unused (and there­fore untouched) in this work, notably the Salsa20 stream cipher, and the top-lev­el Cryp­to­Box” oper­a­tion. Per­for­mance analy­sis and tun­ing of those could be done next.
  • Do some more exper­i­ments with ‑O3, but use GNAT’s prag­ma Loop_​Optimize (No_​Unroll) to con­trol loop unrolling precisely.
  • Try this all over again with a 64-bit RISC‑V target.
  • Try it all again with GNAT-LLVM. This would be inter­est­ing just to see how much faster or slow­er LLVM’s code runs, but there are also a host of for­mal ver­i­fi­ca­tion tools for LLVM IR that we could use to check, for exam­ple, the cor­rect­ness of opti­miza­tion pass­es using tools like Galois Inc’s SAW.
  • More work is need­ed on RSO and how we can teach the com­pil­er to take advan­tage of it for ver­i­fied SPARK code. Watch out for more progress on this in a future GNAT release.

Acknowl­edge­ments #

Many thanks to Eric Bot­ca­zou of Ada­Core for his sup­port in this work. In par­tic­u­lar, Eric explained the ins-and-outs of Return Slot Opti­miza­tion to me and imple­ment­ed the improve­ments to RSO in GNAT 22.0w that are report­ed above. Thanks to Cyrille Comar, Yan­nick Moy and the rest of SPARK team at Ada­Core for their sup­port and com­ments on ear­li­er drafts of this blog.

Posted in #SPARK    #Cryptography    #Formal Verification    #Code generation    #RISC-V    #Security   

About Roderick Chapman

Roderick Chapman

Rod is Director of Protean Code Limited. He specializes in the design, implementation and verification of high-integrity software systems. For many years, Rod led the SPARK design team at Praxis and Altran in the UK, and continues to work closely with the SPARK team at AdaCore.