Today I will write the first article in a short series about the development of an SMTLIB processing tool in SPARK. Instead of focusing on features, I intend to focus on the how I have proved absence of run-time errors in the name table and lexer. I had two objectives: show absence of run-time errors, and do not write useless defensive code. Today's blog will be about the name table, a data structure found in many compilers that can map strings to a unique integer and back. The next blog post will talk about the lexical analyzer.
Strings are unconstrained arrays and as such indefinite types, so keeping them around without using pointers in an Ada program is often a bit inconvenient. In particular you will not be able to use any of the definite containers, nor can we create a formal hash table with strings as the key. Thus the name table will create a unique integer for each string, but unlike a simple hash this process should be trivially reversible.
The name table provides a new private type
Name_Id (which is just a
Natural), and the following two subprograms:
procedure Lookup (S : String; N : out Name_Id) with Global => (In_Out => Name_Table), Pre => Invariant, Post => Invariant; -- Obtain name_id for the given string. function To_String (N : Name_Id) return String with Global => Name_Table, Pre => Invariant; -- The original string.
You will notice at this point the
Invariant. We establish this in the elaboration of the package, and from then onward we preserve it. Maintaining such an invariant is a common pattern if the proof obligations are non-trivial. The invariant is a just ghost function:
function Invariant return Boolean with Ghost, Global => Name_Table; -- Internal invariant for the name table.
The central idea for proof in the name table is that you can only ever add to it, and once you have added a string to it it can never be removed. In other words, any
Name_Id you have will always stay valid.
The basic concept of the name table is to have a unbounded array (implemented using Ada.Containers.Formal_Vector) containing characters, such as (f, o, o, b, a, r, ...). We then have an array that tells us how to produce strings from
Name_Id objects. Such a
Name_Entry is a small record:
type Name_Entry is record Table_Index : Char_Table_Index; Length : Positive; end record;
And the name table (again, an instance of
Formal_Vector) is an array of these, for example:
1 => (Table_Index => 1, Length => 3), 2 => (Table_Index => 4, Length => 3), 3 => (Table_Index => 1, Length => 6), ...
In this example
Name_Id 1 to 3 represent the string 'foo', 'bar' and 'foobar' respectively.
It should be fairly obvious how to create this table, but its not terribly efficient. We either have a lot of duplicate entries (and then we lose the ability to test if two strings are equal by only comparing their
Name_Id), or we need to search the entire table for duplicates. The solution of course is to add a hash table on top of this, so we extend the
Name_Entry to also be a linked list for a hash bucket.
Hash_Table_Size : constant := 256; subtype Hash_Table_Index_T is Hash_Type range 0 .. (Hash_Table_Size - 1); type Hash_Table_T is array (Hash_Table_Index_T) of Name_Id; Hash_Table : Hash_Table_T := (others => 0); type Name_Entry is record Table_Index : Char_Table_Index; Length : Positive; Next_Hash : Name_Id; end record;
The process for searching a string is now: first check the
Hash_Table which will point to the first
Name_Entry with a matching hash. If this is not a match, we follow
Next_Hash until we get to the end of this linked list.
So far we have not proven anything, so what *are* the proof obligations? The main proof obligation comes from
To_String where we wish to look into the table and then take
Length characters from character array starting at
Table_Index. We could write defensive code, but we can also tackle this with proof. This is the implementation of
function To_String (N : Name_Id) return String is -- The only names are the ones produced by Lookup. pragma Assume (N <= Last_Index (Entry_Table)); begin if N = 0 then return ""; end if; declare E : constant Name_Entry := Element (Entry_Table, N); L : constant Positive := E.Length; begin return S : String (1 .. L) do for I in Positive range 1 .. L loop S (I) := Element (Char_Table, E.Table_Index + Char_Table_Index (I - 1)); pragma Loop_Invariant (Invariant); end loop; end return; end; end To_String;
The assumption is something we establish in two places: a) we cannot ever delete entries (this cannot be expressed yet in SPARK annotations), and when we add a new entry to the
Entry_Table the Name_Id will be equal to
Last_Index (Entry_Table) (this part is proven). Taken together, this justifies the assumption.
We special case the empty string, but then we just look into the table and pull out the elements of the character array. However we need to show that
E.Table_Index + Char_Table_Index (I - 1) is in bounds of the container.
So, how do we do this? Essentially, we'd like to just know that any given element that we pull out of the table magically has this property. The property itself is easy enough to express:
function Valid_Entry (E : Name_Entry) return Boolean is (E.Next_Hash <= Last_Index (Entry_Table) and Char_Table_Index (E.Length - 1) <= Last_Index (Char_Table) - E.Table_Index) with Ghost, Pre => Valid_Tables;
This predicate expresses two things: the linked list pointer will point to something inside the table (i.e. it's safe to follow the link) and that the range of characters described is within the range of the character table.
We can now easily write a predicate that expresses this property over the entire table using a quantified expression:
function Valid_Name_Table return Boolean is (for all I in Name_Id range First_Index (Entry_Table) .. Last_Index (Entry_Table) => Valid_Entry (Element (Entry_Table, I))) with Ghost, Pre => Valid_Tables;
Valid_Tables just encodes the size limit on the two tables and is:
function Valid_Tables return Boolean is (Last_Index (Entry_Table) <= Name_Id'Last and Last_Index (Char_Table) <= Char_Table_Index'Last) with Ghost;
We also need to say something about the hash table. This invariant records that any entry in the hash table will point to something sensible in the entry table (or is 0):
function Valid_Hashes return Boolean is (for all H in Hash_Table_Index_T => Hash_Table (H) <= Last_Index (Entry_Table)) with Ghost;
We finally wrap this up in the single invariant:
function Invariant return Boolean is ((Valid_Tables and then Valid_Name_Table) and Valid_Hashes);
Lookup always maintains this invariant, and it is a precondition to
To_String, this is all we need to show that we will never access anything not yet allocated in the character table.
For completeness sake here is the code for Lookup (it's more complex and I'll leave it as an exercise to the reader to work out how it maintains the invariant....
procedure Lookup (S : String; N : out Name_Id) is Ptr : Name_Id := 0; H : constant Hash_Table_Index_T := Ada.Strings.Hash (S) mod Hash_Table_Size; begin -- Special case the empty string. if S'Length = 0 then N := 0; return; end if; -- Look at the first element of the hash bucket. N := Hash_Table (H); if N in Valid_Name_Id then -- It exists... Ptr := N; loop -- If it matches the string, then we stop here. if To_String (Ptr) = S then N := Ptr; return; end if; pragma Loop_Invariant (Ptr in Valid_Name_Id and Ptr <= Last_Index (Entry_Table) and Invariant); -- Otherwise, we follow the linked list until we hit the end. exit when Element (Entry_Table, Ptr).Next_Hash = 0; Ptr := Element (Entry_Table, Ptr).Next_Hash; end loop; end if; -- We have not found the string, so at this point we need to add it -- to the table. Merge (S, N); -- And then fix up the hash table. In the first case we need to -- update the last link in the hash bucket; in the second case we -- update the top-level hash table as its the first item in the -- bucket. if Ptr in Valid_Name_Id then Replace_Element (Entry_Table, Ptr, Element (Entry_Table, Ptr)'Update (Next_Hash => N)); else Hash_Table (H) := N; end if; end Lookup;
It really helped the proof effort to separate out the code that actually adds a new entry to the table. The code for
Merge (which also maintains the invariant) is shown below. Note there are two more assumptions here: that we'll never run out of memory. We can avoid doing this if we have a special Error token we can return, but here we just assume that we'll not run out of space. Also observe the postcondition for
Merge it maintains the invariant and records that the
Name_Id generated will be in the bounds of the entry table.
procedure Merge (S : String; N : out Name_Id) with Global => (In_Out => (Char_Table, Entry_Table), Proof_In => Hash_Table), Pre => Invariant, Post => Invariant and Length (Entry_Table) = Length (Entry_Table)'Old + 1 and Length (Char_Table) >= Length (Char_Table)'Old and N = Last_Index (Entry_Table) is -- If this is not true, then we're out of memory on the string or -- entry table... pragma Assume (Last_Index (Char_Table) < Char_Table_Index'Last - S'Length); pragma Assume (Last_Index (Entry_Table) < Name_Id'Last); begin -- Add the string to the character table. for I in Positive range 1 .. S'Length loop Append (Char_Table, S (I + (S'First - 1))); pragma Loop_Invariant (Invariant and Length (Char_Table) = Length (Char_Table)'Loop_Entry + Char_Tables.Capacity_Range (I)); end loop; -- Construct a Name_Entry and add it to the table. declare E : constant Name_Entry := (Table_Index => Last_Index (Char_Table) - (S'Length - 1), Length => S'Length, Next_Hash => 0); begin Append (Entry_Table, E); end; N := Last_Index (Entry_Table); end Merge;
Looking at the body of
To_String you will notice that there is actually not that much annotation, despite the complicated nature of the name table. The main trick to do this was to keep the invariant as an expression function.
Also note that you get 100% automatic proof with just CVC4.
In the next blog post I will write about the lexer, focusing on the termination proof.