SPARKSMT - An SMTLIB Processing Tool Written in SPARK - Part I
by Florian Schanda –
Dear Diary,
Today I will write the first article in a short series about the development of an SMTLIB processing tool in SPARK. Instead of focusing on features, I intend to focus on the how I have proved absence of run-time errors in the name table and lexer. I had two objectives: show absence of run-time errors, and do not write useless defensive code. Today's blog will be about the name table, a data structure found in many compilers that can map strings to a unique integer and back. The next blog post will talk about the lexical analyzer.
The interface
Strings are unconstrained arrays and as such indefinite types, so keeping them around without using pointers in an Ada program is often a bit inconvenient. In particular you will not be able to use any of the definite containers, nor can we create a formal hash table with strings as the key. Thus the name table will create a unique integer for each string, but unlike a simple hash this process should be trivially reversible.
The name table provides a new private type Name_Id
(which is just a Natural
), and the following two subprograms:
procedure Lookup (S : String;
N : out Name_Id)
with Global => (In_Out => Name_Table),
Pre => Invariant,
Post => Invariant;
-- Obtain name_id for the given string.
function To_String (N : Name_Id) return String
with Global => Name_Table,
Pre => Invariant;
-- The original string.
You will notice at this point the Invariant
. We establish this in the elaboration of the package, and from then onward we preserve it. Maintaining such an invariant is a common pattern if the proof obligations are non-trivial. The invariant is a just ghost function:
function Invariant return Boolean
with Ghost,
Global => Name_Table;
-- Internal invariant for the name table.
The central idea for proof in the name table is that you can only ever add to it, and once you have added a string to it it can never be removed. In other words, any Name_Id
you have will always stay valid.
The internals
The basic concept of the name table is to have a unbounded array (implemented using Ada.Containers.Formal_Vector) containing characters, such as (f, o, o, b, a, r, ...). We then have an array that tells us how to produce strings from Name_Id
objects. Such a Name_Entry
is a small record:
type Name_Entry is record
Table_Index : Char_Table_Index;
Length : Positive;
end record;
And the name table (again, an instance of Formal_Vector
) is an array of these, for example:
1 => (Table_Index => 1,
Length => 3),
2 => (Table_Index => 4,
Length => 3),
3 => (Table_Index => 1,
Length => 6),
...
In this example Name_Id
1 to 3 represent the string 'foo', 'bar' and 'foobar' respectively.
It should be fairly obvious how to create this table, but its not terribly efficient. We either have a lot of duplicate entries (and then we lose the ability to test if two strings are equal by only comparing their Name_Id
), or we need to search the entire table for duplicates. The solution of course is to add a hash table on top of this, so we extend the Name_Entry
to also be a linked list for a hash bucket.
Hash_Table_Size : constant := 256;
subtype Hash_Table_Index_T is Hash_Type range 0 .. (Hash_Table_Size - 1);
type Hash_Table_T is array (Hash_Table_Index_T) of Name_Id;
Hash_Table : Hash_Table_T := (others => 0);
type Name_Entry is record
Table_Index : Char_Table_Index;
Length : Positive;
Next_Hash : Name_Id;
end record;
The process for searching a string is now: first check the Hash_Table
which will point to the first Name_Entry
with a matching hash. If this is not a match, we follow Next_Hash
until we get to the end of this linked list.
So far we have not proven anything, so what *are* the proof obligations? The main proof obligation comes from To_String
where we wish to look into the table and then take Length
characters from character array starting at Table_Index
. We could write defensive code, but we can also tackle this with proof. This is the implementation of To_String
:
function To_String (N : Name_Id) return String
is
-- The only names are the ones produced by Lookup.
pragma Assume (N <= Last_Index (Entry_Table));
begin
if N = 0 then
return "";
end if;
declare
E : constant Name_Entry := Element (Entry_Table, N);
L : constant Positive := E.Length;
begin
return S : String (1 .. L) do
for I in Positive range 1 .. L loop
S (I) := Element (Char_Table,
E.Table_Index + Char_Table_Index (I - 1));
pragma Loop_Invariant (Invariant);
end loop;
end return;
end;
end To_String;
The assumption is something we establish in two places: a) we cannot ever delete entries (this cannot be expressed yet in SPARK annotations), and when we add a new entry to the Entry_Table
the Name_Id will be equal to Last_Index (Entry_Table)
(this part is proven). Taken together, this justifies the assumption.
We special case the empty string, but then we just look into the table and pull out the elements of the character array. However we need to show that E.Table_Index + Char_Table_Index (I - 1)
is in bounds of the container.
So, how do we do this? Essentially, we'd like to just know that any given element that we pull out of the table magically has this property. The property itself is easy enough to express:
function Valid_Entry (E : Name_Entry) return Boolean is
(E.Next_Hash <= Last_Index (Entry_Table) and
Char_Table_Index (E.Length - 1) <= Last_Index (Char_Table) - E.Table_Index)
with Ghost,
Pre => Valid_Tables;
This predicate expresses two things: the linked list pointer will point to something inside the table (i.e. it's safe to follow the link) and that the range of characters described is within the range of the character table.
We can now easily write a predicate that expresses this property over the entire table using a quantified expression:
function Valid_Name_Table return Boolean is
(for all I in Name_Id
range First_Index (Entry_Table) .. Last_Index (Entry_Table) =>
Valid_Entry (Element (Entry_Table, I)))
with Ghost,
Pre => Valid_Tables;
The precondition Valid_Tables
just encodes the size limit on the two tables and is:
function Valid_Tables return Boolean is
(Last_Index (Entry_Table) <= Name_Id'Last and
Last_Index (Char_Table) <= Char_Table_Index'Last)
with Ghost;
We also need to say something about the hash table. This invariant records that any entry in the hash table will point to something sensible in the entry table (or is 0):
function Valid_Hashes return Boolean is
(for all H in Hash_Table_Index_T =>
Hash_Table (H) <= Last_Index (Entry_Table))
with Ghost;
We finally wrap this up in the single invariant:
function Invariant return Boolean is
((Valid_Tables and then Valid_Name_Table) and
Valid_Hashes);
Since Lookup
always maintains this invariant, and it is a precondition to To_String
, this is all we need to show that we will never access anything not yet allocated in the character table.
For completeness sake here is the code for Lookup (it's more complex and I'll leave it as an exercise to the reader to work out how it maintains the invariant....
procedure Lookup (S : String;
N : out Name_Id)
is
Ptr : Name_Id := 0;
H : constant Hash_Table_Index_T :=
Ada.Strings.Hash (S) mod Hash_Table_Size;
begin
-- Special case the empty string.
if S'Length = 0 then
N := 0;
return;
end if;
-- Look at the first element of the hash bucket.
N := Hash_Table (H);
if N in Valid_Name_Id then
-- It exists...
Ptr := N;
loop
-- If it matches the string, then we stop here.
if To_String (Ptr) = S then
N := Ptr;
return;
end if;
pragma Loop_Invariant
(Ptr in Valid_Name_Id and
Ptr <= Last_Index (Entry_Table) and
Invariant);
-- Otherwise, we follow the linked list until we hit the end.
exit when Element (Entry_Table, Ptr).Next_Hash = 0;
Ptr := Element (Entry_Table, Ptr).Next_Hash;
end loop;
end if;
-- We have not found the string, so at this point we need to add it
-- to the table.
Merge (S, N);
-- And then fix up the hash table. In the first case we need to
-- update the last link in the hash bucket; in the second case we
-- update the top-level hash table as its the first item in the
-- bucket.
if Ptr in Valid_Name_Id then
Replace_Element (Entry_Table,
Ptr,
Element (Entry_Table, Ptr)'Update (Next_Hash => N));
else
Hash_Table (H) := N;
end if;
end Lookup;
It really helped the proof effort to separate out the code that actually adds a new entry to the table. The code for Merge
(which also maintains the invariant) is shown below. Note there are two more assumptions here: that we'll never run out of memory. We can avoid doing this if we have a special Error token we can return, but here we just assume that we'll not run out of space. Also observe the postcondition for Merge
it maintains the invariant and records that the Name_Id
generated will be in the bounds of the entry table.
procedure Merge (S : String;
N : out Name_Id)
with Global => (In_Out => (Char_Table, Entry_Table),
Proof_In => Hash_Table),
Pre => Invariant,
Post => Invariant and
Length (Entry_Table) = Length (Entry_Table)'Old + 1 and
Length (Char_Table) >= Length (Char_Table)'Old and
N = Last_Index (Entry_Table)
is
-- If this is not true, then we're out of memory on the string or
-- entry table...
pragma Assume (Last_Index (Char_Table)
< Char_Table_Index'Last - S'Length);
pragma Assume (Last_Index (Entry_Table) < Name_Id'Last);
begin
-- Add the string to the character table.
for I in Positive range 1 .. S'Length loop
Append (Char_Table, S (I + (S'First - 1)));
pragma Loop_Invariant
(Invariant and
Length (Char_Table) =
Length (Char_Table)'Loop_Entry + Char_Tables.Capacity_Range (I));
end loop;
-- Construct a Name_Entry and add it to the table.
declare
E : constant Name_Entry :=
(Table_Index => Last_Index (Char_Table) - (S'Length - 1),
Length => S'Length,
Next_Hash => 0);
begin
Append (Entry_Table, E);
end;
N := Last_Index (Entry_Table);
end Merge;
Closing thoughts
Looking at the body of Lookup
and To_String
you will notice that there is actually not that much annotation, despite the complicated nature of the name table. The main trick to do this was to keep the invariant as an expression function.
Also note that you get 100% automatic proof with just CVC4.
In the next blog post I will write about the lexer, focusing on the termination proof.