# Proving a simple program doing I/O ... with SPARK

## by Joffrey Huguet , Johannes Kanig – Jul 09, 2019

The functionality of many security-critical programs is directly related to Input/Output (I/O). This includes command-line utilities such as gzip, which might process untrusted data downloaded from the internet, but also any servers that are directly connected to the internet, such as webservers, DNS servers and so on. However, I/O has received little attention from the formal methods community, and SPARK also does not help the programmer much when it comes to I/O. SPARK has been used to debug functions from the standard library Ada.Text_IO, but this approach lacked support for error handling and didn't allow going up to the application level.

As an example, take a look at the current specification of Ada.Text_IO.Put, which only recently has been annotated with some SPARK contracts:

procedure Put
(File : File_Type;
Item : String)
with
Pre    => Is_Open (File) and then Mode (File) /= In_File,
Global => (In_Out => File_System);

(We have suppressed the postcondition of this function, which talks about line and page length, a functionality of Ada.Text_IO which is not relevant to this blog post.)

We can see that Put has a very light contract that protects against some errors, such as calling it on a file that has not been opened for writing, but not against the other many possible errors related to writing, such as a full disk or file permissions. If such an error occurs, Put raises an exception, but SPARK does not allow catching exceptions. So in the above form, even a proved SPARK program that uses Put may terminate with an unhandled exception.

Moreover, Put does not specify what exactly is written. For example, one cannot prove in SPARK that two calls to Put with the same File argument write the concatenation of the two Item arguments to the file.

This second problem can probably be solved using a stronger contract (though it would be difficult to do so in Ada.Text_IO, whose interface must respect the Ada standard), but together with the first point it means that, even if our program is annotated with a suitable contract and proved, we can only say something informal like “if the program doesn’t crash because of unexpected errors, it respects its contract”.

In this blogpost, we propose to solve these issues as follows. We replace Put by a procedure Write which reports errors via its output Has_Written:

procedure Write
(Fd          : int;
Buf         : Init_String;
Num_Bytes   : Size_T;
Has_Written : out ssize_t);

We now can annotate Write with a suitable postcondition that explains exactly what has been written to the file descriptor, and that a negative value of Has_Written signals an error.

It is no accident that the new procedure looks like the POSIX system call write. In fact we decided to base this experiment on the POSIX API that consists of open, close, read and write, and simply added very thin SPARK wrappers around those system calls. The advantage of this approach is that system calls never crash (unless there are bugs in the kernel, of course) - they simply flag errors via their return value, or in our case, the out parameter Has_Written. This means that, assuming our program contains a Boolean variable Error that is updated accordingly after invoking the system calls, we can now formally write down the contract of our program in the form:

if not Error then ...

And if one thinks about it, this is really the best any program dealing with I/O can hope for, because some errors cannot be predicted or avoided by the programmer, such as trying to write to a full disk or trying to open a file that does not exist.

We could further refine the above condition to distinguish the behavior of the program depending on the error that was encountered, but we did not do this in the work described here.

The other advantage of using a POSIX-like API is that porting existing programs from C to this API would be simpler.

We validated this approach by writing a SPARK clone of the “cat” utility. We were able to prove that cat copies the content of the files in its argument to stdout, unless errors were encountered. The project is available following this link.

# How to represent the file system?

The main interest of the library is to be able to write properties about content, which means that we have to represent this content through something (global variables, state abstraction…). We decided to use maps, that link a file descriptor to an unbounded string representing the content of the corresponding file. Formal_Hashed_Maps were convenient to use because they come with many functions that compare two different maps (e.g Keys_Included_Except, or Elements_Equal_Except).

In Ada, even Unbounded_String (from Ada.Strings.Unbounded) are somehow bounded: the maximum length of Unbounded_String is Natural’Last, a length that could be exceeded by certain files. We had to create another library, that relies on Ada.Strings.Unbounded, but would accept appending two unbounded strings whose lengths are maximal. The choice we made was to drop any character that would make the length overflow, e.g. the append function has the following contract:

function "&" (L, R : My_Unbounded_String) return My_Unbounded_String with
Global         => null,
Contract_Cases =>
(Length (L) = Natural'Last               =>
To_String ("&"'Result) = To_String (L),
--  When L already has maximal length

Length (L) < Natural'Last
and then
Length (R) <= Natural'Last - Length (L) =>
To_String ("&"'Result) = To_String (L) & To_String (R),
--  When R can be appended entirely

others                                  =>
To_String ("&"'Result)
= To_String (L) & To_String (R) (1 .. Natural'Last - Length (L)));
--  When R cannot be appended entirely

Also, what you can see here is that all properties of unbounded strings are expressed with the conversion to String type: this allows to use the existing axiomatization of strings (through arrays) instead of redefining one.

Another design choice was made here: what’s the content of a file we opened in read mode? In stdio.ads, we considered that the content of this file is strictly what we read from it. Because there is no way to know the content of the file, we decided to implement cat as a procedure that would “write whatever it reads”. Any given file could be modified while reading, or, for the case of stdin, parent process may also read part of the data.

# The I/O library

As said before, the library consists of thin bindings to system calls. Interfacing scalar types (int, size_t, ssize_t) was easy. We also wanted to model more precisely instantiation of buffers. Indeed, when calling Read procedure, the buffer might not be initialized entirely after the call; it is also possible to call Write on a partially initialized String, to copy the first n values that are initialized. The current version of SPARK requires that arrays (and thus Strings) are fully initialized. More details are available in the SPARK User's Guide. A new proposed evolution of the language (see here) is already prototyped in SPARK and allows manipulating safely partially initialized data. In the prototype, we declare a new type, Init_Char, that will have the Init_By_Proof annotation.

subtype Init_Char is Character;
pragma Annotate(GNATprove, Init_By_Proof, Init_Char);

type Init_String is array (Positive range <>) of Init_Char;

With this type declaration, it is now possible to use the attribute ‘Valid_Scalars on Init_Char variables or slices of Init_String variables in Ghost code to specify which scalars have been initialized

The next part was writing the bindings to C functions. An example, for Read, is the following:

function C_Read (Fd : int; Buf : System.Address; Size : size_t; Offset : off_t) return ssize_t;

procedure Read (Fd : int; Buf : out Init_String; Has_Read : out ssize_t) is
begin
end Read;

Other than hiding the use of addresses from SPARK, this part was not very difficult.

The final part was adding the contracts to our procedures.

Firstly, there are no preconditions. System calls may return errors, but they will accept any parameter in input. The only precondition we have in the library is a precondition on Write, that states that the characters that we want to write in the file are initialized.

Secondly, every postcondition is a case expression, where we give properties for each possible return value, e.g:

procedure Open (File : char_array; Flags : int; Fd : out int) with
Global => (In_Out => (FD_Table, Errors.Error_State, Contents)),
Post   =>
(case Fd is
when -1                      => Contents'Old = Contents,
when 0 .. int (OPEN_MAX - 1) =>
Length (Contents'Old) + 1 = Length (Contents)
and then
Contains (Contents, Fd)
and then
Length (Element (Contents, Fd)) = 0
and then
not Contains (Contents'Old, Fd)
and then
Model (Contents'Old) <= Model (Contents)
and then
M.Keys_Included_Except (Model (Contents),
Model (Contents'Old),
Fd),
when others                  => False);

The return value of Open will be either -1, which corresponds to an error, or a natural value (in my case, OPEN_MAX is equal to 1023, 1024 being the maximum number of files that can be open at the same time on my machine). If an error occured, the Contents map is the same as before. If an appropriate file descriptor is returned, the postcondition states that the new Contents map has the same elements as before, plus a new empty unbounded string associated with the file descriptor. With regard to the functional model, these contracts are complete.

# The Cat program

The cat program is split in two different parts: the main program that opens/closes file(s) in its argument, and calls Copy_To_Stdout. This procedure will read from the input file and write to stdout what it read.

Since errors in I/O can happen, we propagate them using a status flag from nested subprograms to the main program and handle them there. This point is the first difference with Ada.Text_IO.

The other difference is the presence of postconditions about data, for example in postcondition of Copy_To_Stdout:

procedure Copy_To_Stdout (Input : int; Err : out Int) with
Post =>
(if Err = 0 then Element (Contents, Stdout)
= Element (Contents’Old, Stdout) & Element (Contents, Input));

This postcondition is the only functional contract we have for cat, this is why it is so important. It states that if no error occurred, the content of stdout is equal to its value before calling Copy_To_Stdout appended to the content we read from the input.

If we wanted to write a more precise contract, the main difficulty would be to handle the cases where an error occured. For example, if we call cat on three different files, and one error occurs when copying the second file to stdout, we have no contract about the content of Stdout, and everything becomes unprovable. Adding contracts for these cases would require work on slices and sliding and to do even more case-splitting, which would add more difficulty for the provers.

Type definitions and helper subprograms to define the library take about 200 lines of code. The library itself has around 100 lines of contracts. The cat program has 100 lines of implementation and 1200 lines of Ghost code in order to prove everything. Around 1000 verification conditions for the entire project (I/O library + cat + lemmas) are discharged by the solvers to prove everything with auto-active proof.

# Conclusion

Cat looks like a simple program, and it is. But being able to prove correctness of cat shows that we are able to reason about contents and copying of data (here, between file descriptors), something which is necessary and largely identical for more complex applications like network drivers or servers that listen on sockets. We will be looking at extending our approach to code manipulating (network) sockets.

Posted in #Formal Verification    #SPARK