User friendly strings API
In a previous post, we described the design of a new strings package, with improved performance compared to the standard Ada unbounded strings implementation. That post focused on various programming techniques used to make that package as fast as possible.
This post is a followup, which describes the design of the user API for the strings package - showing various solutions that can be used in Ada to make user-friendly data structures.
One of the features added in Ada 2005 is a prefix dot notation when calling primitive operations of tagged types. For instance, if you have the following declarations, you can call the subprograms Slice in one of two ways:
declare type XString is tagged private; procedure Slice (Self : in out XString; Low, High : Integer); -- a primitive operation S : XString; begin S.Slice (1, 2); -- the two calls do the same thing, here using a prefix notation Slice (S, 1, 2); end;
This is a very minor change. But in practice, people tend to use the prefix notation because it is more "natural" for people who have read or written code in other programming languages.
In fact, it is so popular that there is some demand to extend this prefix notation, in some future version of the language, to types other than tagged types. Using a tagged type has a cost, since it makes variables slightly bigger (they now have a hidden access field).
In our case, though, a XString was already a tagged type since it is also controlled, so there is no additional cost.
The standard Ada strings are quite convenient to use (at least once you understand that you should use declare blocks since you need to know their size in advance). For instance, one can access characters with expressions like:
A : constant Character := S (2); -- assuming 2 is a valid index B : constant Character := S (S'First + 1); -- better, of course C : constant Character := S (S'Last);
The first line of the above example hides one of the difficulties for newcomers: strings can have any index range, so using just "2" is likely to be wrong here. Instead, the second line should be used. The GNATCOLL strings avoid this pitfall by always indexing from 1. As was explained in the first blog post, this is both needed for the code (so that internally we can reallocate strings as needed without changing the indexes manipulated by the user), and more intuitive for a lot of users.
The Ada unbounded string has a similar approach, and all strings are indexed starting at 1. But you can't use the same code as above, and instead you need to write the more cumbersome:
S := To_Unbounded_String (...); A : constant Character := Element (S, 2); -- second character of the string, always B : constant Character := Ada.Strings.Unbounded.Element (S, 2); -- when not using use-clauses
GNATCOLL.Strings takes advantage of some new aspects introduced in Ada 2012 to provide custom indexing functions. So the spec looks like:
type XString is tagged private with Constant_Indexing => Get; function Get (Self : XString; Index : Positive) return Character; S : XString := ...; A : constant Character := S (2); -- always the second character
After adding this simple Constant_Indexing aspect, we are now back to the simple syntax we were using for standard String, using parenthesis to point to a specific character. But here we also know that "2" is always the second character, so we do not need to use the 'First attribute.
There is a similar aspect, named Variable_Indexing which can be used to modify a specific character of the string, as we would do with a standard string. So we can write:
type XString is tagged private with Variable_Indexing => Reference; type Character_Reference (Char : not null access Char_Type) is limited null record with Implicit_Dereference => Char; function Reference (Self : aliased in out XString; Index : Positive) return Character_Reference; S : XString := ...; S (2) := 'A'; S (2).Char := 'A'; -- same as above, but not taking advantage of Implicit_Derefence
Although this is simple to use for users of the library, the implementation is actually much more complex than for Constant_Indexing.
First of all, we need to introduce a new type, called a Reference Type (in our example this is the Character_Reference), which basically is a type with an access discriminant and the Implicit_Dereference aspect. This type acts as a safer replacement for an access type (for instance it can't be copied quite as easily). Through this type, we have an access to a specific character, and therefore users can replace that character easily.
But the real difficulty is exactly because of this access. Imaging that we assign the string to another one. At that point, they share the internal buffer when using the copy-on-write mode described in the first post. But if we then modify a character we modify it in both strings, although from the user's point of view these are two different instances ! Here are three examples of code that would fail:
begin S2 := S; S (2) := 'A'; -- Now both S2 and S have 'A' as their second character end; declare R : Character_Reference := S.Reference (2); begin S2 := S; -- now sharing the internal buffer R := 'A'; -- Both S2 and S have 'A' as their second character end; declare R : Character_Reference := S.Reference (2); begin S.Append ("ABCDEF"); R := 'A'; -- might point to deallocated memory end;
Of course, we want our code to be safe, so the implementation needs to prevent the above errors. As soon we take a Reference to a character we must ensure the internal buffer is no longer shared. This fixes the first example above: Since the call to S (2) no longer shares the buffer, we are indeed only modifying S and not S2.
The second example looks similar, but in fact the sharing is done after we have taken the reference. So in fact when we take a Reference we also need to make the internal buffer unshareable. This is done by setting a special value for the refcounting, so that the assignment always make a copy of S's internal buffer, rather than share it.
There is a hidden cost to using Variable_Indexing, since we are no longer able to share the buffer as soon as a reference was taken. This is unfortunately unavoidable, and is one of those examples where convenience of use goes counter to performance...
The third example is much more complex. Here, we have a reference to a character in the string, but when we append data to the string we are likely to reallocate memory. The system could be allocating new memory, copying the data, and then freeing the old one. And our existing reference would point to the memory area that we have just freed !
I have not found a solution here (and C++ implementations seem to have similar limitations). We cannot simply prevent reallocation (i.e. changing the size of the string), since the user might simply have taken a reference, changed the character, and dropped the reference. At that point, reallocating would be safe. For now, we rely on documentation and on the fact that it is some extra work to preserve a reference as we did in the example, it is much more natural to save the index, and then take another reference when needed, as in:
declare Saved : constant Integer := 2; begin S.Append ("ABCDEF"); S (Saved) := 'A'; -- safe end;
Finally, we want to make it easy to traverse a string and read all its characters (and possibly modify them, of course). With the standard strings, one would do:
declare S : String (1 .. 10) := ...; begin for Idx in S'Range loop S (Idx) := 'A'; end loop; for C of S loop Put (C); end loop end; declare S : Unbounded_String := ....; begin for Idx in 1 .. Length (S) loop Replace_Element (S, Idx, 'A'); Put (Element (S, Idx)); end loop; end;
Obviously, the version with unbounded strings is a lot more verbose and less user-friendly. It is possible, since Ada 2012, to provide custom iteration for our own strings, via some predefined aspects. As the gem shows, this is a complex operation...
GNAT provides a much simpler aspect that fulfills all practical needs, namely the Iterable aspect. We use it as such:
type XString is tagged private with Iterable => (First => First, Next => Next, Has_Element => Has_Element, Get => Get); function First (Self : XString) return Positive is (1); function Next (Self : XString; Index : Positive) return Positive is (Index + 1); function Has_Element (Self : XString; Index : Positive) return Boolean is (Index <= Self.Length); function Get (Self : Xstring; Idx : Positive) return Character; -- same one we used for indexing
These functions are especially simple here because we know the range of indexes. So basically they are declared as expression functions and made inline. This means that the loop can be very fast in practice. For consistency with the way the "for...of" loop is used with standard Ada containers, we chose to have Get return the character itself, rather than its index. And now we can use:
S : XString := ...; for C of S loop Put (C); end loop;
As explained, the for..of loop returns the character itself, which means we can't change the string directly. So we need to provide a second iterator that will return the indexes. We do this via a new small type that takes care of everything, as in:
type Index_Range is record Low, High : Natural; end record with Iterable => (First => First, Next => Next, Has_Element => Has_Element, Element => Element); function First (Self : Index_Range) return Positive is (Self.Low); function Next (Self : Index_Range; Index : Positive) return Positive is (Index + 1); function Has_Element (Self : Index_Range; Index : Positive) return Boolean is (Index <= Self.High); function Element (Self : Index_Range; Index : Positive) return Positive is (Index); function Iterate (Self : XString) return Index_Range is ((Low => 1, High => Self.Length)); S : XString := ...; for Idx in S.Iterate loop S (Idx) := 'A'; end loop;
The type Index_Range is very similar to the previous use of the Iterable aspect. We just need to introduce one extra primitive operation for the string, which is used to initiate the iteration.
In fact, we could make Iterate and Index_Range more complex, so that for instance we can iterate on every other character, or on every 'A' in the string, or any kind of iteration scheme. This would of course require a more complex Index_Range, but opens up nice possibilities.
With standard strings, one can get a slice by using notation like:
declare S : String (1 .. 10) := ...; begin Put (S (2 .. 5)); S (2 .. 5) := 'ABCD'; end;
Unfortunately, there is no possible equivalent for GNATCOLL strings, because the range notation ("..") cannot be redefined. We have to use code like:
declare S : XString := ...; begin Put (S.Slice (2, 5)); S.Replace (2, 5, 'ABCD'); end;
It would be nice if Ada had a new aspect to let us map the ".." notation to a function, for instance something like:
type XString is tagged private with Range_Indexing => Slice, -- NOT Ada 2012 Variable_Range_Indexing => Replace; -- NOT Ada 2012 function Slice (Self : XString; Low, High : Integer) return XString; -- Low and High must be of same type, but that type could be anything -- The return value could also be anything procedure Replace (S : in out XString; Low, High : Integer; Replace_With : String); -- Likewise, Low and High must be of same type, but this could be anything. -- and Replace_With could be any type
Perhaps in some future version of the language?
Ada 2012 provides quite a number of new aspects that can be applied to types, and make user API more convenient to use. Some of them are complex to use, and GNAT sometimes provides a simpler version.