New strings package in GNATCOLL
by Emmanuel Briot –
New strings package in GNATCOLL
GNATCOLL has recently acquired two new packages, namely GNATCOLL.Strings and GNATCOLL.Strings_Impl. The latter is a generic package, one instance of which is provided as GNATCOLL.Strings.
But why a new strings package? Ada already has quite a lot of ways to represent and manipulate strings, is a new one needed?
This new package is an attempt at finding a middle ground between the standard String type (which is efficient but inflexible) and unbounded strings (which are flexible, but could be more efficient).
GNATCOLL.Strings therefore provides strings (named XString, as in extended-strings) that can grow as needed (up to Natural'Last, like standard strings), yet are faster than unbounded strings. They also come with an extended API, which includes all primitive operations from unbounded strings, in addition to some subprograms inspired from GNATCOLL.Utils and the Python and C++ programming languages.
Small string optimization
GNATCOLL.Strings uses a number of tricks to improve on the efficiency. The most important one is to limit the number of memory allocations. For this, we use a trick similar to what all C++ implementations do nowadays, namely the small string optimization.
The idea is that when a string is short, we can avoid all memory allocations altogether, while still keeping the string type itself small. We therefore use an Unchecked_Union, where a string can be viewed in two ways:
Small string
[f][s][ characters of the string 23 bytes ]
f = 1 bit for a flag, set to 0 for a small string
s = 7 bits for the size of the string (i.e. number of significant
characters in the array)
Big string
[f][c ][size ][data ][first ][pad ]
f = 1 bit for a flag, set to 1 for a big string
c = 31 bits for half the capacity. This is the size of the buffer
pointed to by data, and which contains the actual characters of
the string.
size = 32 bits for the size of the string, i.e. the number of
significant characters in the buffer.
data = a pointer (32 or 64 bits depending on architecture)
first = 32 bits, see the handling of substrings below
pad = 32 bits on a 64 bits system, 0 otherwise.
This is because of alignment issues.
So in the same amount of memory (24 bytes), we can either store a small string of 23 characters or less with no memory allocations, or a big string that requires allocation. In a typical application, most strings are smaller than 23 bytes, so we are saving very significant time here.
This representation has to work on both 32 bits systems and 64 bits systems, so we have careful representation clauses to take this into account. It also needs to work on both big-endian and little-endian systems. Thanks to Ada's representation clauses, this one in fact relatively easy to achieve (well, okay, after trying a few different approaches to emulate what's done in C++, and that did not work elegantly). In fact, emulating via bit-shift operations ended up with code that was less efficient than letting the compiler do it automatically because of our representation clauses.
Character types
Applications should be able to handle the whole set of unicode characters. In Ada, these are represented as the Wide_Character type, rather than Character, and stored on 2 bytes rather than 1. Of course, for a lot of applications it would be wasting memory to always store 2 bytes per character, so we want to give flexibility to users here.
So the package GNATCOLL.Strings_Impl is a generic. It has several formal parameters, among which:
* Character_Type is the type used to represent each character. Typically, it will be Character, Wide_Character, or even possibly Wide_Wide_Character. It could really be any scalar type, so for instance we could use this package to represent DNA with its 4-valued nucleobases.
* Character_String is an array of these characters, as would be represented in Ada. It will typically be a String or a Wide_String. This type is used to make this package work with the rest of the Ada world.
Note about unicode: we could also always use a Character, and use UTF-8 encoding internally. But this makes all operations (from taking the length to moving the next character) slower, and more fragile. We must make sure not to cut a string in the middle of a multi-byte sequence. Instead, we manipulate a string of code points (in terms of unicode). A similar choice is made in Ada (String vs Wide_String), Python and C++.
Configuring the size of small strings
The above is what is done for most C++ implementations nowadays. The maximum 23 characters we mentioned for a small string depends in fact on several criteria, which impact the actual maximum size of a small string:
* on 32 bits system, the size of the big string is 16 bytes, so the maximum size of a small string is 15 bytes.
* on 64 bits system, the size of the big string is 24 bytes, so the maximum size of a small string is 23 bytes.
* If using a Character as the character type, the above are the actual number of characters in the string. But if you are using a Wide_Character, this is double the maximum length of the string, so a small string is either 7 characters or 11 characters long.
This is often a reasonable number, and given that applications mostly use small strings, we are already saving a lot of allocations. However, in some cases we know that the typical length of strings in a particular context is different. For instance, GNATCOLL.Traces builds messages to output in the log file. Such messages will typically be at most 100 characters, although they can of course be much larger sometimes.
We have added one more formal parameter to GNATCOLL.Strings_Impl to control the maximum size of small strings. If for instance we decide that a "small" string is anywhere from 1 to 100 characters long (i.e. we do not want to allocate memory for those strings), it can be done via this parameter. Of course, in such cases the size of the string itself becomes much larger.
In this example it would be 101 bytes long, rather than the 24 bytes. Although we are saving on memory allocations, we are also spending more time copying data when the string is passed around, so you'll need to measure the performance here.
The maximum size for the small string is 127 bytes however, because this size and the 1-bit flag need to fit in 1 bytes in the representation clauses we showed above. We tried to make this more configurable, but this makes things significantly more complex between little-endian and big-endian systems, and having large "small" strings would not make much sense in terms of performance anyway.
Typical C++ implementations do not make this small size configurable.
Task safety
Just like unbounded strings, the strings in this package are not thread safe. This means that you cannot access the same string (read or write) from two different threads without somehow protecting the access via a protected type, locks,...
In practice, sharing strings would rarely be done, so if the package itself was doing its own locking we would end up with very bad performance in all cases, for a few cases where it might prove useful.
As we'll discuss below, it is possible to use two different strings that actually share the same internal buffer, from two different threads. Since this is an implementation detail, this package takes care of guaranteeing the integrity of the shared data in such a case.
Copy on write
There is one more formal parameter, to configure whether this package should use copy-on-write or not. When copy on write is enabled, you can have multiple strings that internally share the same buffer of characters. This means that assigning a string to another one becomes a reasonably fast operation (copy a pointer and increment a refcount). Whenever the string is modified, a copy of the buffer is done so that other copies of the same string are not impacted.
But in fact, there is one drawback with this scheme: we need reference counting to know when we can free the shared data, or when we need to make a copy of it. This reference counting must be thread safe, since users might be using two different strings from two different threads, but they share data internally.
Thus the reference counting is done via atomic operations, which have some impact on performance. Since multiple threads try to access the same memory addresses, this is also a source of contention in multi-threaded applications.
For this reason, the current C++ standard prevents the use of copy-on-write for strings.
In our cases, we chose to make this configurable in the generic, so that users can decide whether to pay the cost of the atomic operations, but save on the number of memory allocations and copy of the characters. Sometimes it is better to share the data, but sometimes it is better to systematically copy it. Again, actual measurements of the performance are needed for your specific application.
Growth strategy
When the current size of the string becomes bigger than the available allocated memory (for instance because you are appending characters), this package needs to reallocate memory. There are plenty of strategies here, from allocating only the exact amount of memory needed (which saves on memory usage, but is very bad in terms of performance), to doubling the current size of the string until we have enough space, as currently done in the GNAT unbounded strings implementation.
The latter approach would therefore allocate space for two characters, then for 4, then 8 and so on.
This package has a slightly different strategy. Remember that we only start allocating memory past the size of small strings, so we will for instance first allocate 24 bytes. When more memory is needed, we multiply this size by 1.5, which some researchers have found to be a good comprise between waste of memory and number of allocations. For very large strings, we always allocate multiples of the memory page size (4096 bytes), since this is what the system will make available anyway. So we will basically allocate the following: 24, 36, 54, 82, 122,...
An additional constraint is that we only ever allocate even number of bytes. This is called the capacity of the string. In the layout of the big string, as shown above, we store half that capacity, which saves one bit that we use for the flag.
Growing memory
This package does not use the Ada new operator. Instead, we use functions from System.Memory directly, in part so that we can use the realloc system call. This is much more efficient when we need to grow the internal buffer, since in most cases we won't have to copy the characters at all. Saving on those additional copies has a significant impact, as we'll see in the performance measurements below.
Substrings
One other optimization performed by this package (which is not done for unbounded strings or various C++ implementations) is to optimize substrings when also using copy-on-write.
We simply store the index within the shared buffer of the first character of the string , instead of always starting at one.
From the user's point of view, this is an implementation detail. Strings are always indexed from 1, and internally we convert to an actual position in the buffer. This means that if we need to reallocate the buffer, for instance when the string is modified, we transparently change the index of the first character, but the indexes the user was using are still valid.
This results in very significant savings, as shown below in the timings for Trim for instance. Also, we can do an operation like splitting a string very efficiently.
For instance, the following code doesn't allocate any memory, beside setting the initial value of the string. It parses a file containing some "key=value" lines, with optional spaces, and possibly empty lines:
declare
S, Key, Value : XString;
L : XString_Array (1 .. 2);
Last : Natural;
begin
S.Set (".......");
-- Get each line
for Line of S.Split (ASCII.LF) loop
-- Split into at most two substrings
Line.Split ('=', Into => L, Last => Last);
if Last = 2 then
Key := L (1);
Key.Trim; -- Removing leading and trailing spaces
Value := L (2);
Value.Trim;
end if;
end loop;
end;
Conclusion
We use various tricks to improve the performance over unbounded strings. From not allocating any memory when a string is 24 bytes (or a configurable limit) or less, to growing the string as needed via a careful growth strategy, to sharing internal data until we need to modify it, these various techniques combine to provide a fast and flexible implementation.
Here are some timing experiments done on a laptop (multiple operating systems lead to similar results). The exact timings are irrelevant, so the results are given in percentage of what the unbounded string takes to performance similar operations.
We configured the GNATCOLL.Strings package in different ways, either with or without copy-on-write, and with a small string size from the default 0..23 bytes or a larger 0..127 bytes.
Setting a small string multiple times (e.g. Set_Unbounded_String)
unbounded_string = 100 %
xstring-23 without copy on write = 11 %
xstring-23 with copy on write = 12 %
xstring-127 with copy on write = 17 %
Here we see that not doing any memory allocation makes XString
much faster than unbounded string. Most of the time is spent
copying characters around, via memcpy.
Setting a large string multiple times (e.g. Set_Unbounded_String)
unbounded_string = 100 %
xstring-23 without copy on write = 41 %
xstring-23 with copy on write = 50 %
xstring-127 with copy on write = 32 %
Here, XString apparently proves better are reusing already
allocated memory, although the timings are similar when creating
new strings instead. Most of the difference is probably related to the
use of realloc instead of alloc.
Assigning small strings (e.g. S2 := S1)
unbounded_string = 100 %
xstring-23 without copy on write = 31 %
xstring-23 with copy on write = 27 %
xstring-127 with copy on write = 57 %
Assigning large strings (e.g. S2 := S1)
unbounded_string = 100 %
xstring-23 without copy on write = 299 %
xstring-23 with copy on write = 63 %
xstring-127 with copy on write = 60 %
When not using copy-on-write (which unbounded strings do), we need
to reallocate memory, which shows on the second line.
Appending to large string (e.g. Append (S, "...."))
unbounded_string = 100 %
xstring-23 without copy on write = 39 %
xstring-23 with copy on write = 48 %
same, with tasking = 142 %
xstring-127 with copy on write = 49 %
When we use tasking, XStrings use atomic operations for the reference
counter, which slows things down. They become slower than unbounded
strings, because the latter in fact have a bug when using two
different strings from two different threads, and they share data
(they try to save on an atomic operation... this bug is being worked
on).
Removing leading and trailing spaces (e.g. Trim (S, Ada.Strings.Both))
unbounded_string = 100 %
xstring-23 without copy on write = 50 %
xstring-23 with copy on write = 16 %
xstring-127 with copy on write = 18 %
Here we see the benefits of the substrings optimization, which shares
data for the substrings.