The VSS (as an abbreviation for Virtual String Subsystem) library is designed to provide advanced string and text processing capabilities. It offers a convenient and robust API that allows developers to work with Unicode text, regardless of its internal representation. Last time we provided overview of the library, and in this article, we will introduce concepts of cursors that are used to iterate, retrieve, and modify text data.
In VSS, individual characters in the strings cannot be retrieved using integer indices. This was done to eliminate the possibility of writing inefficient code - commonly used text encodings use
variable-length sequences of code units to represent a single character, thus a linear scan of the string data is needed to extract a character with specified index.
To point to the position of a text element (which can be either a single character or a text segment), the cursor mechanism is used. The cursor can point to one of the special positions: before the first element of the text and after the last element of the text.
Each cursor is attached to the string object, which guarantees additional properties:
The cursor cannot be used to perform operations on another string object.
The cursor keeps track of its position when the attached string object is modified, by either moving position to point to the same element, or to invalidate itself. Each type of cursor describes behavior on modification of the attached string object.
When a string object is destroyed, all cursors associated with it are invalidated.
There are two basic kinds of cursors: iterators and markers. Iterators not only point to a text element, but also allow to move to the next/previous text element. Markers only allow to save the position of the text element for later use.
All cursors allow the use of the Element function to obtain text elements pointed by the cursor. Type of return value depends on the type of the cursor and may be either Virtual_Character for character cursors or Virtual_String for text segment cursors. Function Has_Element allows checking whether the cursor points to some text element of the text.
VSS provides an implementation of iterators for characters, grapheme clusters, word boundaries and lines. All iterators are declared as limited private types, thus does not allow assignment of one iterator object to another. All iterators provide a common interface, but each iterator has its own implementation properties.
Iterators can be initialized in two ways:
Immediately when declaring an object using subprograms of Virtual_String type
Before the use of an iterator object with its subprograms
Iterator can be initialized to point to one of positions:
before the first text element
on the first text element
at the position pointed by a cursor
on the last text element
after the last text item
Below is an example of the declaration of different types of iterator objects and their initialization on object declaration. This scheme is useful when an iterator object is to be used within a declarative block.
declare S : VSS.Strings.Virtual_String := "some text"; CI : VSS.Strings.Character_Iterators.Character_Iterator := S.Before_First_Character; GI : VSS.Strings.Grapheme_Cluster_Iterators .Grapheme_Cluster_Iterator := S.At_First_Grapheme_Cluster; WI : VSS.Strings.Word_Iterators.Word_Iterator := S.At_Last_Word; LI : VSS.Strings.Line_Iterators.Line_Iterator := S.After_Last_Line;
Another way to initialize iterator objects is useful when
iterator objects are declared outside of the declarative region, or to
reuse an iterator object.
type All_Iterators is record S : VSS.Strings.Virtual_String; CI : VSS.Strings.Character_Iterators.Character_Iterator; GI : VSS.Strings.Grapheme_Cluster_Iterators .Grapheme_Cluster_Iterator; WI : VSS.Strings.Word_Iterators.Word_Iterator; LI : VSS.Strings.Line_Iterators.Line_Iterator; end record; procedure Initialize (Data : in out All_Iterators) is begin Data.S := "some text"; Data.CI.Set_Before_First (Data.S); Data.GI.Set_At_First (Data.S); Data.WI.Set_At_Last (Data.S); Data.LI.Set_After_Last (Data.S); end Initialize;
The Forward and Backward functions are used to move the
cursor along a text. They return True if the iterator points to some
text element in the text, and False if the iterator points to the
position before/after the text. Element function of the cursor can be
used to get the text element pointed by the iterator, and Has_Element
function can be used to check whether the iterator points to some text
element of the text.
Now it is time to solve the problem from the previous article and print the text "F⃗=ma⃗" in reverse order of characters.
with Ada.Wide_Wide_Text_IO; with VSS.Strings.Conversions; with VSS.Strings.Grapheme_Cluster_Iterators; procedure Example_1 is Text : VSS.Strings.Virtual_String := "F⃗=ma⃗"; Iterator : VSS.Strings.Grapheme_Cluster_Iterators .Grapheme_Cluster_Iterator := Text.After_Last_Grapheme_Cluster; begin Ada.Wide_Wide_Text_IO.Put_Line (VSS.Strings.Conversions.To_Wide_Wide_String (Text)); while Iterator.Backward loop Ada.Wide_Wide_Text_IO.Put (VSS.Strings.Conversions.To_Wide_Wide_String (Iterator.Element)); end loop; Ada.Wide_Wide_Text_IO.New_Line; end Example_1;
The solution uses Grapheme_Cluster_Iterator, which allows iterating over sequences of characters that form a visual representation of the single grapheme understandable by the reader. Using the Character_Iterator in the example may give a similar result for simple text, but in more complex cases it will give an incorrect result.
Markers are another type of cursor that allows to save the position of the text element in text. Like iterators, markers can point either a single character or a text segment. Unlike iterators, markers can’t navigate over text elements of the text. However, they are declared as non-limited private types and may be used in assignment statements.
Let’s find the maximum substring of the text that starts from some character and ends on another character.
with Ada.Wide_Wide_Text_IO; with VSS.Characters; with VSS.Strings.Conversions; with VSS.Strings.Character_Iterators; with VSS.Strings.Markers; procedure Example_2 is use type VSS.Characters.Virtual_Character; Text : VSS.Strings.Virtual_String := "ABCDEFCDEF"; C1 : VSS.Characters.Virtual_Character := 'C'; C2 : VSS.Characters.Virtual_Character := 'F'; M1 : VSS.Strings.Markers.Character_Marker; M2 : VSS.Strings.Markers.Character_Marker; J : VSS.Strings.Character_Iterators.Character_Iterator := Text.Before_First_Character; begin while J.Forward loop if J.Element = C1 then M1 := J.Marker; exit; end if; end loop; while J.Forward loop if J.Element = C2 then M2 := J.Marker; end if; end loop; Ada.Wide_Wide_Text_IO.Put_Line (VSS.Strings.Conversions.To_Wide_Wide_String (Text.Slice (M1, M2))); end Example_2;
The example uses two markers to save position of the character cursor, and uses the Slice subprogram of the Virtual_String to obtain a slice of the text.
Position of the cursor
Any cursor has position information relative to the start of the text. In addition to index of the pointed character for the character cursor, and indices of the first and last characters of the text segment cursors, offsets of the start of the first character and end of the last character in code units of UTF-8 and UTF-16 encodings can be taken. It helps a lot with interaction with systems that count the length of the text data as the number of code units of corresponding encodings.
This blog post gives an overview of cursors (iterators and markers) and their use in text processing. The use of cursors is not limited to what is described, they are used in many operations of the Virtual_String type routines where it is necessary to provide the position of a text element in the text. An example would be the use of the Slice function used in the last example to obtain a slice of the text between two cursors. Operations of the Virtual_String type will be discussed in the next blog post.