AdaCore Blog

VSS: Cursors, Iterators and Markers

VSS: Cursors, Iterators and Markers

by Vadim Godunko

The VSS (as an abbre­vi­a­tion for Vir­tu­al String Sub­sys­tem) library is designed to pro­vide advanced string and text pro­cess­ing capa­bil­i­ties. It offers a con­ve­nient and robust API that allows devel­op­ers to work with Uni­code text, regard­less of its inter­nal rep­re­sen­ta­tion. Last time we provided overview of the library, and in this arti­cle, we will introduce concepts of cursors that are used to iterate, retrieve, and modify text data.

Cursors

In VSS, individual characters in the strings cannot be retrieved using integer indices. This was done to eliminate the possibility of writing inefficient code - commonly used text encodings use

variable-length sequences of code units to represent a single character, thus a linear scan of the string data is needed to extract a character with specified index.

To point to the position of a text element (which can be either a single character or a text segment), the cursor mechanism is used. The cursor can point to one of the special positions: before the first element of the text and after the last element of the text.

Each cursor is attached to the string object, which guarantees additional properties:

  • The cursor cannot be used to perform operations on another string object.

  • The cursor keeps track of its position when the attached string object is modified, by either moving position to point to the same element, or to invalidate itself. Each type of cursor describes behavior on modification of the attached string object.

  • When a string object is destroyed, all cursors associated with it are invalidated.

There are two basic kinds of cursors: iterators and markers. Iterators not only point to a text element, but also allow to move to the next/previous text element. Markers only allow to save the position of the text element for later use.

All cursors allow the use of the Element function to obtain text elements pointed by the cursor. Type of return value depends on the type of the cursor and may be either Virtual_Character for character cursors or Virtual_String for text segment cursors. Function Has_Element allows checking whether the cursor points to some text element of the text.

Iterators

VSS provides an implementation of iterators for characters, grapheme clusters, word boundaries and lines. All iterators are declared as limited private types, thus does not allow assignment of one iterator object to another. All iterators provide a common interface, but each iterator has its own implementation properties.

Iterators can be initialized in two ways:

  • Immediately when declaring an object using subprograms of Virtual_String type

  • Before the use of an iterator object with its subprograms

Iterator can be initialized to point to one of positions:

  • before the first text element

  • on the first text element

  • at the position pointed by a cursor

  • on the last text element

  • after the last text item

Below is an example of the declaration of different types of iterator objects and their initialization on object declaration. This scheme is useful when an iterator object is to be used within a declarative block.

declare
   S  : VSS.Strings.Virtual_String := "some text";
   CI : VSS.Strings.Character_Iterators.Character_Iterator :=
     S.Before_First_Character;
   GI : VSS.Strings.Grapheme_Cluster_Iterators
           .Grapheme_Cluster_Iterator :=
             S.At_First_Grapheme_Cluster;
   WI : VSS.Strings.Word_Iterators.Word_Iterator :=
     S.At_Last_Word;
   LI : VSS.Strings.Line_Iterators.Line_Iterator :=
     S.After_Last_Line;

Another way to initialize iterator objects is useful when

iterator objects are declared outside of the declarative region, or to

reuse an iterator object.

type All_Iterators is record
   S  : VSS.Strings.Virtual_String;
   CI : VSS.Strings.Character_Iterators.Character_Iterator;
   GI : VSS.Strings.Grapheme_Cluster_Iterators
          .Grapheme_Cluster_Iterator;
   WI : VSS.Strings.Word_Iterators.Word_Iterator;
   LI : VSS.Strings.Line_Iterators.Line_Iterator;
end record;

procedure Initialize (Data : in out All_Iterators) is
begin
   Data.S := "some text";
   Data.CI.Set_Before_First (Data.S);
   Data.GI.Set_At_First (Data.S);
   Data.WI.Set_At_Last (Data.S);
   Data.LI.Set_After_Last (Data.S);
end Initialize;

The Forward and Backward functions are used to move the

cursor along a text. They return True if the iterator points to some

text element in the text, and False if the iterator points to the

position before/after the text. Element function of the cursor can be

used to get the text element pointed by the iterator, and Has_Element

function can be used to check whether the iterator points to some text

element of the text.


Example

Now it is time to solve the problem from the previous article and print the text "F⃗=ma⃗" in reverse order of characters.

with Ada.Wide_Wide_Text_IO;
with VSS.Strings.Conversions;
with VSS.Strings.Grapheme_Cluster_Iterators;

procedure Example_1 is
   Text     : VSS.Strings.Virtual_String := "F⃗=ma⃗";
   Iterator : VSS.Strings.Grapheme_Cluster_Iterators
                .Grapheme_Cluster_Iterator :=
     Text.After_Last_Grapheme_Cluster;

begin
   Ada.Wide_Wide_Text_IO.Put_Line
     (VSS.Strings.Conversions.To_Wide_Wide_String (Text));

   while Iterator.Backward loop
      Ada.Wide_Wide_Text_IO.Put
        (VSS.Strings.Conversions.To_Wide_Wide_String
           (Iterator.Element));
   end loop;

   Ada.Wide_Wide_Text_IO.New_Line;
end Example_1;

The solution uses Grapheme_Cluster_Iterator, which allows iterating over sequences of characters that form a visual representation of the single grapheme understandable by the reader. Using the Character_Iterator in the example may give a similar result for simple text, but in more complex cases it will give an incorrect result.

Markers

Markers are another type of cursor that allows to save the position of the text element in text. Like iterators, markers can point either a single character or a text segment. Unlike iterators, markers can’t navigate over text elements of the text. However, they are declared as non-limited private types and may be used in assignment statements.

Example

Let’s find the maximum substring of the text that starts from some character and ends on another character.

with Ada.Wide_Wide_Text_IO;
with VSS.Characters;

with VSS.Strings.Conversions;
with VSS.Strings.Character_Iterators;
with VSS.Strings.Markers;

procedure Example_2 is

   use type VSS.Characters.Virtual_Character;

   Text : VSS.Strings.Virtual_String := "ABCDEFCDEF";

   C1  : VSS.Characters.Virtual_Character := 'C';
   C2  : VSS.Characters.Virtual_Character := 'F';

   M1  : VSS.Strings.Markers.Character_Marker; 
   M2  : VSS.Strings.Markers.Character_Marker;

   J    : VSS.Strings.Character_Iterators.Character_Iterator :=
     Text.Before_First_Character;

begin
   while J.Forward loop
      if J.Element = C1 then
         M1 := J.Marker;

         exit;
      end if;
   end loop;

   while J.Forward loop
      if J.Element = C2 then
         M2 := J.Marker;
      end if;
   end loop;

   Ada.Wide_Wide_Text_IO.Put_Line
     (VSS.Strings.Conversions.To_Wide_Wide_String
        (Text.Slice (M1, M2)));
end Example_2;

The example uses two markers to save position of the character cursor, and uses the Slice subprogram of the Virtual_String to obtain a slice of the text.

Position of the cursor

Any cursor has position information relative to the start of the text. In addition to index of the pointed character for the character cursor, and indices of the first and last characters of the text segment cursors, offsets of the start of the first character and end of the last character in code units of UTF-8 and UTF-16 encodings can be taken. It helps a lot with interaction with systems that count the length of the text data as the number of code units of corresponding encodings.

Conclusions

This blog post gives an overview of cursors (iterators and markers) and their use in text processing. The use of cursors is not limited to what is described, they are used in many operations of the Virtual_String type routines where it is necessary to provide the position of a text element in the text. An example would be the use of the Slice function used in the last example to obtain a slice of the text between two cursors. Operations of the Virtual_String type will be discussed in the next blog post.

Posted in #vss    #Unicode    #strings    #Libraries