AdaCore Blog

Introduction to VSS library

Introduction to VSS library

by Maxim Reznik

Intro­duc­tion to the VSS library #

The VSS (as an abbre­vi­a­tion for Vir­tu­al String Sub­sys­tem) library is designed to pro­vide advanced string and text pro­cess­ing capa­bil­i­ties. It offers a con­ve­nient and robust API that allows devel­op­ers to work with Uni­code text, regard­less of its inter­nal rep­re­sen­ta­tion. In this arti­cle, we will intro­duce you to the library and explain its pur­pose, high­light­ing its use­ful­ness for devel­op­ers work­ing in this area.

What is the ratio­nale behind cre­at­ing anoth­er library for string pro­cess­ing? #

Although Ada offers sev­er­al stan­dard string types, and there are sev­er­al string libraries devel­oped by the Ada com­mu­ni­ty, each one has its own draw­backs or limitations.

The String, Wide_​String, and Wide_​Wide_​String types are indef­i­nite types (i.e., it is nec­es­sary to con­strain their bounds when cre­at­ing an object of these types), which can be incon­ve­nient when stor­ing string val­ues in a vari­able or a con­tain­er. The Unbounded_​String, Unbounded_​Wide_​String, and Unbounded_​Wide_​Wide_​String types are def­i­nite, but the set of oper­a­tions they pro­vide is lim­it­ed, and dot nota­tion is not avail­able for them. A sim­i­lar lim­i­ta­tion applies to the bound­ed ver­sions of these types.

Fur­ther­more, each type is restrict­ed to a spe­cif­ic char­ac­ter set, neces­si­tat­ing the con­ver­sion of the char­ac­ter set when read­ing, writ­ing, or inter­act­ing with exter­nal sources. String, Bounded_​String, and Unbounded_​String types only sup­port Latin‑1, while wide types use 2 or 4 bytes per char­ac­ter, even for ASCII.

The most com­mon­ly used encod­ing, UTF8, is not native­ly sup­port­ed by any of these types. The UTF8_​String type attempts to fill this gap, but it breaks the user’s expec­ta­tion that each ele­ment is a char­ac­ter and places the bur­den and com­plex­i­ty of work­ing with the encod­ing on the user.

In present-day sys­tems, text is not mere­ly a sequence of char­ac­ters but rather con­sists of grapheme clus­ters, words, and lines, as defined by the Uni­code stan­dard. As a result, tasks such as com­par­ing and sort­ing strings (col­la­tion) and case con­ver­sion can­not be per­formed sole­ly at the lev­el of indi­vid­ual char­ac­ters. For exam­ple To_Upper ("ß") = "SS". The stan­dard library does not pro­vide sup­port for this.

To over­come these issues, we have devel­oped a new library — VSS. This library

  • pro­vides a def­i­nite type to rep­re­sent a Uni­code char­ac­ter string with a con­ve­nient set of oper­a­tions. A ded­i­cat­ed string vec­tor type with an effi­cient imple­men­ta­tion is also provided.
  • pro­vides an encod­ing-agnos­tic API that allows effi­cient imple­men­ta­tions tai­lored to the plat­form or appli­ca­tion. Cur­rent­ly, the inter­nal rep­re­sen­ta­tion is always UTF8. How­ev­er, oth­er imple­men­ta­tions are in progress. There­fore, the inter­nal rep­re­sen­ta­tion will depend on the source of the text data and/​or the default encod­ing select­ed for the spe­cif­ic platform.
  • offers a com­pre­hen­sive range of string and string vec­tor oper­a­tions, com­pa­ra­ble to those found in oth­er pro­gram­ming languages.
  • takes advan­tage of more mod­ern lan­guage fea­tures and tech­nolo­gies, offer­ing improved per­for­mance, mem­o­ry usage, and oth­er benefits.

Get­ting Start­ed #

The library can be found on GitHub and is dis­trib­uted under the Apache 2.0 license. It can be built using an Ada 2022 com­pli­ant com­pil­er. Addi­tion­al­ly, it is pos­si­ble to use Alire to build the library.

$ alr get vss

The VSS library is divid­ed into mul­ti­ple projects:

  • vss_text.gpr — base string library with
    • Uni­code string, string vec­tor, byte vec­tor types
    • input/​output text streams to read/​write files, mem­o­ry and stdin/​stdout
    • iter­a­tors for char­ac­ters, grapheme clus­ters, words and lines
    • encoders and decoders for sev­er­al of the most pop­u­lar text encodings
  • vss_regexp.gpr — a reg­u­lar expres­sion engine
  • vss_json.gpr — a JSON stream­ing API that allows for effi­cient pars­ing and com­pos­ing of JSON con­tent on the fly
  • vss_xml.gpr — a XML stream­ing API imple­ment­ed over XMLA­da or Matresh­ka libraries
  • vss_xml_templates.gpr — a XML tem­plate engine inspired by Zope Page Templates

How about giv­ing the VSS string library a try?

First steps with VSS #

We start with cre­at­ing a sam­ple Alire crate and adding VSS as a dependency:

$ alr init --bin vss_test
$ cd vss_test
$ alr with vss

Then we mod­i­fy vss_test.adb with the fol­low­ing code:

pragma Wide_Character_Encoding (UTF8);

with VSS.Strings;
with VSS.Strings.Conversions;
with Ada.Wide_Wide_Text_IO;

procedure Vss_Test is
   Text : VSS.Strings.Virtual_String := "𝛼−𝛽";
begin
   Ada.Wide_Wide_Text_IO.Put_Line
    (VSS.Strings.Conversions.To_Wide_Wide_String (Text));
end Vss_Test;

The first line spec­i­fies to GNAT that the source code rep­re­sen­ta­tion will use UTF8 encod­ing. Then we add VSS library units and the Wide_Wide_Text_IO pack­age. The Text vari­able ini­tial­iza­tion lever­ages Ada 2022 syn­tax for user defined lit­er­als. It hides a call to VSS.Strings.To_Virtual_String for the string lit­er­al. The explic­it call is required for con­vert­ing back to a string.

To build and exe­cute this code just run:

$ alr run

After cre­at­ing the Virtual_​String object, Text”, we can then:

  • find if it’s emp­ty: Text.Is_Empty
  • find Text’s length in char­ac­ters: Text.Character_Length
  • find Text’s hash: Text.Hash
  • check if it starts (or ends) with anoth­er string: Text.Starts_With ("𝛼")
  • do a case con­ver­sion: Text.To_Uppercase
  • etc.

We can mod­i­fy Text by

  • Append­ing a string or char­ac­ter: Text.Append ('.');
  • Prepend­ing a string or char­ac­ter: Text.Prepend (">>>");
  • Eras­ing: Text.Clear;
  • etc.

We can split Text into a string vec­tor (defined in VSS.String_Vectors):

declare
   List : VSS.String_Vectors.Virtual_String_Vector := Text.Split ('−');
begin
   for Item of List loop
      Ada.Wide_Wide_Text_IO.Put_Line
        (VSS.Strings.Conversions.To_Wide_Wide_String (Item));
   end loop;
end;

A ded­i­cat­ed func­tion Text.Split_Lines splits the text into a string vec­tor using a spec­i­fied line sep­a­ra­tor. Con­verse­ly, the vec­tor type offers the Join and Join_Lines func­tions for the oppo­site operations.

Con­clu­sion #

The VSS library pro­vides advanced string and text pro­cess­ing capa­bil­i­ties. It offers an API that allows devel­op­ers to work with Uni­code text, regard­less of its inter­nal rep­re­sen­ta­tion. The library pro­vides addi­tion­al func­tion­al­i­ty beyond Ada’s stan­dard string types and oth­er string libraries devel­oped by the com­mu­ni­ty. It pro­vides a def­i­nite type to rep­re­sent a Uni­code char­ac­ter string, with a com­pre­hen­sive range of oper­a­tions com­pa­ra­ble to those found in oth­er pro­gram­ming lan­guages. Addi­tion­al­ly, it is encod­ing-agnos­tic, allow­ing effi­cient imple­men­ta­tions tai­lored to the plat­form or appli­ca­tion. The library is divid­ed into mul­ti­ple projects, with each project cater­ing to a spe­cif­ic need. 

The VSS library is dis­trib­uted under the Apache 2.0 license, and it can be built using an Ada 2022 com­pli­ant com­pil­er. With its effi­cient imple­men­ta­tion, mod­ern lan­guage fea­tures and tech­nolo­gies, and sup­port for tasks such as com­par­ing and sort­ing strings, the VSS library is a use­ful tool for devel­op­ers work­ing with strings and text processing.

In sub­se­quent arti­cles, we will explore more advanced con­cepts such as cur­sors, streams, encoders/​decoders, and so on. Stay tuned!

Exer­cise for the read­er: Can you write a func­tion to reverse the string "F⃗=ma⃗" (F U+20D7 = m a U+20D7) in prepa­ra­tion for the next article?

Posted in #Unicode    #strings    #Libraries    #vss   

About Maxim Reznik

Maxim Reznik

Maxim Reznik is a Software Engineer and Consultant at AdaCore. At AdaCore, he works on the GNAT Studio, the Ada VS Code extension and the Language Server Protocol implementation.