Introduction to the VSS library #
The VSS (as an abbreviation for Virtual String Subsystem) library is designed to provide advanced string and text processing capabilities. It offers a convenient and robust API that allows developers to work with Unicode text, regardless of its internal representation. In this article, we will introduce you to the library and explain its purpose, highlighting its usefulness for developers working in this area.
What is the rationale behind creating another library for string processing? #
Although Ada offers several standard string types, and there are several string libraries developed by the Ada community, each one has its own drawbacks or limitations.
The String, Wide_String, and Wide_Wide_String types are indefinite types (i.e., it is necessary to constrain their bounds when creating an object of these types), which can be inconvenient when storing string values in a variable or a container. The Unbounded_String, Unbounded_Wide_String, and Unbounded_Wide_Wide_String types are definite, but the set of operations they provide is limited, and dot notation is not available for them. A similar limitation applies to the bounded versions of these types.
Furthermore, each type is restricted to a specific character set, necessitating the conversion of the character set when reading, writing, or interacting with external sources. String, Bounded_String, and Unbounded_String types only support Latin‑1, while wide types use 2 or 4 bytes per character, even for ASCII.
The most commonly used encoding, UTF‑8, is not natively supported by any of these types. The UTF8_String type attempts to fill this gap, but it breaks the user’s expectation that each element is a character and places the burden and complexity of working with the encoding on the user.
In present-day systems, text is not merely a sequence of characters but rather consists of grapheme clusters, words, and lines, as defined by the Unicode standard. As a result, tasks such as comparing and sorting strings (collation) and case conversion cannot be performed solely at the level of individual characters. For example
To_Upper ("ß") = "SS". The standard library does not provide support for this.
To overcome these issues, we have developed a new library — VSS. This library
- provides a definite type to represent a Unicode character string with a convenient set of operations. A dedicated string vector type with an efficient implementation is also provided.
- provides an encoding-agnostic API that allows efficient implementations tailored to the platform or application. Currently, the internal representation is always UTF‑8. However, other implementations are in progress. Therefore, the internal representation will depend on the source of the text data and/or the default encoding selected for the specific platform.
- offers a comprehensive range of string and string vector operations, comparable to those found in other programming languages.
- takes advantage of more modern language features and technologies, offering improved performance, memory usage, and other benefits.
Getting Started #
The library can be found on GitHub and is distributed under the Apache 2.0 license. It can be built using an Ada 2022 compliant compiler. Additionally, it is possible to use Alire to build the library.
$ alr get vss
The VSS library is divided into multiple projects:
vss_text.gpr— base string library with
- Unicode string, string vector, byte vector types
- input/output text streams to read/write files, memory and stdin/stdout
- iterators for characters, grapheme clusters, words and lines
- encoders and decoders for several of the most popular text encodings
vss_regexp.gpr— a regular expression engine
vss_json.gpr— a JSON streaming API that allows for efficient parsing and composing of JSON content on the fly
vss_xml.gpr— a XML streaming API implemented over XMLAda or Matreshka libraries
vss_xml_templates.gpr— a XML template engine inspired by Zope Page Templates
How about giving the VSS string library a try?
First steps with VSS #
We start with creating a sample Alire crate and adding VSS as a dependency:
$ alr init --bin vss_test $ cd vss_test $ alr with vss
Then we modify
vss_test.adb with the following code:
pragma Wide_Character_Encoding (UTF8); with VSS.Strings; with VSS.Strings.Conversions; with Ada.Wide_Wide_Text_IO; procedure Vss_Test is Text : VSS.Strings.Virtual_String := "𝛼−𝛽"; begin Ada.Wide_Wide_Text_IO.Put_Line (VSS.Strings.Conversions.To_Wide_Wide_String (Text)); end Vss_Test;
The first line specifies to GNAT that the source code representation will use UTF‑8 encoding. Then we add VSS library units and the
Wide_Wide_Text_IO package. The
Text variable initialization leverages Ada 2022 syntax for user defined literals. It hides a call to
VSS.Strings.To_Virtual_String for the string literal. The explicit call is required for converting back to a string.
To build and execute this code just run:
$ alr run
After creating the Virtual_String object, “Text”, we can then:
- find if it’s empty:
- find Text’s length in characters:
- find Text’s hash:
- check if it starts (or ends) with another string:
- do a case conversion:
We can modify Text by
- Appending a string or character:
- Prepending a string or character:
We can split
Text into a string vector (defined in
declare List : VSS.String_Vectors.Virtual_String_Vector := Text.Split ('−'); begin for Item of List loop Ada.Wide_Wide_Text_IO.Put_Line (VSS.Strings.Conversions.To_Wide_Wide_String (Item)); end loop; end;
A dedicated function
Text.Split_Lines splits the text into a string vector using a specified line separator. Conversely, the vector type offers the
Join_Lines functions for the opposite operations.
The VSS library provides advanced string and text processing capabilities. It offers an API that allows developers to work with Unicode text, regardless of its internal representation. The library provides additional functionality beyond Ada’s standard string types and other string libraries developed by the community. It provides a definite type to represent a Unicode character string, with a comprehensive range of operations comparable to those found in other programming languages. Additionally, it is encoding-agnostic, allowing efficient implementations tailored to the platform or application. The library is divided into multiple projects, with each project catering to a specific need.
The VSS library is distributed under the Apache 2.0 license, and it can be built using an Ada 2022 compliant compiler. With its efficient implementation, modern language features and technologies, and support for tasks such as comparing and sorting strings, the VSS library is a useful tool for developers working with strings and text processing.
In subsequent articles, we will explore more advanced concepts such as cursors, streams, encoders/decoders, and so on. Stay tuned!
Exercise for the reader: Can you write a function to reverse the string
F U+20D7 = m a U+20D7) in preparation for the next article?