# C library bindings: GCC plugins to the rescue

## by Pierre-Marie de Rodat – Jun 13, 2016

I recently started working on an Ada binding for the excellent libuv C library. This library provides a convenient API to perform asynchronous I/O under an event loop, which is a popular way to develop server stacks. A central part of this API is its enumeration type for error codes: most functions use it. Hence, one of the first things I had to do was to bind the enumeration type for error codes. Believe it or not: this is harder than it first seems!

The enumeration type itself is defined in uv.h and makes a heavy use of C macros. Besides, the representation of each enumerator can be platform dependent. Because of this, I would like to get the type binding automatically generated. But only the enumeration type: I want to bind the rest manually to expose an Ada API that is as idiomatic as possible.

# Don’t we already have a tool for this?

The first option that came to my mind was to use the famous -fdump-ada-spec GCC flag:

subtype uv_errno_t is unsigned;
UV_E2BIG : constant uv_errno_t := -7;
UV_EACCES : constant uv_errno_t := -13;
UV_EADDRINUSE : constant uv_errno_t := -98;
--  [...]

That’s a good start but I wanted something better. First, uv_errno_t in C really is an enumeration: each value is independent, there is no bit-field game, so what I want is a genuine Ada enumerated type. Moreover, the output from -fdump-ada-spec is raw and supposed to be used as a starting base to be refined. For instance I would have to rework the casing of identifiers and remove the UV_ prefix.

Besides, the output contains the type definition… but also bindings for the rest of the API (other types, subprograms, …), so I would have to filter out the rest. Overall, given that this binding will have to be generated for each platform, using -fdump-ada-spec does not look convenient.

# Cool kids handle C with Clang…

My second thought was to use Clang. Libclang’s Python API makes it possible to visit the whole parse tree, which sounds good. I would write a script to find the enumeration type definition, then iterate over its children so that I could list the couples of names and representation values.

Unfortunately, this API does not expose key data such as the value of an integer literal. This is a deal breaker for my binding generator: no integer literal means no representation value.

And now, what? Building a dummy compilation unit with debug information and extract enumeration data out of the generated DWARF looks cumbersome. No please… don’t tell me I'll have to parse preprocessed C code using regular expressions! Then I remembered something I heard some time ago, a technology existed called… GCC plugins.

Sure, I could probably have done this with Clang’s C++ API, but people working with Ada are much more likely to have GCC available rather than a Clang plugin development setup. Also, I was looking for an opportunity to check how the GCC plugins worked. ☺

# … but graybeards use GCC plugins

Since GCC 4.5, it is possible for everyone to extend the compiler without re-building GCC, for instance using your favorite GNU/Linux distribution’s pre-packaged compiler. The GCC wiki lists several plugins such as the famous DragonEgg: an LLVM backend for GCC. Note that the material here assumes it works with GCC 6 but the plugin events have changed a bit since the introduction of plugins.

So how does that work? Each plugin is a mere shared library (.so file on GNU/Linux) that is expected to expose a couple of specific symbols: a license “manifest” and an entry point function. The most straightforward way to write plugins is to use the same programming language as GCC: C++. This way, access to all internals is going to look like GCC’s own code. Note that plugins exist to expose internals in other languages: see for instance MELT or Python.

Loading a plugin is easy: just add a -fplugin=/path/to/plugin.so argument to your gcc/g++ command line: see the documentation for more details.

Now, remember the compiler pipeline architectures that you learnt at school:

The GCC plugin API makes it possible to run code at various points in the pipeline: before and after individual function parsing, after type specifier parsing and before each optimization pass, etc. At plugin initialization, a special function is executed and it registers which function should be executed, and when in the pipeline (see API documentation).

But what can plugins actually do? I would say: a lot… but not anything. Plugins interact directly with GCC’s internals, so they can inspect and mutate intermediate representations whenever they are run. However, they cannot change how existing passes work, so for instance plugins will not be able to hook into debug information generation.

# Let’s bind

Getting back to our enumeration business: it happens that all we need to do is to process enumeration type declarations right after they have been parsed. Fantastic, let’s create a new GCC plugin. Create the following Makefile:

# Makefile
HOST_GCC=g++
TARGET_GCC=gcc
PLUGIN_SOURCE_FILES= bind-enums.cc
GCCPLUGINS_DIR:= $(shell$(TARGET_GCC) -print-file-name=plugin)
CXXFLAGS+= -I$(GCCPLUGINS_DIR)/include -fPIC -fno-rtti -O2 -Wall -Wextra bind-enums.so:$(PLUGIN_SOURCE_FILES)
$(HOST_GCC) -shared$(CXXFLAGS) $^ -o$@

Then create the following plugin skeleton:

/* bind-enums.cc */
#include <stdio.h>

/* All plugin sources should start including "gcc-plugin.h".  */
#include "gcc-plugin.h"
/* This let us inspect the GENERIC intermediate representation.  */
#include "tree.h"

/* All plugins must export this symbol so that they can be linked with
int plugin_is_GPL_compatible;

/* Most interesting part so far: this is the plugin entry point.  */
int
plugin_init (struct plugin_name_args *plugin_info,
struct plugin_gcc_version *version)
{
(void) version;

/* Give GCC a proper name and version number for this plugin.  */
const char *plugin_name = plugin_info->base_name;
struct plugin_info pi = { "0.1", "Enum binder plugin" };
register_callback (plugin_name, PLUGIN_INFO, NULL, &pi);

/* Check everything is fine displaying a familiar message.  */
printf ("Hello, world!\n");

return 0;
}

Hopefully, the code and its comments are self-explanatory. The next steps are to build this plugin and actually run it:

# To run in your favorite shell
$make […]$ gcc -c -fplugin=$PWD/bind-enums.so random-source.c Hello, world! So far, so good. Now let’s do something useful with this plugin: handle the PLUGIN_FINISH_TYPE event to process enumeration types. There are two things to do: • create a function that will get executed everytime something fires at this event; • register this function as a callback for this event. /* To be added before "return" in plugin_init. */ register_callback (plugin_name, PLUGIN_FINISH_TYPE, &handle_finish_type, NULL); /* To be added before plugin_init. */ /* Given an enumeration type (ENUMERAL_TYPE node) and a name for it (IDENTIFIER_NODE), describe its enumerators on the standard output. */ static void dump_enum_type (tree enum_type, tree enum_name) { printf ("Found enum %s:\n", IDENTIFIER_POINTER (enum_name)); /* Process all enumerators. These are encoded as a linked list of TREE_LIST nodes starting from TYPE_VALUES and following TREE_CHAIN links. */ for (tree v = TYPE_VALUES (enum_type); v != NULL; v = TREE_CHAIN (v)) { /* Get this enumerator's value (TREE_VALUE). Give up if it's not a small integer. */ char buffer[128] = "\"<big integer>\""; if (tree_fits_shwi_p (TREE_VALUE (v))) { long l = tree_to_shwi (TREE_VALUE (v)); snprintf (buffer, 128, "%li", l); } printf (" %s = %s\n", IDENTIFIER_POINTER (TREE_PURPOSE (v)), buffer); } } /* Thanks to register_callback, GCC will call the following for each parsed type specifier, providing the corresponding GENERIC node as the "gcc_data" argument. */ static void handle_finish_type (void *gcc_data, void *user_data) { (void) user_data; tree t = (tree) gcc_data; /* Skip everything that is not a named enumeration type. */ if (TREE_CODE (t) != ENUMERAL_TYPE || TYPE_NAME (t) == NULL) return; dump_enum_type (t, TYPE_NAME (t)); } These new functions finally feature GCC internals handling. As you might have guessed, tree is the type name GCC uses to designate a GENERIC node. The set of kinds for nodes is defined in tree.def (TREE_CODE (t) returns the node kind) while tree attribute getters and setters are defined in tree.h. You can find a friendlier and longer introduction to GENERIC in the GCC Internals Manual. By the way: how do we know what GCC passes as the “gcc_data” argument? Well it’s not documented… or more precisely, it’s documented in the source code! Rebuild the plugin and then run it on a simple example: $ make
[…]
$cat <<EOF > test.h > enum simple_enum { A, B }; > enum complex_enum { C = 1, D = -3}; > typedef enum { E, H } typedef_enum;$ gcc -c -fplugin=$PWD/bind-enums.so test.h Found enum simple_enum: A = 0 B = 1 Found enum complex_enum: C = 1 D = -3 That’s good! But wait: the input example contains 3 types whereas the plugin mentions only the first two, where’s the mistake? This is actually expected: the first two enumerations have names (simple_enum and complex_enum) while the last one is anonymous. It’s actually the typedef that wraps it that has a name (typedef_enum). The PLUGIN_FINISH_TYPE event is called on the anonymous enum type, but as it has no name, the guard skips it: see the code above: /* Skip everything that is not a named enumeration type. */. We need names to produce bindings, so let’s process typedef nodes. /* To be added before "return" in plugin_init. */ register_callback (plugin_name, PLUGIN_FINISH_DECL, &handle_finish_decl, NULL); /* To be added before plugin_init. */ /* Like handle_finish_type, but called instead for each parsed declaration. */ static void handle_finish_decl (void *gcc_data, void *user_data) { (void) user_data; tree t = (tree) gcc_data; tree type = TREE_TYPE (t); /* Skip everything that is not a typedef for an enumeration type. */ if (TREE_CODE (t) != TYPE_DECL || TREE_CODE (type) != ENUMERAL_TYPE) return; dump_enum_type (type, DECL_NAME (t)); } The PLUGIN_FINISH_DECL event is triggered for all parsed declarations: functions, arguments, variables, type definitions and so on. We want to process only typedefs (TYPE_DECL) that wrap (TREE_TYPE) enumeration types, hence the above guard. Rebuild the plugin and run it once again: $ make
[…]
$gcc -c -fplugin=$PWD/bind-enums.so test.h
Found enum simple_enum:
A = 0
B = 1
Found enum complex_enum:
C = 1
D = -3
Found enum typedef_enum:
E = 0
H = 1

Fine, it looks like we covered all cases. What remains to be done now is to tune the plugin so that its output is easy to parse, for instance using the JSON format. Then to write a simple script that turns this JSON data into the expected Ada spec for our enumeration types, this is easy enough but it goes beyond the scope of what is an already long enough post, don’t you think?

# Help! My plugin went mad!

So you have started to write your own plugins but something is wrong: unexpected results, GCC internal error, crash, … How can we investigate such issues?

Of course you can do like hardcore programmers do and sow debug prints all over your code. However, if you are more like me and prefer using a debugger or tools such as Valgrind, there is a way. First edit your Makefile so that it builds with debug flags: replace the occurence of -O2 with -O0 -g3. This stands for: no optimization and debug information including macros. Then rebuild the plugin.

Running GDB over the gcc/g++ command is not the next step as it’s just a driver that will spawn subprocesses which perform the actual compilation. Instead, run your usual gcc/g++ command with additional -v -save-temps flags. This will print a lot, including the command lines for the various subprocesses involved:

$gcc -c -fplugin=$PWD/bind-enums.so test.h -v -save-temps |& grep fplugin
COLLECT_GCC_OPTIONS='-c' '-fplugin=/tmp/bind-enums/bind-enums.so' '-v' '-save-temps' '-mtune=generic' '-march=x86-64'
/usr/lib/gcc/x86_64-pc-linux-gnu/6.1.1/cc1 -E -quiet -v -iplugindir=/usr/lib/gcc/x86_64-pc-linux-gnu/6.1.1/plugin test.h -mtune=generic -march=x86-64 -fplugin=/tmp/bind-enums/bind-enums.so -fpch-preprocess -o test.i
COLLECT_GCC_OPTIONS='-c' '-fplugin=/tmp/bind-enums/bind-enums.so' '-v' '-save-temps' '-mtune=generic' '-march=x86-64'
/usr/lib/gcc/x86_64-pc-linux-gnu/6.1.1/cc1 -fpreprocessed test.i -iplugindir=/usr/lib/gcc/x86_64-pc-linux-gnu/6.1.1/plugin -quiet -dumpbase test.h -mtune=generic -march=x86-64 -auxbase test -version -fplugin=/tmp/bind-enums/bind-enums.so -o test.s --output-pch=test.h.gch

The above output contains four lines: let’s forget about line 1 and line 3; line 2 is the preprocessor invocation (note the -E flag), while line 4 performs the actual C compilation. As it's the latter that actually parses the input as C code, it's the one that triggers the plugin events. And it's the one to run under GDB:

# Run GDB in quiet mode so that we are not flooded in verbose output.
\$ gdb -q --args /usr/lib/gcc/x86_64-pc-linux-gnu/6.1.1/cc1 -fpreprocessed test.i -iplugindir=/usr/lib/gcc/x86_64-pc-linux-gnu/6.1.1/plugin -quiet -dumpbase test.h -mtune=generic -march=x86-64 -auxbase test -version -fplugin=/tmp/bind-enums/bind-enums.so -o test.s --output-pch=test.h.gch
Reading symbols from /usr/lib/gcc/x86_64-pc-linux-gnu/6.1.1/cc1...(no debugging symbols found)...done.

# Put a breakpoint in your plugin. The plugin is a dynamically loaded shared
# library, so it's expected that GDB cannot find the plugin yet.
(gdb) b plugin_init
Function "plugin_init" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (plugin_init) pending.

# Run the compiler: thanks to the above breakpoint, we will get in plugin_init.
(gdb) run
Starting program: /usr/lib/gcc/x86_64-pc-linux-gnu/6.1.1/cc1 -fpreprocessed test.i -iplugindir=/usr/lib/gcc/x86_64-pc-linux-gnu/6.1.1/plugin -quiet -dumpbase test.h -mtune=generic -march=x86-64 -auxbase test -version -fplugin=/tmp/bind-enums/bind-enums.so -o test.s --output-pch=test.h.gch

Breakpoint 1, plugin_init (plugin_info=0x1de5d30, version=0x1c7fdc0) at bind-enums.cc:63
63        const char *plugin_name = plugin_info->base_name;
(gdb) # Victory!

Debugging can lead you out of your plugin, into GCC’s own code. If this happens, you will need to build your own GCC to include debug information. This is a complex task for which I know, unfortunately, there is no documentation.

# Final words

As this small example (final files attached) demonstrates, GCC plugins can be quite useful. This time, we just hooked into some kind of parse tree but it’s possible to deal with all intermediate representations in the compilation pipeline: go and check out what the plugins listed on GCC’s wiki can do!