
A new math library for Ada: Neo.SIMD
by Justin Squirek –
Introduction
When you’re building a game engine, it’s practically guaranteed you’ll need a dedicated linear algebra library at some point. Physics calculations, camera movement, enemy AI - all those features rely on these core mathematical operations. If you’re coding in C++, Rust, or Python, you have plenty of well-known libraries to choose from, like Intel’s MKL, PyTorch, Boost, or more specialized solutions such as the Terathon Math Library. But if you’re working in Ada, the landscape for a custom math library is still fresh ground. That’s why, as part of my AdaDoom3 project, I created Neo.SIMD - a new library I hope will help fill that gap.
Wait, so what exactly is SIMD?
SIMD or “Single Instruction Multiple Data” instructions are a way for processors to perform a single operation on multiple values at once. Given the nature of linear algebra these instructions become all-the-more important when matrices and vectors start getting thrown around since performing operations on them tend to involve highly parallel data and algorithms.
Utilizing these special instructions can dramatically reduce the number of required cycles for such use cases, and, while compilers will automatically generate some SIMD instructions behind the scenes when optimizations are enabled, we want to push things a step further.
The goals of Neo.SIMD
The goals of the library are simple:
Identify key linear algebra functions to implement that can take advantage of hand-written assembly and are relevant to general-purpose calculations and graphics programming.
Take advantage of Ada’s ability to interface directly with assembly to write simple, modular, inline assembly functions.
Leverage any known open-source implementations so that we can be sure that we are working with the latest algorithms.
Support x86-64 (AVX, SSE, etc) as well as ARM (NEON)
Identifying key functions
I began by studying the code for Eric Lengyel’s wonderful Terathon Math Library. Eric is a mathematician with a long history in the games industry and has written several books on subjects such as game physics and game mathematics, so the Terathon Math Library should provide a perfect insight into which functions should be implemented for my purposes.
Using this library as a model, the needed functionality can be boiled down into several categories:
Minimum/Maximum or “clamping” of vectors
Arithmetic (addition, subtraction, division, etc) of vectors
Rounding
Trigonometry (cos, sin, tan, etc.)
Transformation via a “Transform matrix”
Cross products of vectors
Determinant calculations for matrices
Multiplication of matrices
Implementation
Generally, most implementations of SIMD math libraries use what are called “intrinsics” which are loosely wrapped assembly instructions that can be easily integrated with high-level code. However, I decided against this method for a couple of reasons - namely, that I would have to import these intrinsic functions for their use in Ada, but also for what I perceive as improved readability, tunability, and understandably.
So, without further ado, let’s dive in and look at a couple of examples that illustrate the difference between the two approaches.
Below is a set of functions (minus some smallish boilerplate ones) from Terathon that accomplish 4D matrix multiplication. Although on the surface this looks like high-level C++ there are, in fact, intrinsic assembly calls present (see the functions starting with “_mm”). In many ways, this can be seen as more readable since it integrates directly with C++ and has a friendly appearance. However, there is a cost - most notably, a lack of fine-grain control over individual registers and ordering of individual instructions.
inline vec_float VecTransformPoint3D
(const vec_float& c1, const vec_float& c2, const vec_float& c3, const vec_float& c4, const vec_float& p)
{
vec_float result = _mm_mul_ps(c1, VecSmearX(p));
result = _mm_add_ps(result, _mm_mul_ps(c2, VecSmearY(p)));
result = _mm_add_ps(result, _mm_mul_ps(c3, VecSmearZ(p)));
return (_mm_add_ps(result, c4));
}
inline vec_float VecTransformVector3D(const vec_float& c1, const vec_float& c2, const vec_float& c3, const vec_float& v)
{
vec_float result = _mm_mul_ps(c1, VecSmearX(v));
result = _mm_add_ps(result, _mm_mul_ps(c2, VecSmearY(v)));
return (_mm_add_ps(result, _mm_mul_ps(c3, VecSmearZ(v))));
}
Matrix4D Terathon::operator *(const Matrix4D& m1, const Transform3D& m2)
{
Matrix4D result;
vec_float a = VecLoad(&m1(0,0));
vec_float b = VecLoad(&m1(0,1));
vec_float c = VecLoad(&m1(0,2));
VecStore(VecTransformVector3D(a, b, c, VecLoad(&m2(0,0))), &result(0,0));
VecStore(VecTransformVector3D(a, b, c, VecLoad(&m2(0,1))), &result(0,1));
VecStore(VecTransformVector3D(a, b, c, VecLoad(&m2(0,2))), &result(0,2));
VecStore(VecTransformPoint3D
(a, b, c, VecLoad(&m1(0,3)), VecLoad(&m2(0,3))), &result(0,3));
return (result);
}
Here is the implementation for 4D matrix multiplication in Neo.SIMD. Although it may appear longer or more cumbersome, it achieves a certain simplicity. First, this function has no dependencies - it is fully self-contained (all it requires in the definition of Matrix_4D). It also allows very fine control over the order of instructions and registers - which I view as the primary purpose of writing assembly over high-level solutions in the first place. For example, using prefetch operations and organizing instructions to facilitate pipelining can help save those precious cycles.
Finally, assembly blocks satisfy the WYSIWYG criterion - "what you see is what you get" - while the use of intrinsics can result in wildly different object code based on optimization flags and compilers. Although it is impossible to say for sure, in all of the compiler configurations I tested, the resulting assembly from the Terathon implementation of this function ended up being 30%-60% longer in terms of instruction count.
function "*" (Left, Right : Matrix_4D) return Matrix_4D is
Result : aliased Matrix_4D;
begin
Asm (Clobber => "ymm0, ymm1, ymm2, ymm3, ymm8, ymm10, ymm11, memory",
Inputs => (Ptr'Asm_Input ("r", Left'Address),
Ptr'Asm_Input ("r", Right'Address),
Ptr'Asm_Input ("r", Result'Address)),
Template =>
-- Prefetch Right rows
" prefetcht0 (%1) " & E & -- Prefetch Right.X
" prefetcht0 16(%1) " & E & -- Prefetch Right.Y
" prefetcht0 32(%1) " & E & -- Prefetch Right.Z
" prefetcht0 48(%1) " & E & -- Prefetch Right.W
-- Initialize accumulators for rows 0/1
" vxorps %%ymm2, %%ymm2, %%ymm2 " & E & -- ymm2 ← (0.0, ...) Accumulator for X
" vxorps %%ymm3, %%ymm3, %%ymm3 " & E & -- ymm3 ← (0.0, ...) Accumulator for Y
" vxorps %%ymm10, %%ymm10, %%ymm10 " & E & -- ymm10 ← (0.0, ...) Accumulator for Z
" vxorps %%ymm11, %%ymm11, %%ymm11 " & E & -- ymm11 ← (0.0, ...) Accumulator for W
-- Multiply with Right.X
" vbroadcastss (%0), %%ymm0 " & E & -- ymm0 ← (Left.XX, ...)
" vbroadcastss 16(%0), %%ymm1 " & E & -- ymm1 ← (Left.YX, ...)
" vmovaps (%1), %%ymm8 " & E & -- ymm8 ← Right.X
" vfmadd231ps %%ymm8, %%ymm0, %%ymm2 " & E & -- ymm2 ← ymm2 + (ymm0 * ymm8)
" vfmadd231ps %%ymm8, %%ymm1, %%ymm3 " & E & -- ymm3 ← ymm3 + (ymm1 * ymm8)
" vbroadcastss 32(%0), %%ymm0 " & E & -- ymm0 ← (Left.ZX, ...)
" vbroadcastss 48(%0), %%ymm1 " & E & -- ymm1 ← (Left.WX, ...)
" vfmadd231ps %%ymm8, %%ymm0, %%ymm10 " & E & -- ymm10 ← ymm10 + (ymm0 * Right.X)
" vfmadd231ps %%ymm8, %%ymm1, %%ymm11 " & E & -- ymm11 ← ymm11 + (ymm1 * Right.X)
-- Multiply with Right.Y
" vbroadcastss 4(%0), %%ymm0 " & E & -- ymm0 ← (Left.XY, ...)
" vbroadcastss 20(%0), %%ymm1 " & E & -- ymm1 ← (Left.YY, ...)
" vmovups 16(%1), %%ymm8 " & E & -- ymm8 ← Right.Y
" vfmadd231ps %%ymm8, %%ymm0, %%ymm2 " & E & -- ymm2 ← ymm2 + (ymm0 * ymm8)
" vfmadd231ps %%ymm8, %%ymm1, %%ymm3 " & E & -- ymm3 ← ymm3 + (ymm1 * ymm8)
" vbroadcastss 36(%0), %%ymm0 " & E & -- ymm0 ← (Left.ZY, ...)
" vbroadcastss 52(%0), %%ymm1 " & E & -- ymm1 ← (Left.WY, ...)
" vfmadd231ps %%ymm8, %%ymm0, %%ymm10 " & E & -- ymm10 ← ymm10 + (ymm0 * ymm8)
" vfmadd231ps %%ymm8, %%ymm1, %%ymm11 " & E & -- ymm11 ← ymm11 + (ymm1 * ymm8)
-- Multiply with Right.Z
" vbroadcastss 8(%0), %%ymm0 " & E & -- ymm0 ← (Left.XZ, ...)
" vbroadcastss 24(%0), %%ymm1 " & E & -- ymm1 ← (Left.YZ, ...)
" vmovups 32(%1), %%ymm8 " & E & -- ymm8 ← Right.Z
" vfmadd231ps %%ymm8, %%ymm0, %%ymm2 " & E & -- ymm2 ← ymm2 + (ymm0 * ymm8)
" vfmadd231ps %%ymm8, %%ymm1, %%ymm3 " & E & -- ymm3 ← ymm3 + (ymm1 * ymm8)
" vbroadcastss 40(%0), %%ymm0 " & E & -- ymm0 ← (Left.ZZ, ...)
" vbroadcastss 56(%0), %%ymm1 " & E & -- ymm1 ← (Left.WZ, ...)
" vfmadd231ps %%ymm8, %%ymm0, %%ymm10 " & E & -- ymm10 ← ymm10 + (ymm0 * ymm8)
" vfmadd231ps %%ymm8, %%ymm1, %%ymm11 " & E & -- ymm11 ← ymm11 + (ymm1 * ymm8)
-- Multiply with Right.W
" vbroadcastss 12(%0), %%ymm0 " & E & -- ymm0 ← (Left.XW, ...)
" vbroadcastss 28(%0), %%ymm1 " & E & -- ymm1 ← (Left.YW, ...)
" vmovups 48(%1), %%ymm8 " & E & -- ymm8 ← Right.W
" vfmadd231ps %%ymm8, %%ymm0, %%ymm2 " & E & -- ymm2 ← ymm2 + (ymm0 * ymm8)
" vfmadd231ps %%ymm8, %%ymm1, %%ymm3 " & E & -- ymm3 ← ymm3 + (ymm1 * ymm8)
" vbroadcastss 44(%0), %%ymm0 " & E & -- ymm0 ← (Left.ZW, ...)
" vbroadcastss 60(%0), %%ymm1 " & E & -- ymm1 ← (Left.WW, ...)
" vfmadd231ps %%ymm8, %%ymm0, %%ymm10 " & E & -- ymm10 ← ymm10 + (ymm0 * ymm8)
" vfmadd231ps %%ymm8, %%ymm1, %%ymm11 " & E & -- ymm11 ← ymm11 + (ymm1 * ymm8)
-- Store results interleaved with computation
" vmovaps %%ymm2, (%2) " & E & -- Result.X ← ymm2
" vmovups %%ymm3, 16(%2) " & E & -- Result.Y ← ymm3
" vmovups %%ymm10, 32(%2) " & E & -- Result.Z ← ymm10
" vmovups %%ymm11, 48(%2) "); -- Result.W ← ymm11
return Result;
end;
Benchmarks
Now, for the fun part - let’s check out some unscientific benchmarks! I created three test datasets for comparing various functions within the math libraries which generally have the most amount of associated instructions. The test data set consists of 500k operations and are generated from a Python script and output into a CSV table to ensure the same operations get performed for each implementation's test, and all of them were run on my (now aging) Alienware 13 R3 w/ Intel i7 with 32 GB of RAM in sequence. To give some extra perspective, I added two other implementations: a “naive” Ada approach, which uses normal operations and no SIMD, and PyTorch - although I did not include PyTorch in the graphs since it skewed the scales too much for obvious reasons as you'll see.
Build settings
For Neo.SIMD/Ada Naive: -O3 -march=native -ffast-math
For Terathon: -O3 -march=native -ffast-math
Matrix 4D Multiplication
PyTorch: 0.939628s
Ada (Naive): 0.0132585s
Terathon Math Library: 0.01238s
Neo.SIMD: 0.0093s

Matrix 3D Multiplication
PyTorch: 0.917344s
Ada (Naive): 0.0065655s
Terathon Math Library: 0.006195s
Neo.SIMD: 0.0065303s

Matrix 4D Determinant
PyTorch: 3.67728s
Ada (Naive): 0.0086014s
Terathon Math Library: 0.004519s
Neo.SIMD: 0.0062651s

Analysis of results
Although this is nowhere near a comprehensive test of the different implementations since we are only looking at a small subset of libraries and functions, there are several things that stand out based on the results - namely that the naive approach (without special assembly instructions) still performs fantastically on many tasks. Of course, this will vary from compiler to compiler and based on compilation options, but none-the-less it demonstrates the power of modern compilers and tools such as GNAT.
The second thing of note is that the approach taken by Neo.SIMD is viable - it does indeed outperform its reference library with a 25% speed up in at least one of the core functions above.
Conclusion - what's next?
Hand-written assembly has certainly fallen out of favor in the current age, after all - it is not easy to beat the compiler! However, considering the immense amount of calculations necessary for maintaining 3D environments or performing AI transformations, I don’t think fine-tuned SIMD assembly is going away anytime soon.
As far as next steps, I plan to continue benchmarking the rest of the functions, expanding the library, and, perhaps, use it to move forward with developing some physics for AdaDoom3 🙂
To view the current state of the project you can view it at my github here: https://github.com/AdaDoom3/Ne...