Building Safety by Design: CHERI in Critical Systems Development
by Paul Butcher –
A question often debated around high-integrity software (software responsible for critical systems that are required to have a very high level of assurance) is if we have sufficiently argued, with verification evidence, that the risk of our system failing is as low as reasonably practical, should we deploy our system with dynamic constraint checking on or off. In this context, dynamic constraint checking includes software runtime checks (assertions that make checks over the correct behavior of the system) and hardware checks provided by some microprocessors, for example, memory protection provided by a Memory Management Unit (MMU). The answer is not straightforward and will be based on the specifics of the system, including the assurance level required and the regulatory guidelines being conformed to. Historically, especially within industries such as aerospace, the trend has been to make guarantees over the behavior of the system such that the system can be deployed without the need for dynamic constraint checking; if we're confident that the fault won't occur, there is no benefit in guarding against it. However, the landscape is evolving; the demand for more complex and interconnected systems has never been higher, and cyber security concerns are at the forefront of high-integrity systems development. The problem is not new, well understood, and is nicely summarized by Edsger W. Dijkstra in "The Humble Programmer" (1972):
"Program testing can be a very effective way to show the presence of bugs, but it is hopelessly inadequate for showing their absence."
The problem is exacerbated in modern software development. While technology like formal proof, other forms of static analysis and software-driven memory sanitizers have helped build more convincing verification evidence, memory-unsafe programming languages like C are still widely used, making convincing verification arguments and guarantees over the absence of runtime errors hard to achieve. To a much lesser extent, the same is also true of memory-safe programming languages; both Ada and Rust have language constructs that provide no protection; in Rust, this is known as "unsafe" mode; in Ada, it is referred to as "Unchecked Programming," and there are justifiable times in low-level programming when these operations need to be used.
Security by Default
During the Embedded Real-Time Systems conference in Toulouse, France 2024, AdaCore presented a paper titled "Security by Default - CHERI ISA Extensions Coupled with a Security-Enhanced Ada Runtime." The paper, which has since been published in the conference journal <ERTS 2024 Tu.4.c.1>, discusses security claims around Capability Hardware Enhanced RISC Instructions (CHERI) microprocessor extensions and the direct role the hardware can play in achieving security airworthiness. For aerospace, security airworthiness is the discipline of protecting air vehicles from unauthorized electronic interaction, focusing directly on attack vectors that could directly impact the vehicle's safe operability.
Through our research and development of pure capability CHERI-compliant Ada runtimes, we argue that CHERI microprocessors are well-suited for safety-critical software deployment. Furthermore, our toolchains offer a unique perspective on fault tolerance by handling CHERI hardware-detected capability faults in a first-of-its-kind translation into software exceptions that can be caught and handled. Within high-integrity software development, exception handling is often not considered suitable, and many arguments against their usage are valid. For example, exception handlers are typically implemented with jump statements within the control flow, making the system indeterministic. However, this argument is weak as the alternative of not identifying the failure still results in indeterministic behavior (the difference is that with exception handling, at least you know a fault has occurred). Therefore, the correct usage of exception handling provides an effective means of a last line of defense against triggered anomalies and unwanted behavior within a high-integrity system.
Meeting Airworthiness Standards
Airworthiness criteria standards like DO-178C/ED-12C, titled "Software Considerations in Airborne Systems and Equipment Certification," define Level A software as the highest level of criticality. Requirements for developing Level A software are assigned when software failures or malfunctions could result in catastrophic consequences, such as the loss of the aircraft or the loss of life. More recent guidelines, such as DO-326A/ED-202A, titled "Airworthiness Security Process," ensure the safety hazard analysis includes unauthorized electronic interaction (cyber attacks).
One of the underlying pillars of safety-critical systems where CHERI hardware plays a primary role is determinism. A deterministic system always produces the same output for a given input under identical conditions, and the initial state and inputs fully define the behavior. Deterministic software ensures that events occur in a predictable sequence without randomness or undefined behaviors, and one of the biggest threats to deterministic behavior is software bugs. Software vulnerabilities are typically (albeit not limited to) bugs within the executing code space that can be manipulated for malicious intent, and more often than not, these types of bugs equate to unsafe memory accesses. A classic example is a buffer overflow, which can result in unpredictable (indeterministic) but hard-to-detect behavior; CHERI provides runtime hardware constraint checks to protect against overflows through fine-grain compartmentalization of objects which not only guards against unsafe memory accesses but also brings deterministic behavior to these types of software bugs. By removing the bug's impact, CHERI removes the vulnerability; while the bug could still be triggered for malicious intent, CHERI nullifies the result by detecting the fault before the unsafe memory instruction is executed. A counterargument here is when the attack intends to perform a denial of service or, in the case of a safety-critical system, remove the availability of the safety measure by triggering a hardware-detected capability fault when recovery is known to be infeasible. This is where exception handling plays a critical role; by providing engineers with software-level notifications of hardware-detected triggered attacks, security solutions can be designed into the system early in the lifecycle that offers vulnerability logging and fault tolerance through attack recovery, allowing strong security and safety arguments that still ensure requirements around determinism are satisfied.
Our paper discusses examples of how cyber-attack detection, resilience, recovery, and new failure mode strategies like "fail secure but degraded" can be designed into systems. These include using the CHERI compartmentalization model to isolate software components so that failed subsystems (which have detected capability faults) can be placed into a dormant state or a recovery mode initiated. CHERI capability fault exception handlers also enable dynamic vulnerability logging; when the fault is detected, the state of the system and the triggering inputs can be stored for offline analysis, reproduction of the failure, and subsequent software fixes and patching.
The paper can be read via our tech papers website here. Please contact us if you'd like to learn more about our GNAT Pro for CHERI solutions or our other safety, security, and mission-critical systems development tooling.