Reliability, Availability, and Serviceability
Overview
Reliability, Availability, and Serviceability (RAS) aims to increase the robustness of a system by detecting hardware errors, recording them and correcting them where possible. Arm’s RAS extension provides this robustness to both the processor and system architectures.
RAS techniques help reduce unplanned outages by:
Detecting and correcting transient errors before they lead to application or system failure.
Identifying and replacing failing components.
Predicting failures in advance, enabling proactive maintenance during planned downtime.
In system software, there are two primary models for error handling:
Firmware-First Handling (FFH)
RAS events are initially reported to firmware. Firmware is responsible for reading the RAS error record provided by the hardware, transposing the information into an error report, and notifying the operating system of the error. The operating system takes the recovery actions.
Kernel-First Handling (KFH)
The operating system kernel is responsible for directly handling RAS events and managing recovery actions.
In the current stage, this reference design implements only FFH.
This implementation targets RAS support specifically for the Cortex-A720AE CPU core, including minimal firmware level support in Trusted Firmware-A (TF-A) and the Safety Island.
The operating system decides how to handle hardware errors, but the current implementation does not support OS-level error handling. At the firmware level, TF-A only reads and logs the error, while the Safety Island reads the error and triggers the SSU. In the future, TF-A will notify the OS about the error to enable full RAS handling.
Error types
For a RAS error, the following types can be recorded:
Corrected Error
The error was detected and corrected. It no longer affects the node’s state and has not been silently propagated. The node continues to operate normally.
Deferred Error
The error was detected but not corrected and has been deferred. It has not been silently propagated and may remain latent in the system.
Uncorrected Error
The error was detected but neither corrected nor deferred. It remains latent in the system. Uncorrected errors can be further classified as:
Unrecoverable: The error has not been silently propagated.
Uncontainable: The error may have been silently propagated. If isolation is not possible, a system shutdown is required to prevent catastrophic failure.
The following diagram illustrates the taxonomy of RAS error types.
Error processing
The reference design supports handling two types of RAS interrupts:
Fault Handling Interrupt (FHI)
Error Recovery Interrupt (ERI)
These interrupts are routed to both the Primary Compute and the Safety Island. Each side uses different handling logic.
Safety Island error processing
When an error occurs, the Safety Island receives an error interrupt. The handler reads the error record, assesses the impact, and performs the appropriate action:
If the error is corrected, log diagnostic messages only.
If the error is uncorrected but containable, signal the SSU as a non-critical error.
If the error is uncorrected and uncontainable, signal the SSU as a critical error.
Primary Compute error processing
When an error occurs, the TF-A RAS handler (running at EL3) receives the interrupt. The handler reads the error record, logs the details, and clears the relevant registers.
In a standard FFH process, TF-A should generate an error report and notify the operating system. However, the notification is not supported in the current implementation, so no further recovery actions are taken by TF-A.
Both the Safety Island and TF-A access the error record, coordinating using MHUv3 doorbell signals to avoid race conditions in clearing the error records.
Primary Compute CPU Core RAS
The RAS extension implemented in the Cortex-A720AE cores includes cache protection. It protects against RAM bitcell errors that could result in incorrect data being stored or read.
The Cortex-A720AE RAS includes the following features:
Cache protection with Single Error Detect (SED) parity on the functional RAMs that only contain clean data. This includes the L1 instruction cache tag, L1 instruction cache data, and the Memory Management Unit (MMU) RAMs.
Cache protection with Single Error Correct, Double Error Detect (SECDED), Error Correcting Code (ECC) on the functional RAMs that contain dirty data. This includes the L1 data cache tag, L1 data cache data, L2 cache tag, L2 cache data, and the L2 Transaction Queue (TQ) RAMs.
The core can continue operation in the presence of a single-bit RAM error.
Error simulation
Error injection uses detection and reporting registers to simulate errors for testing error-handling mechanisms. The Cortex-A720AE core supports injection of the following error types:
Injection of Corrected Errors:
A Corrected Error (CE) is generated for a single-bit ECC error on an L1 data cache access.
Injection of Deferred Errors:
A Deferred Error (DE) is generated for a double-bit ECC error on eviction of a cache line from the L1 cache to the L2 cache, or as a result of a snoop on the L1 cache.
Injection of Uncontainable Errors:
An Uncontainable Error (UC) is generated for a double-bit ECC error on the L1 and L2 tag RAM following an eviction.