Transient Fault Protection (TFP) for Application Processor
Introduction
The Cortex®‑A720AE core used in Application Processor supports the Transient Fault Protection (TFP) feature, which enhances reliability by including extra logic to check the integrity of flip-flops in the functional (non-debug) logic. This mechanism is designed to detect single transient faults affecting a group of functional flip-flops. This feature can be utilized to significantly boost the transient fault detection capability of the core during safety critical applications and can be a key component towards achieving a Single Point Fault Metric (SPFM) (transient) goals and Safe Failure Fraction (SFF) (transient) at the core level.
Transient Fault Detection Mechanism
When TFP is enabled in hardware, additional logic to calculate the parity for a group of functional flops that have a common clock, reset, and enable term is instantiated.
The parity information is stored in an additional flip-flop, called the parity-flop.
The output of this flop is checked against the parity of the data stored in the associated group of functional-flops and a difference between the two indicates the fault has occurred on the functional-flops or the parityflop itself.
The error signals from each group of parity logic are combined by functional unit using a logical OR reduction and routed to the RAS registers for reporting and error signaling.
Fig. 40 Transient Fault Protection Mechanism
Fault Detection Constraints
The flop parity mechanism is capable of detecting a single transient fault within a parity group.
A fault that causes an even number of bit-flips cannot be detected by the transient fault protection logic.
Fault Reaction
Errors that are detected by transient fault protection logic cannot be contained and do not include any specific features for hardware recovery.
The errors detected by the flop parity mechanism signal are reported in the
ERXSTATUS_EL1register.The detected errors are reported as Uncorrected Errors of type Uncontainable:
Register Bit |
Value |
Description |
|---|---|---|
|
1’b1 |
Uncorrected Error |
|
2’b00 |
Uncontainable Type |
Additional diagnostic information is provided by the IERR fields within the
ERXSTATUS_EL1register. The IERR codes indicate which TFP chunk (or functional unit) detected the parity error. The IERR codes for the Cortex®‑A720AE core are as follows:
IERR Code |
Affected Protection Unit |
|---|---|
0b00100 |
Data side (Dside) |
0b00101 |
Vector Unit (VX) |
0b00110 |
Memory Management Unit (MMU) |
0b00111 |
Level 2 Cache |
0b01000 |
GIC CPU Interface (INTC) |
0b01001 |
Debug Trace |
0b01010 |
Instruction side (Iside) |
0b01011 |
Decode |
0b01100 |
Rename |
0b01101 |
Commit |
0b01110 |
Issue |
0b01111 |
Iexecute |
0b10000 |
Axis Bridge |
Note
This field is valid only when ERXSTATUS_EL1.V is 0b1
and ERXSTATUS_EL1.SERR is 0x1A. In all other cases,
the field is reported as UNKNOWN.
Implementation in Software
The software implementation of the TFP feature comprises the following elements:
Enabling TFP
To enable detection and reporting of errors via the transient fault protection mechanism, software sets the following fields in RAS registers:
Register |
Bit |
Description |
|---|---|---|
|
0 |
ED Enable error detection and reporting globally |
|
33 |
TFPEN Enable TFP error detection and reporting |
It is recommended to enable TFP error reporting in a Mixed-Configuration Hybrid-mode, which is typically employed as per the Aspen specifications where all cores operate in Hybrid split mode.
Error Handling
When a transient fault is detected by the flop parity mechanism:
The RAS error record is updated.
A fault handling interrupt (FHI) is raised.
In the TF-A RAS error handler, the
ERXSTATUS_EL1register is examined to confirm a transient fault. The corresponding error information (IERR) indicates the source of the fault which is mentioned in debug print for example:
WARNING: CPU RAS: TFP Error Detected : AXIS_BRIDGE
The similar processing is implemented in scp-firmware running on SI-CL0 as described in Safety Island error processing, where diagnostic message is logged for example:
AP detected TFP Error : AXIS_BRIDGE
Validation
The TFP enablement is validated in the Primary Compute CPUs RAS tests.