Transient Fault Protection (TFP) for Application Processor

Introduction

The Cortex®‑A720AE core used in Application Processor supports the Transient Fault Protection (TFP) feature, which enhances reliability by including extra logic to check the integrity of flip-flops in the functional (non-debug) logic. This mechanism is designed to detect single transient faults affecting a group of functional flip-flops. This feature can be utilized to significantly boost the transient fault detection capability of the core during safety critical applications and can be a key component towards achieving a Single Point Fault Metric (SPFM) (transient) goals and Safe Failure Fraction (SFF) (transient) at the core level.

Transient Fault Detection Mechanism

  • When TFP is enabled in hardware, additional logic to calculate the parity for a group of functional flops that have a common clock, reset, and enable term is instantiated.

  • The parity information is stored in an additional flip-flop, called the parity-flop.

  • The output of this flop is checked against the parity of the data stored in the associated group of functional-flops and a difference between the two indicates the fault has occurred on the functional-flops or the parityflop itself.

  • The error signals from each group of parity logic are combined by functional unit using a logical OR reduction and routed to the RAS registers for reporting and error signaling.

Transient Fault Protection Mechanism

Fig. 40 Transient Fault Protection Mechanism


Fault Detection Constraints

  • The flop parity mechanism is capable of detecting a single transient fault within a parity group.

  • A fault that causes an even number of bit-flips cannot be detected by the transient fault protection logic.

Fault Reaction

  • Errors that are detected by transient fault protection logic cannot be contained and do not include any specific features for hardware recovery.

  • The errors detected by the flop parity mechanism signal are reported in the ERXSTATUS_EL1 register.

  • The detected errors are reported as Uncorrected Errors of type Uncontainable:

Table 2 Error Reporting Fields

Register Bit

Value

Description

ERXSTATUS_EL1.UE

1’b1

Uncorrected Error

ERXSTATUS_EL1.UET

2’b00

Uncontainable Type

  • Additional diagnostic information is provided by the IERR fields within the ERXSTATUS_EL1 register. The IERR codes indicate which TFP chunk (or functional unit) detected the parity error. The IERR codes for the Cortex®‑A720AE core are as follows:

Table 3 IERR Codes for TFP Protection Units

IERR Code

Affected Protection Unit

0b00100

Data side (Dside)

0b00101

Vector Unit (VX)

0b00110

Memory Management Unit (MMU)

0b00111

Level 2 Cache

0b01000

GIC CPU Interface (INTC)

0b01001

Debug Trace

0b01010

Instruction side (Iside)

0b01011

Decode

0b01100

Rename

0b01101

Commit

0b01110

Issue

0b01111

Iexecute

0b10000

Axis Bridge

Note

This field is valid only when ERXSTATUS_EL1.V is 0b1 and ERXSTATUS_EL1.SERR is 0x1A. In all other cases, the field is reported as UNKNOWN.

Implementation in Software

The software implementation of the TFP feature comprises the following elements:

Enabling TFP

To enable detection and reporting of errors via the transient fault protection mechanism, software sets the following fields in RAS registers:

Table 4 TFP Control Registers

Register

Bit

Description

ERXCTLR_EL1

0

ED Enable error detection and reporting globally

ERXCTLR_EL1

33

TFPEN Enable TFP error detection and reporting

It is recommended to enable TFP error reporting in a Mixed-Configuration Hybrid-mode, which is typically employed as per the Aspen specifications where all cores operate in Hybrid split mode.

Error Handling

When a transient fault is detected by the flop parity mechanism:

  • The RAS error record is updated.

  • A fault handling interrupt (FHI) is raised.

  • In the TF-A RAS error handler, the ERXSTATUS_EL1 register is examined to confirm a transient fault. The corresponding error information (IERR) indicates the source of the fault which is mentioned in debug print for example:

WARNING: CPU RAS: TFP Error Detected : AXIS_BRIDGE
  • The similar processing is implemented in scp-firmware running on SI-CL0 as described in Safety Island error processing, where diagnostic message is logged for example:

AP detected TFP Error : AXIS_BRIDGE

Validation

The TFP enablement is validated in the Primary Compute CPUs RAS tests.