Fault Management
Introduction
The Fault Management subsystem for the Safety Island provides a mechanism for capturing, reporting and collating faults from supported hardware in safety-critical designs.
The subsystem interfaces with the following types of devices:
A fault device, which reports faults from its safety mechanisms. It may also report faults originating from other fault devices to support the creation of a fault device tree.
A safety state device, which manages a state in reaction to reported faults.
Supporting driver implementations are provided for the following Arm hardware designs:
A Device Fault Management Unit (Device FMU): a fault device attached to a GIC-720AE interrupt controller.
A System Fault Management Unit (System FMU): a fault device which collates faults from upstream FMUs.
A Safety Status Unit (SSU): a safety state device which manages a state machine in response to faults in a safety-critical system.
Faults
A unique fault (i.e. generated by a specific safety mechanism and reported by a fault device implementation) is represented by the subsystem and driver interfaces as a device-specific 32-bit integer along with a handle to the originating device.
A fault may be critical or non-critical and this affects how it is processed by the subsystem.
Fault Device Trees
The subsystem is configured with a list of “root” fault devices - those located at the root of a fault device tree. Root fault devices are typically collators of faults from multiple upstream fault devices (possibly recursively) and may also directly affect the state of a connected safety state device.
The diagram below shows an illustrative fault device tree. (For the simpler Kronos topology, see Kronos Deployment below.)
Safety States
The SSU state machine has 4 safety states:
TEST
: Self-test
SAFE
: Safe operation
ERRN
: Non-critical fault detected
ERRC
: Critical fault detected
Control signals from software:
compl_ok
: Diagnostics complete or non-critical fault cleared
nce_ok
: Non-critical fault diagnosed
ce_not_ok
: Critical fault diagnosed
nce_not_ok
: Non-correctable non-critical fault
Control signals connected in hardware to the root fault device:
nc_error
: Non critical error
c_error
: Critical error
reset
TEST
is the initial state on boot. The software is responsible for
transitioning to SAFE
after the successful completion of a self-test
routine. ERRC
represents a critical system failure, which can only be
recovered by resetting the system. A non-critical fault causes a transition to
ERRN
, which can either be recovered back to SAFE
or promoted to
ERRC
by the software.
The diagram below shows all the possible transitions between these states using these signals.
Finite State Machine (FSM) States and Transitions:
From reset the FSM defaults to the
TEST
state.It shall stay in this state until SW has completed any power up tests. If the SW controlled tests pass then a write can be issued indicating that to move the FSM to the
SAFE
state.If the tests fail then a write can be issued to move the FSM to the
ERRN
state, indicating that an error has occurred that may be resolvable.After further tests if the SW can issue a write depending on whether it was determined the error has been resolved or not, moving the FSM to
SAFE
it was resolved orERRC
if it was not.When in
SAFE
mode the FSM can only be moved after either:
a reset moving it back to
TEST
a non-critical error interrupt moving it to
ERRN
a critical error interrupt moving it to
ERRC
if a critical and non-critical error occur in the same time the critical error takes precedence and the FSM shall move to
ERRC
Design
The Fault Management subsystem for the Safety Island implementation and functionality are grounded in the Zephyr real-time operating system (RTOS) environment.
Drivers
Driver interfaces are provided for fault devices and safety state devices. Specific driver implementations with devicetree bindings are provided for the Arm FMU and Arm SSU.
The public driver interfaces are described under components/safety_island/zephyr/src/include/zephyr/drivers/fault_mgmt
The drivers are instantiated in the devicetree using bindings under components/safety_island/zephyr/src/dts/bindings/fault_mgmt
Fault Management Unit
The FMU driver is an implementation of a fault device. Inside the driver, one of two driver implementations is selected at runtime to handle differences between the GIC-720AE and the System FMU programmers’ views.
It is expected that interrupts are only defined for root FMUs. If the root FMU is a System FMU, it will collate faults from multiple upstream sources. The driver in this case will inspect the status of other FMUs in the tree when a fault occurs to determine the exact origin and cause of the fault.
The FMU driver allows a single callback to be registered, through which incoming faults are reported.
Safety Status Unit
The SSU driver is an implementation of a safety state device. It implements the safety state device interface which allows its state to be read and controlled.
Subsystem
The Fault Management subsystem manages two fault-handling threads (one for critical faults and another for non-critical faults), which listen for queued faults from any configured root fault device and forward them to all configured fault handlers.
Multiple fault handlers can be statically registered (using the
FAULT_MGMT_HANDLER_DEFINE
macro), each of which is called once per root
fault device on initialization, then once per reported fault. Handlers are
registered with a unique priority that determines the order in which they
are called.
Certain subsystem features are themselves implemented as handlers. It is expected that in order to implement a Fault Management policy for a safety-critical system design, one or more additional custom fault handlers would be required to perform tasks such as:
Configuring the criticality and enabled state of fault device safety mechanisms.
Performing a self-test routine before notifying the safety state device that the system is safe for operation.
Reacting to non-critical faults and deciding whether to perform a corrective action to reset the safety state or promote to a critical fault. This decision may be based on the provided fault count storage.
The subsystem has configuration options to manage the stack space, priority and queue size of both threads, which should be tuned and validated according to deployment requirements. Specifically, more complex custom handlers may require more stack space as they are called on the subsystem threads.
The public interface for the subsystem and its components is described under components/safety_island/zephyr/src/include/zephyr/subsys/fault_mgmt
Safety component
The safety component contains additional interfaces to facilitate reading and updating a system’s safety state.
If enabled, this component requires (and validates at boot) that all root fault devices have an attached safety state device.
Storage component
The storage component manages historical counts per safety mechanism per fault device.
Two storage backends are provided:
Trusted Firmware-M PSA Protected Storage Interfaces, with an in-memory cache populated at boot.
A non-persistent in-memory implementation, using only Zephyr’s
sys_hash_map
.
For the PSA backend, there are configuration options to manage the storage key and the maximum record count, which should be tuned and validated depending on the number of distinct faults and devices and/or other system constraints.
Kronos Deployment
The Kronos FVP models:
An SSU in the Safety Island.
A System FMU in the Safety Island, attached to the SSU.
An FMU attached to the GIC-720AE in the Primary Compute, attached to the System FMU.
The Kronos Fault Management application (components/safety_island/zephyr/src/apps/fault_mgmt) provides Kconfig and devicetree overlays for a sample deployment using these devices on Safety Island Cluster 1. The functionality can be evaluated using the Zephyr shell on this cluster. Additionally, this application serves as the basis for the automated validation (see Integration Tests Validating the Fault Management Subsystem).
For fault count storage, the application uses the PSA Protected Storage
implementation provided by TF-M. CONFIG_MAX_PSA_PROTECTED_STORAGE_SIZE
is
configured according to TF-M storage constraints.
Validation
The Kronos Reference Design contains integration tests for the overall FMU and SSU integration, described at Integration Tests Validating the Fault Management Subsystem
Shell Reference
The subsystem provides an optional shell command (enabled using
CONFIG_FAULT_MGMT_SHELL
) which exposes the subsystem API interactively for
evaluation and validation purposes. Its sub-commands are described below.
fault tree
- Print a description of the fault device tree (including any safety state devices) to the console. The device names printed here can be used in the other commands below.
fault inject DEVICE FAULT_ID
- Inject a specificFAULT_ID
intoDEVICE
. The resultant fault will be logged on the console.
fault set_enabled DEVICE FAULT_ID ENABLED
- Enable or disable a specificFAULT_ID
on aDEVICE
. SetENABLED
to1
to enable or0
to disable.
fault set_critical DEVICE FAULT_ID CRITICAL
- Configure a specificFAULT_ID
on aDEVICE
as critical or non-critical. SetCRITICAL
to1
to set as critical or0
to set as non-critical.
The FAULT_ID
above refers to a 32-bit integer whose valid values are
device-specific (e.g. 0x100
represents an APB access error for a System
FMU but a GICD Clock Error for a GIC-720AE FMU) and opaque to the driver
itself.
The following are only available if CONFIG_FAULT_MGMT_SAFETY
is enabled:
fault safety_status DEVICE
- Print the current status of safety stateDEVICE
to the console.
fault safety_control DEVICE SIGNAL
- SendSIGNAL
to safety stateDEVICE
.
The following are only available if CONFIG_FAULT_MGMT_STORAGE
is enabled:
fault list [THRESHOLD]
- List all reported fault counts. The optionalTHRESHOLD
filters out faults below a certain count.
fault summary
- Show a more detailed summary of the fault counts, including a list of the most reported faults.
fault count
- Print the total count of reported faults.
fault clear
- Reset all fault counts back to zero.
The test suite at yocto/meta-kronos/lib/oeqa/runtime/cases/test_10_fault_mgmt.py demonstrates usage of these sub-commands.
Safety Considerations
The Fault Management subsystem has the following features to mitigate the risks of unexpected runtime behavior causing a denial of service:
Iterative methods that take a fixed amount of stack space based on
CONFIG_FAULT_MGMT_MAX_TREE_DEPTH
are used to traverse fault device trees.Invalid combinations of configuration values (e.g. a root FMU without IRQ numbers) are detected at compile time where possible.
The subsystem functionality is composed of independent handlers which can be disabled if not required.
Note that there are conditions where the subsystem will panic and the application running on the Safety Island cluster will stop processing further faults (non-exhaustive):
Faults arrive more quickly than they are handled over a long enough period for a queue to fill up.
A fault arrives at a System FMU from an unknown Device FMU.
The number of stored fault records exceeds the amount of available storage.
An unexpected error code is returned when attempting to write a fault count to the storage.