..
 # SPDX-FileCopyrightText: <text>Copyright 2023-2024 Arm Limited and/or its
 # affiliates <open-source-office@arm.com></text>
 #
 # SPDX-License-Identifier: MIT

.. _design_applications_fault_mgmt:

################
Fault Management
################

************
Introduction
************

The Fault Management subsystem for the Safety Island provides a mechanism for
capturing, reporting and collating faults from supported hardware in
safety-critical designs.

The subsystem interfaces with the following types of devices:

 * A fault device, which reports faults from its safety mechanisms. It may also
   report faults originating from other fault devices to support the creation
   of a fault device tree.
 * A safety state device, which manages a state in reaction to reported faults.

Supporting driver implementations are provided for the following Arm hardware
designs:

 * A Device Fault Management Unit (Device FMU): a fault device attached to a
   GIC-720AE interrupt controller.
 * A System Fault Management Unit (System FMU): a fault device which collates
   faults from upstream FMUs.
 * A Safety Status Unit (SSU): a safety state device which manages a state
   machine in response to faults in a safety-critical system.

Faults
======

A unique fault (i.e. generated by a specific safety mechanism and reported by a
fault device implementation) is represented by the subsystem and driver
interfaces as a device-specific 32-bit integer along with a handle to the
originating device.

A fault may be critical or non-critical and this affects how it is processed by
the subsystem.

Fault Device Trees
==================

The subsystem is configured with a list of "root" fault devices - those located
at the root of a fault device tree. Root fault devices are typically collators
of faults from multiple upstream fault devices (possibly recursively) and may
also directly affect the state of a connected safety state device.

The diagram below shows an illustrative fault device tree. (For the simpler
Kronos topology, see :ref:`design_applications_fault_mgmt_kronos_deployment`
below.)

.. image:: ../images/sample_fault_device_tree.*
   :align: center
   :alt: A Sample Fault Device Tree

Safety States
=============

The SSU state machine has 4 safety states:

 * ``TEST``: Self-test
 * ``SAFE``: Safe operation
 * ``ERRN``: Non-critical fault detected
 * ``ERRC``: Critical fault detected

Control signals from software:

 * ``compl_ok``: Diagnostics complete or non-critical fault cleared
 * ``nce_ok``: Non-critical fault diagnosed
 * ``ce_not_ok``: Critical fault diagnosed
 * ``nce_not_ok``: Non-correctable non-critical fault

Control signals connected in hardware to the root fault device:

 * ``nc_error``: Non critical error
 * ``c_error``: Critical error
 * ``reset``

``TEST`` is the initial state on boot. The software is responsible for
transitioning to ``SAFE`` after the successful completion of a self-test
routine. ``ERRC`` represents a critical system failure, which can only be
recovered by resetting the system. A non-critical fault causes a transition to
``ERRN``, which can either be recovered back to ``SAFE`` or promoted to
``ERRC`` by the software.

The diagram below shows all the possible transitions between these states using
these signals.

.. image:: ../images/ssu_states.*
   :align: center
   :alt: SSU States

Finite State Machine (FSM) States and Transitions:

 * From reset the FSM defaults to the ``TEST`` state.
 * It shall stay in this state until SW has completed any power up tests. If the SW
   controlled tests pass then a write can be issued indicating that to move the FSM
   to the ``SAFE`` state.
 * If the tests fail then a write can be issued to move the FSM to the ``ERRN`` state,
   indicating that an error has occurred that may be resolvable.
 * After further tests if the SW can issue a write depending on whether it was
   determined the error has been resolved or not, moving the FSM to ``SAFE`` it was
   resolved or ``ERRC`` if it was not.

   When in ``SAFE`` mode the FSM can only be moved after either:

    - a reset moving it back to ``TEST``
    - a non-critical error interrupt moving it to ``ERRN``
    - a critical error interrupt moving it to ``ERRC``
    - if a critical and non-critical error occur in the same time the critical error
      takes precedence and the FSM shall move to ``ERRC``

.. _design_applications_fault_mgmt_design:

******
Design
******

The Fault Management subsystem for the Safety Island implementation and functionality
are grounded in the Zephyr real-time operating system (RTOS) environment.

Drivers
=======

Driver interfaces are provided for fault devices and safety state devices.
Specific driver implementations with devicetree bindings are provided for the
Arm FMU and Arm SSU.

The public driver interfaces are described under
:repo:`components/safety_island/zephyr/src/include/zephyr/drivers/fault_mgmt`

The drivers are instantiated in the devicetree using bindings under
:repo:`components/safety_island/zephyr/src/dts/bindings/fault_mgmt`

Fault Management Unit
---------------------

The FMU driver is an implementation of a fault device. Inside the driver, one
of two driver implementations is selected at runtime to handle differences
between the GIC-720AE and the System FMU programmers' views.

It is expected that interrupts are only defined for root FMUs. If the root FMU
is a System FMU, it will collate faults from multiple upstream sources. The
driver in this case will inspect the status of other FMUs in the tree when a
fault occurs to determine the exact origin and cause of the fault.

The FMU driver allows a single callback to be registered, through which
incoming faults are reported.

Safety Status Unit
------------------

The SSU driver is an implementation of a safety state device. It implements the
safety state device interface which allows its state to be read and controlled.

Subsystem
=========

The Fault Management subsystem manages two fault-handling threads (one for
critical faults and another for non-critical faults), which listen for queued
faults from any configured root fault device and forward them to all configured
fault handlers.

Multiple fault handlers can be statically registered  (using the
``FAULT_MGMT_HANDLER_DEFINE`` macro), each of which is called once per root
fault device on initialization, then once per reported fault. Handlers are
registered with a unique priority that determines the order in which they
are called.

Certain subsystem features are themselves implemented as handlers. It is
expected that in order to implement a Fault Management policy for a
safety-critical system design, one or more additional custom fault handlers
would be required to perform tasks such as:

 * Configuring the criticality and enabled state of fault device safety
   mechanisms.
 * Performing a self-test routine before notifying the safety state device
   that the system is safe for operation.
 * Reacting to non-critical faults and deciding whether to perform a corrective
   action to reset the safety state or promote to a critical fault. This
   decision may be based on the provided fault count storage.

The subsystem has configuration options to manage the stack space, priority and
queue size of both threads, which should be tuned and validated according to
deployment requirements. Specifically, more complex custom handlers may require
more stack space as they are called on the subsystem threads.

The public interface for the subsystem and its components is described under
:repo:`components/safety_island/zephyr/src/include/zephyr/subsys/fault_mgmt`

Safety component
----------------

The safety component contains additional interfaces to facilitate reading and
updating a system's safety state.

If enabled, this component requires (and validates at boot) that all root fault
devices have an attached safety state device.

Storage component
-----------------

The storage component manages historical counts per safety mechanism per fault
device.

Two storage backends are provided:

 * `Trusted Firmware-M PSA Protected Storage Interfaces`_, with an in-memory
   cache populated at boot.
 * A non-persistent in-memory implementation, using only Zephyr's
   ``sys_hash_map``.

For the PSA backend, there are configuration options to manage the storage key
and the maximum record count, which should be tuned and validated depending on
the number of distinct faults and devices and/or other system constraints.

.. _design_applications_fault_mgmt_kronos_deployment:

*****************
Kronos Deployment
*****************

The Kronos FVP models:

 * An SSU in the Safety Island.
 * A System FMU in the Safety Island, attached to the SSU.
 * An FMU attached to the GIC-720AE in the Primary Compute, attached to the
   System FMU.
 * An FMU attached to the GIC-720AE in the Safety Island, attached to the
   System FMU.

.. image:: ../images/kronos_fault_device_tree.*
   :align: center
   :alt: Kronos Fault Device Tree

The Kronos Fault Management application
(:repo:`components/safety_island/zephyr/src/apps/fault_mgmt`)
provides Kconfig and devicetree overlays for a sample deployment using these
devices on Safety Island Cluster 1. The functionality can be evaluated using
the Zephyr shell on this cluster. Additionally, this application serves as the
basis for the automated validation (see :ref:`validation_fault_management`).

For fault count storage, the application uses the PSA Protected Storage
implementation provided by TF-M. ``CONFIG_MAX_PSA_PROTECTED_STORAGE_SIZE`` is
configured according to TF-M storage constraints.

Validation
==========

The Kronos Reference Design contains integration tests for the overall FMU and
SSU integration, described at :ref:`validation_fault_management`

.. _design_applications_fault_mgmt_shell_reference:

***************
Shell Reference
***************

The subsystem provides an optional shell command (enabled using
``CONFIG_FAULT_MGMT_SHELL``) which exposes the subsystem API interactively for
evaluation and validation purposes. Its sub-commands are described below.

 * ``fault tree`` - Print a description of the fault device tree (including any
   safety state devices) to the console. The device names printed here can
   be used in the other commands below.
 * ``fault inject DEVICE FAULT_ID`` - Inject a specific ``FAULT_ID`` into
   ``DEVICE``. The resultant fault will be logged on the console.
 * ``fault set_enabled DEVICE FAULT_ID ENABLED`` - Enable or disable a specific
   ``FAULT_ID`` on a ``DEVICE``. Set ``ENABLED`` to ``1`` to enable or ``0``
   to disable.
 * ``fault set_critical DEVICE FAULT_ID CRITICAL`` - Configure a specific
   ``FAULT_ID`` on a ``DEVICE`` as critical or non-critical. Set ``CRITICAL``
   to ``1`` to set as critical or ``0`` to set as non-critical.

The ``FAULT_ID`` above refers to a 32-bit integer whose valid values are
device-specific (e.g. ``0x100`` represents an *APB access error* for a System
FMU but a *GICD Clock Error* for a GIC-720AE FMU) and opaque to the driver
itself.

The following are only available if ``CONFIG_FAULT_MGMT_SAFETY`` is enabled:

 * ``fault safety_status DEVICE`` - Print the current status of safety state
   ``DEVICE`` to the console.
 * ``fault safety_control DEVICE SIGNAL`` - Send ``SIGNAL`` to safety state
   ``DEVICE``.

The following are only available if ``CONFIG_FAULT_MGMT_STORAGE`` is enabled:

 * ``fault list [THRESHOLD]`` - List all reported fault counts. The optional
   ``THRESHOLD`` filters out faults below a certain count.
 * ``fault summary`` - Show a more detailed summary of the fault counts,
   including a list of the most reported faults.
 * ``fault count`` - Print the total count of reported faults.
 * ``fault clear`` - Reset all fault counts back to zero.

The test suite at :repo:`yocto/meta-arm-auto-solutions/lib/oeqa/runtime/cases/test_10_fault_mgmt.py`
demonstrates usage of these sub-commands.

.. _design_applications_fault_mgmt_limitations:

*********************
Safety Considerations
*********************

The Fault Management subsystem has the following features to mitigate the risks
of unexpected runtime behavior causing a denial of service:

 * Iterative methods that take a fixed amount of stack space based on
   ``CONFIG_FAULT_MGMT_MAX_TREE_DEPTH`` are used to traverse fault device
   trees.
 * Invalid combinations of configuration values (e.g. a root FMU without IRQ
   numbers) are detected at compile time where possible.
 * The subsystem functionality is composed of independent handlers which can be
   disabled if not required.

Note that there are conditions where the subsystem will panic and the
application running on the Safety Island cluster will stop processing further
faults (non-exhaustive):

 * Faults arrive more quickly than they are handled over a long enough period
   for a queue to fill up.
 * A fault arrives at a System FMU from an unknown Device FMU.
 * The number of stored fault records exceeds the amount of available storage.
 * An unexpected error code is returned when attempting to write a fault count
   to the storage.