.. # SPDX-FileCopyrightText: Copyright 2023-2024 Arm Limited and/or its # affiliates # # SPDX-License-Identifier: MIT .. _design_applications_fault_mgmt: ################ Fault Management ################ ************ Introduction ************ The Fault Management subsystem for the Safety Island provides a mechanism for capturing, reporting and collating faults from supported hardware in safety-critical designs. The subsystem interfaces with the following types of devices: * A fault device, which reports faults from its safety mechanisms. It may also report faults originating from other fault devices to support the creation of a fault device tree. * A safety state device, which manages a state in reaction to reported faults. Supporting driver implementations are provided for the following Arm hardware designs: * A Device Fault Management Unit (Device FMU): a fault device attached to a GIC-720AE interrupt controller. * A System Fault Management Unit (System FMU): a fault device which collates faults from upstream FMUs. * A Safety Status Unit (SSU): a safety state device which manages a state machine in response to faults in a safety-critical system. Faults ====== A unique fault (i.e. generated by a specific safety mechanism and reported by a fault device implementation) is represented by the subsystem and driver interfaces as a device-specific 32-bit integer along with a handle to the originating device. A fault may be critical or non-critical and this affects how it is processed by the subsystem. Fault Device Trees ================== The subsystem is configured with a list of "root" fault devices - those located at the root of a fault device tree. Root fault devices are typically collators of faults from multiple upstream fault devices (possibly recursively) and may also directly affect the state of a connected safety state device. The diagram below shows an illustrative fault device tree. (For the simpler Kronos topology, see :ref:`design_applications_fault_mgmt_kronos_deployment` below.) .. image:: ../images/sample_fault_device_tree.* :align: center :alt: A Sample Fault Device Tree Safety States ============= The SSU state machine has 4 safety states: * ``TEST``: Self-test * ``SAFE``: Safe operation * ``ERRN``: Non-critical fault detected * ``ERRC``: Critical fault detected Control signals from software: * ``compl_ok``: Diagnostics complete or non-critical fault cleared * ``nce_ok``: Non-critical fault diagnosed * ``ce_not_ok``: Critical fault diagnosed * ``nce_not_ok``: Non-correctable non-critical fault Control signals connected in hardware to the root fault device: * ``nc_error``: Non critical error * ``c_error``: Critical error * ``reset`` ``TEST`` is the initial state on boot. The software is responsible for transitioning to ``SAFE`` after the successful completion of a self-test routine. ``ERRC`` represents a critical system failure, which can only be recovered by resetting the system. A non-critical fault causes a transition to ``ERRN``, which can either be recovered back to ``SAFE`` or promoted to ``ERRC`` by the software. The diagram below shows all the possible transitions between these states using these signals. .. image:: ../images/ssu_states.* :align: center :alt: SSU States Finite State Machine (FSM) States and Transitions: * From reset the FSM defaults to the ``TEST`` state. * It shall stay in this state until SW has completed any power up tests. If the SW controlled tests pass then a write can be issued indicating that to move the FSM to the ``SAFE`` state. * If the tests fail then a write can be issued to move the FSM to the ``ERRN`` state, indicating that an error has occurred that may be resolvable. * After further tests if the SW can issue a write depending on whether it was determined the error has been resolved or not, moving the FSM to ``SAFE`` it was resolved or ``ERRC`` if it was not. When in ``SAFE`` mode the FSM can only be moved after either: - a reset moving it back to ``TEST`` - a non-critical error interrupt moving it to ``ERRN`` - a critical error interrupt moving it to ``ERRC`` - if a critical and non-critical error occur in the same time the critical error takes precedence and the FSM shall move to ``ERRC`` .. _design_applications_fault_mgmt_design: ****** Design ****** The Fault Management subsystem for the Safety Island implementation and functionality are grounded in the Zephyr real-time operating system (RTOS) environment. Drivers ======= Driver interfaces are provided for fault devices and safety state devices. Specific driver implementations with devicetree bindings are provided for the Arm FMU and Arm SSU. The public driver interfaces are described under :repo:`components/safety_island/zephyr/src/include/zephyr/drivers/fault_mgmt` The drivers are instantiated in the devicetree using bindings under :repo:`components/safety_island/zephyr/src/dts/bindings/fault_mgmt` Fault Management Unit --------------------- The FMU driver is an implementation of a fault device. Inside the driver, one of two driver implementations is selected at runtime to handle differences between the GIC-720AE and the System FMU programmers' views. It is expected that interrupts are only defined for root FMUs. If the root FMU is a System FMU, it will collate faults from multiple upstream sources. The driver in this case will inspect the status of other FMUs in the tree when a fault occurs to determine the exact origin and cause of the fault. The FMU driver allows a single callback to be registered, through which incoming faults are reported. Safety Status Unit ------------------ The SSU driver is an implementation of a safety state device. It implements the safety state device interface which allows its state to be read and controlled. Subsystem ========= The Fault Management subsystem manages two fault-handling threads (one for critical faults and another for non-critical faults), which listen for queued faults from any configured root fault device and forward them to all configured fault handlers. Multiple fault handlers can be statically registered (using the ``FAULT_MGMT_HANDLER_DEFINE`` macro), each of which is called once per root fault device on initialization, then once per reported fault. Handlers are registered with a unique priority that determines the order in which they are called. Certain subsystem features are themselves implemented as handlers. It is expected that in order to implement a Fault Management policy for a safety-critical system design, one or more additional custom fault handlers would be required to perform tasks such as: * Configuring the criticality and enabled state of fault device safety mechanisms. * Performing a self-test routine before notifying the safety state device that the system is safe for operation. * Reacting to non-critical faults and deciding whether to perform a corrective action to reset the safety state or promote to a critical fault. This decision may be based on the provided fault count storage. The subsystem has configuration options to manage the stack space, priority and queue size of both threads, which should be tuned and validated according to deployment requirements. Specifically, more complex custom handlers may require more stack space as they are called on the subsystem threads. The public interface for the subsystem and its components is described under :repo:`components/safety_island/zephyr/src/include/zephyr/subsys/fault_mgmt` Safety component ---------------- The safety component contains additional interfaces to facilitate reading and updating a system's safety state. If enabled, this component requires (and validates at boot) that all root fault devices have an attached safety state device. Storage component ----------------- The storage component manages historical counts per safety mechanism per fault device. Two storage backends are provided: * `Trusted Firmware-M PSA Protected Storage Interfaces`_, with an in-memory cache populated at boot. * A non-persistent in-memory implementation, using only Zephyr's ``sys_hash_map``. For the PSA backend, there are configuration options to manage the storage key and the maximum record count, which should be tuned and validated depending on the number of distinct faults and devices and/or other system constraints. .. _design_applications_fault_mgmt_kronos_deployment: ***************** Kronos Deployment ***************** The Kronos FVP models: * An SSU in the Safety Island. * A System FMU in the Safety Island, attached to the SSU. * An FMU attached to the GIC-720AE in the Primary Compute, attached to the System FMU. * An FMU attached to the GIC-720AE in the Safety Island, attached to the System FMU. .. image:: ../images/kronos_fault_device_tree.* :align: center :alt: Kronos Fault Device Tree The Kronos Fault Management application (:repo:`components/safety_island/zephyr/src/apps/fault_mgmt`) provides Kconfig and devicetree overlays for a sample deployment using these devices on Safety Island Cluster 1. The functionality can be evaluated using the Zephyr shell on this cluster. Additionally, this application serves as the basis for the automated validation (see :ref:`validation_fault_management`). For fault count storage, the application uses the PSA Protected Storage implementation provided by TF-M. ``CONFIG_MAX_PSA_PROTECTED_STORAGE_SIZE`` is configured according to TF-M storage constraints. Validation ========== The Kronos Reference Design contains integration tests for the overall FMU and SSU integration, described at :ref:`validation_fault_management` .. _design_applications_fault_mgmt_shell_reference: *************** Shell Reference *************** The subsystem provides an optional shell command (enabled using ``CONFIG_FAULT_MGMT_SHELL``) which exposes the subsystem API interactively for evaluation and validation purposes. Its sub-commands are described below. * ``fault tree`` - Print a description of the fault device tree (including any safety state devices) to the console. The device names printed here can be used in the other commands below. * ``fault inject DEVICE FAULT_ID`` - Inject a specific ``FAULT_ID`` into ``DEVICE``. The resultant fault will be logged on the console. * ``fault set_enabled DEVICE FAULT_ID ENABLED`` - Enable or disable a specific ``FAULT_ID`` on a ``DEVICE``. Set ``ENABLED`` to ``1`` to enable or ``0`` to disable. * ``fault set_critical DEVICE FAULT_ID CRITICAL`` - Configure a specific ``FAULT_ID`` on a ``DEVICE`` as critical or non-critical. Set ``CRITICAL`` to ``1`` to set as critical or ``0`` to set as non-critical. The ``FAULT_ID`` above refers to a 32-bit integer whose valid values are device-specific (e.g. ``0x100`` represents an *APB access error* for a System FMU but a *GICD Clock Error* for a GIC-720AE FMU) and opaque to the driver itself. The following are only available if ``CONFIG_FAULT_MGMT_SAFETY`` is enabled: * ``fault safety_status DEVICE`` - Print the current status of safety state ``DEVICE`` to the console. * ``fault safety_control DEVICE SIGNAL`` - Send ``SIGNAL`` to safety state ``DEVICE``. The following are only available if ``CONFIG_FAULT_MGMT_STORAGE`` is enabled: * ``fault list [THRESHOLD]`` - List all reported fault counts. The optional ``THRESHOLD`` filters out faults below a certain count. * ``fault summary`` - Show a more detailed summary of the fault counts, including a list of the most reported faults. * ``fault count`` - Print the total count of reported faults. * ``fault clear`` - Reset all fault counts back to zero. The test suite at :repo:`yocto/meta-arm-auto-solutions/lib/oeqa/runtime/cases/test_10_fault_mgmt.py` demonstrates usage of these sub-commands. .. _design_applications_fault_mgmt_limitations: ********************* Safety Considerations ********************* The Fault Management subsystem has the following features to mitigate the risks of unexpected runtime behavior causing a denial of service: * Iterative methods that take a fixed amount of stack space based on ``CONFIG_FAULT_MGMT_MAX_TREE_DEPTH`` are used to traverse fault device trees. * Invalid combinations of configuration values (e.g. a root FMU without IRQ numbers) are detected at compile time where possible. * The subsystem functionality is composed of independent handlers which can be disabled if not required. Note that there are conditions where the subsystem will panic and the application running on the Safety Island cluster will stop processing further faults (non-exhaustive): * Faults arrive more quickly than they are handled over a long enough period for a queue to fill up. * A fault arrives at a System FMU from an unknown Device FMU. * The number of stored fault records exceeds the amount of available storage. * An unexpected error code is returned when attempting to write a fault count to the storage.