..
 # SPDX-FileCopyrightText: <text>Copyright 2025 Arm Limited and/or its
 # affiliates <open-source-office@arm.com></text>
 #
 # SPDX-License-Identifier: MIT

.. _aspen_fault_management_unit:

###########################
Fault Management Unit (FMU)
###########################

************
Introduction
************

The Fault Management Unit (FMU) is responsible for collecting both internal
fault signals and faults from upstream devices. It consolidates these signals
into standardized critical (C) and non-critical (NC) outputs to enable
system-level monitoring and response.

Safety Diagnostics Monitoring is a critical subsystem that facilitates the
detection, reporting, and escalation of hardware faults in safety-critical
designs. It relies on the FMU to perform fault aggregation and signaling.

The subsystem interfaces with the following device types:

* System FMUs, which propagate faults from their own safety mechanisms or from
  other FMUs in a hierarchical topology. This is the FMU described in the Safety
  Island Design Specification.
* GIC FMUs, which monitor faults from the Generic Interrupt Controller (GIC)
  and provides fault signals to the Safety Diagnostics Monitoring subsystem from
  its own safety mechanisms.
* MHU FMUs, which trigger interrupts for faults in other Safety elements, such
  as RSE/PC <-> Cluster<n> MHU critical / non critical errors.
* Safety Status Unit (SSU), which transitions between operational states
  in response to fault conditions.

This documentation explains the FMU module’s role in the Safety Diagnostics
Monitoring subsystem and how it integrates into the SCP-firmware.

================
Key Capabilities
================

* Aggregation of faults from originating and upstream devices.
* Differentiation and signaling of critical (C) and non-critical (NC) faults.
* Hardware threshold-based fault escalation.
* Fault injection APIs for validation.
* Event-driven notification via the SCP-firmware event loop.

********************
Design and Framework
********************

The FMU module in SCP-firmware provides support for handling faults reported
by FMUs. It is responsible for configuring interrupt handlers, parsing fault
status from FMU registers, and propagating fault notifications through the
firmware event system. It has internal support for multiple FMU types (currently
the System FMU, GIC FMU and MHU FMUs) using an internal implementation API.
It implements logic for fault detection, escalation, and reporting.

The FMU module is implemented under the automotive-specific directory
hierarchy within `product/automotive-rd/module/fmu/`. It follows
the standard SCP-firmware module structure.

Each FMU is declared as a firmware element:

.. code-block:: c

   enum si0_fmu_idx {
       SI0_FMU_ROOT,
       SI0_FMU1,
       SI0_FMU2,
       SI0_FMU3,
       SI0_FMU4,
       SI0_GIC_FMU,
       SI0_MHU_FMU,
       SI0_FMU_COUNT
   };

The configuration example:

.. code-block:: c

   static const struct fwk_element fmu_devices[SI0_FMU_COUNT + 1] = {
       [SI0_FMU_ROOT] = {
           .name = "fmu0",
           .data = &((struct mod_fmu_dev_config) {
               .base = SI0_FMU0__BASE,
               .parent = MOD_FMU_PARENT_NONE,
           }),
       },
       [SI0_FMU_1] = {
           .name = "fmu1",
           .data = &((struct mod_fmu_dev_config) {
               .base = SI0_FMU1_BASE,
               .parent = SI0_FMU_ROOT,
               .parent_cr_index = 0,
               .parent_ncr_index = 1,
           }),
       },
       ...
       [SI0_GIC_FMU] = {
           .name = "gic_fmu",
           .data = &((struct mod_fmu_dev_config) {
               .base = SI0_GIC_FMU_BASE,
               .parent = SI0_FMU4,
               .parent_cr_index = 204,
               .parent_ncr_index = 203,
           }),
       },
      [SI0_MHU_FMU] = {
           .name = "mhu_fmu",
           .data = &((struct mod_fmu_dev_config) {
               .base = SI0_MHU_RSE_CL0_FMU_BASE,
               .parent = SI0_FMU4,
               .parent_cr_index = 0,
               .parent_ncr_index = 1,
           }),
       },
       [SI0_FMU_COUNT] = {0},
   };

==================
FMU Register Table
==================

.. list-table::
   :widths: 25 55 10 10
   :header-rows: 1

   * - Register
     - Description
     - System FMU
     - GIC FMU / MHU FMU
   * - ``ERR<n>FR``
     - Indicates available control signals for a fault record.
     - yes
     - yes
   * - ``ERR<n>CTRL``
     - Enables/disables detection and interrupts.
     - yes
     - yes
   * - ``ERR<n>STATUS``
     - Fault valid status and internal error indicator.
     - yes
     - yes
   * - ``ERRIMPDEF<n>``
     - Configures injection, upgrade threshold, etc (System FMU only).
     - yes
     - no
   * - ``SMEN``
     - Safety mechanism enable register (GIC FMU and MHU FMU).
     - no
     - yes
   * - ``SMERR``
     - Inject a fault into te FMU (GIC FMU and MHU FMU).
     - no
     - yes
   * - ``SMCR``
     - Control the criticality of faults (GIC FMU and MHU FMU).
     - no
     - yes
   * - ``SYS_KEY``
     - Unlocks protected registers.
     - yes
     - yes
   * - ``ERRUPDATE``
     - Reports back on updates to other registers.
     - no
     - yes
   * - (``ERRGSR_L<i>``, ``ERRGSR_H<i>``) or ``ERRGSR``
     - Global status for error records.
     - yes
     - yes
   * - ``ERRPIDRx``, ``ERRCIDRx``
     - Identifies FMU component type.
     - yes
     - yes

.. note::

   In the register names above, ``<n>`` refers to a specific fault record index,
   while ``<i>`` denotes the index into grouped status registers (e.g., global
   status for multiple fault records).

=====================
RD-Aspen FMU Topology
=====================

The RD-Aspen integrates five System FMUs. Each FMU is responsible for
monitoring faults from a specific subset of system components or domains:

- **System FMU 0 (Root FMU)**:
  Acts as the central aggregator. It collects fault outputs from all the leaf
  FMUs and propagates consolidated Critical and Non-Critical fault signals
  to the SSU.

- **System FMU 1**:
  Monitors faults reported via the CSSFAULT input signals from components
  outside the Safety Island, such as those in the Compute Subsystem (CSS).

- **System FMU 2**:
  Serves as a continuation of FMU 1, monitoring additional external faults
  received via CSSFAULT pins. Together, FMU 1 and FMU 2 provide full coverage
  of externally sourced fault records.

- **System FMU 3**:
  Handles faults originating from processor cluster elements within the Safety
  Island.

- **System FMU 4**:
  Monitors faults from all other internal Safety Island components. This
  includes memory, interconnects, peripheral units, the interrupt controller,
  Message Handling Units (MHUs), Address Translation Units (ATUs), clock/reset
  and infrastructure elements.

- **GIC FMU**: Monitors faults from the Generic Interrupt Controller (GIC) in
  the Safety Island. It provides critical fault signals to the Safety
  Diagnostics Monitoring subsystem from its own safety mechanisms.

- **MHU FMU**: Monitors faults from RSE/PC <-> Cluster<n> MHU
  critical / non critical errors within the Safety Island. It provides
  interrupts for both critical and non-critical error conditions.

Each FMU is configured as an individual firmware element and collectively
ensures complete fault coverage of the RD-Aspen platform for safety diagnostics.

.. figure:: ../images/fmu_tree.*
   :align: center
   :alt: RD-Aspen FMU Topology

   RD-Aspen FMU Topology

|

==================
Module API Summary
==================

The FMU driver provides a set of APIs for fault injection, status querying,
threshold management, and fault escalation control. The table below summarizes
the available functions and their intended usage.

.. list-table::
   :widths: 25 45 10 20
   :header-rows: 1

   * - API
     - Description
     - System FMU
     - GIC FMU/
       MHU FMU
   * - inject()
     - Inject fault manually for testing.
     - yes
     - yes
   * - get_enabled()
     - Check if a fault record is enabled (System FMU only).
     - yes
     - no
   * - set_enabled()
     - Enable or disable a fault record.
     - yes
     - yes
   * - get_count(), set_count()
     - Read or update fault occurrence count (System FMU only).
     - yes
     - no
   * - get_threshold(), set_threshold()
     - Configure hardware escalation threshold (System FMU only).
     - yes
     - no
   * - get_upgrade_enabled(), set_upgrade_enabled()
     - Enable automatic promotion to critical (System FMU only).
     - yes
     - no
   * - set_critical()
     - Set fault criticality (GIC FMU and MHU FMU).
     - no
     - yes

======================
Escalation and Logging
======================

The driver logs fault events when they are received, including details such
as the fault type (critical or non-critical), node index, and safety
mechanism index. This information can be observed on the UART console during
runtime and is useful for debugging and validation.

Example log output:

.. code-block:: text

    [FMU] Critical fault received: Device: 0x0, Node 0x01, SM 0x10

The meaning of the node index is specific to the FMU implementation type:
 * For System FMUs: The node index corresponds to the incoming FMU input index.
 * For GIC FMUs: The node index corresponds to the block type - GIC distributor
   (GICD), wake request, shared peripheral interrupt (SPI) Collator, GIC cluster
   interface (GCI), interrupt translation service (ITS) or FMU.
 * For MHU FMUs: The node index corresponds to the block type. There are three
   types of blocks: Sender block, Receiver block and FMU block.
   MHU FMU uses FMU block type identifier as its block type.

For the System FMU, threshold-based escalation is configured using the
``set_threshold()`` and ``set_upgrade_enabled()`` APIs. When the number of fault
occurrences exceeds the defined threshold and escalation is enabled, the FMU
automatically promotes the fault to critical.


===================
Fault Handling Flow
===================

When a fault interrupt is received by the root FMU, the fault handling
process begins. The interrupt service routine (ISR) triggers an iterative
walk through the FMU hierarchy to identify and process all active fault
records.

The driver inspects each FMU by reading the ``ERRGSR`` registers to detect
active faults. Once an active fault record is found, it
is acknowledged by writing to the corresponding register.

If a fault is detected on a non-root FMU, the walk continues toward the
leaves of the FMU topology until no further upstream faults are found. For
each confirmed fault, the driver collects fault metadata, including the FMU
device index, node index, and safety mechanism ID.

A ``mod_fmu_fault_notification_params`` structure is populated with these
details, and a fault notification is raised using the
``fwk_notification_notify()`` API. Fault events are also logged via
``FWK_LOG_INFO`` to aid debugging and diagnostics.

Once fault processing completes, the root FMU asserts its critical or
non-critical fault lines. These are received by the SSU,
which evaluates the system's safety state and signals status changes
to the External Safety Management (ESM).

=============
Notifications
=============

The FMU driver raises notifications for each fault that occurs by placing an
event into the SCP-firmware's shared event queue. This queue processes events
serially, ensuring a consistent and predictable execution order.

These notifications are intended primarily for diagnostic and logging
purposes. As such, they are not used in safety-critical fault responses but
can provide useful runtime insights for non-critical fault monitoring and
debugging.

.. note::
   The SCP-firmware event loop is finite. If faults arrive faster than they
   can be processed, they will be silently dropped. To avoid loss of fault
   notifications, the event queue size should be configured appropriately.

======================
Testing and Validation
======================

**Unit Testing**: Executed on host using Unity framework.
Refer to :link_subs:`rd-aspen:scp-unit-test` for more information on this framework.

**Integration Testing**: This will use the SCP-firmware debugger CLI.
Refer to :ref:`rd-aspen_user_guide_reproduce` for more information on this framework.

**OEQA Automation**: This will use the SCP-firmware debugger CLI.
Refer to :ref:`rd-aspen_validation` for more information on this framework.