..
 # SPDX-FileCopyrightText: <text>Copyright 2025 Arm Limited and/or its
 # affiliates <open-source-office@arm.com></text>
 #
 # SPDX-License-Identifier: MIT

.. _aspen_fault_management_unit:

###########################
Fault Management Unit (FMU)
###########################

************
Introduction
************

The Fault Management Unit (FMU) is responsible for collecting both internal 
fault signals and faults from upstream devices. It consolidates these signals
into standardized critical (C) and non-critical (NC) outputs to enable
system-level monitoring and response. The FMU described in the Safety Island
Design Specification is referred to here as the System FMU.

Safety Diagnostics Monitoring is a critical subsystem that facilitates the
detection, reporting, and escalation of hardware faults in safety-critical
designs. It relies on the FMU to perform fault aggregation and signaling.

The subsystem interfaces with the following device types:

* FMUs, which propagate faults from their own safety mechanisms or from other
  FMUs in a hierarchical topology.
* Safety Status Unit (SSU), which transitions between operational states
  in response to fault conditions.

This documentation explains the FMU module’s role in the Safety Diagnostics
Monitoring subsystem and how it integrates into the SCP-firmware.

================
Key Capabilities
================

* Aggregation of faults from originating and upstream devices.
* Differentiation and signaling of critical (C) and non-critical (NC) faults.
* Hardware threshold-based fault escalation.
* Fault injection APIs for validation.
* Event-driven notification via the SCP-firmware event loop.

********************
Design and Framework
********************

The FMU module in SCP-firmware provides support for handling faults reported
by System FMUs. It is responsible for configuring interrupt handlers, parsing
fault status from FMU registers, and propagating fault notifications through
the firmware event system. The driver is currently tailored to work with
System FMUs and implements logic for fault detection, escalation, and
reporting.

The FMU module is implemented under the automotive-specific directory
hierarchy within `product/automotive-rd/module/fmu/`. It follows
the standard SCP-firmware module structure.

Each FMU is declared as a firmware element:

.. code-block:: c

   enum si0_fmu_idx {
       SI0_FMU_ROOT,
       SI0_FMU1,
       SI0_FMU2,
       SI0_FMU3,
       SI0_FMU4,
       SI0_FMU_COUNT
   };

The configuration example:

.. code-block:: c

   static const struct fwk_element fmu_devices[SI0_FMU_COUNT + 1] = {
       [SI0_FMU_ROOT] = {
           .name = "fmu0",
           .data = &((struct mod_fmu_dev_config) {
               .base = SI0_FMU0__BASE,
               .parent = MOD_FMU_PARENT_NONE,
           }),
       },
       [SI0_FMU_1] = {
           .name = "fmu1",
           .data = &((struct mod_fmu_dev_config) {
               .base = SI0_FMU1_BASE,
               .parent = SI0_FMU_ROOT,
               .parent_cr_index = 0,
               .parent_ncr_index = 1,
           }),
       },
       ...
       [SI0_FMU_COUNT] = {0},
   };

==================
FMU Register Table
==================

.. list-table::
   :widths: 25 75
   :header-rows: 1

   * - Register
     - Description
   * - ``ERR<n>FR``
     - Indicates available control signals for a fault record.
   * - ``ERR<n>CTRL``
     - Enables/disables detection and interrupts.
   * - ``ERR<n>STATUS``
     - Fault valid status and internal error indicator.
   * - ``ERRIMPDEF<n>``
     - Configures injection, upgrade threshold, etc.
   * - ``SYS_KEY``
     - Unlocks protected registers.
   * - ``ERRGSR_L<i>``, ``ERRGSR_H<i>``
     - Global status for error records.
   * - ``ERRPIDRx``, ``ERRCIDRx``
     - Identifies FMU component type.

.. note::

   In the register names above, ``<n>`` refers to a specific fault record index,
   while ``<i>`` denotes the index into grouped status registers (e.g., global
   status for multiple fault records).

=====================
RD-Aspen FMU Topology
=====================

The RD-Aspen integrates five System FMUs. Each FMU is responsible for
monitoring faults from a specific subset of system components or domains:

- **System FMU 0 (Root FMU)**:
  Acts as the central aggregator. It collects fault outputs from all the leaf
  FMUs and propagates consolidated Critical and Non-Critical fault signals
  to the SSU.

- **System FMU 1**:
  Monitors faults reported via the CSSFAULT input signals from components
  outside the Safety Island, such as those in the Compute Subsystem (CSS).

- **System FMU 2**:
  Serves as a continuation of FMU 1, monitoring additional external faults
  received via CSSFAULT pins. Together, FMU 1 and FMU 2 provide full coverage
  of externally sourced fault records.

- **System FMU 3**:
  Handles faults originating from processor cluster elements within the Safety
  Island.

- **System FMU 4**:
  Monitors faults from all other internal Safety Island components. This
  includes memory, interconnects, peripheral units, the interrupt controller,
  Message Handling Units (MHUs), Address Translation Units (ATUs), clock/reset
  and infrastructure elements.

Each FMU is configured as an individual firmware element and collectively
ensures complete fault coverage of the RD-Aspen platform for safety diagnostics.

.. image:: ../images/fmu_tree.*
   :align: center
   :alt: RD-Aspen FMU Topology

==================
Module API Summary
==================

The FMU driver provides a set of APIs for fault injection, status querying,
threshold management, and fault escalation control. The table below summarizes
the available functions and their intended usage.

.. list-table::
   :widths: 30 50
   :header-rows: 1

   * - API
     - Description
   * - inject()
     - Inject fault manually for testing.
   * - get_enabled()
     - Check if a fault record is enabled.
   * - set_enabled()
     - Enable or disable a fault record.
   * - get_count(), set_count()
     - Read or update fault occurrence count.
   * - get_threshold(), set_threshold()
     - Configure hardware escalation threshold.
   * - get_upgrade_enabled(), set_upgrade_enabled()
     - Enable automatic promotion to critical.

======================
Escalation and Logging
======================

The driver logs fault events when they are received, including details such
as the fault type (critical or non-critical), node index, and safety
mechanism index. This information can be observed on the UART console during
runtime and is useful for debugging and validation.

Example log output:

.. code-block:: text

    [FMU] Critical fault received: Device: 0x0, Node 0x01, SM 0x10

Threshold-based escalation is configured using the ``set_threshold()`` and
``set_upgrade_enabled()`` APIs. When the number of fault occurrences exceeds
the defined threshold and escalation is enabled, the FMU automatically promotes
the fault to critical.


===================
Fault Handling Flow
===================

When a fault interrupt is received by the root FMU, the fault handling
process begins. The interrupt service routine (ISR) triggers an iterative
walk through the FMU hierarchy to identify and process all active fault
records.

The driver inspects each FMU by reading the ``ERRGSR_L`` and ``ERRGSR_H``
registers to detect active faults. Once an active fault record is found, it
is acknowledged by writing to the corresponding ``ERRIMPDEF`` register with
the interrupt clear (``IC``) bit set.

If a fault is detected on a non-root FMU, the walk continues toward the
leaves of the FMU topology until no further upstream faults are found. For
each confirmed fault, the driver collects fault metadata, including the FMU
device index, node index, and safety mechanism ID.

A ``mod_fmu_fault_notification_params`` structure is populated with these
details, and a fault notification is raised using the
``fwk_notification_notify()`` API. Fault events are also logged via
``FWK_LOG_INFO`` to aid debugging and diagnostics.

Once fault processing completes, the root FMU asserts its critical or
non-critical fault lines. These are received by the SSU,
which evaluates the system's safety state and signals status changes
to the External Safety Management (ESM).

=============
Notifications
=============

The FMU driver raises notifications for each fault that occurs by placing an
event into the SCP-firmware's shared event queue. This queue processes events
serially, ensuring a consistent and predictable execution order.

These notifications are intended primarily for diagnostic and logging
purposes. As such, they are not used in safety-critical fault responses but
can provide useful runtime insights for non-critical fault monitoring and
debugging.

.. note::
   The SCP-firmware event loop is finite. If faults arrive faster than they
   can be processed, they will be silently dropped. To avoid loss of fault
   notifications, the event queue size should be configured appropriately.

======================
Testing and Validation
======================

**Unit Testing**: Executed on host using Unity framework.
Refer to :link_subs:`rd-aspen:scp-unit-test` for more information on this framework.

**Integration Testing**: This will use the SCP-firmware debugger CLI.
Refer to :ref:`rd-aspen_user_guide_reproduce` for more information on this framework.

**OEQA Automation**: This will use the SCP-firmware debugger CLI.
Refer to :ref:`rd-aspen_validation` for more information on this framework.