Fault Management Unit (FMU)

Introduction

The Fault Management Unit (FMU) is responsible for collecting both internal fault signals and faults from upstream devices. It consolidates these signals into standardized critical (C) and non-critical (NC) outputs to enable system-level monitoring and response.

Safety Diagnostics Monitoring is a critical subsystem that facilitates the detection, reporting, and escalation of hardware faults in safety-critical designs. It relies on the FMU to perform fault aggregation and signaling.

The subsystem interfaces with the following device types:

  • System FMUs, which propagate faults from their own safety mechanisms or from other FMUs in a hierarchical topology. This is the FMU described in the Safety Island Design Specification.

  • GIC FMUs, which monitor faults from the Generic Interrupt Controller (GIC) and provides fault signals to the Safety Diagnostics Monitoring subsystem from its own safety mechanisms.

  • MHU FMUs, which trigger interrupts for faults in other Safety elements, such as RSE/PC <-> Cluster<n> MHU critical / non critical errors.

  • Safety Status Unit (SSU), which transitions between operational states in response to fault conditions.

This documentation explains the FMU module’s role in the Safety Diagnostics Monitoring subsystem and how it integrates into the SCP-firmware.

Key Capabilities

  • Aggregation of faults from originating and upstream devices.

  • Differentiation and signaling of critical (C) and non-critical (NC) faults.

  • Hardware threshold-based fault escalation.

  • Fault injection APIs for validation.

  • Event-driven notification via the SCP-firmware event loop.

Design and Framework

The FMU module in SCP-firmware provides support for handling faults reported by FMUs. It is responsible for configuring interrupt handlers, parsing fault status from FMU registers, and propagating fault notifications through the firmware event system. It has internal support for multiple FMU types (currently the System FMU, GIC FMU and MHU FMUs) using an internal implementation API. It implements logic for fault detection, escalation, and reporting.

The FMU module is implemented under the automotive-specific directory hierarchy within product/automotive-rd/module/fmu/. It follows the standard SCP-firmware module structure.

Each FMU is declared as a firmware element:

enum si0_fmu_idx {
    SI0_FMU_ROOT,
    SI0_FMU1,
    SI0_FMU2,
    SI0_FMU3,
    SI0_FMU4,
    SI0_GIC_FMU,
    SI0_MHU_FMU,
    SI0_FMU_COUNT
};

The configuration example:

static const struct fwk_element fmu_devices[SI0_FMU_COUNT + 1] = {
    [SI0_FMU_ROOT] = {
        .name = "fmu0",
        .data = &((struct mod_fmu_dev_config) {
            .base = SI0_FMU0__BASE,
            .parent = MOD_FMU_PARENT_NONE,
        }),
    },
    [SI0_FMU_1] = {
        .name = "fmu1",
        .data = &((struct mod_fmu_dev_config) {
            .base = SI0_FMU1_BASE,
            .parent = SI0_FMU_ROOT,
            .parent_cr_index = 0,
            .parent_ncr_index = 1,
        }),
    },
    ...
    [SI0_GIC_FMU] = {
        .name = "gic_fmu",
        .data = &((struct mod_fmu_dev_config) {
            .base = SI0_GIC_FMU_BASE,
            .parent = SI0_FMU4,
            .parent_cr_index = 204,
            .parent_ncr_index = 203,
        }),
    },
   [SI0_MHU_FMU] = {
        .name = "mhu_fmu",
        .data = &((struct mod_fmu_dev_config) {
            .base = SI0_MHU_RSE_CL0_FMU_BASE,
            .parent = SI0_FMU4,
            .parent_cr_index = 0,
            .parent_ncr_index = 1,
        }),
    },
    [SI0_FMU_COUNT] = {0},
};

FMU Register Table

Register

Description

System FMU

GIC FMU / MHU FMU

ERR<n>FR

Indicates available control signals for a fault record.

yes

yes

ERR<n>CTRL

Enables/disables detection and interrupts.

yes

yes

ERR<n>STATUS

Fault valid status and internal error indicator.

yes

yes

ERRIMPDEF<n>

Configures injection, upgrade threshold, etc (System FMU only).

yes

no

SMEN

Safety mechanism enable register (GIC FMU and MHU FMU).

no

yes

SMERR

Inject a fault into te FMU (GIC FMU and MHU FMU).

no

yes

SMCR

Control the criticality of faults (GIC FMU and MHU FMU).

no

yes

SYS_KEY

Unlocks protected registers.

yes

yes

ERRUPDATE

Reports back on updates to other registers.

no

yes

(ERRGSR_L<i>, ERRGSR_H<i>) or ERRGSR

Global status for error records.

yes

yes

ERRPIDRx, ERRCIDRx

Identifies FMU component type.

yes

yes

Note

In the register names above, <n> refers to a specific fault record index, while <i> denotes the index into grouped status registers (e.g., global status for multiple fault records).

RD-Aspen FMU Topology

The RD-Aspen integrates five System FMUs. Each FMU is responsible for monitoring faults from a specific subset of system components or domains:

  • System FMU 0 (Root FMU): Acts as the central aggregator. It collects fault outputs from all the leaf FMUs and propagates consolidated Critical and Non-Critical fault signals to the SSU.

  • System FMU 1: Monitors faults reported via the CSSFAULT input signals from components outside the Safety Island, such as those in the Compute Subsystem (CSS).

  • System FMU 2: Serves as a continuation of FMU 1, monitoring additional external faults received via CSSFAULT pins. Together, FMU 1 and FMU 2 provide full coverage of externally sourced fault records.

  • System FMU 3: Handles faults originating from processor cluster elements within the Safety Island.

  • System FMU 4: Monitors faults from all other internal Safety Island components. This includes memory, interconnects, peripheral units, the interrupt controller, Message Handling Units (MHUs), Address Translation Units (ATUs), clock/reset and infrastructure elements.

  • GIC FMU: Monitors faults from the Generic Interrupt Controller (GIC) in the Safety Island. It provides critical fault signals to the Safety Diagnostics Monitoring subsystem from its own safety mechanisms.

  • MHU FMU: Monitors faults from RSE/PC <-> Cluster<n> MHU critical / non critical errors within the Safety Island. It provides interrupts for both critical and non-critical error conditions.

Each FMU is configured as an individual firmware element and collectively ensures complete fault coverage of the RD-Aspen platform for safety diagnostics.

RD-Aspen FMU Topology

Fig. 36 RD-Aspen FMU Topology


Module API Summary

The FMU driver provides a set of APIs for fault injection, status querying, threshold management, and fault escalation control. The table below summarizes the available functions and their intended usage.

API

Description

System FMU

GIC FMU/ MHU FMU

inject()

Inject fault manually for testing.

yes

yes

get_enabled()

Check if a fault record is enabled (System FMU only).

yes

no

set_enabled()

Enable or disable a fault record.

yes

yes

get_count(), set_count()

Read or update fault occurrence count (System FMU only).

yes

no

get_threshold(), set_threshold()

Configure hardware escalation threshold (System FMU only).

yes

no

get_upgrade_enabled(), set_upgrade_enabled()

Enable automatic promotion to critical (System FMU only).

yes

no

set_critical()

Set fault criticality (GIC FMU and MHU FMU).

no

yes

Escalation and Logging

The driver logs fault events when they are received, including details such as the fault type (critical or non-critical), node index, and safety mechanism index. This information can be observed on the UART console during runtime and is useful for debugging and validation.

Example log output:

[FMU] Critical fault received: Device: 0x0, Node 0x01, SM 0x10
The meaning of the node index is specific to the FMU implementation type:
  • For System FMUs: The node index corresponds to the incoming FMU input index.

  • For GIC FMUs: The node index corresponds to the block type - GIC distributor (GICD), wake request, shared peripheral interrupt (SPI) Collator, GIC cluster interface (GCI), interrupt translation service (ITS) or FMU.

  • For MHU FMUs: The node index corresponds to the block type. There are three types of blocks: Sender block, Receiver block and FMU block. MHU FMU uses FMU block type identifier as its block type.

For the System FMU, threshold-based escalation is configured using the set_threshold() and set_upgrade_enabled() APIs. When the number of fault occurrences exceeds the defined threshold and escalation is enabled, the FMU automatically promotes the fault to critical.

Fault Handling Flow

When a fault interrupt is received by the root FMU, the fault handling process begins. The interrupt service routine (ISR) triggers an iterative walk through the FMU hierarchy to identify and process all active fault records.

The driver inspects each FMU by reading the ERRGSR registers to detect active faults. Once an active fault record is found, it is acknowledged by writing to the corresponding register.

If a fault is detected on a non-root FMU, the walk continues toward the leaves of the FMU topology until no further upstream faults are found. For each confirmed fault, the driver collects fault metadata, including the FMU device index, node index, and safety mechanism ID.

A mod_fmu_fault_notification_params structure is populated with these details, and a fault notification is raised using the fwk_notification_notify() API. Fault events are also logged via FWK_LOG_INFO to aid debugging and diagnostics.

Once fault processing completes, the root FMU asserts its critical or non-critical fault lines. These are received by the SSU, which evaluates the system’s safety state and signals status changes to the External Safety Management (ESM).

Notifications

The FMU driver raises notifications for each fault that occurs by placing an event into the SCP-firmware’s shared event queue. This queue processes events serially, ensuring a consistent and predictable execution order.

These notifications are intended primarily for diagnostic and logging purposes. As such, they are not used in safety-critical fault responses but can provide useful runtime insights for non-critical fault monitoring and debugging.

Note

The SCP-firmware event loop is finite. If faults arrive faster than they can be processed, they will be silently dropped. To avoid loss of fault notifications, the event queue size should be configured appropriately.

Testing and Validation

Unit Testing: Executed on host using Unity framework. Refer to System Control Processor (SCP) Unit Test for more information on this framework.

Integration Testing: This will use the SCP-firmware debugger CLI. Refer to Reproduce for more information on this framework.

OEQA Automation: This will use the SCP-firmware debugger CLI. Refer to Validation for more information on this framework.