Fault Management Unit (FMU)

Introduction

The Fault Management Unit (FMU) is responsible for collecting both internal fault signals and faults from upstream devices. It consolidates these signals into standardized critical (C) and non-critical (NC) outputs to enable system-level monitoring and response. The FMU described in the Safety Island Design Specification is referred to here as the System FMU.

Safety Diagnostics Monitoring is a critical subsystem that facilitates the detection, reporting, and escalation of hardware faults in safety-critical designs. It relies on the FMU to perform fault aggregation and signaling.

The subsystem interfaces with the following device types:

FMUs, which propagate faults from their own safety mechanisms or from other FMUs in a hierarchical topology.
Safety Status Unit (SSU), which transitions between operational states in response to fault conditions.

This documentation explains the FMU module’s role in the Safety Diagnostics Monitoring subsystem and how it integrates into the SCP-firmware.

Key Capabilities

Aggregation of faults from originating and upstream devices.
Differentiation and signaling of critical (C) and non-critical (NC) faults.
Hardware threshold-based fault escalation.
Fault injection APIs for validation.
Event-driven notification via the SCP-firmware event loop.

Design and Framework

The FMU module in SCP-firmware provides support for handling faults reported by System FMUs. It is responsible for configuring interrupt handlers, parsing fault status from FMU registers, and propagating fault notifications through the firmware event system. The driver is currently tailored to work with System FMUs and implements logic for fault detection, escalation, and reporting.

The FMU module is implemented under the automotive-specific directory hierarchy within product/automotive-rd/module/fmu/. It follows the standard SCP-firmware module structure.

Each FMU is declared as a firmware element:

enum si0_fmu_idx {
    SI0_FMU_ROOT,
    SI0_FMU1,
    SI0_FMU2,
    SI0_FMU3,
    SI0_FMU4,
    SI0_FMU_COUNT
};

The configuration example:

static const struct fwk_element fmu_devices[SI0_FMU_COUNT + 1] = {
    [SI0_FMU_ROOT] = {
        .name = "fmu0",
        .data = &((struct mod_fmu_dev_config) {
            .base = SI0_FMU0__BASE,
            .parent = MOD_FMU_PARENT_NONE,
        }),
    },
    [SI0_FMU_1] = {
        .name = "fmu1",
        .data = &((struct mod_fmu_dev_config) {
            .base = SI0_FMU1_BASE,
            .parent = SI0_FMU_ROOT,
            .parent_cr_index = 0,
            .parent_ncr_index = 1,
        }),
    },
    ...
    [SI0_FMU_COUNT] = {0},
};

FMU Register Table

Register	Description
`ERR<n>FR`	Indicates available control signals for a fault record.
`ERR<n>CTRL`	Enables/disables detection and interrupts.
`ERR<n>STATUS`	Fault valid status and internal error indicator.
`ERRIMPDEF<n>`	Configures injection, upgrade threshold, etc.
`SYS_KEY`	Unlocks protected registers.
`ERRGSR_L<i>`, `ERRGSR_H<i>`	Global status for error records.
`ERRPIDRx`, `ERRCIDRx`	Identifies FMU component type.

Note

In the register names above, <n> refers to a specific fault record index, while <i> denotes the index into grouped status registers (e.g., global status for multiple fault records).

RD-Aspen FMU Topology

The RD-Aspen integrates five System FMUs. Each FMU is responsible for monitoring faults from a specific subset of system components or domains:

System FMU 0 (Root FMU): Acts as the central aggregator. It collects fault outputs from all the leaf FMUs and propagates consolidated Critical and Non-Critical fault signals to the SSU.
System FMU 1: Monitors faults reported via the CSSFAULT input signals from components outside the Safety Island, such as those in the Compute Subsystem (CSS).
System FMU 2: Serves as a continuation of FMU 1, monitoring additional external faults received via CSSFAULT pins. Together, FMU 1 and FMU 2 provide full coverage of externally sourced fault records.
System FMU 3: Handles faults originating from processor cluster elements within the Safety Island.
System FMU 4: Monitors faults from all other internal Safety Island components. This includes memory, interconnects, peripheral units, the interrupt controller, Message Handling Units (MHUs), Address Translation Units (ATUs), clock/reset and infrastructure elements.

Each FMU is configured as an individual firmware element and collectively ensures complete fault coverage of the RD-Aspen platform for safety diagnostics.

Module API Summary

The FMU driver provides a set of APIs for fault injection, status querying, threshold management, and fault escalation control. The table below summarizes the available functions and their intended usage.

API	Description
inject()	Inject fault manually for testing.
get_enabled()	Check if a fault record is enabled.
set_enabled()	Enable or disable a fault record.
get_count(), set_count()	Read or update fault occurrence count.
get_threshold(), set_threshold()	Configure hardware escalation threshold.
get_upgrade_enabled(), set_upgrade_enabled()	Enable automatic promotion to critical.

Escalation and Logging

The driver logs fault events when they are received, including details such as the fault type (critical or non-critical), node index, and safety mechanism index. This information can be observed on the UART console during runtime and is useful for debugging and validation.

Example log output:

[FMU] Critical fault received: Device: 0x0, Node 0x01, SM 0x10

Threshold-based escalation is configured using the set_threshold() and set_upgrade_enabled() APIs. When the number of fault occurrences exceeds the defined threshold and escalation is enabled, the FMU automatically promotes the fault to critical.

Fault Handling Flow

When a fault interrupt is received by the root FMU, the fault handling process begins. The interrupt service routine (ISR) triggers an iterative walk through the FMU hierarchy to identify and process all active fault records.

The driver inspects each FMU by reading the ERRGSR_L and ERRGSR_H registers to detect active faults. Once an active fault record is found, it is acknowledged by writing to the corresponding ERRIMPDEF register with the interrupt clear (IC) bit set.

If a fault is detected on a non-root FMU, the walk continues toward the leaves of the FMU topology until no further upstream faults are found. For each confirmed fault, the driver collects fault metadata, including the FMU device index, node index, and safety mechanism ID.

A mod_fmu_fault_notification_params structure is populated with these details, and a fault notification is raised using the fwk_notification_notify() API. Fault events are also logged via FWK_LOG_INFO to aid debugging and diagnostics.

Once fault processing completes, the root FMU asserts its critical or non-critical fault lines. These are received by the SSU, which evaluates the system’s safety state and signals status changes to the External Safety Management (ESM).

Notifications

The FMU driver raises notifications for each fault that occurs by placing an event into the SCP-firmware’s shared event queue. This queue processes events serially, ensuring a consistent and predictable execution order.

These notifications are intended primarily for diagnostic and logging purposes. As such, they are not used in safety-critical fault responses but can provide useful runtime insights for non-critical fault monitoring and debugging.

Note

The SCP-firmware event loop is finite. If faults arrive faster than they can be processed, they will be silently dropped. To avoid loss of fault notifications, the event queue size should be configured appropriately.

Testing and Validation

Unit Testing: Executed on host using Unity framework. Refer to System Control Processor (SCP) Unit Test for more information on this framework.

Integration Testing: This will use the SCP-firmware debugger CLI. Refer to Reproduce for more information on this framework.

OEQA Automation: This will use the SCP-firmware debugger CLI. Refer to Validation for more information on this framework.