Critical Application Monitoring Demo
Introduction
Critical applications often follow a pattern where the workloads are split into multiple periodic tasks chained together to produce a feature pipeline. Detection of application execution faults in such safety-critical systems is one of the pillars of a system’s reliability strategy. The Critical Application Monitoring (CAM) project implements a solution for monitoring such critical applications using a monitoring service that runs on a higher safety level system. The main goal of CAM is to ensure that a certain piece of code running in critical applications executes periodically at a specific frequency. When the execution time is violated, critical applications are deemed as malfunctioning. The classes of issues that CAM can detect can be broadly classified into:
Temporal issues: Events arriving outside the expected frequency.
Logical issues: Events arriving out of order.
The CAM project is integrated into the Kronos Reference Software Stack to demonstrate the feasibility of monitoring Primary Compute applications from the Safety Island. Refer to Critical Application Monitoring Documentation for more information on CAM project and its implementation details.
Critical Application Monitoring on Kronos
The Critical Application Monitoring demo can be run on both Baremetal and Virtualization Architectures.
The following diagram shows the architecture of the demo in the Baremetal Architecture:
CAM consists of the following major components:
Stream configuration file: Configuration file containing the number of stream events and their timing characteristics according to the requirements of the critical application.
Stream deployment data: Binary representation of the stream configuration that needs to be deployed to the Safety Island.
cam-tool: A python-based tool used to generate and deploy stream deployment data by analyzing stream configuration file.
cam-service: CAM monitoring agent that monitors event streams sent by critical applications and runs from higher safety cores in the Safety Island.
cam-service
uses the stream deployment data to validate event streams produced by critical applications.libcam: CAM library that offers a simple, thread-safe API that can be used by critical applications to integrate the CAM project. The API enables the applications to register with
cam-service
and generate event streams to be sent tocam-service
.cam-app-example: An example application that uses
libcam
API to integrate CAM framework. It also supports error injection into the stream events to trigger a fault detection bycam-service
.
The Primary Compute components are deployed on the baremetal Linux root filesystem in the Baremetal Architecture build and on the DomU1 and DomU2 Linux root filesystem in the Virtualization Architecture.
In the Kronos Reference Software Stack, cam-service
is deployed on the
Safety Island Cluster 1 in order to provide applications on the Primary Compute
with a high safety level of monitoring services.
The following are platform requirements to support the cam-service
deployment on the Safety Island:
Communication between the Safety Island and the Primary Compute for event streams.
Synchronized clocks on the Safety Island and the Primary Compute for temporal check.
Storage and a file system on the Safety Island for stream data deployment.
Virtualization Architecture
The following diagram shows the architecture of the demo in the Virtualization Architecture:
In this deployment, two different instances of cam-app-example run on DomU1 and DomU2. Each application is monitored by cam-service concurrently via separate data deployment and event streams.
Communication Interfaces
BSD sockets (over TCP) are used in order to send the event message from
cam-app-example
to cam-service
via the Heterogeneous Inter-Processor Communication (HIPC) feature.
Time Synchronization
Real-time clocks on the Primary Compute and the Safety Island are synchronized via the gPTP protocol.
Zephyr File System
Zephyr supports the FAT file system and can mount it to a RAM disk. Refer to Zephyr file system.
Note
Due to the volatility of the RAM disk, on every system boot, the CAM stream
data needs to be deployed from the Primary Compute to the Safety Island
Cluster 1 via cam-tool
.
Validation
Refer to the CAM Demo validations Integration Tests Validating the Critical Application Monitoring Demo.