The Fabric layer provides clustering, system health, software management, configuration management, upgrade capabilities and alerting. It is responsible for keeping services running and managing resources such as disks, containers and the network. It tracks and reacts to environment changes such as failure detection and provides alerts related to system health. The Fabric layer has the following components:
The node agent is a lightweight agent written in Java that runs natively on all ECS nodes. Its main duties include managing and controlling host resources (Docker containers, disks, the firewall, the network) and monitoring system processes. Examples of management include formatting and mounting disks, opening required ports, ensuring all processes are running, and determining public and private network interfaces. It has an event stream that provides ordered events to a lifecycle manager to indicate events occurring on the system. A Fabric CLI is useful to diagnose issues and look at overall system state.
The lifecycle manager runs on a subset of three or five nodes and manages the lifecycle of applications running on nodes. Each lifecycle manager is responsible for tracking several nodes. Its main goal is to manage the entire lifecycle of the ECS application from boot to deployment, including failure detection, recovery, notification, and migration. It looks at the node agent streams and drives the agent to handle the situation. When a node is down, it responds to failures or inconsistencies in the state of the node by restoring the system to a known good state. If a lifecycle manager instance is down, another one takes its place.
The registry contains the ECS Docker images used during installation, upgrade, and node replacement. A Docker container called fabric-registry runs on one node within the ECS rack and holds the repository of ECS Docker images and information required for installations and upgrades. Although the registry is available on one node at a time, all Docker images are locally cached on every node, so any may serve the registry.
The event library is used within the Fabric layer to expose the lifecycle and node agent event streams. Events generated by the system are persisted onto shared memory and disk to provide historical information about the state and health of the ECS system. These ordered event streams can be used to restore the system to a specific state by replaying the ordered events stored. Some examples of events include node events such as started, stopped, or degraded.
The hardware manager is integrated to the Fabric Agent to support industry standard hardware. Its main purpose is to provide hardware specific status and event information, and provisioning of the hardware layer to higher level services within ECS.