The primary methods for monitoring ECS are:
- ECS portal—The dashboard on the ECS portal provides the first view into system health. From the dashboard, you can further explore major issues by using the other monitoring panes provided by the ECS portal. Dashboard items that indicate the need for further investigation include:
- Nodes and disks with a red X or yellow caution marks. If you see any of these, go to the Nodes and Process health pane to determine which disk and node is not working and investigate further.
- Critical alerts. Examine the Events pane to see critical alerts and determine if action is required.
- Capacity. The Metering Pane indicates which namespace or buckets are using capacity. View Capacity Utilization to determine if more disks need to be added.
- Performance data. Determine if workload performance is as expected by using the historical view.
- Geo monitoring. Look at failover progress to validate that failover is as expected.
- Audit logs—Audit logs record changes in the system configuration. Review the audit logs for unauthorized modifications such as owner or ACL changes, quota changes, or creation and deletion of buckets and users.
- Event notifications—Types of event notifications in ECS include:
- SNMP: Information about network-managed device status and statistics to SNMP network management clients.
- Syslog: Provides a method for centralized storage and retrieval of system log messages.
- ECS service logs—ECS service logs are available on each node and are accessible over SSH by the system administrator user. These are output logs collected for each of the services running on a node—authsvc, blobsvc, eventsvc, and so on. These logs are designed for ECS experts and engineering to further probe and diagnose possible issues. The service logs are in opt/emc/caspian/fabric/agent/services/object/main/log.
- Look for unevenness of CPU, memory, and network bandwidth between nodes.
- Become familiar with the performance of the system and the metrics that are expected over time so that if rates are out of the normal range investigation can be initiated.
- Do not let ECS get too full. Account for rebalancing time when expanding.
- Look for a higher than normal number of failed requests and determine root cause.
- Regularly check events and audit logs.