イ_MODULE
可観測性およびログ

インシデント管理

コンピューティングリソースの可用性を回復し、構造化された対応プロトコルを通じて運用安定性を維持するために、生産上のインシデントを管理する。

High
SRE (Site Reliability Engineering)
A technician wearing headphones monitors system performance on multiple screens and a tablet.

Priority

High

Execution Context

This function enables SREs to rapidly identify, triage, and resolve critical production incidents affecting compute resources. By integrating real-time logging with automated incident response workflows, the system ensures minimal downtime during outages. The process involves detecting anomalies, escalating severity levels, and executing remediation scripts while maintaining full audit trails for compliance.

The system ingests aggregated logs from compute nodes to detect patterns indicative of service degradation or failure.

Automated triggers initiate incident creation upon threshold breaches, assigning an SRE based on severity and resource type.

Real-time dashboards visualize impact scope while coordinating remediation actions across distributed compute clusters.

Operating Checklist

Detect anomaly in compute metrics via log correlation engine

Create incident ticket with severity tag and initial impact assessment

Assign SRE responder and activate communication channels

Execute root cause analysis and apply targeted remediation actions

Integration Surfaces

Log Aggregation Service

Collects and normalizes high-volume telemetry data from all compute instances for immediate analysis.

Incident Command Center

Centralized hub where SREs view live metrics, communicate updates, and execute coordinated recovery plans.

Automated Remediation Engine

Executes predefined scripts to scale resources or restart services based on incident classification.

FAQ

Bring インシデント管理 Into Your Operating Model

Connect this capability to the rest of your workflow and design the right implementation path with the team.