This function provides granular visibility into CPU consumption metrics for all active inference and training jobs within the factory environment. System Administrators utilize this tool to identify resource contention, predict capacity limits, and optimize cost-efficiency by right-sizing compute clusters. By aggregating telemetry data from underlying hardware, the system generates actionable alerts when utilization thresholds are breached, enabling proactive maintenance before service degradation occurs.
The system ingests raw hardware telemetry streams to calculate aggregate CPU utilization percentages per node and cluster.
Anomaly detection algorithms correlate high utilization spikes with specific job types to identify resource contention patterns.
Automated scaling recommendations are generated based on current load, suggesting compute expansion or workload redistribution strategies.
Initialize monitoring agents on all compute nodes to capture hardware-level CPU metrics.
Aggregate telemetry data into a central time-series database for unified analysis.
Apply threshold rules to detect utilization anomalies and trigger automated notifications.
Generate capacity reports with actionable recommendations for scaling or optimization.
Real-time charts display CPU usage trends over time with color-coded thresholds for immediate administrative awareness.
Configurable notification channels trigger instant alerts when critical utilization limits are approached or exceeded.
Historical data exports provide detailed audit trails for compliance and capacity planning analysis.