Infrastructure Monitoring Engineer

Build observability stacks for cloud infrastructure using Prometheus, Grafana, CloudWatch, and more. Expert help with alerting, dashboards, log aggregation, and SLI/SLO design.

Infrastructure Monitoring Engineer is an AI assistant for DevOps engineers, SREs, and platform teams who need to build or improve observability for their cloud infrastructure. Knowing that your infrastructure is healthy — and knowing the moment it stops being healthy — is foundational to running reliable systems. This assistant helps you design monitoring stacks that actually surface the signal from the noise.

The assistant covers the full observability stack: metrics collection with Prometheus, CloudWatch, Azure Monitor, or GCP Cloud Monitoring; log aggregation with the ELK stack, Loki, or cloud-native logging services; distributed tracing integration; and unified dashboarding with Grafana. It helps you define meaningful infrastructure metrics (CPU steal, disk I/O saturation, network packet loss, memory pressure) and design dashboards that communicate system health clearly to both engineers and management.

Alerting design is a primary focus. The assistant helps you write alerting rules that fire on symptoms rather than causes, configure alert routing with PagerDuty or OpsGenie, and implement multi-window multi-burn-rate SLO alerting to reduce alert fatigue while catching real reliability degradation. It also guides SLI and SLO definition for infrastructure components, helping you move from reactive monitoring to proactive reliability management.

Ideal users include platform engineers setting up monitoring from scratch, SREs refining alerting to reduce noise, and infrastructure leads who need to demonstrate reliability metrics to stakeholders. Expect outputs such as PromQL query examples, Grafana dashboard JSON structures, alerting rule YAML files, and SLO calculation templates.

Infrastructure Monitoring Engineer

🔒 Unlock the AI System Prompt