Skip to main content

Observability & Monitoring

DevOpsGenie deploys a complete observability stack as part of the platform installation. This guide explains how to use, configure, and extend it.

The Observability Stack

SignalCollectionStorageVisualization
MetricsPrometheus, node-exporter, kube-state-metricsPrometheus (TSDB) / ThanosGrafana
LogsPromtail / Fluent BitLokiGrafana Explore
TracesOpenTelemetry SDK + CollectorTempoGrafana Tempo
AlertsAlertmanagerPagerDuty / Slack

Pre-Built Dashboards

The following Grafana dashboards are included out of the box:

DashboardDescription
EKS Cluster OverviewNode health, capacity, resource utilization
Kubernetes WorkloadsDeployment, ReplicaSet, and pod metrics
Karpenter Node LifecycleNode provisioning, termination, bin-packing efficiency
Service SLOError rate, latency p50/p95/p99, availability
ALB Request MetricsRequest rate, error rate, latency by target group
Loki Log ExplorerFull-text log search with label filters
Node ExporterOS-level CPU, memory, disk, and network
AWS CostsNamespace and team-level cost attribution via Kubecost

Configuring SLOs

SLOs (Service Level Objectives) are defined as Prometheus recording rules. DevOpsGenie generates SLO dashboards and burn-rate alerts automatically.

Define an SLO

kubernetes/monitoring/slos/payments-api.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: payments-api-slos
namespace: monitoring
labels:
prometheus: kube-prometheus
role: alert-rules
spec:
groups:
- name: payments-api.slo.rules
interval: 30s
rules:
# Availability SLO: 99.9% (43.8 min/month error budget)
- record: job:http_requests_errors:rate5m
expr: |
sum(rate(http_requests_total{job="payments-api", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="payments-api"}[5m]))

# Latency SLO: 95% of requests < 300ms
- record: job:http_request_duration_p95:rate5m
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{job="payments-api"}[5m]))
by (le)
)

- name: payments-api.slo.alerts
rules:
- alert: PaymentsAPIHighErrorRate
expr: job:http_requests_errors:rate5m > 0.01
for: 5m
labels:
severity: critical
team: payments
annotations:
summary: "payments-api error rate above 1%"
description: "Current error rate: {{ $value | humanizePercentage }}. SLO target: 0.1%."
runbook_url: "https://docs.devopsgenie.io/runbooks/payments-api-errors"

- alert: PaymentsAPIHighLatency
expr: job:http_request_duration_p95:rate5m > 0.3
for: 10m
labels:
severity: warning
team: payments
annotations:
summary: "payments-api p95 latency above 300ms"
description: "Current p95 latency: {{ $value | humanizeDuration }}. SLO target: 300ms."

Alertmanager Configuration

kubernetes/monitoring/alertmanager-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: platform-alerts
namespace: monitoring
spec:
route:
groupBy: ['alertname', 'cluster', 'namespace']
groupWait: 30s
groupInterval: 5m
repeatInterval: 4h
receiver: 'slack-default'
routes:
- matchers:
- name: severity
value: critical
receiver: 'pagerduty-critical'
repeatInterval: 1h

- matchers:
- name: team
value: payments
receiver: 'slack-payments'

receivers:
- name: 'slack-default'
slackConfigs:
- apiURL:
name: alertmanager-secrets
key: slack-webhook-url
channel: '#platform-alerts'
sendResolved: true
title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
text: |
{{ range .Alerts }}
*Namespace:* {{ .Labels.namespace }}
*Summary:* {{ .Annotations.summary }}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}

- name: 'pagerduty-critical'
pagerdutyConfigs:
- routingKey:
name: alertmanager-secrets
key: pagerduty-routing-key
description: '{{ .CommonAnnotations.summary }}'

Distributed Tracing

Instrument your application with the OpenTelemetry SDK:

src/instrumentation.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'payments-api',
[SEMRESATTRS_SERVICE_VERSION]: process.env.APP_VERSION || 'unknown',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector.monitoring.svc.cluster.local:4317',
}),
});

sdk.start();

Log Querying with LogQL

Loki uses LogQL for log queries in Grafana:

# All error logs for the payments-api in the last hour
{namespace="team-payments", app="payments-api"} |= "error" | json | level="error"

# Latency breakdown from structured logs
{namespace="team-payments"} | json | __error__="" | unwrap duration_ms | quantile_over_time(0.95, [5m]) by (app)

# Rate of 5xx responses from nginx access logs
sum(rate({namespace="ingress-nginx"} | pattern `<_> "<method> <_> <_>" <status> <_>` | status >= 500 [1m])) by (status)

Long-Term Storage with Thanos

For clusters retaining metrics beyond 15 days, DevOpsGenie uses Thanos for long-term storage:

kubernetes/monitoring/thanos-objstore.yaml
apiVersion: v1
kind: Secret
metadata:
name: thanos-objstore-config
namespace: monitoring
type: Opaque
stringData:
objstore.yaml: |
type: S3
config:
bucket: my-platform-thanos-metrics
region: us-east-1
sse_config:
type: SSE-S3

Next Steps