Observability & Monitoring

DevOpsGenie deploys a complete observability stack as part of the platform installation. This guide explains how to use, configure, and extend it.

The Observability Stack

Signal	Collection	Storage	Visualization
Metrics	Prometheus, node-exporter, kube-state-metrics	Prometheus (TSDB) / Thanos	Grafana
Logs	Promtail / Fluent Bit	Loki	Grafana Explore
Traces	OpenTelemetry SDK + Collector	Tempo	Grafana Tempo
Alerts	Alertmanager	—	PagerDuty / Slack

Pre-Built Dashboards

The following Grafana dashboards are included out of the box:

Dashboard	Description
EKS Cluster Overview	Node health, capacity, resource utilization
Kubernetes Workloads	Deployment, ReplicaSet, and pod metrics
Karpenter Node Lifecycle	Node provisioning, termination, bin-packing efficiency
Service SLO	Error rate, latency p50/p95/p99, availability
ALB Request Metrics	Request rate, error rate, latency by target group
Loki Log Explorer	Full-text log search with label filters
Node Exporter	OS-level CPU, memory, disk, and network
AWS Costs	Namespace and team-level cost attribution via Kubecost

Configuring SLOs

SLOs (Service Level Objectives) are defined as Prometheus recording rules. DevOpsGenie generates SLO dashboards and burn-rate alerts automatically.

Define an SLO

kubernetes/monitoring/slos/payments-api.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payments-api-slos
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    - name: payments-api.slo.rules
      interval: 30s
      rules:
        # Availability SLO: 99.9% (43.8 min/month error budget)
        - record: job:http_requests_errors:rate5m
          expr: |
            sum(rate(http_requests_total{job="payments-api", status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{job="payments-api"}[5m]))

        # Latency SLO: 95% of requests < 300ms
        - record: job:http_request_duration_p95:rate5m
          expr: |
            histogram_quantile(0.95,
              sum(rate(http_request_duration_seconds_bucket{job="payments-api"}[5m]))
              by (le)
            )

    - name: payments-api.slo.alerts
      rules:
        - alert: PaymentsAPIHighErrorRate
          expr: job:http_requests_errors:rate5m > 0.01
          for: 5m
          labels:
            severity: critical
            team: payments
          annotations:
            summary: "payments-api error rate above 1%"
            description: "Current error rate: {{ $value | humanizePercentage }}. SLO target: 0.1%."
            runbook_url: "https://docs.devopsgenie.io/runbooks/payments-api-errors"

        - alert: PaymentsAPIHighLatency
          expr: job:http_request_duration_p95:rate5m > 0.3
          for: 10m
          labels:
            severity: warning
            team: payments
          annotations:
            summary: "payments-api p95 latency above 300ms"
            description: "Current p95 latency: {{ $value | humanizeDuration }}. SLO target: 300ms."

Alertmanager Configuration

kubernetes/monitoring/alertmanager-config.yaml
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: platform-alerts
  namespace: monitoring
spec:
  route:
    groupBy: ['alertname', 'cluster', 'namespace']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 4h
    receiver: 'slack-default'
    routes:
      - matchers:
          - name: severity
            value: critical
        receiver: 'pagerduty-critical'
        repeatInterval: 1h

      - matchers:
          - name: team
            value: payments
        receiver: 'slack-payments'

  receivers:
    - name: 'slack-default'
      slackConfigs:
        - apiURL:
            name: alertmanager-secrets
            key: slack-webhook-url
          channel: '#platform-alerts'
          sendResolved: true
          title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
          text: |
            {{ range .Alerts }}
            *Namespace:* {{ .Labels.namespace }}
            *Summary:* {{ .Annotations.summary }}
            *Runbook:* {{ .Annotations.runbook_url }}
            {{ end }}

    - name: 'pagerduty-critical'
      pagerdutyConfigs:
        - routingKey:
            name: alertmanager-secrets
            key: pagerduty-routing-key
          description: '{{ .CommonAnnotations.summary }}'

Distributed Tracing

Instrument your application with the OpenTelemetry SDK:

src/instrumentation.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: 'payments-api',
    [SEMRESATTRS_SERVICE_VERSION]: process.env.APP_VERSION || 'unknown',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector.monitoring.svc.cluster.local:4317',
  }),
});

sdk.start();

Log Querying with LogQL

Loki uses LogQL for log queries in Grafana:

# All error logs for the payments-api in the last hour
{namespace="team-payments", app="payments-api"} |= "error" | json | level="error"

# Latency breakdown from structured logs
{namespace="team-payments"} | json | __error__="" | unwrap duration_ms | quantile_over_time(0.95, [5m]) by (app)

# Rate of 5xx responses from nginx access logs
sum(rate({namespace="ingress-nginx"} | pattern `<_> "<method> <_> <_>" <status> <_>` | status >= 500 [1m])) by (status)

Long-Term Storage with Thanos

For clusters retaining metrics beyond 15 days, DevOpsGenie uses Thanos for long-term storage:

kubernetes/monitoring/thanos-objstore.yaml
apiVersion: v1
kind: Secret
metadata:
  name: thanos-objstore-config
  namespace: monitoring
type: Opaque
stringData:
  objstore.yaml: |
    type: S3
    config:
      bucket: my-platform-thanos-metrics
      region: us-east-1
      sse_config:
        type: SSE-S3

Next Steps

Security & Access Control — secure your monitoring infrastructure
CI/CD Best Practices — integrate SLO checks into deployment pipelines

The Observability Stack​

Pre-Built Dashboards​

Configuring SLOs​

Define an SLO​

Alertmanager Configuration​

Distributed Tracing​

Log Querying with LogQL​

Long-Term Storage with Thanos​

Next Steps​