Monitoring with Prometheus & Grafana

Monitoring provides visibility into system health, performance, and reliability—essential for proactive operations.

Monitoring Fundamentals

┌─────────────────────────────────────────────────────────────────┐
│                    Monitoring Stack                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐       │
│   │   Targets   │───>│  Prometheus │───>│   Grafana   │       │
│   │  (Apps/Infra)│    │  (Metrics DB)│    │(Dashboards) │       │
│   └─────────────┘    └──────┬──────┘    └─────────────┘       │
│                             │                                    │
│                             ▼                                    │
│                      ┌─────────────┐                            │
│                      │    Alert    │                            │
│                      │   Manager   │                            │
│                      └──────┬──────┘                            │
│                             │                                    │
│                             ▼                                    │
│                      ┌─────────────┐                            │
│                      │  PagerDuty  │                            │
│                      │   Slack     │                            │
│                      │   Email     │                            │
│                      └─────────────┘                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Key Metrics Types

Type	Description	Example
Counter	Cumulative, only increases	Total requests, errors
Gauge	Can go up or down	Temperature, memory usage
Histogram	Samples in buckets	Request duration
Summary	Like histogram with configurable quantiles	Request latency

Prometheus

Open-source monitoring and alerting toolkit.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      Prometheus                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Retrieval ──> TSDB ──> HTTP API ──> UI / Grafana / Alerts   │
│      │            │                                           │
│      │            └─> WAL (Write Ahead Log)                    │
│      └─> Pull metrics from exporters                           │
│                                                                  │
│   Service Discovery: Kubernetes, Consul, EC2, file, etc.        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Installation

# prometheus-deployment.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: production
        replica: '{{.ExternalURL}}'
    
    alerting:
      alertmanagers:
        - static_configs:
            - targets: ['alertmanager:9093']
    
    rule_files:
      - /etc/prometheus/rules/*.yml
    
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
      
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
      
      - job_name: 'node-exporter'
        static_configs:
          - targets: ['node-exporter:9100']
      
      - job_name: 'application'
        static_configs:
          - targets: ['app:8080']
        metrics_path: /actuator/prometheus
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:latest
          args:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/prometheus'
            - '--storage.tsdb.retention.time=15d'
            - '--web.console.libraries=/usr/share/prometheus/console_libraries'
            - '--web.console.templates=/usr/share/prometheus/consoles'
          ports:
            - containerPort: 9090
          volumeMounts:
            - name: config
              mountPath: /etc/prometheus
            - name: storage
              mountPath: /prometheus
      volumes:
        - name: config
          configMap:
            name: prometheus-config
        - name: storage
          persistentVolumeClaim:
            claimName: prometheus-pvc

PromQL - Query Language

# Basic queries
up  # Check if target is up
node_cpu_seconds_total  # CPU usage
rate(http_requests_total[5m])  # Request rate over 5 minutes

# Aggregations
sum by (job) (up)
avg by (instance) (node_memory_MemAvailable_bytes)

# Range queries
http_requests_total{job="api"}[1h]
increase(http_requests_total[1h])

# Functions
rate(http_requests_total[5m])
irate(http_requests_total[5m])
increase(http_requests_total[1h])
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Recording rules
record: job:http_requests_total:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))

Alerting Rules

# prometheus-rules.yml
groups:
  - name: node_alerts
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node {{ $labels.instance }} has been down for more than 5 minutes."
      
      - alert: HighMemoryUsage
        expr: |
          (
            node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
          ) / node_memory_MemTotal_bytes * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% (current: {{ $value }}%)"
      
      - alert: HighCPUUsage
        expr: |
          100 - (
            avg by (instance) (
              irate(node_cpu_seconds_total{mode="idle"}[5m])
            ) * 100
          ) > 80
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% (current: {{ $value }}%)"
      
      - alert: DiskSpaceLow
        expr: |
          (
            node_filesystem_avail_bytes{mountpoint="/"} /
            node_filesystem_size_bytes{mountpoint="/"}
          ) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk has less than 10% space remaining"
      
      - alert: HighErrorRate
        expr: |
          (
            sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) /
            sum by (job) (rate(http_requests_total[5m]))
          ) * 100 > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate for {{ $labels.job }}"
          description: "Error rate is above 5% (current: {{ $value }}%)"

Alertmanager

Handles alerts sent by Prometheus.

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'app_password'
  slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: warning
      receiver: 'slack-warnings'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

receivers:
  - name: 'default'
    email_configs:
      - to: 'oncall@example.com'
        subject: '{{ template "email.default.subject" . }}'
        body: '{{ template "email.default.body" . }}'
  
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-service-key'
        severity: critical
        description: '{{ .CommonAnnotations.summary }}'
  
  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

Grafana

Visualization and analytics platform.

Installation

apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
        - name: grafana
          image: grafana/grafana:latest
          ports:
            - containerPort: 3000
          env:
            - name: GF_SECURITY_ADMIN_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: grafana-secrets
                  key: admin-password
            - name: GF_INSTALL_PLUGINS
              value: grafana-piechart-panel
          volumeMounts:
            - name: storage
              mountPath: /var/lib/grafana
            - name: datasources
              mountPath: /etc/grafana/provisioning/datasources
            - name: dashboards
              mountPath: /etc/grafana/provisioning/dashboards
      volumes:
        - name: storage
          persistentVolumeClaim:
            claimName: grafana-pvc
        - name: datasources
          configMap:
            name: grafana-datasources
        - name: dashboards
          configMap:
            name: grafana-dashboards
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
data:
  datasources.yml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://prometheus:9090
        isDefault: true
        editable: false

Key Dashboard Metrics

Category	Metrics
Infrastructure	CPU, Memory, Disk, Network, Load
Application	Request rate, Latency, Error rate, Saturation
Business	Active users, Transactions, Revenue

USE Method

Utilization: Percent time busy
Saturation: Queue length / extra work waiting
Errors: Error count

RED Method (Microservices)

Rate: Requests per second
Errors: Failed requests
Duration: Time per request

Quiz

Question 1 of 5

What is a counter metric type in Prometheus?

A value that can go up and down

A cumulative metric that only increases

A fixed value that never changes

A metric with multiple values

Next Steps

Now let's explore centralized logging with the ELK stack.