DevOps

Monitoring with Prometheus & Grafana

Learn monitoring fundamentals with Prometheus for metrics collection and Grafana for visualization. Set up alerts and dashboards.

By TechCoder TeamLast updated: 2026-06-02
In a Nutshell

Learn monitoring fundamentals with Prometheus for metrics collection and Grafana for visualization. Set up alerts and dashboards. This hands-on tutorial focuses on practical implementation of monitoring with prometheus & grafana concepts.

Monitoring with Prometheus & Grafana

Monitoring provides visibility into system health, performance, and reliability—essential for proactive operations.

Monitoring Fundamentals

┌─────────────────────────────────────────────────────────────────┐
│                    Monitoring Stack                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐       │
│   │   Targets   │───>│  Prometheus │───>│   Grafana   │       │
│   │  (Apps/Infra)│    │  (Metrics DB)│    │(Dashboards) │       │
│   └─────────────┘    └──────┬──────┘    └─────────────┘       │
│                             │                                    │
│                             ▼                                    │
│                      ┌─────────────┐                            │
│                      │    Alert    │                            │
│                      │   Manager   │                            │
│                      └──────┬──────┘                            │
│                             │                                    │
│                             ▼                                    │
│                      ┌─────────────┐                            │
│                      │  PagerDuty  │                            │
│                      │   Slack     │                            │
│                      │   Email     │                            │
│                      └─────────────┘                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Key Metrics Types

TypeDescriptionExample
CounterCumulative, only increasesTotal requests, errors
GaugeCan go up or downTemperature, memory usage
HistogramSamples in bucketsRequest duration
SummaryLike histogram with configurable quantilesRequest latency

Prometheus

Open-source monitoring and alerting toolkit.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      Prometheus                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Retrieval ──> TSDB ──> HTTP API ──> UI / Grafana / Alerts   │
│      │            │                                           │
│      │            └─> WAL (Write Ahead Log)                    │
│      └─> Pull metrics from exporters                           │
│                                                                  │
│   Service Discovery: Kubernetes, Consul, EC2, file, etc.        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Installation

# prometheus-deployment.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: production
        replica: '{{.ExternalURL}}'
    
    alerting:
      alertmanagers:
        - static_configs:
            - targets: ['alertmanager:9093']
    
    rule_files:
      - /etc/prometheus/rules/*.yml
    
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
      
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
      
      - job_name: 'node-exporter'
        static_configs:
          - targets: ['node-exporter:9100']
      
      - job_name: 'application'
        static_configs:
          - targets: ['app:8080']
        metrics_path: /actuator/prometheus
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:latest
          args:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/prometheus'
            - '--storage.tsdb.retention.time=15d'
            - '--web.console.libraries=/usr/share/prometheus/console_libraries'
            - '--web.console.templates=/usr/share/prometheus/consoles'
          ports:
            - containerPort: 9090
          volumeMounts:
            - name: config
              mountPath: /etc/prometheus
            - name: storage
              mountPath: /prometheus
      volumes:
        - name: config
          configMap:
            name: prometheus-config
        - name: storage
          persistentVolumeClaim:
            claimName: prometheus-pvc

PromQL - Query Language

# Basic queries
up  # Check if target is up
node_cpu_seconds_total  # CPU usage
rate(http_requests_total[5m])  # Request rate over 5 minutes

# Aggregations
sum by (job) (up)
avg by (instance) (node_memory_MemAvailable_bytes)

# Range queries
http_requests_total{job="api"}[1h]
increase(http_requests_total[1h])

# Functions
rate(http_requests_total[5m])
irate(http_requests_total[5m])
increase(http_requests_total[1h])
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Recording rules
record: job:http_requests_total:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))

Alerting Rules

# prometheus-rules.yml
groups:
  - name: node_alerts
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node {{ $labels.instance }} has been down for more than 5 minutes."
      
      - alert: HighMemoryUsage
        expr: |
          (
            node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
          ) / node_memory_MemTotal_bytes * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% (current: {{ $value }}%)"
      
      - alert: HighCPUUsage
        expr: |
          100 - (
            avg by (instance) (
              irate(node_cpu_seconds_total{mode="idle"}[5m])
            ) * 100
          ) > 80
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% (current: {{ $value }}%)"
      
      - alert: DiskSpaceLow
        expr: |
          (
            node_filesystem_avail_bytes{mountpoint="/"} /
            node_filesystem_size_bytes{mountpoint="/"}
          ) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk has less than 10% space remaining"
      
      - alert: HighErrorRate
        expr: |
          (
            sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) /
            sum by (job) (rate(http_requests_total[5m]))
          ) * 100 > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate for {{ $labels.job }}"
          description: "Error rate is above 5% (current: {{ $value }}%)"

Alertmanager

Handles alerts sent by Prometheus.

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'app_password'
  slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: warning
      receiver: 'slack-warnings'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

receivers:
  - name: 'default'
    email_configs:
      - to: 'oncall@example.com'
        subject: '{{ template "email.default.subject" . }}'
        body: '{{ template "email.default.body" . }}'
  
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-service-key'
        severity: critical
        description: '{{ .CommonAnnotations.summary }}'
  
  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

Grafana

Visualization and analytics platform.

Installation

apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
        - name: grafana
          image: grafana/grafana:latest
          ports:
            - containerPort: 3000
          env:
            - name: GF_SECURITY_ADMIN_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: grafana-secrets
                  key: admin-password
            - name: GF_INSTALL_PLUGINS
              value: grafana-piechart-panel
          volumeMounts:
            - name: storage
              mountPath: /var/lib/grafana
            - name: datasources
              mountPath: /etc/grafana/provisioning/datasources
            - name: dashboards
              mountPath: /etc/grafana/provisioning/dashboards
      volumes:
        - name: storage
          persistentVolumeClaim:
            claimName: grafana-pvc
        - name: datasources
          configMap:
            name: grafana-datasources
        - name: dashboards
          configMap:
            name: grafana-dashboards
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
data:
  datasources.yml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://prometheus:9090
        isDefault: true
        editable: false

Key Dashboard Metrics

CategoryMetrics
InfrastructureCPU, Memory, Disk, Network, Load
ApplicationRequest rate, Latency, Error rate, Saturation
BusinessActive users, Transactions, Revenue

USE Method

  • Utilization: Percent time busy
  • Saturation: Queue length / extra work waiting
  • Errors: Error count

RED Method (Microservices)

  • Rate: Requests per second
  • Errors: Failed requests
  • Duration: Time per request

Quiz

Quiz

Question 1 of 5

What is a counter metric type in Prometheus?

A value that can go up and down
A cumulative metric that only increases
A fixed value that never changes
A metric with multiple values

Next Steps

Now let's explore centralized logging with the ELK stack.