DevOps
Monitoring with Prometheus & Grafana
Learn monitoring fundamentals with Prometheus for metrics collection and Grafana for visualization. Set up alerts and dashboards.
By TechCoder TeamLast updated: 2026-06-02
In a Nutshell
Learn monitoring fundamentals with Prometheus for metrics collection and Grafana for visualization. Set up alerts and dashboards. This hands-on tutorial focuses on practical implementation of monitoring with prometheus & grafana concepts.
Monitoring with Prometheus & Grafana
Monitoring provides visibility into system health, performance, and reliability—essential for proactive operations.
Monitoring Fundamentals
┌─────────────────────────────────────────────────────────────────┐
│ Monitoring Stack │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Targets │───>│ Prometheus │───>│ Grafana │ │
│ │ (Apps/Infra)│ │ (Metrics DB)│ │(Dashboards) │ │
│ └─────────────┘ └──────┬──────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Alert │ │
│ │ Manager │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ PagerDuty │ │
│ │ Slack │ │
│ │ Email │ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Key Metrics Types
| Type | Description | Example |
|---|---|---|
| Counter | Cumulative, only increases | Total requests, errors |
| Gauge | Can go up or down | Temperature, memory usage |
| Histogram | Samples in buckets | Request duration |
| Summary | Like histogram with configurable quantiles | Request latency |
Prometheus
Open-source monitoring and alerting toolkit.
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Prometheus │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Retrieval ──> TSDB ──> HTTP API ──> UI / Grafana / Alerts │
│ │ │ │
│ │ └─> WAL (Write Ahead Log) │
│ └─> Pull metrics from exporters │
│ │
│ Service Discovery: Kubernetes, Consul, EC2, file, etc. │
│ │
└─────────────────────────────────────────────────────────────────┘
Installation
# prometheus-deployment.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: production
replica: '{{.ExternalURL}}'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'application'
static_configs:
- targets: ['app:8080']
metrics_path: /actuator/prometheus
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: storage
mountPath: /prometheus
volumes:
- name: config
configMap:
name: prometheus-config
- name: storage
persistentVolumeClaim:
claimName: prometheus-pvc
PromQL - Query Language
# Basic queries
up # Check if target is up
node_cpu_seconds_total # CPU usage
rate(http_requests_total[5m]) # Request rate over 5 minutes
# Aggregations
sum by (job) (up)
avg by (instance) (node_memory_MemAvailable_bytes)
# Range queries
http_requests_total{job="api"}[1h]
increase(http_requests_total[1h])
# Functions
rate(http_requests_total[5m])
irate(http_requests_total[5m])
increase(http_requests_total[1h])
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Recording rules
record: job:http_requests_total:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
Alerting Rules
# prometheus-rules.yml
groups:
- name: node_alerts
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "Node {{ $labels.instance }} has been down for more than 5 minutes."
- alert: HighMemoryUsage
expr: |
(
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
) / node_memory_MemTotal_bytes * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% (current: {{ $value }}%)"
- alert: HighCPUUsage
expr: |
100 - (
avg by (instance) (
irate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100
) > 80
for: 15m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% (current: {{ $value }}%)"
- alert: DiskSpaceLow
expr: |
(
node_filesystem_avail_bytes{mountpoint="/"} /
node_filesystem_size_bytes{mountpoint="/"}
) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk has less than 10% space remaining"
- alert: HighErrorRate
expr: |
(
sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) /
sum by (job) (rate(http_requests_total[5m]))
) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.job }}"
description: "Error rate is above 5% (current: {{ $value }}%)"
Alertmanager
Handles alerts sent by Prometheus.
# alertmanager.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'app_password'
slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: warning
receiver: 'slack-warnings'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
receivers:
- name: 'default'
email_configs:
- to: 'oncall@example.com'
subject: '{{ template "email.default.subject" . }}'
body: '{{ template "email.default.body" . }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'your-service-key'
severity: critical
description: '{{ .CommonAnnotations.summary }}'
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts'
send_resolved: true
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
Grafana
Visualization and analytics platform.
Installation
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:latest
ports:
- containerPort: 3000
env:
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-secrets
key: admin-password
- name: GF_INSTALL_PLUGINS
value: grafana-piechart-panel
volumeMounts:
- name: storage
mountPath: /var/lib/grafana
- name: datasources
mountPath: /etc/grafana/provisioning/datasources
- name: dashboards
mountPath: /etc/grafana/provisioning/dashboards
volumes:
- name: storage
persistentVolumeClaim:
claimName: grafana-pvc
- name: datasources
configMap:
name: grafana-datasources
- name: dashboards
configMap:
name: grafana-dashboards
---
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
data:
datasources.yml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
Key Dashboard Metrics
| Category | Metrics |
|---|---|
| Infrastructure | CPU, Memory, Disk, Network, Load |
| Application | Request rate, Latency, Error rate, Saturation |
| Business | Active users, Transactions, Revenue |
USE Method
- Utilization: Percent time busy
- Saturation: Queue length / extra work waiting
- Errors: Error count
RED Method (Microservices)
- Rate: Requests per second
- Errors: Failed requests
- Duration: Time per request
Quiz
Quiz
Question 1 of 5What is a counter metric type in Prometheus?
A value that can go up and down
A cumulative metric that only increases
A fixed value that never changes
A metric with multiple values
Next Steps
Now let's explore centralized logging with the ELK stack.