Unlocking Observability: A Comprehensive Guide to Setting Up Prometheus & Grafana
Improve Monitoring by Installing Prometheus and Grafana

When you hear the word "monitoring," what comes to your mind? Before I dove deep into it, I thought it was just about checking if servers were up or down - like a simple yes or no status. While that basic idea isn't wrong, there's so much more beneath the surface. Monitoring is really about creating an intricate web of metrics and alerts that determines not just what's running, but how different parts of a system are performing together.
Pre-requisites
To fully benefit from this article, readers should have the following prerequisites:
A Linux-based cloud server (Ubuntu was used in this implementation, but any Linux distribution should work)
Root or sudo access to the server
Basic knowledge of the Linux command line
Basic knowledge of YAML for configuration files
GitHub Actions configured for your repositories
Firewall configured to allow necessary ports:
SSH (port 22)
Grafana (port 3000)
Prometheus (port 9090)
Node Exporter (port 9100)
Blackbox Exporter (port 9115)
AlertManager (port 9093)
Sufficient server resources:
Minimum 2GB RAM
At least 10GB of available disk space
2+ CPU cores recommended for better performance
Network connectivity to monitor endpoints (for Blackbox Exporter)
A Slack workspace and permissions to create webhooks (for AlertManager integration)
Basic understanding of metrics and monitoring concepts
DNS records configured if you'll be accessing dashboards through domain names
Part 1: Setting Up the Foundation
Let's start by creating the necessary directory structure for our monitoring stack:
sudo mkdir -p /opt/reconcile/prometheus /etc/systemd/system/prometheus.service /opt/reconcile/prometheus/prometheus-3.2.1.linux-amd64/prometheus.yml
sudo mkdir -p /opt/reconcile/node_exporter /etc/systemd/system/node_exporter.service
sudo mkdir -p /etc/apt/keyrings/
sudo mkdir -p /opt/reconcile/blackbox_exporter /opt/reconcile/blackbox_exporter/blackbox.yml /etc/systemd/system/blackbox_exporter.service
sudo mkdir -p /etc/alertmanager /etc/systemd/system/alertmanager.service /etc/alertmanager/alertmanager.yml
sudo mkdir -p /var/lib/alertmanager /opt/reconcile/prometheus/alert.rules.yml
sudo mkdir -p /opt/dora-exporter /home/reconxi/pushgateway-1.11.0.linux-amd64
Installing Core Components
Prometheus: The Heart of Our Metrics Collection
wget https://github.com/prometheus/prometheus/releases/download/v3.2.1/prometheus-3.2.1.linux-amd64.tar.gz
tar xvfz prometheus-3.2.1.linux-amd64.tar.gz
sudo mv prometheus-3.2.1.linux-amd64 /opt/reconcile/prometheus/
AlertManager: Our Notification Hub
# Download AlertManager
wget https://github.com/prometheus/alertmanager/releases/download/v${VERSION}/alertmanager-${VERSION}.linux-amd64.tar.gz
# Extract the archive
tar xvf alertmanager-${VERSION}.linux-amd64.tar.gz
# Move binaries to appropriate location
sudo mv alertmanager-${VERSION}.linux-amd64/alertmanager /usr/local/bin/
sudo mv alertmanager-${VERSION}.linux-amd64/amtool /usr/local/bin/
Node Exporter: System-Level Insights
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
sudo mv node_exporter-1.6.1.linux-amd64/node_exporter /uswget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xvfz node_exporter-1.8.2.linux-amd64.tar.gz
sudo mkdir -p /opt/reconcile/node_exporter
sudo mv node_exporter-1.8.2.linux-amd64/* /opt/reconcile/node_exporter/
Blackbox Exporter: External Health Checks
sudo wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz
tar xvfz blackbox_exporter-0.24.0.linux-amd64.tar.gz
sudo mkdir -p /opt/reconcile/blackbox_exporter
sudo mv blackbox_exporter-0.24.0.linux-amd64/* /opt/reconcile/blackbox_exporter/
Grafana: Visualizing Our Metrics
sudo apt-get install -y apt-transport-https software-properties-common wget
#Import GPG Key:
sudo mkdir -p /etc/apt/keyrings/
sudo wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
Add repositories:
# For stable releases
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com/ stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
# For beta releases
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com/ beta main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
Part 2: Configuring Services
Creating Systemd Service Files
For Prometheus, create /etc/systemd/system/prometheus.service:
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
WorkingDirectory=/opt/reconcile/prometheus/prometheus-3.2.1.linux-amd64
ExecStart=/opt/reconcile/prometheus/prometheus-3.2.1.linux-amd64/prometheus --config.file=prometheus.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target
For AlertManager, create /etc/systemd/system/alertmanager.service:
[Unit]
Description=Prometheus AlertManager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager/
[Install]
WantedBy=multi-user.target
For Node Exporter, create /etc/systemd/system/node_exporter.service:
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/opt/reconcile/node_exporter/node_exporter
[Install]
WantedBy=multi-user.target
For Blackbox Exporter, create /etc/systemd/system/blackbox_exporter.service:
[Unit]
Description=Blackbox Exporter
After=network.target
[Service]
User=blackbox_exporter
Group=blackbox_exporter
Type=simple
ExecStart=/opt/reconcile/blackbox_exporter/blackbox_exporter --config.file=/opt/reconcile/blackbox_exporter/blackbox.yml
[Install]
WantedBy=multi-user.target
Part 3: Configuration Files
Prometheus Configuration
Create /opt/reconcile/prometheus/prometheus-3.2.1.linux-amd64/prometheus.yml :
global:
scrape_interval: 15s
evaluation_interval: 15s
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: [‘localhost:9093’]
# Load rules once and periodically evaluate them
rule_files:
- “rules/node_exporter_alerts.yml”
- “rules/blackbox_alerts.yml”
- “rules/dora_alerts.yml”
scrape_configs:
- job_name: ‘prometheus’
static_configs:
- targets: [‘localhost:9090’]
- job_name: ‘node_exporter’
scrape_interval: 10s
static_configs:
- targets: [‘localhost:9100’]
# PM2 metrics from host (for NextJS apps)
# - job_name: ‘pm2’
# scrape_interval: 10s
# static_configs:
# - targets: [‘localhost:9209’]
# Blackbox exporter for HTTP/HTTPS uptime and SSL monitoring
- job_name: ‘blackbox_http’
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response
static_configs:
- targets:
- https://reconxi.com
- https://dev.reconxi.com
# Add more URLs as needed
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9115 # Blackbox exporter’s address
# Blackbox exporter for SSL monitoring
- job_name: ‘blackbox_ssl’
metrics_path: /probe
params:
module: [tls] # Use the TLS probe
static_configs:
- targets:
- reconxi.com:443
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9115
- job_name: ‘dora-metrics’
static_configs:
- targets: [‘localhost:8888’]
Blackbox Exporter Configuration
Create /etc/blackbox_exporter/blackbox.yml:
- job_name: "blackbox_http"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://dev.reconxi.com
- https://reconxi.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9115
- job_name: "blackbox_ssl"
metrics_path: /probe
params:
module: [http_ssl]
static_configs:
- targets:
- https://dev.reconxi.com
- https://reconxi.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9115
AlertManager Configuration
Create /etc/alertmanager/alertmanager.yml:
global:
resolve_timeout: 1m
slack_api_url: ‘https://hooks.slack.com/services/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’
route:
group_by: [‘alertname’, ‘instance’, ‘job’]
group_wait: 30s
group_interval: 1m
repeat_interval: 4h
receiver: ‘slack-notifications’
routes:
- match:
severity: critical
receiver: ‘slack-critical’
- match:
severity: warning
receiver: ‘slack-warning’
receivers:
- name: ‘slack-notifications’
slack_configs:
- send_resolved: true
channel: ‘#devops-alerts’
title: ‘{{ if eq .Status “firing” }}:red_circle: ALERT{{ else }}:large_green_circle: RESOLVED{{ end }}: {{ .CommonLabels.alertname }}’
text: |
{{ if eq .Status “firing” }}*SYSTEM ALERT*{{ else }}*SYSTEM RECOVERED*{{ end }}
{{ range .Alerts }}
*{{ if eq .Status “firing” }}{{ .Annotations.summary }}{{ else }}{{ .Annotations.resolved_summary }}{{ end }}*
{{ if eq .Status “firing” }}{{ .Annotations.description }}{{ else }}{{ .Annotations.resolved_description }}{{ end }}
*:alarm_clock: Incident Details:*
• Started: {{ .StartsAt }}
• Status: {{ .Status | toUpper }}
*:mag: Technical Information:*
• System: {{ .Labels.instance }}
• Job: {{ .Labels.job }}
{{ if eq .Status “firing” }}
• Severity: {{ .Labels.severity }}
*:busts_in_silhouette: Impact Assessment:*
• Users affected: {{ if eq .Labels.job “blackbox_http” }}Website visitors{{ else }}Service users{{ end }}
*:link: Diagnostic Links:*
• <https://grafana.website.com|View Dashboard>
• <https://logs.website.com|View Logs>
*:busts_in_silhouette: Team to Notify:* @team-reconxi-devops
{{ end }}
{{ end }}
icon_emoji: ‘{{ if eq .Status “firing” }}:red_circle:{{ else }}:green_circle:{{ end }}’
- name: ‘slack-critical’
slack_configs:
- send_resolved: true
channel: ‘#devops-alerts’
title: ‘{{ if eq .Status “firing” }}:red_circle: CRITICAL{{ elseismatic: true
title: ‘{{ if eq .Status “firing” }}:red_circle: CRITICAL{{ else }}:large_green_circle: RESOLVED{{ end }}: {{ .CommonLabels.alertname }}’
text: |
{{ if eq .Status “firing” }}*CRITICAL SYSTEM ALERT*{{ else }}*SYSTEM RECOVERED*{{ end }}
{{ range .Alerts }}
*{{ if eq .Status “firing” }}{{ .Annotations.summary }}{{ else }}{{ .Annotations.resolved_summary }}{{ end }}*
{{ if eq .Status “firing” }}{{ .Annotations.description }}{{ else }}{{ .Annotations.resolved_description }}{{ end }}
*:alarm_clock: Incident Details:*
• Started: {{ .StartsAt }}
• Status: {{ .Status | toUpper }}
*:mag: Technical Information:*
• System: {{ .Labels.instance }}
• Job: {{ .Labels.job }}
{{ if eq .Status “firing” }}
{{ if eq .Labels.job “blackbox_http” }}
• Error: Connection failed
• HTTP Status: No response
{{ end }}
*:busts_in_silhouette: Impact Assessment:*
• Severity: Critical
• User Impact: {{ if eq .Labels.job “blackbox_http” }}All website users affected{{ else }}Service degradation{{ end }}
*Related Systems:*
• API Gateway: {{ if eq .Labels.job “blackbox_http” }}Operational{{ end }}
• Database: Operational
*:link: Diagnostic Links:*
• <https://grafana.reconxi.com|View Dashboard>
*:memo: Actions:*
{{ if eq .Labels.job “blackbox_http” }}Check load balancer, verify instances, review logs before restarting services.{{ else }}Check for runaway processes, recent deployments, or traffic spikes.{{ end }}
*:rotating_light: Attention:* <@U08ASEFTPFW> <@U08AQBRT1EG> <@U08BD1J3C5N> <@U08BFNMP18U> <@U08AX7P0G9K> <@U08B1TXE9LZ> <@U08BDQA218Q> <@U08AR58UXMJ> <@U08B5JG3GP3> <@U08A7RXNHFZ> <@U08AM349U2Z>
{{ end }}
{{ end }}
icon_emoji: ‘{{ if eq .Status “firing” }}:fire:{{ else }}:white_check_mark:{{ end }}’
link_names: true
- name: ‘slack-warning’
slack_configs:
- send_resolved: true
channel: ‘#devops-alerts’
title: ‘{{ if eq .Status “firing” }}:warning: WARNING{{ else }}:large_green_circle: RESOLVED{{ end }}: {{ .CommonLabels.alertname }}’
text: |
{{ if eq .Status “firing” }}*WARNING ALERT*{{ else }}*WARNING RESOLVED*{{ end }}
{{ range .Alerts }}
*{{ if eq .Status “firing” }}{{ .Annotations.summary }}{{ else }}{{ .Annotations.resolved_summary }}{{ end }}*
{{ if eq .Status “firing” }}{{ .Annotations.description }}{{ else }}{{ .Annotations.resolved_description }}{{ end }}
*:alarm_clock: Incident Details:*
• Started: {{ .StartsAt }}
• Status: {{ .Status | toUpper }}
*:mag: Technical Information:*
• System: {{ .Labels.instance }}
• Job: {{ .Labels.job }}
{{ if eq .Status “firing” }}
{{ if eq .Labels.alertname “SlowResponseTime” }}
• Response Time: {{ if eq .Labels.job “blackbox_http” }}Slow{{ end }}
{{ end }}
{{ if eq .Labels.alertname “SSLCertExpiringSoon” }}
• Certificate Expires: Soon
{{ end }}
{{ if eq .Labels.alertname “HighCPULoad” }}
• CPU Load: High
{{ end }}
{{ if eq .Labels.alertname “HighMemoryLoad” }}
• Memory Use: High
{{ end }}
{{ if eq .Labels.alertname “HighDiskUsage” }}
• Disk Usage: High
{{ end }}
*:busts_in_silhouette: Impact Assessment:*
• Severity: Warning
• User Impact: Potential performance degradation
*:link: Diagnostic Links:*
• <https://grafana.reconxi.com|View Dashboard>
*:bulb: Recommended Actions:*
{{ if eq .Labels.alertname “SlowResponseTime” }}Check database queries or high backend resource usage.{{ else if eq .Labels.alertname “SSLCertExpiringSoon” }}Renew SSL certificate before expiration.{{ else if eq .Labels.alertname “HighCPULoad” }}Identify CPU-intensive processes and optimize.{{ else if eq .Labels.alertname “HighMemoryLoad” }}Check for memory leaks or increase available memory.{{ else if eq .Labels.alertname “HighDiskUsage” }}Clean up disk space or expand storage.{{ end }}
*:rotating_light: Attention:* <@U08ASEFTPFW> <@U08AQBRT1EG> <@U08BD1J3C5N> <@U08BFNMP18U> <@U08AX7P0G9K> <@U08B1TXE9LZ> <@U08BDQA218Q> <@U08AR58UXMJ> <@U08B5JG3GP3> <@U08A7RXNHFZ> <@U08AM349U2Z>
{{ end }}
{{ end }}
icon_emoji: ‘{{ if eq .Status “firing” }}:warning:{{ else }}:white_check_mark:{{ end }}’
link_names: true
Part 4: Starting Services and Verifying Setup
Reload systemd and start all services:
sudo systemctl daemon-reload
sudo systemctl enable prometheus alertmanager node-exporter blackbox-exporter grafana-server
sudo systemctl start prometheus alertmanager node-exporter blackbox-exporter grafana-server
Verify that all services are running properly:
sudo systemctl status prometheus alertmanager node-exporter blackbox-exporter grafana-server
Part 5: Setting Up Slack Integration for Alerts
Create a Slack App:
Go to Slack API Apps
Click "Create New App"
Choose "From scratch"
Enter an App Name (i.e ReconXi Monitoring System)
Select your Slack Workspace (i.e HNG12)
Click "Create App"
Enable Incoming Webhooks:
In the newly created app, go to "Incoming Webhooks"
Toggle "Activate Incoming Webhooks" to ON
Click "Add New Webhook to Workspace"
Select the Slack Channel for alerts ( i.e #devops-alerts)
Click "Allow" to grant permission
Copy the generated Webhook URL
Update AlertManager Configuration:
Edit
/etc/alertmanager/alertmanager.ymlReplace 'YOUR_SLACK_WEBHOOK_URL' with the copied URL
Restart AlertManager:
sudo systemctl restart alertmanager
Part 6: Setting Up Grafana Dashboards
Access your Grafana instance at
http://<your-server-ip>:3000Login with the default credentials (admin/admin) and set a new password
Add Prometheus as a data source:
Go to Configuration > Data Sources
Add a new data source
Select Prometheus
Set the URL to
http://localhost:9090Click "Save & Test"
Import pre-built dashboards:
Click the "+" icon in the sidebar
Select "Import"
Enter dashboard ID:
Node Exporter: 1860
Blackbox Exporter: 7587
Select Prometheus as the data source
Click "Import"
Dora Metrics Implementation can be accessed here
Why This Monitoring Stack?
You might wonder why we chose this particular combination of tools. Here's our reasoning:
Open Source Power: Prometheus and Grafana are industry-standard open-source tools with massive community support and extensive documentation.
Scalability: This stack scales from monitoring a single server to hundreds of microservices without significant architectural changes.
Flexibility: The configuration-as-code approach allows us to version control our monitoring setup and deploy it consistently across environments.
Rich Ecosystem: The wide variety of exporters lets us monitor virtually any system or application without custom development.
Proactive Alerts: Instead of reacting to outages, we can address issues before they impact users by setting appropriate alert thresholds.
Common Challenges and Solutions
Alert Fatigue: Careful tuning of thresholds is essential. Start conservative and adjust based on real-world patterns.
Resource Consumption: Prometheus itself needs resources. For larger deployments, consider federation or remote storage.
High Cardinality: Be cautious with labels that have many possible values as they can cause performance issues.
Dashboard Organization: Implement folder structures and naming conventions from the start for better maintainability.
Below are screenshots of our Grafana Dashboards, Alerting Setup, and Slack Notifications





Conclusion
Setting up Prometheus and Grafana has transformed how we understand our systems. We've moved from reactive troubleshooting to proactive monitoring with real-time visibility into every aspect of our infrastructure.
The combination of detailed metrics, beautiful visualizations, and actionable alerts gives us confidence that we'll know about issues before our users do - and that's the true power of modern monitoring.
Just as monitoring has evolved from simple checks to comprehensive observability, our approach to infrastructure management continues to evolve, enabling us to build more reliable, performant systems that delight our users.



