Skip to main content

Command Palette

Search for a command to run...

Unlocking Observability: A Comprehensive Guide to Setting Up Prometheus & Grafana

Improve Monitoring by Installing Prometheus and Grafana

Updated
10 min read
Unlocking Observability: A Comprehensive Guide to Setting Up Prometheus & Grafana

When you hear the word "monitoring," what comes to your mind? Before I dove deep into it, I thought it was just about checking if servers were up or down - like a simple yes or no status. While that basic idea isn't wrong, there's so much more beneath the surface. Monitoring is really about creating an intricate web of metrics and alerts that determines not just what's running, but how different parts of a system are performing together.

Pre-requisites

To fully benefit from this article, readers should have the following prerequisites:

  • A Linux-based cloud server (Ubuntu was used in this implementation, but any Linux distribution should work)

  • Root or sudo access to the server

  • Basic knowledge of the Linux command line

  • Basic knowledge of YAML for configuration files

  • GitHub Actions configured for your repositories

  • Firewall configured to allow necessary ports:

    • SSH (port 22)

    • Grafana (port 3000)

    • Prometheus (port 9090)

    • Node Exporter (port 9100)

    • Blackbox Exporter (port 9115)

    • AlertManager (port 9093)

  • Sufficient server resources:

    • Minimum 2GB RAM

    • At least 10GB of available disk space

    • 2+ CPU cores recommended for better performance

  • Network connectivity to monitor endpoints (for Blackbox Exporter)

  • A Slack workspace and permissions to create webhooks (for AlertManager integration)

  • Basic understanding of metrics and monitoring concepts

  • DNS records configured if you'll be accessing dashboards through domain names

Part 1: Setting Up the Foundation

Let's start by creating the necessary directory structure for our monitoring stack:

sudo mkdir -p /opt/reconcile/prometheus /etc/systemd/system/prometheus.service /opt/reconcile/prometheus/prometheus-3.2.1.linux-amd64/prometheus.yml
sudo mkdir -p /opt/reconcile/node_exporter /etc/systemd/system/node_exporter.service
sudo mkdir -p /etc/apt/keyrings/
sudo mkdir -p /opt/reconcile/blackbox_exporter /opt/reconcile/blackbox_exporter/blackbox.yml /etc/systemd/system/blackbox_exporter.service
sudo mkdir -p /etc/alertmanager /etc/systemd/system/alertmanager.service /etc/alertmanager/alertmanager.yml
sudo mkdir -p /var/lib/alertmanager /opt/reconcile/prometheus/alert.rules.yml
sudo mkdir -p /opt/dora-exporter /home/reconxi/pushgateway-1.11.0.linux-amd64

Installing Core Components

Prometheus: The Heart of Our Metrics Collection

wget https://github.com/prometheus/prometheus/releases/download/v3.2.1/prometheus-3.2.1.linux-amd64.tar.gz
tar xvfz prometheus-3.2.1.linux-amd64.tar.gz
sudo mv prometheus-3.2.1.linux-amd64 /opt/reconcile/prometheus/

AlertManager: Our Notification Hub

# Download AlertManager
wget https://github.com/prometheus/alertmanager/releases/download/v${VERSION}/alertmanager-${VERSION}.linux-amd64.tar.gz

# Extract the archive
tar xvf alertmanager-${VERSION}.linux-amd64.tar.gz

# Move binaries to appropriate location
sudo mv alertmanager-${VERSION}.linux-amd64/alertmanager /usr/local/bin/
sudo mv alertmanager-${VERSION}.linux-amd64/amtool /usr/local/bin/

Node Exporter: System-Level Insights

wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
sudo mv node_exporter-1.6.1.linux-amd64/node_exporter /uswget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xvfz node_exporter-1.8.2.linux-amd64.tar.gz
sudo mkdir -p /opt/reconcile/node_exporter
sudo mv node_exporter-1.8.2.linux-amd64/* /opt/reconcile/node_exporter/

Blackbox Exporter: External Health Checks

sudo wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz
tar xvfz blackbox_exporter-0.24.0.linux-amd64.tar.gz
sudo mkdir -p /opt/reconcile/blackbox_exporter
sudo mv blackbox_exporter-0.24.0.linux-amd64/* /opt/reconcile/blackbox_exporter/

Grafana: Visualizing Our Metrics

sudo apt-get install -y apt-transport-https software-properties-common wget

#Import GPG Key:
sudo mkdir -p /etc/apt/keyrings/
sudo wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null

Add repositories:
# For stable releases
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com/ stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

# For beta releases
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com/ beta main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

Part 2: Configuring Services

Creating Systemd Service Files

For Prometheus, create /etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
WorkingDirectory=/opt/reconcile/prometheus/prometheus-3.2.1.linux-amd64
ExecStart=/opt/reconcile/prometheus/prometheus-3.2.1.linux-amd64/prometheus --config.file=prometheus.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target

For AlertManager, create /etc/systemd/system/alertmanager.service:

[Unit]
Description=Prometheus AlertManager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager/

[Install]
WantedBy=multi-user.target

For Node Exporter, create /etc/systemd/system/node_exporter.service:

[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/opt/reconcile/node_exporter/node_exporter

[Install]
WantedBy=multi-user.target

For Blackbox Exporter, create /etc/systemd/system/blackbox_exporter.service:

[Unit]
Description=Blackbox Exporter
After=network.target

[Service]
User=blackbox_exporter
Group=blackbox_exporter
Type=simple
ExecStart=/opt/reconcile/blackbox_exporter/blackbox_exporter --config.file=/opt/reconcile/blackbox_exporter/blackbox.yml

[Install]
WantedBy=multi-user.target

Part 3: Configuration Files

Prometheus Configuration

Create /opt/reconcile/prometheus/prometheus-3.2.1.linux-amd64/prometheus.yml :

global:
  scrape_interval: 15s
  evaluation_interval: 15s
# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
      - targets: [‘localhost:9093’]
# Load rules once and periodically evaluate them
rule_files:
  - “rules/node_exporter_alerts.yml”
  - “rules/blackbox_alerts.yml”
  - “rules/dora_alerts.yml”
scrape_configs:
  - job_name: ‘prometheus’
    static_configs:
      - targets: [‘localhost:9090’]
  - job_name: ‘node_exporter’
    scrape_interval: 10s
    static_configs:
      - targets: [‘localhost:9100’]
  # PM2 metrics from host (for NextJS apps)
  # - job_name: ‘pm2’
  #   scrape_interval: 10s
  #   static_configs:
  #     - targets: [‘localhost:9209’]
  # Blackbox exporter for HTTP/HTTPS uptime and SSL monitoring
  - job_name: ‘blackbox_http’
    metrics_path: /probe
    params:
      module: [http_2xx]  # Look for a HTTP 200 response
    static_configs:
      - targets:
          - https://reconxi.com
          - https://dev.reconxi.com
        # Add more URLs as needed
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115  # Blackbox exporter’s address
  # Blackbox exporter for SSL monitoring
  - job_name: ‘blackbox_ssl’
    metrics_path: /probe
    params:
      module: [tls]  # Use the TLS probe
    static_configs:
      - targets:
          - reconxi.com:443
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115
  - job_name: ‘dora-metrics’
    static_configs:
      - targets: [‘localhost:8888’]

Blackbox Exporter Configuration

Create /etc/blackbox_exporter/blackbox.yml:

  - job_name: "blackbox_http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://dev.reconxi.com
          - https://reconxi.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115
  - job_name: "blackbox_ssl"
    metrics_path: /probe
    params:
      module: [http_ssl]
    static_configs:
      - targets:
          - https://dev.reconxi.com
          - https://reconxi.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115

AlertManager Configuration

Create /etc/alertmanager/alertmanager.yml:

global:
  resolve_timeout: 1m
  slack_api_url: ‘https://hooks.slack.com/services/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’
route:
  group_by: [‘alertname’, ‘instance’, ‘job’]
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 4h
  receiver: ‘slack-notifications’
  routes:
    - match:
        severity: critical
      receiver: ‘slack-critical’
    - match:
        severity: warning
      receiver: ‘slack-warning’
receivers:
  - name: ‘slack-notifications’
    slack_configs:
      - send_resolved: true
        channel: ‘#devops-alerts’
        title: ‘{{ if eq .Status “firing” }}:red_circle: ALERT{{ else }}:large_green_circle: RESOLVED{{ end }}: {{ .CommonLabels.alertname }}
        text: |
          {{ if eq .Status “firing” }}*SYSTEM ALERT*{{ else }}*SYSTEM RECOVERED*{{ end }}
          {{ range .Alerts }}
          *{{ if eq .Status “firing” }}{{ .Annotations.summary }}{{ else }}{{ .Annotations.resolved_summary }}{{ end }}*
          {{ if eq .Status “firing” }}{{ .Annotations.description }}{{ else }}{{ .Annotations.resolved_description }}{{ end }}
          *:alarm_clock: Incident Details:*
          • Started: {{ .StartsAt }}
          • Status: {{ .Status | toUpper }}
          *:mag: Technical Information:*
          • System: {{ .Labels.instance }}
          • Job: {{ .Labels.job }}
          {{ if eq .Status “firing” }}
          • Severity: {{ .Labels.severity }}
          *:busts_in_silhouette: Impact Assessment:*
          • Users affected: {{ if eq .Labels.job “blackbox_http” }}Website visitors{{ else }}Service users{{ end }}
          *:link: Diagnostic Links:*
          • <https://grafana.website.com|View Dashboard>
          • <https://logs.website.com|View Logs>
          *:busts_in_silhouette: Team to Notify:* @team-reconxi-devops
          {{ end }}
          {{ end }}
        icon_emoji: ‘{{ if eq .Status “firing” }}:red_circle:{{ else }}:green_circle:{{ end }}’
  - name: ‘slack-critical’
    slack_configs:
      - send_resolved: true
        channel: ‘#devops-alerts’
        title: ‘{{ if eq .Status “firing” }}:red_circle: CRITICAL{{ elseismatic: true
        title: ‘{{ if eq .Status “firing” }}:red_circle: CRITICAL{{ else }}:large_green_circle: RESOLVED{{ end }}: {{ .CommonLabels.alertname }}
        text: |
          {{ if eq .Status “firing” }}*CRITICAL SYSTEM ALERT*{{ else }}*SYSTEM RECOVERED*{{ end }}
          {{ range .Alerts }}
          *{{ if eq .Status “firing” }}{{ .Annotations.summary }}{{ else }}{{ .Annotations.resolved_summary }}{{ end }}*
          {{ if eq .Status “firing” }}{{ .Annotations.description }}{{ else }}{{ .Annotations.resolved_description }}{{ end }}
          *:alarm_clock: Incident Details:*
          • Started: {{ .StartsAt }}
          • Status: {{ .Status | toUpper }}
          *:mag: Technical Information:*
          • System: {{ .Labels.instance }}
          • Job: {{ .Labels.job }}
          {{ if eq .Status “firing” }}
          {{ if eq .Labels.job “blackbox_http” }}
          • Error: Connection failed
          • HTTP Status: No response
          {{ end }}
          *:busts_in_silhouette: Impact Assessment:*
          • Severity: Critical
          • User Impact: {{ if eq .Labels.job “blackbox_http” }}All website users affected{{ else }}Service degradation{{ end }}
          *Related Systems:*
          • API Gateway: {{ if eq .Labels.job “blackbox_http” }}Operational{{ end }}
          • Database: Operational
          *:link: Diagnostic Links:*
          • <https://grafana.reconxi.com|View Dashboard>
          *:memo: Actions:*
          {{ if eq .Labels.job “blackbox_http” }}Check load balancer, verify instances, review logs before restarting services.{{ else }}Check for runaway processes, recent deployments, or traffic spikes.{{ end }}
          *:rotating_light: Attention:* <@U08ASEFTPFW> <@U08AQBRT1EG> <@U08BD1J3C5N> <@U08BFNMP18U> <@U08AX7P0G9K> <@U08B1TXE9LZ> <@U08BDQA218Q> <@U08AR58UXMJ> <@U08B5JG3GP3> <@U08A7RXNHFZ> <@U08AM349U2Z>
          {{ end }}
          {{ end }}
        icon_emoji: ‘{{ if eq .Status “firing” }}:fire:{{ else }}:white_check_mark:{{ end }}’
        link_names: true
  - name: ‘slack-warning’
    slack_configs:
      - send_resolved: true
        channel: ‘#devops-alerts’
        title: ‘{{ if eq .Status “firing” }}:warning: WARNING{{ else }}:large_green_circle: RESOLVED{{ end }}: {{ .CommonLabels.alertname }}
        text: |
          {{ if eq .Status “firing” }}*WARNING ALERT*{{ else }}*WARNING RESOLVED*{{ end }}
          {{ range .Alerts }}
          *{{ if eq .Status “firing” }}{{ .Annotations.summary }}{{ else }}{{ .Annotations.resolved_summary }}{{ end }}*
          {{ if eq .Status “firing” }}{{ .Annotations.description }}{{ else }}{{ .Annotations.resolved_description }}{{ end }}
          *:alarm_clock: Incident Details:*
          • Started: {{ .StartsAt }}
          • Status: {{ .Status | toUpper }}
          *:mag: Technical Information:*
          • System: {{ .Labels.instance }}
          • Job: {{ .Labels.job }}
          {{ if eq .Status “firing” }}
          {{ if eq .Labels.alertname “SlowResponseTime” }}
          • Response Time: {{ if eq .Labels.job “blackbox_http” }}Slow{{ end }}
          {{ end }}
          {{ if eq .Labels.alertname “SSLCertExpiringSoon” }}
          • Certificate Expires: Soon
          {{ end }}
          {{ if eq .Labels.alertname “HighCPULoad” }}
          • CPU Load: High
          {{ end }}
          {{ if eq .Labels.alertname “HighMemoryLoad” }}
          • Memory Use: High
          {{ end }}
          {{ if eq .Labels.alertname “HighDiskUsage” }}
          • Disk Usage: High
          {{ end }}
          *:busts_in_silhouette: Impact Assessment:*
          • Severity: Warning
          • User Impact: Potential performance degradation
          *:link: Diagnostic Links:*
          • <https://grafana.reconxi.com|View Dashboard>
          *:bulb: Recommended Actions:*
          {{ if eq .Labels.alertname “SlowResponseTime” }}Check database queries or high backend resource usage.{{ else if eq .Labels.alertname “SSLCertExpiringSoon” }}Renew SSL certificate before expiration.{{ else if eq .Labels.alertname “HighCPULoad” }}Identify CPU-intensive processes and optimize.{{ else if eq .Labels.alertname “HighMemoryLoad” }}Check for memory leaks or increase available memory.{{ else if eq .Labels.alertname “HighDiskUsage” }}Clean up disk space or expand storage.{{ end }}
          *:rotating_light: Attention:* <@U08ASEFTPFW> <@U08AQBRT1EG> <@U08BD1J3C5N> <@U08BFNMP18U> <@U08AX7P0G9K> <@U08B1TXE9LZ> <@U08BDQA218Q> <@U08AR58UXMJ> <@U08B5JG3GP3> <@U08A7RXNHFZ> <@U08AM349U2Z>
          {{ end }}
          {{ end }}
        icon_emoji: ‘{{ if eq .Status “firing” }}:warning:{{ else }}:white_check_mark:{{ end }}’
        link_names: true

Part 4: Starting Services and Verifying Setup

Reload systemd and start all services:

sudo systemctl daemon-reload
sudo systemctl enable prometheus alertmanager node-exporter blackbox-exporter grafana-server
sudo systemctl start prometheus alertmanager node-exporter blackbox-exporter grafana-server

Verify that all services are running properly:

sudo systemctl status prometheus alertmanager node-exporter blackbox-exporter grafana-server

Part 5: Setting Up Slack Integration for Alerts

  1. Create a Slack App:

    • Go to Slack API Apps

    • Click "Create New App"

    • Choose "From scratch"

    • Enter an App Name (i.e ReconXi Monitoring System)

    • Select your Slack Workspace (i.e HNG12)

    • Click "Create App"

  2. Enable Incoming Webhooks:

    • In the newly created app, go to "Incoming Webhooks"

    • Toggle "Activate Incoming Webhooks" to ON

    • Click "Add New Webhook to Workspace"

    • Select the Slack Channel for alerts ( i.e #devops-alerts)

    • Click "Allow" to grant permission

    • Copy the generated Webhook URL

  3. Update AlertManager Configuration:

    • Edit /etc/alertmanager/alertmanager.yml

    • Replace 'YOUR_SLACK_WEBHOOK_URL' with the copied URL

    • Restart AlertManager: sudo systemctl restart alertmanager

Part 6: Setting Up Grafana Dashboards

  1. Access your Grafana instance at http://<your-server-ip>:3000

  2. Login with the default credentials (admin/admin) and set a new password

  3. Add Prometheus as a data source:

    • Go to Configuration > Data Sources

    • Add a new data source

    • Select Prometheus

    • Set the URL to http://localhost:9090

    • Click "Save & Test"

  4. Import pre-built dashboards:

    • Click the "+" icon in the sidebar

    • Select "Import"

    • Enter dashboard ID:

      • Node Exporter: 1860

      • Blackbox Exporter: 7587

    • Select Prometheus as the data source

    • Click "Import"

Dora Metrics Implementation can be accessed here

Why This Monitoring Stack?

You might wonder why we chose this particular combination of tools. Here's our reasoning:

  • Open Source Power: Prometheus and Grafana are industry-standard open-source tools with massive community support and extensive documentation.

  • Scalability: This stack scales from monitoring a single server to hundreds of microservices without significant architectural changes.

  • Flexibility: The configuration-as-code approach allows us to version control our monitoring setup and deploy it consistently across environments.

  • Rich Ecosystem: The wide variety of exporters lets us monitor virtually any system or application without custom development.

  • Proactive Alerts: Instead of reacting to outages, we can address issues before they impact users by setting appropriate alert thresholds.

Common Challenges and Solutions

  • Alert Fatigue: Careful tuning of thresholds is essential. Start conservative and adjust based on real-world patterns.

  • Resource Consumption: Prometheus itself needs resources. For larger deployments, consider federation or remote storage.

  • High Cardinality: Be cautious with labels that have many possible values as they can cause performance issues.

  • Dashboard Organization: Implement folder structures and naming conventions from the start for better maintainability.

Below are screenshots of our Grafana Dashboards, Alerting Setup, and Slack Notifications

Conclusion

Setting up Prometheus and Grafana has transformed how we understand our systems. We've moved from reactive troubleshooting to proactive monitoring with real-time visibility into every aspect of our infrastructure.

The combination of detailed metrics, beautiful visualizations, and actionable alerts gives us confidence that we'll know about issues before our users do - and that's the true power of modern monitoring.

Just as monitoring has evolved from simple checks to comprehensive observability, our approach to infrastructure management continues to evolve, enabling us to build more reliable, performant systems that delight our users.