📊 Prometheus

開源系統監控與告警工具

Prometheus 以拉取(pull)模式收集各服務的指標,內建 PromQL 查詢語言,搭配 Alertmanager 設定多渠道告警,是 Kubernetes 生態的監控標準。

安裝

$ # 下載最新版本(從 GitHub Releases 取得最新版本號)
PROM_VER="2.52.0"
wget https://github.com/prometheus/prometheus/releases/download/v${PROM_VER}/prometheus-${PROM_VER}.linux-amd64.tar.gz
tar xvf prometheus-${PROM_VER}.linux-amd64.tar.gz
sudo mv prometheus-${PROM_VER}.linux-amd64 /opt/prometheus

# 建立系統使用者
sudo useradd -r -s /usr/sbin/nologin prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
$ sudo vim /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
ExecStart=/opt/prometheus/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus \
    --storage.tsdb.retention.time=30d \
    --web.listen-address=0.0.0.0:9090
Restart=on-failure

[Install]
WantedBy=multi-user.target
$ sudo systemctl enable --now prometheus
# 開啟瀏覽器訪問 http://伺服器IP:9090

設定(/etc/prometheus/prometheus.yml)

global:
  scrape_interval: 15s      # 每 15 秒抓取一次指標
  evaluation_interval: 15s  # 每 15 秒評估一次告警規則

# 告警規則檔案
rule_files:
  - /etc/prometheus/rules/*.yml

# 告警接收器
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

# 抓取目標
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets:
          - 'localhost:9100'       # 本機
          - '192.168.1.11:9100'   # 其他伺服器
        labels:
          env: production
$ sudo systemctl reload prometheus

Node Exporter(主機指標收集)

$ NODE_VER="1.8.1"
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_VER}/node_exporter-${NODE_VER}.linux-amd64.tar.gz
tar xvf node_exporter-${NODE_VER}.linux-amd64.tar.gz
sudo mv node_exporter-${NODE_VER}.linux-amd64/node_exporter /usr/local/bin/

sudo useradd -r -s /usr/sbin/nologin node_exporter
$ sudo vim /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target
$ sudo systemctl enable --now node_exporter
# 確認指標端點:curl http://localhost:9100/metrics

PromQL 查詢範例

查詢說明
up各 target 是否在線(1=正常)
100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)CPU 使用率 %
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100記憶體可用 %
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)磁碟使用率 %
irate(node_network_receive_bytes_total[5m])網路接收速率

告警規則範例(/etc/prometheus/rules/alerts.yml)

groups:
  - name: node_alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "高 CPU 使用率 ({{ $labels.instance }})"
          description: "CPU 使用率超過 80%,目前為 {{ $value }}%"

      - alert: DiskSpaceLow
        expr: 100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100) > 85
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "磁碟空間不足 ({{ $labels.instance }})"