📊 Prometheus
開源系統監控與告警工具
Prometheus 以拉取(pull)模式收集各服務的指標,內建 PromQL 查詢語言,搭配 Alertmanager 設定多渠道告警,是 Kubernetes 生態的監控標準。
安裝
$ # 下載最新版本(從 GitHub Releases 取得最新版本號)
PROM_VER="2.52.0"
wget https://github.com/prometheus/prometheus/releases/download/v${PROM_VER}/prometheus-${PROM_VER}.linux-amd64.tar.gz
tar xvf prometheus-${PROM_VER}.linux-amd64.tar.gz
sudo mv prometheus-${PROM_VER}.linux-amd64 /opt/prometheus
# 建立系統使用者
sudo useradd -r -s /usr/sbin/nologin prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus$ sudo vim /etc/systemd/system/prometheus.service[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
ExecStart=/opt/prometheus/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=30d \
--web.listen-address=0.0.0.0:9090
Restart=on-failure
[Install]
WantedBy=multi-user.target$ sudo systemctl enable --now prometheus
# 開啟瀏覽器訪問 http://伺服器IP:9090設定(/etc/prometheus/prometheus.yml)
global:
scrape_interval: 15s # 每 15 秒抓取一次指標
evaluation_interval: 15s # 每 15 秒評估一次告警規則
# 告警規則檔案
rule_files:
- /etc/prometheus/rules/*.yml
# 告警接收器
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
# 抓取目標
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets:
- 'localhost:9100' # 本機
- '192.168.1.11:9100' # 其他伺服器
labels:
env: production$ sudo systemctl reload prometheusNode Exporter(主機指標收集)
$ NODE_VER="1.8.1"
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_VER}/node_exporter-${NODE_VER}.linux-amd64.tar.gz
tar xvf node_exporter-${NODE_VER}.linux-amd64.tar.gz
sudo mv node_exporter-${NODE_VER}.linux-amd64/node_exporter /usr/local/bin/
sudo useradd -r -s /usr/sbin/nologin node_exporter$ sudo vim /etc/systemd/system/node_exporter.service[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target$ sudo systemctl enable --now node_exporter
# 確認指標端點:curl http://localhost:9100/metricsPromQL 查詢範例
| 查詢 | 說明 |
|---|---|
up | 各 target 是否在線(1=正常) |
100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) | CPU 使用率 % |
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 | 記憶體可用 % |
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100) | 磁碟使用率 % |
irate(node_network_receive_bytes_total[5m]) | 網路接收速率 |
告警規則範例(/etc/prometheus/rules/alerts.yml)
groups:
- name: node_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "高 CPU 使用率 ({{ $labels.instance }})"
description: "CPU 使用率超過 80%,目前為 {{ $value }}%"
- alert: DiskSpaceLow
expr: 100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100) > 85
for: 2m
labels:
severity: critical
annotations:
summary: "磁碟空間不足 ({{ $labels.instance }})"