Prometheus监控系统资源故障报警
準備環境:
|
主機 |
環境 |
部署內容 |
|
192.168.220.130 |
centos7.6 |
node_exporter-1.2.0.linux-amd64 |
|
192.168.220.131 |
centos7.6 |
node_exporter-1.2.0.linux-amd64 |
|
192.168.220.129 |
centos7.6 |
prometheus-2.28.1.linux-amd64 alertmanager-0.21.0 grafana-8.0.6-1.x86_64.rpm node_exporter-1.2.0.linux-amd64 |
安裝node_exporter
下載解壓下載地址:
https://github.com/prometheus/node_exporter/releases/download/v1.2.0/node_exporter-1.2.0.linux-amd64.tar.gz
tar -xzvf node_exporter-1.2.0.linux-amd64.tar.gz -C /opt ln -s /opt/node_exporter-1.2.0.linux-amd64 /opt/node_exporter vim /usr/lib/systemd/system/node_exporter.service [Unit] Description=node_exporter After=network.target [Service] Type=simple User=prometheus ExecStart=/opt/node_exporter/node_exporter Restart=on-failure [Install] WantedBy=multi-user.target chown prometheus:prometheus /usr/lib/systemd/system/node_exporter.service chown prometheus:prometheus /opt/node_exporte
設置開機啟動
systemctl daemon-reload systemctl enable node_exporter.service systemctl start node_exporter.service systemctl status node_exporter.service
瀏覽器訪問http://192.168.220.129:9100/metrics,會跳轉到metrics頁面,通過輪詢的方式更新數據
修改prometheus.yml
將 node_exporter 加入 prometheus.yml配置中
vim /opt/prometheus/prometheus.yml - job_name: 'Linux' file_sd_configs: - files: ['/opt/prometheus/rules/test_cluster.yml'] refresh_interval: 5s
vim /opt/prometheus/rules/test_cluster.yml
- targets: ['192.168.220.129:9100']
labels:
name: Linux-test1
- targets: ['192.168.220.130:9100']
labels:
name: Linux-test2
- targets: ['192.168.220.131:9100']
labels:
name: Linux-test3
重啟prometheus服務
systemctl restart prometheus.service 或者熱加載 curl -X POST http://localhost:9090/-/reload
Grafana模板導入
下載模板
https://grafana.com/grafana/dashboards/11074
在grafana中導入dashboard
安裝alertmanager
下載安裝包并配置
下載地址:
https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
tar -xzvf alertmanager-0.21.0.linux-amd64.tar.gz -C /opt
ln -s /opt/alertmanager-0.21.0.linux-amd64 /opt/alertmanager
vim /opt/alertmanager/conf/alertmanager.yml
global:
smtp_smarthost: smtp.exmail.xxx.com:465 # 發件人郵箱smtp地址
smtp_auth_username: xxxx@xxx.com # 發件人郵箱賬號
smtp_from: xxx@xxx.com # 發件人郵箱賬號
smtp_auth_password: xxxxxx # 發件人郵箱密碼(郵箱授權碼)
resolve_timeout: 5m
smtp_require_tls: false
route:
# group_by: ['alertname'] # 報警分組依據
group_wait: 10s # 最初即第一次等待多久時間發送一組警報的通知
group_interval: 10s # 在發送新警報前的等待時間
repeat_interval: 1m # 發送重復警報的周期 對于email配置中多頻繁
receiver: 'email'
receivers:
- name: email
email_configs:
- send_resolved: true
to: xxx@xxx.com # 收件人郵箱賬號
設置alertmanager系統服務,并配置開機啟動
vim /usr/lib/systemd/system/alertmanager.service [Unit] Description=Prometheus Documentation=https://prometheus.io/ After=network.target [Service] Type=simple User=prometheus ExecStart=/opt/alertmanager/alertmanager --config.file=/opt/alertmanager/conf/alertmanager.yml --storage.path=/opt/alertmanager/data Restart=on-failure [Install] WantedBy=multi-user.target
設置開機啟動
systemctl daemon-reload systemctl enable prometheus.service systemctl start prometheus.service systemctl status alertmanager.service
prometheus配置
在prometheus目錄下編輯報警模版system_rules.yml,添加一些自定義報警項。
groups:
- name: Host
rules:
- alert: 主機狀態報警
expr: up == 0
for: 1m
labels:
serverity: high
annotations:
summary: "{{$labels.instance}}:服務器宕機"
description: "{{$labels.instance}}:服務器延時超過5分鐘"
- alert: CPU報警
expr: 100 * (1 - avg(irate(node_cpu_seconds_total{mode="idle"}[2m])) by(instance)) > 90
for: 1m
labels:
serverity: middle
annotations:
summary: "{{$labels.instance}}: High CPU Usage Detected"
description: "{{$labels.instance}}: CPU usage is {{$value}}, above 90%"
- alert: 內存報警
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 1m
labels:
serverity: high
annotations:
summary: "{{$labels.instance}}: High Memory Usage Detected"
description: "{{$labels.instance}}: Memory Usage i{{ $value }}, above 85%"
- alert: 磁盤報警
expr: 100 * (node_filesystem_size_bytes{fstype=~"xfs|ext4"} - node_filesystem_avail_bytes) / node_filesystem_size_bytes > 90
for: 1m
labels:
serverity: middle
annotations:
summary: "{{$labels.instance}}: High Disk Usage Detected"
description: "{{$labels.instance}}, mountpoint {{$labels.mountpoint}}: Disk Usage is {{ $value }}, above 90%"
- alert: IO報警
expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
for: 1m
labels:
serverity: high
annotations:
summary: "{{$labels.mountpoint}} 流入磁盤IO使用率過高!"
description: "{{$labels.mountpoint }} 流入磁盤IO大于60%(目前使用:{{$value}})"
- alert: 網絡報警
expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
for: 1m
labels:
serverity: high
annotations:
summary: "{{$labels.mountpoint}} 流入網絡帶寬過高!"
description: "{{$labels.mountpoint }}流入網絡帶寬持續2分鐘高于100M. RX帶寬使用率{{$value}}"
- alert: TCP會話報警
expr: node_netstat_Tcp_CurrEstab > 1000
for: 1m
labels:
serverity: high
annotations:
summary: "{{$labels.mountpoint}} TCP_ESTABLISHED過高!"
description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"
在prometheus目錄下編輯prometheus的配置文件,將監控的配置信息添加到prometheus.yml。如下圖所示:
重啟Prometheus加載配置 systemctl restart prometheus.service
訪問驗證:http://192.168.220.129:9090/alerts
驗證郵件報警
登陸prometheus的web頁面,查看報警信息。
瀏覽器輸入Prometheus_IP:9090 ,可以看到各個報警項的狀態。
郵箱驗證報警郵件
總結
以上是生活随笔為你收集整理的Prometheus监控系统资源故障报警的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: PNG格式的图像文件,创建的图像的MIM
- 下一篇: Python生成pyd文件