使用Prometheus+Alertmanager告警JVM异常情况
本文介紹如何使用Prometheus+Alertmanager來對JVM的某些情況作出告警。
?
本文所提到的腳本可以在這里下載。
摘要
用到的工具:
- Docker,本文大量使用了Docker來啟動各個應用。
- Prometheus,負責抓取/存儲指標信息,并提供查詢功能,本文重點使用它的告警功能。
- Grafana,負責數據可視化(本文重點不在于此,只是為了讓讀者能夠直觀地看到異常指標)。
- Alertmanager,負責將告警通知給相關人員。
- JMX exporter,提供JMX中和JVM相關的metrics。
- Tomcat,用來模擬一個Java應用。
先講一下大致步驟:
- heap使用超過最大上限的50%、80%、90%
- instance down機時間超過30秒、1分鐘、5分鐘
- old gc時間在最近5分鐘里超過50%、80%
告警的大致過程如下:
第一步:啟動幾個Java應用
1) 新建一個目錄,名字叫做prom-jvm-demo。
2)?下載JMX exporter到這個目錄。
3) 新建一個文件simple-config.yml內容如下:
--- blacklistObjectNames: ["*:*"]4) 運行以下命令啟動3個Tomcat,記得把<path-to-prom-jvm-demo>替換成正確的路徑(這里故意把-Xmx和-Xms設置的很小,以觸發告警條件):
docker run -d \--name tomcat-1 \-v <path-to-prom-jvm-demo>:/jmx-exporter \-e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" \-p 6060:6060 \-p 8080:8080 \tomcat:8.5-alpinedocker run -d \--name tomcat-2 \-v <path-to-prom-jvm-demo>:/jmx-exporter \-e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" \-p 6061:6060 \-p 8081:8080 \tomcat:8.5-alpinedocker run -d \--name tomcat-3 \-v <path-to-prom-jvm-demo>:/jmx-exporter \-e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" \-p 6062:6060 \-p 8082:8080 \tomcat:8.5-alpine5) 訪問http://localhost:8080|8081|8082看看Tomcat是否啟動成功。
6) 訪問對應的http://localhost:6060|6061|6062看看JMX exporter提供的metrics。
備注:這里提供的simple-config.yml僅僅提供了JVM的信息,更復雜的配置請參考JMX exporter文檔。
第二步:啟動Prometheus
1) 在之前新建目錄prom-jvm-demo,新建一個文件prom-jmx.yml,內容如下:
crape_configs:- job_name: 'java'static_configs:- targets:- '<host-ip>:6060'- '<host-ip>:6061'- '<host-ip>:6062'# alertmanager的地址 alerting:alertmanagers:- static_configs:- targets:- '<host-ip>:9093'# 讀取告警觸發條件規則 rule_files:- '/prometheus-config/prom-alert-rules.yml'2) 新建文件prom-alert-rules.yml,該文件是告警觸發規則:
# severity按嚴重程度由高到低:red、orange、yello、blue groups:- name: jvm-alertingrules:# down了超過30秒- alert: instance-downexpr: up == 0for: 30slabels:severity: yellowannotations:summary: "Instance {{ $labels.instance }} down"description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 seconds."# down了超過1分鐘- alert: instance-downexpr: up == 0for: 1mlabels:severity: orangeannotations:summary: "Instance {{ $labels.instance }} down"description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."# down了超過5分鐘- alert: instance-downexpr: up == 0for: 5mlabels:severity: blueannotations:summary: "Instance {{ $labels.instance }} down"description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."# 堆空間使用超過50%- alert: heap-usage-too-muchexpr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 50for: 1mlabels:severity: yellowannotations:summary: "JVM Instance {{ $labels.instance }} memory usage > 50%"description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 50%] for more than 1 minutes. current usage ({{ $value }}%)"# 堆空間使用超過80%- alert: heap-usage-too-muchexpr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 80for: 1mlabels:severity: orangeannotations:summary: "JVM Instance {{ $labels.instance }} memory usage > 80%"description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 80%] for more than 1 minutes. current usage ({{ $value }}%)"# 堆空間使用超過90%- alert: heap-usage-too-muchexpr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 90for: 1mlabels:severity: redannotations:summary: "JVM Instance {{ $labels.instance }} memory usage > 90%"description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 90%] for more than 1 minutes. current usage ({{ $value }}%)"# 在5分鐘里,Old GC花費時間超過30%- alert: old-gc-time-too-muchexpr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.3for: 5mlabels:severity: yellowannotations:summary: "JVM Instance {{ $labels.instance }} Old GC time > 30% running time"description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 30% running time] for more than 5 minutes. current seconds ({{ $value }}%)"# 在5分鐘里,Old GC花費時間超過50% - alert: old-gc-time-too-muchexpr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.5for: 5mlabels:severity: orangeannotations:summary: "JVM Instance {{ $labels.instance }} Old GC time > 50% running time"description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 50% running time] for more than 5 minutes. current seconds ({{ $value }}%)"# 在5分鐘里,Old GC花費時間超過80%- alert: old-gc-time-too-muchexpr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.8for: 5mlabels:severity: redannotations:summary: "JVM Instance {{ $labels.instance }} Old GC time > 80% running time"description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 80% running time] for more than 5 minutes. current seconds ({{ $value }}%)"3) 啟動Prometheus:
docker run -d \--name=prometheus \-p 9090:9090 \-v <path-to-prom-jvm-demo>:/prometheus-config \prom/prometheus --config.file=/prometheus-config/prom-jmx.yml4) 訪問http://localhost:9090/alerts應該能看到之前配置的告警規則:
如果沒有看到三個instance,那么等一會兒再試。
第三步:配置Grafana
參考使用Prometheus+Grafana監控JVM
第四步:啟動Alertmanager
1) 新建一個文件alertmanager-config.yml:
global:smtp_smarthost: '<smtp.host:ip>'smtp_from: '<from>'smtp_auth_username: '<username>'smtp_auth_password: '<password>'# The directory from which notification templates are read. templates: - '/alertmanager-config/*.tmpl'# The root route on which each incoming alert enters. route:# The labels by which incoming alerts are grouped together. For example,# multiple alerts coming in for cluster=A and alertname=LatencyHigh would# be batched into a single group.group_by: ['alertname', 'instance']# When a new group of alerts is created by an incoming alert, wait at# least 'group_wait' to send the initial notification.# This way ensures that you get multiple alerts for the same group that start# firing shortly after another are batched together on the first # notification.group_wait: 30s# When the first notification was sent, wait 'group_interval' to send a batch# of new alerts that started firing for that group.group_interval: 5m# If an alert has successfully been sent, wait 'repeat_interval' to# resend them.repeat_interval: 3h # A default receiverreceiver: "user-a"# Inhibition rules allow to mute a set of alerts given that another alert is # firing. # We use this to mute any warning-level notifications if the same alert is # already critical. inhibit_rules: - source_match:severity: 'red'target_match_re:severity: ^(blue|yellow|orange)$# Apply inhibition if the alertname and instance is the same.equal: ['alertname', 'instance'] - source_match:severity: 'orange'target_match_re:severity: ^(blue|yellow)$# Apply inhibition if the alertname and instance is the same.equal: ['alertname', 'instance'] - source_match:severity: 'yellow'target_match_re:severity: ^(blue)$# Apply inhibition if the alertname and instance is the same.equal: ['alertname', 'instance']receivers: - name: 'user-a'email_configs:- to: '<user-a@domain.com>'修改里面關于smtp_*的部分和最下面user-a的郵箱地址。
備注:因為國內郵箱幾乎都不支持TLS,而Alertmanager目前又不支持SSL,因此請使用Gmail或其他支持TLS的郵箱來發送告警郵件,見這個issue
2) 新建文件alert-template.tmpl,這個是郵件內容模板:
{{ define "email.default.html" }} <h2>Summary</h2><p>{{ .CommonAnnotations.summary }}</p><h2>Description</h2><p>{{ .CommonAnnotations.description }}</p> {{ end}}3) 運行下列命令啟動:
docker run -d \--name=alertmanager \-v <path-to-prom-jvm-demo>:/alertmanager-config \-p 9093:9093 \prom/alertmanager --config.file=/alertmanager-config/alertmanager-config.yml4) 訪問http://localhost:9093,看看有沒有收到Prometheus發送過來的告警(如果沒有看到稍等一下):
第五步:等待郵件
等待一會兒(最多5分鐘)看看是否收到郵件。如果沒有收到,檢查配置是否正確,或者docker logs alertmanager看看alertmanager的日志,一般來說都是郵箱配置錯誤導致。
總結
以上是生活随笔為你收集整理的使用Prometheus+Alertmanager告警JVM异常情况的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Java字符串反转以及数组集合转换的方法
- 下一篇: IDEA-替换变量名小技巧