prometheus+consul服务发现+alertmanager配置
以下為prometheus.yml文件的配置?
alerting:alertmanagers:- static_configs:- targets:- 10.11.62.26:9093 rule_files:- "alertmanager_rules.yml"- "zookeeper_rules.yml" #Global configurations global:scrape_interval: 10sscrape_timeout: 10sevaluation_interval: 10s remote_write:- url: "http://localhost:8086/api/v1/prom/write?db=prometheus" remote_read:- url: "http://localhost:8086/api/v1/prom/read?db=prometheus" scrape_configs:- job_name: 'MicroService'consul_sd_configs:- server: ? '10.11.62.13:8500'services: []relabel_configs:- source_labels: [__meta_consul_tags]regex: '(.*)'replacement: '/actuator/prometheus'target_label: ?__metrics_path__- source_labels: [__meta_consul_tags]regex: (.*contextPath=(.*[^(,)]).*)replacement: '${2}/actuator/prometheus'target_label: ?__metrics_path__- source_labels: [__meta_consul_service,__meta_consul_address]regex: (consul.*);(.*)replacement: $2:9107target_label: __address__- source_labels: [__meta_consul_service]regex: (consul.*)replacement: '/metrics'target_label: __metrics_path__- source_labels: [__meta_consul_service]regex: '(.+)'replacement: ${1}target_label: ?meta_consul_service- source_labels: [__meta_consul_service_address]regex: '(.+)'replacement: ${1}target_label: ?meta_consul_service_address- source_labels: [__meta_consul_service_id]regex: '(.+)'replacement: ${1}target_label: ?meta_consul_service_id- source_labels: [__meta_consul_service_port]regex: '(.+)'replacement: ${1}target_label: ?meta_consul_service_port- source_labels: [__address__]regex: '[^:]+'replacement: ${1}target_label: instance # ?- job_name: 'kong' # ? ?metrics_path: '/metrics'# metrics_path defaults to '/metrics'# scheme defaults to 'http'. # ? ?static_configs: # ? ? ?- targets: ['10.11.62.27:8001','10.11.62.28:8001','10.11.62.4:8001','10.11.62.5:8001'] # ?- job_name: 'zk' # ? ?metrics_path: '/metrics'# metrics_path defaults to '/metrics'# ? ? # scheme defaults to 'http'. # ? ?static_configs: # ? ? ?- targets: ['10.11.62.9:9141','10.11.62.4:9141','10.11.62.5:9141'] # ?- job_name: 'kafka' # ? ?metrics_path: '/metrics'# metrics_path defaults to '/metrics'# ? ? # scheme defaults to 'http'. # ? ?static_configs: # ? ? ?- targets: ['10.11.62.32:9308'] # ?- job_name: node_exporter # ? ?static_configs: # ? ? - targets: ['10.11.62.32:9100']- job_name: 'sysmonitor'file_sd_configs:- refresh_interval: 1mfiles:- ./conf.d/*.json- job_name: 'zk'file_sd_configs:- refresh_interval: 1mfiles:- ./conf.d/kafka/*.json- job_name: 'kong'file_sd_configs:- refresh_interval: 1mfiles:- ./conf.d/kong/*.jsonconf.d目錄下文件列表如下
kafka目錄下的文件
內如如下
[
{
"targets": ["10.11.62.4:9141"],
"labels": {
"instance": "10.11.62.4",
"env": "product",
"name":"kfzx-gw-1"
}
},
{
"targets": ["10.11.62.5:9141"],
"labels": {
"instance": "10.11.62.5",
"env": "product",
"name":"kfzx-gw-2"
}
},
{
"targets": ["10.11.62.9:9141"],
"labels": {
"instance": "10.11.62.9",
"env": "product",
"name":"kfzx-cache-2"
}
}
]
同樣在zookeeper上也有類似的內容
alertmanager_rules.yml內容如下
groups:
- name: 服務器系統監控
? rules:
? - alert: "服務器系統監控"
? ? expr: up{job!="MicroService"} == 0
? ? for: 1s #持續時間 , 表示持續15s鐘獲取不到信息,則觸發報警
? ? labels:
? ? ? course: "instancemonitor" ? # 自定義標簽
? ? ? hostname: "{{ $labels.name }}"
? ? ? inhibit_instance: "{{ $labels.instance }}"
? ? annotations:
? ? ? summary: "Instance {{ $labels.instance }} down" # 自定義摘要
? ? ? severity: "緊急"
? ? ? description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 15s." ? # 自定義具體描述
- name: 服務狀態監控
? rules:
? - alert: "微服務狀態監控"
? ? expr: up{job="MicroService"} == 0
? ? for: 1s #持續時間 , 表示持續15s鐘獲取不到信息,則觸發報警
? ? labels:
? ? ? course: "{{ $labels.meta_consul_service }}" ? # 自定義標簽
? ? ? hostname: "{{ $labels.meta_consul_service_id }}"
? ? ? inhibit_instance: "{{ $labels.meta_consul_service_address }}"
? ? annotations:
? ? ? summary: "微服務{{ $labels.meta_consul_service }}異常" # 自定義摘要
? ? ? severity: "緊急"
? ? ? description: "端口為{{ $labels.meta_consul_service_port }}的微服務{{ $labels.meta_consul_service }} 在 {{ $labels.meta_consul_service_address }} 上的實例異常 ." ? # 自定義具體描述
- name: 微服務PS Old Gen監控
? rules:
? - alert: "微服務PS Old Gen監控"
? ? expr: (jvm_memory_used_bytes{area="heap",id="PS Old Gen"})*100/(jvm_memory_max_bytes{area="heap",id="PS Old Gen"}) > 95
? ? for: 5m #持續時間 , 表示持續15s鐘獲取不到信息,則觸發報警
? ? labels:
? ? ? course: "{{ $labels.meta_consul_service }}" ? # 自定義標簽
? ? ? hostname: "{{ $labels.meta_consul_service_id }}"
? ? annotations:
? ? ? summary: "微服務{{ $labels.meta_consul_service }} PS Old Gen 超過95%" # 自定義摘要
? ? ? severity: "緊急"
? ? ? description: "端口為{{ $labels.meta_consul_service_port }}的微服務{{ $labels.meta_consul_service }}在{{ $labels.meta_consul_service_address }}上的實例 PS Old Gen 持續增長達到超過95%以上,當前值為{{ $value }}%" ? # 自定義具體描述
- name: 服務器資源監控
? rules:
? - alert: CPUUsage
? ? expr: 100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle"}[30s]))) *100) > 75
? ? for: 1s #持續時間 , 表示持續15s鐘獲取不到信息,則觸發報警
? ? labels:
? ? ? course: "serverMonitor" ? # 自定義標簽
? ? ? hostname: "{{ $labels.name }}"
? ? annotations:
? ? ? summary: "服務器 {{ $labels.instance }} cpu 使用率超過75%" # 自定義摘要
? ? ? severity: "一般"
? ? ? description: "服務器{{ $labels.instance }}在30秒內cpu使用率持續在75%以上,當前值 {{ $value }}%"
? - alert: FilesystemUsage
? ? expr: (1 - (node_filesystem_free_bytes{fstype!="tmpfs"}) / node_filesystem_size_bytes{fstype!="tmpfs"}) * 100 >75
? ? for: 1s #持續時間 , 表示持續15s鐘獲取不到信息,則觸發報警
? ? labels:
? ? ? course: "serverMonitor" ? # 自定義標簽
? ? ? hostname: "{{ $labels.name }}"
? ? annotations:
? ? ? summary: "服務器 {{ $labels.instance }} 磁盤使用率超過75%" # 自定義摘要
? ? ? severity: "一般"
? ? ? description: "服務器{{ $labels.instance }}磁盤使用率在75%以上,當前值 {{ $value }}%"
? - alert: MemoryUsage
? ? expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes)))* 100 > 90
? ? for: 20s #持續時間 , 表示持續15s鐘獲取不到信息,則觸發報警
? ? labels:
? ? ? course: "serverMonitor" ? # 自定義標簽
? ? ? hostname: "{{ $labels.name }}"
? ? annotations:
? ? ? summary: "服務器 {{ $labels.instance }} 內存使用率超過90%" # 自定義摘要
? ? ? severity: "一般"
? ? ? description: "服務器{{ $labels.instance }}內存使用率在75%以上,當前值 {{ $value }}%"
?
在alertmanager服務中的配置文件中增加抑制規則
?
inhibit_rules:
-?source_match:
????alertname:?'服務器系統監控'
??target_match:
????alertname:?'微服務狀態監控'
??equal:?['inhibit_rules']
?
神坑,請記住,如果要在????relabel_configs中配置target供alertmanager里面的如標簽使用{{$labels.meta_consul_service_address}},切記一定不能在target_label標簽中以下劃線開頭
- source_labels: [__meta_consul_service_address]
??????regex: '(.+)'
??????replacement: ${1}
??????target_label: ?meta_consul_service_address
?
模板配置就按照普通的模板配置就行如微信模板如下:
{{ define "wechat.default.message" }}
{{ range .Alerts }}
========start==========
告警程序:prometheus_alert
告警級別:{{ .Annotations.severity }}
告警類型:{{ .Labels.alertname }}
故障主機: {{ .Labels.instance }}
告警主題: {{ .Annotations.summary }}
告警詳情: {{ .Annotations.description }}
觸發時間: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
========end==========
{{ end }}
{{ end }}
總結
以上是生活随笔為你收集整理的prometheus+consul服务发现+alertmanager配置的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Linux系统下怎样配置多个Tomcat
- 下一篇: redis哨兵相关详解