AlertManager處理客戶端應用程序(例如Prometheus Server)發送的警報。由於Webhook接收器,它需要考慮重複重複處理,分組並將其路由到正確的接收器集成,例如電子郵件,Pagerduty,Opsgenie或許多其他機制。它還可以照顧沉默和抑制警報。
安裝AlertManager有多種方法。
Prometheus.io的下載部分提供了用於發布版本的預編譯版本的二進製文件。使用最新的生產版本二進製版是安裝AlertManager的推薦方法。
Docker圖像可在Quay.io或Docker Hub上找到。
您可以啟動一個AlertManager容器,以嘗試使用
$ docker run --name alertmanager -d -p 127.0.0.1:9093:9093 quay.io/prometheus/alertmanager
現在將在http:// localhost:9093/。
您可以go get它:
$ GO15VENDOREXPERIMENT=1 go get github.com/prometheus/alertmanager/cmd/...
# cd $GOPATH/src/github.com/prometheus/alertmanager
$ alertmanager --config.file=<your_file>
或克隆存儲庫並手動構建:
$ mkdir -p $GOPATH/src/github.com/prometheus
$ cd $GOPATH/src/github.com/prometheus
$ git clone https://github.com/prometheus/alertmanager.git
$ cd alertmanager
$ make build
$ ./alertmanager --config.file=<your_file>
您還可以通過將名稱傳遞給構建函數來構建此存儲庫中的二進製文件之一:
$ make build BINARIES=amtool
這是一個示例配置,應涵蓋新的YAML配置格式的最相關方面。配置的完整文檔可以在此處找到。
global :
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost : ' localhost:25 '
smtp_from : ' [email protected] '
# The root route on which each incoming alert enters.
route :
# The root route must not have any matchers as it is the entry point for
# all alerts. It needs to have a receiver configured so alerts that do not
# match any of the sub-routes are sent to someone.
receiver : ' team-X-mails '
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
#
# To aggregate by all possible labels use '...' as the sole label name.
# This effectively disables aggregation entirely, passing through all
# alerts as-is. This is unlikely to be what you want, unless you have
# a very low alert volume or your upstream notification system performs
# its own grouping. Example: group_by: [...]
group_by : ['alertname', 'cluster']
# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait : 30s
# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval : 5m
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval : 3h
# All the above attributes are inherited by all child routes and can
# overwritten on each.
# The child route trees.
routes :
# This route performs a regular expression match on alert labels to
# catch alerts that are related to a list of services.
- matchers :
- service=~"^(foo1|foo2|baz)$"
receiver : team-X-mails
# The service has a sub-route for critical alerts, any alerts
# that do not match, i.e. severity != critical, fall-back to the
# parent node and are sent to 'team-X-mails'
routes :
- matchers :
- severity="critical"
receiver : team-X-pager
- matchers :
- service="files"
receiver : team-Y-mails
routes :
- matchers :
- severity="critical"
receiver : team-Y-pager
# This route handles all alerts coming from a database service. If there's
# no team to handle it, it defaults to the DB team.
- matchers :
- service="database"
receiver : team-DB-pager
# Also group alerts by affected database.
group_by : [alertname, cluster, database]
routes :
- matchers :
- owner="team-X"
receiver : team-X-pager
- matchers :
- owner="team-Y"
receiver : team-Y-pager
# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules :
- source_matchers :
- severity="critical"
target_matchers :
- severity="warning"
# Apply inhibition if the alertname is the same.
# CAUTION:
# If all label names listed in `equal` are missing
# from both the source and target alerts,
# the inhibition rule will apply!
equal : ['alertname']
receivers :
- name : ' team-X-mails '
email_configs :
- to : ' [email protected], [email protected] '
- name : ' team-X-pager '
email_configs :
- to : ' [email protected] '
pagerduty_configs :
- routing_key : <team-X-key>
- name : ' team-Y-mails '
email_configs :
- to : ' [email protected] '
- name : ' team-Y-pager '
pagerduty_configs :
- routing_key : <team-Y-key>
- name : ' team-DB-pager '
pagerduty_configs :
- routing_key : <team-DB-key> 當前的AlertManager API是版本2。此API是通過OpenAPI項目完全生成的,並且Swagger在HTTP處理程序本身以外。可以在API/V2/OpenAPI.YAML中找到API規範。可以在此處訪問HTML渲染版本。可以通過任何主要語言的任何OpenAPI生成器輕鬆生成客戶端。
使用默認配置,將在A /api/v1或/api/v2前綴下訪問端點。 V2 /status端點為/api/v2/status 。如果設置了--web.route-prefix ,則API路由也在其中前綴,因此--web.route-prefix=/alertmanager/將與/alertmanager/api/v2/status有關。
API V2仍處於繁重的發展狀態,因此可能會發生變化。
amtool是用於與AlertManager API進行交互的CLI工具。它與AlertManager的所有發行版捆綁在一起。
或者,您可以安裝:
$ go install github.com/prometheus/alertmanager/cmd/amtool@latest
查看當前發射警報的所有:
$ amtool alert
Alertname Starts At Summary
Test_Alert 2017-08-02 18:30:18 UTC This is a testing alert!
Test_Alert 2017-08-02 18:30:18 UTC This is a testing alert!
Check_Foo_Fails 2017-08-02 18:30:18 UTC This is a testing alert!
Check_Foo_Fails 2017-08-02 18:30:18 UTC This is a testing alert!
查看所有當前具有擴展輸出的發射警報:
$ amtool -o extended alert
Labels Annotations Starts At Ends At Generator URL
alertname="Test_Alert" instance="node0" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
alertname="Test_Alert" instance="node1" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
alertname="Check_Foo_Fails" instance="node0" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
alertname="Check_Foo_Fails" instance="node1" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
除了查看警報外,您還可以使用AlertManager提供的豐富查詢語法:
$ amtool -o extended alert query alertname="Test_Alert"
Labels Annotations Starts At Ends At Generator URL
alertname="Test_Alert" instance="node0" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
alertname="Test_Alert" instance="node1" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
$ amtool -o extended alert query instance=~".+1"
Labels Annotations Starts At Ends At Generator URL
alertname="Test_Alert" instance="node1" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
alertname="Check_Foo_Fails" instance="node1" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
$ amtool -o extended alert query alertname=~"Test.*" instance=~".+1"
Labels Annotations Starts At Ends At Generator URL
alertname="Test_Alert" instance="node1" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
沉默警報:
$ amtool silence add alertname=Test_Alert
b3ede22e-ca14-4aa0-932c-ca2f3445f926
$ amtool silence add alertname="Test_Alert" instance=~".+0"
e48cb58a-0b17-49ba-b734-3585139b1d25
查看沉默:
$ amtool silence query
ID Matchers Ends At Created By Comment
b3ede22e-ca14-4aa0-932c-ca2f3445f926 alertname=Test_Alert 2017-08-02 19:54:50 UTC kellel
$ amtool silence query instance=~".+0"
ID Matchers Ends At Created By Comment
e48cb58a-0b17-49ba-b734-3585139b1d25 alertname=Test_Alert instance=~.+0 2017-08-02 22:41:39 UTC kellel
沉默到期:
$ amtool silence expire b3ede22e-ca14-4aa0-932c-ca2f3445f926
到期所有與查詢相匹配的沉默:
$ amtool silence query instance=~".+0"
ID Matchers Ends At Created By Comment
e48cb58a-0b17-49ba-b734-3585139b1d25 alertname=Test_Alert instance=~.+0 2017-08-02 22:41:39 UTC kellel
$ amtool silence expire $(amtool silence query -q instance=~".+0")
$ amtool silence query instance=~".+0"
到期所有的沉默:
$ amtool silence expire $(amtool silence query -q)
嘗試模板的工作原理。假設您在配置文件中有一個:
templates:
- '/foo/bar/*.tmpl'
然後,您可以通過使用此命令來測試模板的外觀:
amtool template render --template.glob='/foo/bar/*.tmpl' --template.text='{{ template "slack.default.markdown.v1" . }}'
amtool允許配置文件為方便起見指定一些選項。默認配置文件路徑是$HOME/.config/amtool/config.yml或/etc/amtool/config.yml
示例配置文件可能看起來如下:
# Define the path that `amtool` can find your `alertmanager` instance
alertmanager.url: "http://localhost:9093"
# Override the default author. (unset defaults to your username)
author: [email protected]
# Force amtool to give you an error if you don't include a comment on a silence
comment_required: true
# Set a default output format. (unset defaults to simple)
output: extended
# Set a default receiver
receiver: team-X-pager
amtool允許您以文本樹視圖的形式可視化配置的路由。另外,您可以通過傳遞警報的標籤集來使用它來測試路由,並打印出所有接收器該警報將匹配和分開的警報, (如果使用--verify.receivers Amtool返回錯誤代碼1的錯誤代碼1)
用法的示例:
# View routing tree of remote Alertmanager
$ amtool config routes --alertmanager.url=http://localhost:9090
# Test if alert matches expected receiver
$ amtool config routes test --config.file=doc/examples/simple.yml --tree --verify.receivers=team-X-pager service=database owner=team-X
AlertManager的高可用性是許多公司的生產使用,默認情況下啟用。
重要的是:在AlertManager 0.15和群集工作中,UDP和TCP都需要UDP和TCP。
- 如果您使用的是防火牆,請確保為兩個協議的聚類端口白色。
- 如果您在容器中運行,請確保將兩個協議的聚類端口公開。
要創建一個高度可用的AlertManager群集,需要將實例配置為相互通信。這是使用--cluster.*標誌。
--cluster.listen-address字符串:群集收聽地址(默認為“ 0.0.0.0.0:9094”;空字符串禁用HA模式)--cluster.advertise-address字符串:集群廣告地址--cluster.peer值:初始對等(重複每個附加對等的標誌)--cluster.peer-timeout值:對等超時周期(默認“ 15s”)--cluster.gossip-interval值:集群消息傳播速度(默認“ 200ms”)--cluster.pushpull-interval值:較低的值將以帶寬為代價提高收斂速度(默認的“ 1M0S”)--cluster.settle-timeout值:在評估通知之前等待群集連接安頓下來的最大時間。--cluster.tcp-timeout值:TCP連接的超時值,讀取和寫入(默認“ 10s”)--cluster.probe-timeout值:等待ACK之前的時間,然後標記節點不健康(默認“ 500ms”)--cluster.probe-interval值:隨機節點探針之間的間隔(默認“ 1s”)--cluster.reconnect-interval值:嘗試重新連接到丟失的同行(默認“ 10s”)之間--cluster.reconnect-timeout值:嘗試重新連接到丟失的同伴的時間長度(默認值:“ 6H0M0S”)--cluster.label值:標籤是一個可選的字符串,可在每個數據包和流上包含。它獨特地識別群集並在發送八卦消息時防止交叉通信問題(默認:“”) cluster.listen-address標誌是需要在cluster.peer指定的端口。
如果實例沒有默認路由的RFC 6890的IP地址,則需要cluster.advertise-address標誌。
要在本地計算機上啟動三個對等式的群集,請使用goreman和該存儲庫中的Procfile。
goreman start
為了將您的Prometheus 1.4(或更高版本)指向多個AlertManagers,請在prometheus.yml配置文件中配置它們,例如:
alerting :
alertmanagers :
- static_configs :
- targets :
- alertmanager1:9093
- alertmanager2:9093
- alertmanager3:9093重要的是:不要加載Prometheus及其AlertManagers之間的流量,而是將Prometheus指向所有AlertManagers的列表。 AlertManager實施期望將所有警報發送給所有AlertManagers確保高可用性。
如果不需要在高可用性模式下運行AlertManager,則設置--cluster.listen-address=防止AlertManager聆聽傳入的同行請求。
檢查Prometheus貢獻頁面。
要為用戶界面做出貢獻,請參閱UI/App/prograting.md。
Apache許可證2.0,請參閱許可證。