AlertManager处理客户端应用程序(例如Prometheus Server)发送的警报。由于Webhook接收器,它需要考虑重复重复处理,分组并将其路由到正确的接收器集成,例如电子邮件,Pagerduty,Opsgenie或许多其他机制。它还可以照顾沉默和抑制警报。
安装AlertManager有多种方法。
Prometheus.io的下载部分提供了用于发布版本的预编译版本的二进制文件。使用最新的生产版本二进制版是安装AlertManager的推荐方法。
Docker图像可在Quay.io或Docker Hub上找到。
您可以启动一个AlertManager容器,以尝试使用
$ docker run --name alertmanager -d -p 127.0.0.1:9093:9093 quay.io/prometheus/alertmanager
现在将在http:// localhost:9093/。
您可以go get它:
$ GO15VENDOREXPERIMENT=1 go get github.com/prometheus/alertmanager/cmd/...
# cd $GOPATH/src/github.com/prometheus/alertmanager
$ alertmanager --config.file=<your_file>
或克隆存储库并手动构建:
$ mkdir -p $GOPATH/src/github.com/prometheus
$ cd $GOPATH/src/github.com/prometheus
$ git clone https://github.com/prometheus/alertmanager.git
$ cd alertmanager
$ make build
$ ./alertmanager --config.file=<your_file>
您还可以通过将名称传递给构建函数来构建此存储库中的二进制文件之一:
$ make build BINARIES=amtool
这是一个示例配置,应涵盖新的YAML配置格式的最相关方面。配置的完整文档可以在此处找到。
global :
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost : ' localhost:25 '
smtp_from : ' [email protected] '
# The root route on which each incoming alert enters.
route :
# The root route must not have any matchers as it is the entry point for
# all alerts. It needs to have a receiver configured so alerts that do not
# match any of the sub-routes are sent to someone.
receiver : ' team-X-mails '
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
#
# To aggregate by all possible labels use '...' as the sole label name.
# This effectively disables aggregation entirely, passing through all
# alerts as-is. This is unlikely to be what you want, unless you have
# a very low alert volume or your upstream notification system performs
# its own grouping. Example: group_by: [...]
group_by : ['alertname', 'cluster']
# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait : 30s
# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval : 5m
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval : 3h
# All the above attributes are inherited by all child routes and can
# overwritten on each.
# The child route trees.
routes :
# This route performs a regular expression match on alert labels to
# catch alerts that are related to a list of services.
- matchers :
- service=~"^(foo1|foo2|baz)$"
receiver : team-X-mails
# The service has a sub-route for critical alerts, any alerts
# that do not match, i.e. severity != critical, fall-back to the
# parent node and are sent to 'team-X-mails'
routes :
- matchers :
- severity="critical"
receiver : team-X-pager
- matchers :
- service="files"
receiver : team-Y-mails
routes :
- matchers :
- severity="critical"
receiver : team-Y-pager
# This route handles all alerts coming from a database service. If there's
# no team to handle it, it defaults to the DB team.
- matchers :
- service="database"
receiver : team-DB-pager
# Also group alerts by affected database.
group_by : [alertname, cluster, database]
routes :
- matchers :
- owner="team-X"
receiver : team-X-pager
- matchers :
- owner="team-Y"
receiver : team-Y-pager
# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules :
- source_matchers :
- severity="critical"
target_matchers :
- severity="warning"
# Apply inhibition if the alertname is the same.
# CAUTION:
# If all label names listed in `equal` are missing
# from both the source and target alerts,
# the inhibition rule will apply!
equal : ['alertname']
receivers :
- name : ' team-X-mails '
email_configs :
- to : ' [email protected], [email protected] '
- name : ' team-X-pager '
email_configs :
- to : ' [email protected] '
pagerduty_configs :
- routing_key : <team-X-key>
- name : ' team-Y-mails '
email_configs :
- to : ' [email protected] '
- name : ' team-Y-pager '
pagerduty_configs :
- routing_key : <team-Y-key>
- name : ' team-DB-pager '
pagerduty_configs :
- routing_key : <team-DB-key> 当前的AlertManager API是版本2。此API是通过OpenAPI项目完全生成的,并且Swagger在HTTP处理程序本身以外。可以在API/V2/OpenAPI.YAML中找到API规范。可以在此处访问HTML渲染版本。可以通过任何主要语言的任何OpenAPI生成器轻松生成客户端。
使用默认配置,将在A /api/v1或/api/v2前缀下访问端点。 V2 /status端点为/api/v2/status 。如果设置了--web.route-prefix ,则API路由也在其中前缀,因此--web.route-prefix=/alertmanager/将与/alertmanager/api/v2/status有关。
API V2仍处于繁重的发展状态,因此可能会发生变化。
amtool是用于与AlertManager API进行交互的CLI工具。它与AlertManager的所有发行版捆绑在一起。
或者,您可以安装:
$ go install github.com/prometheus/alertmanager/cmd/amtool@latest
查看当前发射警报的所有:
$ amtool alert
Alertname Starts At Summary
Test_Alert 2017-08-02 18:30:18 UTC This is a testing alert!
Test_Alert 2017-08-02 18:30:18 UTC This is a testing alert!
Check_Foo_Fails 2017-08-02 18:30:18 UTC This is a testing alert!
Check_Foo_Fails 2017-08-02 18:30:18 UTC This is a testing alert!
查看所有当前具有扩展输出的发射警报:
$ amtool -o extended alert
Labels Annotations Starts At Ends At Generator URL
alertname="Test_Alert" instance="node0" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
alertname="Test_Alert" instance="node1" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
alertname="Check_Foo_Fails" instance="node0" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
alertname="Check_Foo_Fails" instance="node1" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
除了查看警报外,您还可以使用AlertManager提供的丰富查询语法:
$ amtool -o extended alert query alertname="Test_Alert"
Labels Annotations Starts At Ends At Generator URL
alertname="Test_Alert" instance="node0" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
alertname="Test_Alert" instance="node1" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
$ amtool -o extended alert query instance=~".+1"
Labels Annotations Starts At Ends At Generator URL
alertname="Test_Alert" instance="node1" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
alertname="Check_Foo_Fails" instance="node1" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
$ amtool -o extended alert query alertname=~"Test.*" instance=~".+1"
Labels Annotations Starts At Ends At Generator URL
alertname="Test_Alert" instance="node1" link="https://example.com" summary="This is a testing alert!" 2017-08-02 18:31:24 UTC 0001-01-01 00:00:00 UTC http://my.testing.script.local
沉默警报:
$ amtool silence add alertname=Test_Alert
b3ede22e-ca14-4aa0-932c-ca2f3445f926
$ amtool silence add alertname="Test_Alert" instance=~".+0"
e48cb58a-0b17-49ba-b734-3585139b1d25
查看沉默:
$ amtool silence query
ID Matchers Ends At Created By Comment
b3ede22e-ca14-4aa0-932c-ca2f3445f926 alertname=Test_Alert 2017-08-02 19:54:50 UTC kellel
$ amtool silence query instance=~".+0"
ID Matchers Ends At Created By Comment
e48cb58a-0b17-49ba-b734-3585139b1d25 alertname=Test_Alert instance=~.+0 2017-08-02 22:41:39 UTC kellel
沉默到期:
$ amtool silence expire b3ede22e-ca14-4aa0-932c-ca2f3445f926
到期所有与查询相匹配的沉默:
$ amtool silence query instance=~".+0"
ID Matchers Ends At Created By Comment
e48cb58a-0b17-49ba-b734-3585139b1d25 alertname=Test_Alert instance=~.+0 2017-08-02 22:41:39 UTC kellel
$ amtool silence expire $(amtool silence query -q instance=~".+0")
$ amtool silence query instance=~".+0"
到期所有的沉默:
$ amtool silence expire $(amtool silence query -q)
尝试模板的工作原理。假设您在配置文件中有一个:
templates:
- '/foo/bar/*.tmpl'
然后,您可以通过使用此命令来测试模板的外观:
amtool template render --template.glob='/foo/bar/*.tmpl' --template.text='{{ template "slack.default.markdown.v1" . }}'
amtool允许配置文件为方便起见指定一些选项。默认配置文件路径是$HOME/.config/amtool/config.yml或/etc/amtool/config.yml
示例配置文件可能看起来如下:
# Define the path that `amtool` can find your `alertmanager` instance
alertmanager.url: "http://localhost:9093"
# Override the default author. (unset defaults to your username)
author: [email protected]
# Force amtool to give you an error if you don't include a comment on a silence
comment_required: true
# Set a default output format. (unset defaults to simple)
output: extended
# Set a default receiver
receiver: team-X-pager
amtool允许您以文本树视图的形式可视化配置的路由。另外,您可以通过传递警报的标签集来使用它来测试路由,并打印出所有接收器该警报将匹配和分开的警报, (如果使用--verify.receivers Amtool返回错误代码1的错误代码1)
用法的示例:
# View routing tree of remote Alertmanager
$ amtool config routes --alertmanager.url=http://localhost:9090
# Test if alert matches expected receiver
$ amtool config routes test --config.file=doc/examples/simple.yml --tree --verify.receivers=team-X-pager service=database owner=team-X
AlertManager的高可用性是许多公司的生产使用,默认情况下启用。
重要的是:在AlertManager 0.15和群集工作中,UDP和TCP都需要UDP和TCP。
- 如果您使用的是防火墙,请确保为两个协议的聚类端口白色。
- 如果您在容器中运行,请确保将两个协议的聚类端口公开。
要创建一个高度可用的AlertManager群集,需要将实例配置为相互通信。这是使用--cluster.*标志。
--cluster.listen-address字符串:群集收听地址(默认为“ 0.0.0.0.0:9094”;空字符串禁用HA模式)--cluster.advertise-address字符串:集群广告地址--cluster.peer值:初始对等(重复每个附加对等的标志)--cluster.peer-timeout值:对等超时周期(默认“ 15s”)--cluster.gossip-interval值:集群消息传播速度(默认“ 200ms”)--cluster.pushpull-interval值:较低的值将以带宽为代价提高收敛速度(默认的“ 1M0S”)--cluster.settle-timeout值:在评估通知之前等待群集连接安顿下来的最大时间。--cluster.tcp-timeout值:TCP连接的超时值,读取和写入(默认“ 10s”)--cluster.probe-timeout值:等待ACK之前的时间,然后标记节点不健康(默认“ 500ms”)--cluster.probe-interval值:随机节点探针之间的间隔(默认“ 1s”)--cluster.reconnect-interval值:尝试重新连接到丢失的同行(默认“ 10s”)之间--cluster.reconnect-timeout值:尝试重新连接到丢失的同伴的时间长度(默认值:“ 6H0M0S”)--cluster.label值:标签是一个可选的字符串,可在每个数据包和流上包含。它独特地识别群集并在发送八卦消息时防止交叉通信问题(默认:“”) cluster.listen-address标志是需要在cluster.peer指定的端口。
如果实例没有默认路由的RFC 6890的IP地址,则需要cluster.advertise-address标志。
要在本地计算机上启动三个对等式的群集,请使用goreman和该存储库中的Procfile。
goreman start
为了将您的Prometheus 1.4(或更高版本)指向多个AlertManagers,请在prometheus.yml配置文件中配置它们,例如:
alerting :
alertmanagers :
- static_configs :
- targets :
- alertmanager1:9093
- alertmanager2:9093
- alertmanager3:9093重要的是:不要加载Prometheus及其AlertManagers之间的流量,而是将Prometheus指向所有AlertManagers的列表。 AlertManager实施期望将所有警报发送给所有AlertManagers确保高可用性。
如果不需要在高可用性模式下运行AlertManager,则设置--cluster.listen-address=防止AlertManager聆听传入的同行请求。
检查Prometheus贡献页面。
要为用户界面做出贡献,请参阅UI/App/prograting.md。
Apache许可证2.0,请参阅许可证。