Prometheus常见30+报错指南
作者:尹正杰
版权声明:原创作品,谢绝转载!否则将追究法律责任。
目录
- Q1: "open prometheus.yml: no such file or directory"
- Q2: dial tcp 10.0.0.44:9100: connect: no route to host
- Q3: grafana-enterprise depends on musl; however: Package musl is not installed.
- Q4: failed to retrieve cluster info from ES
- Q5: failed to verify certificate: x509: cannot validate certificate for 10.0.0.93 because it doesn't contain any IP SANs
- Q6: HTTP Request failed with code 401
- Q7: dial tcp 10.0.0.91:9092: connect: connection refused
- Q8: listen tcp :9100: bind: address already in use
- Q9: line 34: field file_sd_config not found in type config.ScrapeConfig
- Q10: line 35: cannot unmarshal !!map into []*file.SDConfig
- Q11: Error reading file
- Q12: Unknown service ID "prometheus-node43". Ensure that the service ID is passed, not the service name.
- Q13: WARNING: file "/yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/sd/*.yaml" for file_sd in scrape job "yinzhengjie-file-sd-yaml-node_exporter" does not exist
- Q14: dial tcp 127.0.0.1:3306: connect: connection refused
- Q15: no configuration found
- Q16: rror 1045 (28000): Access denied for user 'exporter11'@'10.0.0.42' (using password: YES)
- Q17: ERRO[0005] Couldn't connect to redis instance (redis://localhost:6379)
- Q18: Panel plugin not found: natel-discrete-panel
- Q19: json.Unmarshal failed invalid character '#' looking for beginning of value
- Q20: dial tcp 10.0.0.42:9115: connect: connection refused
- Q21: amtool: error: Failed to read from standard input
- Q22: FAILED: "yinzhengjie-linux95-rules.yml" does not point to an existing file
- Q23: *email.loginAuth auth: 535 Login Fail. Please enter your authorization code to login. More information in https://service.mail.qq.com/detail/0/53
- Q24: VictoriaMetrics Enterprise license is required
- Q25: etcdserver: no leader
- Q26: error reading server preface: EOF
- Q27: snapshot must be requested to one selected node, not multiple
- Q28: data-dir "/var/lib/etcd/" not empty or could not be read
- Q29: failed to verify certificate: x509: certificate signed by unknown authority
Q1: "open prometheus.yml: no such file or directory"
报错信息
[root@prometheus-server31 prometheus-2.53.3.linux-amd64]# ./prometheus
ts=2025-01-10T01:44:41.515Z caller=main.go:537 level=error msg="Error loading config (--config.file=prometheus.yml)" file=/root/prometheus-2.53.3.linux-amd64/prometheus.yml err="open prometheus.yml: no such file or directory"
[root@prometheus-server31 prometheus-2.53.3.linux-amd64]#
错误原因
Prometheus程序找不到配置文件prometheus.yml
解决方案
可以使用--config.file配置文件路径。
Q2: dial tcp 10.0.0.44:9100: connect: no route to host
报错信息
Get "http://10.0.0.44:9100/metrics": dial tcp 10.0.0.44:9100: connect: no route to host
错误原因
Prometheus server无法连接到监控目标,可能是网络或者配置文件。
解决方案
- 检查配置文件是否正确,比如IP地址是否写错;
- 检查目标节点连通性问题;
Q3: grafana-enterprise depends on musl; however: Package musl is not installed.
报错信息
dpkg: dependency problems prevent configuration of grafana-enterprise:
grafana-enterprise depends on musl; however:
Package musl is not installed.
错误原因
缺少依赖包musl,建议安装。
解决方案
使用apt工具安装musl软件包即可。
Q4: failed to retrieve cluster info from ES
报错信息
level=error ts=2025-01-10T09:14:06.608775966Z caller=clusterinfo.go:188 msg="failed to retrieve cluster info from ES" err="Get \"http://localhost:9200/\": EOF"
错误原因
连接ES集群失败。
解决方案
检查ES集群的用户认证是否正确。
Q5: failed to verify certificate: x509: cannot validate certificate for 10.0.0.93 because it doesn't contain any IP SANs
报错信息
level=error ts=2025-01-10T09:16:13.647741728Z caller=clusterinfo.go:267 msg="failed to get cluster info" err="Get \"https://10.0.0.93:9200/\": tls: failed to verify certificate: x509: cannot validate certificate for 10.0.0.93 because it doesn't contain any IP SANs"
错误原因
证书认证失败。
解决方案
跳过证书校验。
Q6: HTTP Request failed with code 401
报错信息
level=error ts=2025-01-10T09:17:37.016462637Z caller=clusterinfo.go:188 msg="failed to retrieve cluster info from ES" err="HTTP Request failed with code 401"
错误原因
401表示用户认证失败。
解决方案
检查连接ES集群的用户名和密码是否正确。
Q7: dial tcp 10.0.0.91:9092: connect: connection refused
报错信息
F0110 17:30:19.276988 2099 kafka_exporter.go:924] Error Init Kafka Client: kafka: client has run out of available brokers to talk to: dial tcp 10.0.0.91:9092: connect: connection refused
错误原因
kafka exporter连接kafka集群失败。
解决方案
检查kafka集群是否正常启动。
Q8: listen tcp :9100: bind: address already in use
报错信息
ts=2025-01-10T09:42:58.329Z caller=node_exporter.go:224 level=error err="listen tcp :9100: bind: address already in use"
错误原因
启动服务是端口被占用
解决方案
- 修改启动端口;
- 将旧的服务关掉;
Q9: line 34: field file_sd_config not found in type config.ScrapeConfig
报错信息
[root@prometheus-server31 ~]# !curl
curl -X POST 10.0.0.31:9090/-/reload
failed to reload config: couldn't load configuration (--config.file="/yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml"): parsing YAML file /yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml: yaml: unmarshal errors:
line 34: field file_sd_config not found in type config.ScrapeConfig
[root@prometheus-server31 ~]#
错误原因
prometheus的第34行配置文件错误,没有发现file_sd_config的配置项。
解决方案
检查配置文件语法,是否正确,很明显少写了一个s,正确的配置项为"file_sd_configs"。
Q10: line 35: cannot unmarshal !!map into []*file.SDConfig
报错信息
[root@prometheus-server31 ~]# curl -X POST 10.0.0.31:9090/-/reload
failed to reload config: couldn't load configuration (--config.file="/yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml"): parsing YAML file /yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml: yaml: unmarshal errors:
line 35: cannot unmarshal !!map into []*file.SDConfig
[root@prometheus-server31 ~]#
错误原因
prometheus配置文件错误,检查第35行的配置信息,缺少数组的配置。
解决方案
检查配置,查找是否缺少数组相关的配置"[]*file.SDConfig"。
Q11: Error reading file
报错信息
[root@prometheus-server31 ~]# tail -100f /yinzhengjie/logs/prometheus/prometheus-server.log
...
{"caller":"file.go:342","component":"discovery manager scrape","config":"yinzhengjie-file-sd-json-node_exporter","discovery":"file","err":"invalid character ']' looking for beginning of value","level":"error","msg":"Error reading file","path":"/yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/sd/linux95-hosts.json","ts":"2025-01-13T01:54:36.637Z"}
...
错误原因
prometheus读取"/yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/sd/linux95-hosts.json"文件出错。
解决方案
Q12: Unknown service ID "prometheus-node43". Ensure that the service ID is passed, not the service name.
报错信息
[root@node-exporter41 ~]# curl -X PUT http://10.0.0.41:8500/v1/agent/service/deregister/prometheus-node43
Unknown service ID "prometheus-node43". Ensure that the service ID is passed, not the service name.
错误原因
注册节点和注销节点不在同一个节点上。
解决方案
在同一个节点进行注册和注销,这是一个坑,待解决。
Q13: WARNING: file "/yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/sd/*.yaml" for file_sd in scrape job "yinzhengjie-file-sd-yaml-node_exporter" does not exist
报错信息
[root@prometheus-server32 prometheus-2.53.3.linux-amd64]# ./promtool check config prometheus.yml
Checking prometheus.yml
WARNING: file "/yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/sd/*.yaml" for file_sd in scrape job "yinzhengjie-file-sd-yaml-node_exporter" does not exist
SUCCESS: prometheus.yml is valid prometheus config file syntax
[root@prometheus-server32 prometheus-2.53.3.linux-amd64]#
错误原因
是一个警告级别的信息,说明服务发现的文件不存在,可以不处理。
解决方案
可以按照提示创建文件。
Q14: dial tcp 127.0.0.1:3306: connect: connection refused
报错信息
time=2025-01-13T15:21:45.510+08:00 level=ERROR source=exporter.go:131 msg="Error opening connection to database" err="dial tcp 127.0.0.1:3306: connect: connection refused"
错误原因
MySQL_exporter连接数据库失败。
解决方案
检查数据库相关的配置是否正确。
Q15: no configuration found
报错信息
time=2025-01-13T15:24:29.911+08:00 level=INFO source=mysqld_exporter.go:244 msg="Error parsing host config" file=.my.cnf err="no configuration found"
错误原因
MySQL_exporter缺失配置文件
解决方案
手动指定配置文件,默认以相对路径加载。
Q16: rror 1045 (28000): Access denied for user 'exporter11'@'10.0.0.42' (using password: YES)
报错信息
time=2025-01-13T15:26:29.952+08:00 level=ERROR source=exporter.go:131 msg="Error opening connection to database" err="Error 1045 (28000): Access denied for user 'exporter11'@'10.0.0.42' (using password: YES)"
错误原因
连接数据库认证失败。
解决方案
- 检查配置文件用户密码是否写的正确;
- 检查数据库是否配置授权;
Q17: ERRO[0005] Couldn't connect to redis instance (redis://localhost:6379)
报错信息
ERRO[0005] Couldn't connect to redis instance (redis://localhost:6379)
错误原因
连接Redis数据库失败。
解决方案
查看相关参数,是否指定了正确的Redis数据库实例。
Q18: Panel plugin not found: natel-discrete-panel
报错信息
Panel plugin not found: natel-discrete-panel
错误原因
Grafana缺少相关的插件。
解决方案
使用grafana工具安装插件即可。比如"grafana-cli plugins install natel-discrete-panel"
Q19: json.Unmarshal failed invalid character '#' looking for beginning of value
报错信息
2025/01/13 17:28:21 json.Unmarshal failed invalid character '#' looking for beginning of value
错误原因
json反序列化失败,说白了,就是加载json文件失败。
解决方案
检查nginx的配置输出格式是否是json格式。
Q20: dial tcp 10.0.0.42:9115: connect: connection refused
报错信息
Get "http://10.0.0.42:9115/probe?class=linux95&module=http_2xx&school=yinzhengjie&target=https%3A%2F%2Fwww.yinzhengjie.com%2F": dial tcp 10.0.0.42:9115: connect: connection refused
错误原因
连接blackbox失败。
解决方案
- 检查blackbox是否正常运行。
- 检查网络
Q21: amtool: error: Failed to read from standard input
报错信息
[root@prometheus-server32 alertmanager-0.27.0.linux-amd64]# ./amtool check-config
amtool: error: Failed to read from standard input
[root@prometheus-server32 alertmanager-0.27.0.linux-amd64]#
错误原因
amtool未指定配置文件。
解决方案
查看帮助信息可以看到需要指定配置文件路径。
Q22: FAILED: "yinzhengjie-linux95-rules.yml" does not point to an existing file
报错信息
[root@prometheus-server31 prometheus-2.53.3.linux-amd64]# ./promtool check config prometheus.yml
Checking prometheus.yml
FAILED: "yinzhengjie-linux95-rules.yml" does not point to an existing file
[root@prometheus-server31 prometheus-2.53.3.linux-amd64]#
错误原因
prometheus未找到规则文件。
解决方案
手动创建对应的规则文件即可。
Q23: *email.loginAuth auth: 535 Login Fail. Please enter your authorization code to login. More information in https://service.mail.qq.com/detail/0/53
报错信息
Nov 14 19:26:22 prometheus-server33 alertmanager[18115]: ts=2024-11-14T11:26:22.723Z caller=notify.go:848 level=warn component=dispatcher receiver=email integration=email[0] aggrGroup="{}:{alertname=\"yinzhengjie_mysqld_exporter-alert\"}" msg="Notify attempt failed, will retry later" attempts=1 err="*email.loginAuth auth: 535 Login Fail. Please enter your authorization code to login. More information in https://service.mail.qq.com/detail/0/53"
错误原因
表示授权码认证失败,需要确认授权码是否过期。
解决方案
建议重新生成授权码。
Q24: VictoriaMetrics Enterprise license is required
报错信息
[root@prometheus-server33 ~]# journalctl -u victoria-metrics.service -f
...
Nov 14 12:03:28 prometheus-server33 victoria-metrics-prod[16999]: 2024-11-14T04:03:28.576Z error VictoriaMetrics/lib/license/copyrights.go:33 VictoriaMetrics Enterprise license is required. Please obtain it at https://victoriametrics.com/products/enterprise/trial/ and pass it via either -license or -licenseFile command-line flags. See https://docs.victoriametrics.com/enterprise/
错误原因
97LTS之后需要授权Enterprise license
解决方案
版本选择建议使用93 LTS,因为使用97 LTS貌似需要企业授权。
Q25: etcdserver: no leader
报错信息
[root@node-exporter42 ~]# etcdctl --endpoints="10.0.0.41:2379,10.0.0.42:2379,10.0.0.43:2379" --cacert=/yinzhengjie/certs/etcd/etcd-ca.pem --cert=/yinzhengjie/certs/etcd/etcd-server.pem --key=/yinzhengjie/certs/etcd/etcd-server-key.pem endpoint status --write-out=table
...
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
| 10.0.0.42:2379 | 18f972748ec1bd96 | 3.5.17 | 20 kB | false | false | 3 | 11 | 11 | etcdserver: no leader |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
错误原因
etcd集群缺少Leader。
解决方案
检查集群是否正常工作,etcd分布式集群要求半数以上存活才能正常工作。
Q26: error reading server preface: EOF
报错信息
{"level":"warn","ts":"2025-01-15T16:12:28.282245+0800","logger":"etcd-client","caller":"v3@v3.5.17/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000aa000/10.0.0.41:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"error reading server preface: EOF\""}
Failed to get the status of endpoint 10.0.0.42:2379 (context deadline exceeded)
错误原因
连接etcd服务失败,未指定证书相关的信息。
解决方案
链接etcd时,要指定etcd证书文件。
Q27: snapshot must be requested to one selected node, not multiple
报错信息
[root@node-exporter42 ~]# etcdctl snapshot save /tmp/yinzhengjie-etcd-`date +%F`.backup
Error: snapshot must be requested to one selected node, not multiple [10.0.0.41:2379 10.0.0.42:2379 10.0.0.43:2379]
[root@node-exporter42 ~]#
错误原因
etcd拍摄快照时需要指定一个特定节点,不能同时指定多个节点。
解决方案
连接指定的etcd节点。
Q28: data-dir "/var/lib/etcd/" not empty or could not be read
报错信息
[root@prometheus-server32 ~]# etcdctl snapshot restore yinzhengjie-etcd-2025-01-15.backup --data-dir=/var/lib/etcd/
Deprecated: Use `etcdutl snapshot restore` instead.
Error: data-dir "/var/lib/etcd/" not empty or could not be read
[root@prometheus-server32 ~]#
错误原因
etcd恢复数据时目录必须为空。
解决方案
将恢复的数据目录设置一个不存在的目录即可。
Q29: failed to verify certificate: x509: certificate signed by unknown authority
报错信息
Get "https://10.0.0.41:2379/metrics": tls: failed to verify certificate: x509: certificate signed by unknown authority
错误原因
prometheus连接ectd集群时需要携带证书文件。
解决方案
添加证书"tls_config"配置即可。
本文来自博客园,作者:尹正杰,转载请注明原文链接:https://www.cnblogs.com/yinzhengjie/p/13946740.html,个人微信: "JasonYin2020"(添加时请备注来源及意图备注,有偿付费)
当你的才华还撑不起你的野心的时候,你就应该静下心来学习。当你的能力还驾驭不了你的目标的时候,你就应该沉下心来历练。问问自己,想要怎样的人生。