Prometheus常见30+报错指南

                                              作者:尹正杰

版权声明:原创作品,谢绝转载!否则将追究法律责任。

目录

Q1: "open prometheus.yml: no such file or directory"

报错信息

[root@prometheus-server31 prometheus-2.53.3.linux-amd64]# ./prometheus 
ts=2025-01-10T01:44:41.515Z caller=main.go:537 level=error msg="Error loading config (--config.file=prometheus.yml)" file=/root/prometheus-2.53.3.linux-amd64/prometheus.yml err="open prometheus.yml: no such file or directory"
[root@prometheus-server31 prometheus-2.53.3.linux-amd64]# 

错误原因

	Prometheus程序找不到配置文件prometheus.yml

解决方案

	可以使用--config.file配置文件路径。

Q2: dial tcp 10.0.0.44:9100: connect: no route to host

报错信息

Get "http://10.0.0.44:9100/metrics": dial tcp 10.0.0.44:9100: connect: no route to host

错误原因

	Prometheus server无法连接到监控目标,可能是网络或者配置文件。

解决方案

	- 检查配置文件是否正确,比如IP地址是否写错;
	- 检查目标节点连通性问题;

Q3: grafana-enterprise depends on musl; however: Package musl is not installed.

报错信息

dpkg: dependency problems prevent configuration of grafana-enterprise:
 grafana-enterprise depends on musl; however:
  Package musl is not installed.

错误原因

	缺少依赖包musl,建议安装。

解决方案

	使用apt工具安装musl软件包即可。

Q4: failed to retrieve cluster info from ES

报错信息

level=error ts=2025-01-10T09:14:06.608775966Z caller=clusterinfo.go:188 msg="failed to retrieve cluster info from ES" err="Get \"http://localhost:9200/\": EOF"

错误原因

	连接ES集群失败。

解决方案

	检查ES集群的用户认证是否正确。

Q5: failed to verify certificate: x509: cannot validate certificate for 10.0.0.93 because it doesn't contain any IP SANs

报错信息

level=error ts=2025-01-10T09:16:13.647741728Z caller=clusterinfo.go:267 msg="failed to get cluster info" err="Get \"https://10.0.0.93:9200/\": tls: failed to verify certificate: x509: cannot validate certificate for 10.0.0.93 because it doesn't contain any IP SANs"

错误原因

	证书认证失败。

解决方案

	跳过证书校验。

Q6: HTTP Request failed with code 401

报错信息

level=error ts=2025-01-10T09:17:37.016462637Z caller=clusterinfo.go:188 msg="failed to retrieve cluster info from ES" err="HTTP Request failed with code 401"

错误原因

	401表示用户认证失败。

解决方案

	检查连接ES集群的用户名和密码是否正确。

Q7: dial tcp 10.0.0.91:9092: connect: connection refused

报错信息

F0110 17:30:19.276988    2099 kafka_exporter.go:924] Error Init Kafka Client: kafka: client has run out of available brokers to talk to: dial tcp 10.0.0.91:9092: connect: connection refused

错误原因

	kafka exporter连接kafka集群失败。

解决方案

	检查kafka集群是否正常启动。

Q8: listen tcp :9100: bind: address already in use

报错信息

ts=2025-01-10T09:42:58.329Z caller=node_exporter.go:224 level=error err="listen tcp :9100: bind: address already in use"

错误原因

	启动服务是端口被占用

解决方案

	- 修改启动端口;
	- 将旧的服务关掉;

Q9: line 34: field file_sd_config not found in type config.ScrapeConfig

报错信息

[root@prometheus-server31 ~]# !curl
curl -X POST 10.0.0.31:9090/-/reload
failed to reload config: couldn't load configuration (--config.file="/yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml"): parsing YAML file /yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml: yaml: unmarshal errors:
  line 34: field file_sd_config not found in type config.ScrapeConfig
[root@prometheus-server31 ~]# 

错误原因

	prometheus的第34行配置文件错误,没有发现file_sd_config的配置项。

解决方案

	检查配置文件语法,是否正确,很明显少写了一个s,正确的配置项为"file_sd_configs"。

Q10: line 35: cannot unmarshal !!map into []*file.SDConfig

报错信息

[root@prometheus-server31 ~]# curl -X POST 10.0.0.31:9090/-/reload
failed to reload config: couldn't load configuration (--config.file="/yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml"): parsing YAML file /yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/prometheus.yml: yaml: unmarshal errors:
  line 35: cannot unmarshal !!map into []*file.SDConfig
[root@prometheus-server31 ~]# 

错误原因

	prometheus配置文件错误,检查第35行的配置信息,缺少数组的配置。

解决方案

	检查配置,查找是否缺少数组相关的配置"[]*file.SDConfig"。

Q11: Error reading file

报错信息

[root@prometheus-server31 ~]# tail -100f  /yinzhengjie/logs/prometheus/prometheus-server.log 
...
{"caller":"file.go:342","component":"discovery manager scrape","config":"yinzhengjie-file-sd-json-node_exporter","discovery":"file","err":"invalid character ']' looking for beginning of value","level":"error","msg":"Error reading file","path":"/yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/sd/linux95-hosts.json","ts":"2025-01-13T01:54:36.637Z"}
...

错误原因

	prometheus读取"/yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/sd/linux95-hosts.json"文件出错。

解决方案

Q12: Unknown service ID "prometheus-node43". Ensure that the service ID is passed, not the service name.

报错信息

[root@node-exporter41 ~]# curl -X PUT http://10.0.0.41:8500/v1/agent/service/deregister/prometheus-node43
Unknown service ID "prometheus-node43". Ensure that the service ID is passed, not the service name.

错误原因

	注册节点和注销节点不在同一个节点上。

解决方案

	在同一个节点进行注册和注销,这是一个坑,待解决。

Q13: WARNING: file "/yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/sd/*.yaml" for file_sd in scrape job "yinzhengjie-file-sd-yaml-node_exporter" does not exist

报错信息

[root@prometheus-server32 prometheus-2.53.3.linux-amd64]# ./promtool check config prometheus.yml
Checking prometheus.yml
  WARNING: file "/yinzhengjie/softwares/prometheus-2.53.3.linux-amd64/sd/*.yaml" for file_sd in scrape job "yinzhengjie-file-sd-yaml-node_exporter" does not exist
 SUCCESS: prometheus.yml is valid prometheus config file syntax

[root@prometheus-server32 prometheus-2.53.3.linux-amd64]# 

错误原因

	是一个警告级别的信息,说明服务发现的文件不存在,可以不处理。

解决方案

	可以按照提示创建文件。

Q14: dial tcp 127.0.0.1:3306: connect: connection refused

报错信息

time=2025-01-13T15:21:45.510+08:00 level=ERROR source=exporter.go:131 msg="Error opening connection to database" err="dial tcp 127.0.0.1:3306: connect: connection refused"

错误原因

	MySQL_exporter连接数据库失败。

解决方案

	检查数据库相关的配置是否正确。

Q15: no configuration found

报错信息

time=2025-01-13T15:24:29.911+08:00 level=INFO source=mysqld_exporter.go:244 msg="Error parsing host config" file=.my.cnf err="no configuration found"

错误原因

	MySQL_exporter缺失配置文件

解决方案

	手动指定配置文件,默认以相对路径加载。

Q16: rror 1045 (28000): Access denied for user 'exporter11'@'10.0.0.42' (using password: YES)

报错信息

time=2025-01-13T15:26:29.952+08:00 level=ERROR source=exporter.go:131 msg="Error opening connection to database" err="Error 1045 (28000): Access denied for user 'exporter11'@'10.0.0.42' (using password: YES)"

错误原因

	连接数据库认证失败。

解决方案

	- 检查配置文件用户密码是否写的正确;
	- 检查数据库是否配置授权;

Q17: ERRO[0005] Couldn't connect to redis instance (redis://localhost:6379)

报错信息

ERRO[0005] Couldn't connect to redis instance (redis://localhost:6379) 

错误原因

	连接Redis数据库失败。

解决方案

	查看相关参数,是否指定了正确的Redis数据库实例。

Q18: Panel plugin not found: natel-discrete-panel

报错信息

	Panel plugin not found: natel-discrete-panel

错误原因

	Grafana缺少相关的插件。

解决方案

	使用grafana工具安装插件即可。比如"grafana-cli plugins install natel-discrete-panel"

Q19: json.Unmarshal failed invalid character '#' looking for beginning of value

报错信息

2025/01/13 17:28:21 json.Unmarshal failed invalid character '#' looking for beginning of value

错误原因

	json反序列化失败,说白了,就是加载json文件失败。

解决方案

	检查nginx的配置输出格式是否是json格式。

Q20: dial tcp 10.0.0.42:9115: connect: connection refused

报错信息

Get "http://10.0.0.42:9115/probe?class=linux95&module=http_2xx&school=yinzhengjie&target=https%3A%2F%2Fwww.yinzhengjie.com%2F": dial tcp 10.0.0.42:9115: connect: connection refused

错误原因

	连接blackbox失败。

解决方案

	- 检查blackbox是否正常运行。
	- 检查网络

Q21: amtool: error: Failed to read from standard input

报错信息

[root@prometheus-server32 alertmanager-0.27.0.linux-amd64]# ./amtool check-config 
amtool: error: Failed to read from standard input
[root@prometheus-server32 alertmanager-0.27.0.linux-amd64]#

错误原因

	amtool未指定配置文件。

解决方案

	查看帮助信息可以看到需要指定配置文件路径。

Q22: FAILED: "yinzhengjie-linux95-rules.yml" does not point to an existing file

报错信息

[root@prometheus-server31 prometheus-2.53.3.linux-amd64]# ./promtool check config  prometheus.yml
Checking prometheus.yml
  FAILED: "yinzhengjie-linux95-rules.yml" does not point to an existing file

[root@prometheus-server31 prometheus-2.53.3.linux-amd64]# 

错误原因

	prometheus未找到规则文件。

解决方案

	手动创建对应的规则文件即可。

Q23: *email.loginAuth auth: 535 Login Fail. Please enter your authorization code to login. More information in https://service.mail.qq.com/detail/0/53

报错信息

Nov 14 19:26:22 prometheus-server33 alertmanager[18115]: ts=2024-11-14T11:26:22.723Z caller=notify.go:848 level=warn component=dispatcher receiver=email integration=email[0] aggrGroup="{}:{alertname=\"yinzhengjie_mysqld_exporter-alert\"}" msg="Notify attempt failed, will retry later" attempts=1 err="*email.loginAuth auth: 535 Login Fail. Please enter your authorization code to login. More information in https://service.mail.qq.com/detail/0/53"

错误原因

	表示授权码认证失败,需要确认授权码是否过期。

解决方案

	建议重新生成授权码。

Q24: VictoriaMetrics Enterprise license is required

报错信息

[root@prometheus-server33 ~]# journalctl -u victoria-metrics.service  -f
...
Nov 14 12:03:28 prometheus-server33 victoria-metrics-prod[16999]: 2024-11-14T04:03:28.576Z        error        VictoriaMetrics/lib/license/copyrights.go:33        VictoriaMetrics Enterprise license is required. Please obtain it at https://victoriametrics.com/products/enterprise/trial/ and pass it via either -license or -licenseFile command-line flags. See https://docs.victoriametrics.com/enterprise/

错误原因

	97LTS之后需要授权Enterprise license

解决方案

	版本选择建议使用93 LTS,因为使用97 LTS貌似需要企业授权。

Q25: etcdserver: no leader

报错信息

[root@node-exporter42 ~]# etcdctl --endpoints="10.0.0.41:2379,10.0.0.42:2379,10.0.0.43:2379" --cacert=/yinzhengjie/certs/etcd/etcd-ca.pem --cert=/yinzhengjie/certs/etcd/etcd-server.pem --key=/yinzhengjie/certs/etcd/etcd-server-key.pem  endpoint status --write-out=table
...
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |        ERRORS         |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
| 10.0.0.42:2379 | 18f972748ec1bd96 |  3.5.17 |   20 kB |     false |      false |         3 |         11 |                 11 | etcdserver: no leader |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+

错误原因

	etcd集群缺少Leader。

解决方案

	检查集群是否正常工作,etcd分布式集群要求半数以上存活才能正常工作。

Q26: error reading server preface: EOF

报错信息

{"level":"warn","ts":"2025-01-15T16:12:28.282245+0800","logger":"etcd-client","caller":"v3@v3.5.17/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000aa000/10.0.0.41:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"error reading server preface: EOF\""}
Failed to get the status of endpoint 10.0.0.42:2379 (context deadline exceeded)

错误原因

	连接etcd服务失败,未指定证书相关的信息。

解决方案

	链接etcd时,要指定etcd证书文件。

Q27: snapshot must be requested to one selected node, not multiple

报错信息

[root@node-exporter42 ~]# etcdctl snapshot save /tmp/yinzhengjie-etcd-`date +%F`.backup
Error: snapshot must be requested to one selected node, not multiple [10.0.0.41:2379 10.0.0.42:2379 10.0.0.43:2379]
[root@node-exporter42 ~]# 

错误原因

	etcd拍摄快照时需要指定一个特定节点,不能同时指定多个节点。

解决方案

	连接指定的etcd节点。

Q28: data-dir "/var/lib/etcd/" not empty or could not be read

报错信息

[root@prometheus-server32 ~]# etcdctl snapshot restore yinzhengjie-etcd-2025-01-15.backup --data-dir=/var/lib/etcd/
Deprecated: Use `etcdutl snapshot restore` instead.

Error: data-dir "/var/lib/etcd/" not empty or could not be read
[root@prometheus-server32 ~]# 

错误原因

	etcd恢复数据时目录必须为空。

解决方案

	将恢复的数据目录设置一个不存在的目录即可。

Q29: failed to verify certificate: x509: certificate signed by unknown authority

报错信息

Get "https://10.0.0.41:2379/metrics": tls: failed to verify certificate: x509: certificate signed by unknown authority

错误原因

	prometheus连接ectd集群时需要携带证书文件。

解决方案

	添加证书"tls_config"配置即可。
posted @ 2020-11-09 05:27  尹正杰  阅读(874)  评论(0)    收藏  举报