gitlab promethues数据块文件占用磁盘导致系统崩溃问题分析
问题现象
服务器服务异常,操作指令失效,df -h发现某个磁盘使用率100%,(如下图所示:/dev/mapper/centos-root)
[root@localhost ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 12K 16G 1% /dev/shm
tmpfs 16G 67M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/mapper/centos-root 50G 50G 20K 100% /
/dev/sda1 1014M 157M 858M 16% /boot
/dev/mapper/centos-home 957G 336M 957G 1% /home
tmpfs 3.2G 0 3.2G 0% /run/user/0
虽然给根分区只分配了50G的大小,很容易会使得使用率达到100%,简单粗暴的恢复方式是扩大根分区的大小,但是还是有必要分析下,该分区下到底哪些进程或者文件导致的磁盘爆满。
分析过程
-
查看当前设备挂载目录下,占用空间最大的子目录或者文件。
[root@localhost /]# du -m --max-depth=1 ./ 125 ./boot 1 ./dev 301 ./home 0 ./proc 67 ./run 0 ./sys 55 ./etc 3223 ./root 35466 ./var 1 ./tmp 3877 ./usr 0 ./media 0 ./mnt 6320 ./opt 0 ./srv 4 ./public 49435 ./发现当前目录的子目录/var占用空间较大,当前环境的实际情况是,在该目录下原则上并不会产生大文件。进入子目录,同样的命令查看子目录中占用空间最大的目录或者文件。最终找到目录下有大量的文件块。
[root@localhost data]# du -m --max-depth=1 ./ 24806 ./wal 24806 ./ [root@localhost data]# cd wal [root@localhost wal]# ll total 25397168 -rw------- 1 gitlab-prometheus gitlab-prometheus 123011072 Sep 7 01:00 00000782 -rw------- 1 gitlab-prometheus gitlab-prometheus 65196032 Sep 7 04:00 00000783 -rw------- 1 gitlab-prometheus gitlab-prometheus 61538304 Sep 7 05:00 00000784 -rw------- 1 gitlab-prometheus gitlab-prometheus 123011072 Sep 7 07:00 00000785 -rw------- 1 gitlab-prometheus gitlab-prometheus 123011072 Sep 7 09:00 00000786 -rw------- 1 gitlab-prometheus gitlab-prometheus 123011072 Sep 7 11:00 00000787 ....... -rw------- 1 gitlab-prometheus gitlab-prometheus 123699200 Sep 20 23:00 00000953 -rw------- 1 gitlab-prometheus gitlab-prometheus 123699200 Sep 21 01:00 00000954 -rw------- 1 gitlab-prometheus gitlab-prometheus 64924126 Sep 21 04:00 00000955 -rw------- 1 gitlab-prometheus gitlab-prometheus 61898752 Sep 21 05:00 00000956 -rw------- 1 gitlab-prometheus gitlab-prometheus 123699200 Sep 21 07:00 00000957 -rw------- 1 gitlab-prometheus gitlab-prometheus 123699200 Sep 21 09:00 00000958 -rw------- 1 gitlab-prometheus gitlab-prometheus 133890048 Sep 22 15:57 00000968 -rw------- 1 gitlab-prometheus gitlab-prometheus 1048576 Sep 22 15:58 00000969 -rw------- 1 gitlab-prometheus gitlab-prometheus 1048576 Sep 22 15:59 00000970 -rw------- 1 gitlab-prometheus gitlab-prometheus 1081344 Sep 22 16:00 00000971 -rw------- 1 gitlab-prometheus gitlab-prometheus 1146880 Sep 22 16:01 00000972 -rw------- 1 gitlab-prometheus gitlab-prometheus 1081344 Sep 22 16:02 00000973 -rw------- 1 gitlab-prometheus gitlab-prometheus 1048576 Sep 22 16:03 00000974 -rw------- 1 gitlab-prometheus gitlab-prometheus 1048576 Sep 22 16:04 00000975 -rw------- 1 gitlab-prometheus gitlab-prometheus 1048576 Sep 22 16:05 00000976 -rw------- 1 gitlab-prometheus gitlab-prometheus 55672832 Sep 22 17:00 00000977 -rw------- 1 gitlab-prometheus gitlab-prometheus 123731968 Sep 22 19:00 00000978 -rw------- 1 gitlab-prometheus gitlab-prometheus 123731968 Sep 22 21:00 00000979 -rw------- 1 gitlab-prometheus gitlab-prometheus 123731968 Sep 22 23:00 00000980 -rw------- 1 gitlab-prometheus gitlab-prometheus 123731968 Sep 23 01:00 00000981 -rw------- 1 gitlab-prometheus gitlab-prometheus 123731968 Sep 23 03:00 00000982 -rw------- 1 gitlab-prometheus gitlab-prometheus 123731968 Sep 23 05:00 00000983 ...... -rw------- 1 gitlab-prometheus gitlab-prometheus 134217728 Sep 28 05:37 00001042 -rw------- 1 gitlab-prometheus gitlab-prometheus 66461696 Sep 28 08:00 00001043 drwx------ 2 gitlab-prometheus gitlab-prometheus 22 Sep 7 05:00 checkpoint.000781 -
查看文件是由什么进程产生,发现上述文件所在目录(/var/opt/gitlab/prometheus/data/wal)为gitlab安装目录下的promethues服务中,推测有该服务对应进程定时产生。同时ps查看相关进程。
[root@localhost wal]# ps -ef | grep prometheus root 2495 1 0 Aug17 ? 00:02:32 runsvdir -P /opt/gitlab/service log: ab/sidekiq: out of disk space svlogd: pausin g: unable to write to current: /var/log/gitlab/prometheus: out of disk space svlogd: pausing: unable to write to current: /var/lo g/gitlab/alertmanager: out of disk space svlogd: pausing: unable to create new current: /var/log/gitlab/logrotate: out of disk sp ace svlogd: pausing: unable to create new current: /var/log/gitlab/gitaly: out of disk space root 2508 2495 0 Aug17 ? 00:00:00 runsv prometheus git 2521 2502 0 Aug17 ? 00:25:13 /opt/gitlab/embedded/bin/gitlab-workhorse -listenNetwork unix -listenUmask 0 -lis tenAddr /var/opt/gitlab/gitlab-workhorse/socket -authBackend http://localhost:8080 -authSocket /var/opt/gitlab/gitlab-rails/socke ts/gitlab.socket -documentRoot /opt/gitlab/embedded/service/gitlab-rails/public -pprofListenAddr -prometheusListenAddr localhost :9229 -secretPath /opt/gitlab/embedded/service/gitlab-rails/.gitlab_workhorse_secret -logFormat json -config config.toml root 2522 2508 0 Aug17 ? 00:00:25 svlogd -tt /var/log/gitlab/prometheus gitlab-+ 2528 2508 1 Aug17 ? 13:32:09 /opt/gitlab/embedded/bin/prometheus --web.listen-address=localhost:9090 --storage .tsdb.path=/var/opt/gitlab/prometheus/data --config.file=/var/opt/gitlab/prometheus/prometheus.yml root 6588 5915 0 09:44 pts/0 00:00:00 grep --color=auto prometheus -
查看gitlab promethues服务状态,prometheus服务正常运行,排除服务异常导致。
[root@localhost ~]# gitlab-ctl status run: alertmanager: (pid 2532) 3632221s; run: log: (pid 2516) 3632221s run: gitaly: (pid 2541) 3632221s; run: log: (pid 2540) 3632221s run: gitlab-exporter: (pid 2526) 3632221s; run: log: (pid 2519) 3632221s run: gitlab-workhorse: (pid 2521) 3632221s; run: log: (pid 2513) 3632221s run: grafana: (pid 2537) 3632221s; run: log: (pid 2530) 3632221s run: logrotate: (pid 22545) 14193s; run: log: (pid 2525) 3632221s run: nginx: (pid 2531) 3632221s; run: log: (pid 2524) 3632221s run: node-exporter: (pid 2536) 3632221s; run: log: (pid 2533) 3632221s run: postgres-exporter: (pid 2535) 3632221s; run: log: (pid 2527) 3632221s run: postgresql: (pid 2518) 3632221s; run: log: (pid 2512) 3632221s run: prometheus: (pid 2528) 3632221s; run: log: (pid 2522) 3632221s run: puma: (pid 2517) 3632221s; run: log: (pid 2514) 3632221s run: redis: (pid 2539) 3632222s; run: log: (pid 2538) 3632222s run: redis-exporter: (pid 2529) 3632222s; run: log: (pid 2523) 3632222s run: sidekiq: (pid 2520) 3632222s; run: log: (pid 2515) 3632222s -
排查gitlab promethues数据备份配置,vi /etc/gitlab/gitlab.rb,找到promethues的相关配置。
################################################################################ ## Prometheus ##! Docs: https://docs.gitlab.com/ee/administration/monitoring/prometheus/ ################################################################################ ###! **To enable only Monitoring service in this machine, uncomment ###! the line below.** ###! Docs: https://docs.gitlab.com/ee/administration/high_availability # monitoring_role['enable'] = true # prometheus['enable'] = true # prometheus['monitor_kubernetes'] = true # prometheus['username'] = 'gitlab-prometheus' # prometheus['group'] = 'gitlab-prometheus' # prometheus['uid'] = nil # prometheus['gid'] = nil # prometheus['shell'] = '/bin/sh' # prometheus['home'] = '/var/opt/gitlab/prometheus' # prometheus['log_directory'] = '/var/log/gitlab/prometheus' # prometheus['rules_files'] = ['/var/opt/gitlab/prometheus/rules/*.rules'] # prometheus['scrape_interval'] = 15 # prometheus['scrape_timeout'] = 15 # prometheus['env_directory'] = '/opt/gitlab/etc/prometheus/env' # prometheus['env'] = { # 'SSL_CERT_DIR' => "/opt/gitlab/embedded/ssl/certs/" # } # ### Custom scrape configs # # Prometheus can scrape additional jobs via scrape_configs. The default automatically # includes all of the exporters supported by the omnibus config. # # See: https://prometheus.io/docs/operating/configuration/#<scrape_config> # # Example: # # prometheus['scrape_configs'] = [ # { # 'job_name': 'example', # 'static_configs' => [ # 'targets' => ['hostname:port'], # ], # }, # ] # ### Custom alertmanager config # # To configure external alertmanagers, create an alertmanager config. # # See: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config # # prometheus['alertmanagers'] = [ # { # 'static_configs' => [ # { # 'targets' => [ # 'hostname:port' # ] # } # ] # } # ] # ### Custom Prometheus flags # # prometheus['flags'] = { # 'storage.tsdb.path' => "/var/opt/gitlab/prometheus/data", # 'storage.tsdb.retention.time' => "15d", # 'config.file' => "/var/opt/gitlab/prometheus/prometheus.yml" # } ##! Advanced settings. Should be changed only if absolutely needed. # prometheus['listen_address'] = 'localhost:9090'所有相关的配置项都是用#注释掉的,相关配置采用默认的值,三个关键配置项:
- 关掉prometheus,只是系统监控收集信息的工具,无任何影响---> prometheus_monitoring['enable']=false;
- 修改data路径,取消注释,改为磁盘空间更大的路径:‘storage.tsdb.path’ => "/var/bigdatapath";
- 修改data保存时间,改小天数prometheus会自动清楚过期的data。'storage.tsdb.retention.time' => "5d"。
需要说明的是,本次分析的场景虽然可以通过上述三种修改方式进行规避,但是通过对比正常的安装gitlab的环境状态,也同样采取了上述配置。所以还需进一步深究问题根因,暂不使用上述三种方法进行规避处理。
另,如果真的修改了上述参数,需要保存配置修改,重启gitlab
gitlab-ctl stop gitlab-ctl reconfigure gitlab-ctl start -
进一步分析前,prometheus的存储原理以及数据备份还原原理。
-
了解上述原理之后,我们知道当prometheus会先将数据写入一些数据块文件,当数据块文件达到一定大小或者写入时间达到一定阈值之后,这些块文件会持久化到磁盘上的备份空间。我们发现当前环境wal目录下有大量的块数据文件没有持久化到备份目录。并且在某些时间点,块文件并没有按照两小时(上述原理)的写入频率出现,同时,块文件的大小也出现了大的波动,比如出现1M大小的块文件,这说明块文件写入已经出现问题,当前磁盘已经不能分配正常大小的文件块了,明显是存储空间不足导致。
-
同时,查看gitlab prometheus日志(/var/log/gitlab/prometheus/current),发现日志中持久化数据块的操作中有很多的“no space left on device”错误;块文件不断累积,同时又不能持久化到备份目录,导致磁盘占用率飙升。根因确定。
2020-09-27_16:13:13.83242 level=error ts=2020-09-27T16:13:13.832Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device" 2020-09-27_16:14:13.88201 level=error ts=2020-09-27T16:14:13.881Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device" 2020-09-27_16:15:13.93866 level=error ts=2020-09-27T16:15:13.938Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device" 2020-09-27_16:16:13.99965 level=error ts=2020-09-27T16:16:13.999Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device" 2020-09-27_16:17:14.05193 level=error ts=2020-09-27T16:17:14.051Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device" 2020-09-27_16:18:14.11222 level=error ts=2020-09-27T16:18:14.112Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device" 2020-09-27_16:19:14.16503 level=error ts=2020-09-27T16:19:14.164Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device" 2020-09-27_16:20:14.21062 level=error ts=2020-09-27T16:20:14.210Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device"
恢复措施
清理比较旧的块文件,然后重新启动gitlab服务,块文件正常写入,两小时后,持久化备份文件(备份目录中:/var/opt/gitlab/prometheus/data,默认保留15天的备份数据)01EK9H9XRQYRPJ6V5A1WFZVSHC自动生成。
[root@localhost gitlab]# cd /var/opt/gitlab/prometheus/data/
[root@localhost data]# ll
total 20
drwx------ 3 gitlab-prometheus gitlab-prometheus 68 Sep 28 13:00 01EK9H9XRQYRPJ6V5A1WFZVSHC
-rw------- 1 gitlab-prometheus gitlab-prometheus 0 Jul 7 18:44 lock
-rw------- 1 gitlab-prometheus gitlab-prometheus 20001 Sep 28 14:41 queries.active
drwx------ 3 gitlab-prometheus gitlab-prometheus 79 Sep 28 13:00 wal
[root@localhost data]# pwd
/var/opt/gitlab/prometheus/data

浙公网安备 33010602011771号