gitlab promethues数据块文件占用磁盘导致系统崩溃问题分析

问题现象

服务器服务异常,操作指令失效,df -h发现某个磁盘使用率100%,(如下图所示:/dev/mapper/centos-root)

[root@localhost ~]# df -h
Filesystem               Size  Used Avail Use% Mounted on
devtmpfs                  16G     0   16G   0% /dev
tmpfs                     16G   12K   16G   1% /dev/shm
tmpfs                     16G   67M   16G   1% /run
tmpfs                     16G     0   16G   0% /sys/fs/cgroup
/dev/mapper/centos-root   50G   50G   20K 100% /
/dev/sda1               1014M  157M  858M  16% /boot
/dev/mapper/centos-home  957G  336M  957G   1% /home
tmpfs                    3.2G     0  3.2G   0% /run/user/0

虽然给根分区只分配了50G的大小,很容易会使得使用率达到100%,简单粗暴的恢复方式是扩大根分区的大小,但是还是有必要分析下,该分区下到底哪些进程或者文件导致的磁盘爆满。

分析过程

  1. 查看当前设备挂载目录下,占用空间最大的子目录或者文件。

    [root@localhost /]# du -m --max-depth=1 ./
    125     ./boot
    1       ./dev
    301     ./home
    0       ./proc
    67      ./run
    0       ./sys
    55      ./etc
    3223    ./root
    35466   ./var
    1       ./tmp
    3877    ./usr
    0       ./media
    0       ./mnt
    6320    ./opt
    0       ./srv
    4       ./public
    49435   ./
    

    发现当前目录的子目录/var占用空间较大,当前环境的实际情况是,在该目录下原则上并不会产生大文件。进入子目录,同样的命令查看子目录中占用空间最大的目录或者文件。最终找到目录下有大量的文件块。

    [root@localhost data]# du -m --max-depth=1 ./
    24806   ./wal
    24806   ./
    [root@localhost data]# cd wal
    [root@localhost wal]# ll
    total 25397168
    -rw------- 1 gitlab-prometheus gitlab-prometheus 123011072 Sep  7 01:00 00000782
    -rw------- 1 gitlab-prometheus gitlab-prometheus  65196032 Sep  7 04:00 00000783
    -rw------- 1 gitlab-prometheus gitlab-prometheus  61538304 Sep  7 05:00 00000784
    -rw------- 1 gitlab-prometheus gitlab-prometheus 123011072 Sep  7 07:00 00000785
    -rw------- 1 gitlab-prometheus gitlab-prometheus 123011072 Sep  7 09:00 00000786
    -rw------- 1 gitlab-prometheus gitlab-prometheus 123011072 Sep  7 11:00 00000787
    .......
    -rw------- 1 gitlab-prometheus gitlab-prometheus 123699200 Sep 20 23:00 00000953
    -rw------- 1 gitlab-prometheus gitlab-prometheus 123699200 Sep 21 01:00 00000954
    -rw------- 1 gitlab-prometheus gitlab-prometheus  64924126 Sep 21 04:00 00000955
    -rw------- 1 gitlab-prometheus gitlab-prometheus  61898752 Sep 21 05:00 00000956
    -rw------- 1 gitlab-prometheus gitlab-prometheus 123699200 Sep 21 07:00 00000957
    -rw------- 1 gitlab-prometheus gitlab-prometheus 123699200 Sep 21 09:00 00000958
    -rw------- 1 gitlab-prometheus gitlab-prometheus 133890048 Sep 22 15:57 00000968
    -rw------- 1 gitlab-prometheus gitlab-prometheus   1048576 Sep 22 15:58 00000969
    -rw------- 1 gitlab-prometheus gitlab-prometheus   1048576 Sep 22 15:59 00000970
    -rw------- 1 gitlab-prometheus gitlab-prometheus   1081344 Sep 22 16:00 00000971
    -rw------- 1 gitlab-prometheus gitlab-prometheus   1146880 Sep 22 16:01 00000972
    -rw------- 1 gitlab-prometheus gitlab-prometheus   1081344 Sep 22 16:02 00000973
    -rw------- 1 gitlab-prometheus gitlab-prometheus   1048576 Sep 22 16:03 00000974
    -rw------- 1 gitlab-prometheus gitlab-prometheus   1048576 Sep 22 16:04 00000975
    -rw------- 1 gitlab-prometheus gitlab-prometheus   1048576 Sep 22 16:05 00000976
    -rw------- 1 gitlab-prometheus gitlab-prometheus  55672832 Sep 22 17:00 00000977
    -rw------- 1 gitlab-prometheus gitlab-prometheus 123731968 Sep 22 19:00 00000978
    -rw------- 1 gitlab-prometheus gitlab-prometheus 123731968 Sep 22 21:00 00000979
    -rw------- 1 gitlab-prometheus gitlab-prometheus 123731968 Sep 22 23:00 00000980
    -rw------- 1 gitlab-prometheus gitlab-prometheus 123731968 Sep 23 01:00 00000981
    -rw------- 1 gitlab-prometheus gitlab-prometheus 123731968 Sep 23 03:00 00000982
    -rw------- 1 gitlab-prometheus gitlab-prometheus 123731968 Sep 23 05:00 00000983
    ......
    -rw------- 1 gitlab-prometheus gitlab-prometheus 134217728 Sep 28 05:37 00001042
    -rw------- 1 gitlab-prometheus gitlab-prometheus  66461696 Sep 28 08:00 00001043
    drwx------ 2 gitlab-prometheus gitlab-prometheus        22 Sep  7 05:00 checkpoint.000781
    
  2. 查看文件是由什么进程产生,发现上述文件所在目录(/var/opt/gitlab/prometheus/data/wal)为gitlab安装目录下的promethues服务中,推测有该服务对应进程定时产生。同时ps查看相关进程。

    [root@localhost wal]# ps -ef | grep prometheus
    root      2495     1  0 Aug17 ?        00:02:32 runsvdir -P /opt/gitlab/service log: ab/sidekiq: out of disk space svlogd: pausin                   g: unable to write to current: /var/log/gitlab/prometheus: out of disk space svlogd: pausing: unable to write to current: /var/lo                   g/gitlab/alertmanager: out of disk space svlogd: pausing: unable to create new current: /var/log/gitlab/logrotate: out of disk sp                   ace svlogd: pausing: unable to create new current: /var/log/gitlab/gitaly: out of disk space
    root      2508  2495  0 Aug17 ?        00:00:00 runsv prometheus
    git       2521  2502  0 Aug17 ?        00:25:13 /opt/gitlab/embedded/bin/gitlab-workhorse -listenNetwork unix -listenUmask 0 -lis                   tenAddr /var/opt/gitlab/gitlab-workhorse/socket -authBackend http://localhost:8080 -authSocket /var/opt/gitlab/gitlab-rails/socke                   ts/gitlab.socket -documentRoot /opt/gitlab/embedded/service/gitlab-rails/public -pprofListenAddr  -prometheusListenAddr localhost                   :9229 -secretPath /opt/gitlab/embedded/service/gitlab-rails/.gitlab_workhorse_secret -logFormat json -config config.toml
    root      2522  2508  0 Aug17 ?        00:00:25 svlogd -tt /var/log/gitlab/prometheus
    gitlab-+  2528  2508  1 Aug17 ?        13:32:09 /opt/gitlab/embedded/bin/prometheus --web.listen-address=localhost:9090 --storage                   .tsdb.path=/var/opt/gitlab/prometheus/data --config.file=/var/opt/gitlab/prometheus/prometheus.yml
    root      6588  5915  0 09:44 pts/0    00:00:00 grep --color=auto prometheus
    
  3. 查看gitlab promethues服务状态,prometheus服务正常运行,排除服务异常导致。

    [root@localhost ~]# gitlab-ctl status
    run: alertmanager: (pid 2532) 3632221s; run: log: (pid 2516) 3632221s
    run: gitaly: (pid 2541) 3632221s; run: log: (pid 2540) 3632221s
    run: gitlab-exporter: (pid 2526) 3632221s; run: log: (pid 2519) 3632221s
    run: gitlab-workhorse: (pid 2521) 3632221s; run: log: (pid 2513) 3632221s
    run: grafana: (pid 2537) 3632221s; run: log: (pid 2530) 3632221s
    run: logrotate: (pid 22545) 14193s; run: log: (pid 2525) 3632221s
    run: nginx: (pid 2531) 3632221s; run: log: (pid 2524) 3632221s
    run: node-exporter: (pid 2536) 3632221s; run: log: (pid 2533) 3632221s
    run: postgres-exporter: (pid 2535) 3632221s; run: log: (pid 2527) 3632221s
    run: postgresql: (pid 2518) 3632221s; run: log: (pid 2512) 3632221s
    run: prometheus: (pid 2528) 3632221s; run: log: (pid 2522) 3632221s
    run: puma: (pid 2517) 3632221s; run: log: (pid 2514) 3632221s
    run: redis: (pid 2539) 3632222s; run: log: (pid 2538) 3632222s
    run: redis-exporter: (pid 2529) 3632222s; run: log: (pid 2523) 3632222s
    run: sidekiq: (pid 2520) 3632222s; run: log: (pid 2515) 3632222s
    
  4. 排查gitlab promethues数据备份配置,vi /etc/gitlab/gitlab.rb,找到promethues的相关配置。

    ################################################################################
    ## Prometheus
    ##! Docs: https://docs.gitlab.com/ee/administration/monitoring/prometheus/
    ################################################################################
    
    ###! **To enable only Monitoring service in this machine, uncomment
    ###!   the line below.**
    ###! Docs: https://docs.gitlab.com/ee/administration/high_availability
    # monitoring_role['enable'] = true
    
    # prometheus['enable'] = true
    # prometheus['monitor_kubernetes'] = true
    # prometheus['username'] = 'gitlab-prometheus'
    # prometheus['group'] = 'gitlab-prometheus'
    # prometheus['uid'] = nil
    # prometheus['gid'] = nil
    # prometheus['shell'] = '/bin/sh'
    # prometheus['home'] = '/var/opt/gitlab/prometheus'
    # prometheus['log_directory'] = '/var/log/gitlab/prometheus'
    # prometheus['rules_files'] = ['/var/opt/gitlab/prometheus/rules/*.rules']
    # prometheus['scrape_interval'] = 15
    # prometheus['scrape_timeout'] = 15
    # prometheus['env_directory'] = '/opt/gitlab/etc/prometheus/env'
    # prometheus['env'] = {
    #   'SSL_CERT_DIR' => "/opt/gitlab/embedded/ssl/certs/"
    # }
    #
    ### Custom scrape configs
    #
    # Prometheus can scrape additional jobs via scrape_configs.  The default automatically
    # includes all of the exporters supported by the omnibus config.
    #
    # See: https://prometheus.io/docs/operating/configuration/#<scrape_config>
    #
    # Example:
    #
    # prometheus['scrape_configs'] = [
    #   {
    #     'job_name': 'example',
    #     'static_configs' => [
    #       'targets' => ['hostname:port'],
    #     ],
    #   },
    # ]
    #
    ### Custom alertmanager config
    #
    # To configure external alertmanagers, create an alertmanager config.
    #
    # See: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config
    #
    # prometheus['alertmanagers'] = [
    #   {
    #     'static_configs' => [
    #       {
    #         'targets' => [
    #           'hostname:port'
    #         ]
    #       }
    #     ]
    #   }
    # ]
    #
    ### Custom Prometheus flags
    #
    # prometheus['flags'] = {
    #   'storage.tsdb.path' => "/var/opt/gitlab/prometheus/data",
    #   'storage.tsdb.retention.time' => "15d",
    #   'config.file' => "/var/opt/gitlab/prometheus/prometheus.yml"
    # }
    
    ##! Advanced settings. Should be changed only if absolutely needed.
    # prometheus['listen_address'] = 'localhost:9090'
    

    所有相关的配置项都是用#注释掉的,相关配置采用默认的值,三个关键配置项:

    • 关掉prometheus,只是系统监控收集信息的工具,无任何影响---> prometheus_monitoring['enable']=false;
    • 修改data路径,取消注释,改为磁盘空间更大的路径:‘storage.tsdb.path’ => "/var/bigdatapath";
    • 修改data保存时间,改小天数prometheus会自动清楚过期的data。'storage.tsdb.retention.time' => "5d"。

    需要说明的是,本次分析的场景虽然可以通过上述三种修改方式进行规避,但是通过对比正常的安装gitlab的环境状态,也同样采取了上述配置。所以还需进一步深究问题根因,暂不使用上述三种方法进行规避处理。

    另,如果真的修改了上述参数,需要保存配置修改,重启gitlab

    gitlab-ctl stop
    gitlab-ctl reconfigure
    gitlab-ctl start
    
  5. 进一步分析前,prometheus的存储原理以及数据备份还原原理。

  6. 了解上述原理之后,我们知道当prometheus会先将数据写入一些数据块文件,当数据块文件达到一定大小或者写入时间达到一定阈值之后,这些块文件会持久化到磁盘上的备份空间。我们发现当前环境wal目录下有大量的块数据文件没有持久化到备份目录。并且在某些时间点,块文件并没有按照两小时(上述原理)的写入频率出现,同时,块文件的大小也出现了大的波动,比如出现1M大小的块文件,这说明块文件写入已经出现问题,当前磁盘已经不能分配正常大小的文件块了,明显是存储空间不足导致。

  7. 同时,查看gitlab prometheus日志(/var/log/gitlab/prometheus/current),发现日志中持久化数据块的操作中有很多的“no space left on device”错误;块文件不断累积,同时又不能持久化到备份目录,导致磁盘占用率飙升。根因确定。

    2020-09-27_16:13:13.83242 level=error ts=2020-09-27T16:13:13.832Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device"
    2020-09-27_16:14:13.88201 level=error ts=2020-09-27T16:14:13.881Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device"
    2020-09-27_16:15:13.93866 level=error ts=2020-09-27T16:15:13.938Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device"
    2020-09-27_16:16:13.99965 level=error ts=2020-09-27T16:16:13.999Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device"
    2020-09-27_16:17:14.05193 level=error ts=2020-09-27T16:17:14.051Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device"
    2020-09-27_16:18:14.11222 level=error ts=2020-09-27T16:18:14.112Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device"
    2020-09-27_16:19:14.16503 level=error ts=2020-09-27T16:19:14.164Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device"
    2020-09-27_16:20:14.21062 level=error ts=2020-09-27T16:20:14.210Z caller=db.go:617 component=tsdb msg="compaction failed" err="persist head block: write compaction: write chunks: no space left on device"
    

恢复措施

清理比较旧的块文件,然后重新启动gitlab服务,块文件正常写入,两小时后,持久化备份文件(备份目录中:/var/opt/gitlab/prometheus/data,默认保留15天的备份数据)01EK9H9XRQYRPJ6V5A1WFZVSHC自动生成。

[root@localhost gitlab]# cd /var/opt/gitlab/prometheus/data/
[root@localhost data]# ll
total 20
drwx------ 3 gitlab-prometheus gitlab-prometheus    68 Sep 28 13:00 01EK9H9XRQYRPJ6V5A1WFZVSHC
-rw------- 1 gitlab-prometheus gitlab-prometheus     0 Jul  7 18:44 lock
-rw------- 1 gitlab-prometheus gitlab-prometheus 20001 Sep 28 14:41 queries.active
drwx------ 3 gitlab-prometheus gitlab-prometheus    79 Sep 28 13:00 wal
[root@localhost data]# pwd
/var/opt/gitlab/prometheus/data
posted @ 2020-10-11 23:08  一介草民李八千  阅读(2653)  评论(0)    收藏  举报