最近我们发现多个用户设备掉电后重启,系统不工作。

 

研究这些返修设备,发现这些设备的表象是网络连接失败,DNS resolve不了。进一步发现/etc/resolv.conf为空,所以应用程序没法进行DNS resolve。但是在同一个路由器下面,其他设备是可以正常获取DNS 服务器信息的,后来检查dhcpc的log,发现他获取了DNS信息,但是写/etc/resolv.conf失败,这样应用程序读取dns server时就会失败。

 

为啥写文件失败呢?后来我们用df检查根文件系统,发现文件系统已经被占满了。下面显示整个16M的根目录空间都被占了

/ # df
Filesystem           1K-blocks      Used Available Use% Mounted on
overlay                  15863     15863         0 100% /
none                       492         0       492   0% /dev
run                     454124        56    454068   0% /run
shm                     454124         0    454124   0% /dev/shm
ubi2:exa_data_          780736     76516    699384  10% /config
ubi2:exa_data_          780736     76516    699384  10% /log
ubi2:exa_data_          780736     76516    699384  10% /tokens
tmpfs                      512         0       512   0% /dev/snd
tmpfs                      512         0       512   0% /dev/input/event0
tmpfs                      512         0       512   0% /dev/hbi
tmpfs                   454124      1652    452472   0% /tmp/ubus.sock
tmpfs                   454124      1652    452472   0% /run/dbus/system_bus_socket
cgroup_root              10240         0     10240   0% /sys/fs/cgroup
/ # lsof |grep deleted
318     /sbin/rc        /run/openrc/exclusive/bootmisc (deleted)
318     /sbin/rc        /run/openrc/exclusive/networking (deleted)
318     /sbin/rc        /run/openrc/exclusive/syslog (deleted)
318     /sbin/rc        /run/openrc/exclusive/avs-server (deleted)
/ # exit

但是我们到文件系统mount到的目录用du查看时,却发现实际的文件并没有占那么多。下面显示upperdir只占用了7M

[router] /overlay # du -d 1
2       ./workdir
7043    ./upperdir
12      ./lost+found
7058    .

那么多余的空间都跑哪去了呢?接着我们来检查一下文件系统对应的img。发现这个img居然被破坏了,这样就导致我们用df看到的信息不正确。

~$ fsck.ext4 rootfs_overlay.img
e2fsck 1.44.1 (24-Mar-2018)
rootfs_overlay.img contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 2378, i_blocks is 2, should be 0.  Fix<y>? no
Deleted inode 2381 has zero dtime.  Fix<y>? no
Deleted inode 2386 has zero dtime.  Fix<y>? no
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  +2612 -(8525--8526) -(8708--8720) -8725 -8731 -(8769--8772) -9235 -(9239--9248) -10319 -(10380--10396) -(10781--10785) -10923 -(12353--12356) -(12649--12656) -(13513--13764)
Fix<y>? no
Free blocks count wrong for group #1 (7063, counted=7062).
Fix<y>? no
Free blocks count wrong (8361, counted=8356).
Fix<y>? no
Inode bitmap differences:  -2381 -2386
Fix<y>? no

rootfs_overlay.img: ********** WARNING: Filesystem still has errors **********

rootfs_overlay.img: 351/4096 files (2.3% non-contiguous), 8023/16384 blocks

 

之后我们用fsck将这个image修复。然后重新mount文件系统,系统这时就工作正常了。

~$ fsck.ext4 rootfs_overlay.img
e2fsck 1.44.1 (24-Mar-2018)
rootfs_overlay.img contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 2378, i_blocks is 2, should be 0.  Fix<y>? yes
Deleted inode 2381 has zero dtime.  Fix<y>? yes
Deleted inode 2386 has zero dtime.  Fix<y>? yes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  +2612 -(8525--8526) -(8708--8720) -8725 -8731 -(8769--8772) -9235 -(9239--9248) -10319 -(10380--10396) -(10781--10785) -10923 -(12353--12356) -(12649--12656) -(13513--13764)
Fix<y>? yes
Free blocks count wrong for group #0 (1294, counted=1293).
Fix<y>? yes
Free blocks count wrong for group #1 (7063, counted=7382).
Fix<y>? yes
Free blocks count wrong (8361, counted=8675).
Fix<y>? yes
Inode bitmap differences:  -2381 -2386
Fix<y>? yes
Free inodes count wrong for group #1 (1711, counted=1713).
Fix<y>? yes
Free inodes count wrong (3745, counted=3747).
Fix ('a' enables 'yes' to all) <y>? yes to all

rootfs_overlay.img: ***** FILE SYSTEM WAS MODIFIED *****
rootfs_overlay.img: 349/4096 files (2.3% non-contiguous), 7709/16384 blocks
~$ fsck.ext4 rootfs_overlay.img
e2fsck 1.44.1 (24-Mar-2018)
rootfs_overlay.img: clean, 349/4096 files, 7709/16384 blocks
~$

  

文件系统被破坏,这个对于嵌入式系统来说,是一个很大的风险,设备极有可能变砖,导致返修。好在ext4是一个日志类型的文件系统,我们可以根据日志对文件系统进行恢复。所以设备启动,mount文件系统之前,一定要用fsck进行检查,一旦发现错误,必须立刻修复。