Linux虚拟文件系统VFS

一、为什么要有虚拟文件系统?

文件存储方式有很多种方式,对应的不同文件系统,例如ext3、NFS、XFS,也可能来自于硬盘SSD、HDD等不同存储介质,如果我们的服务要去调用文件的话,那么针对不同的文件系统要有不同的实现,为了降低这种复杂度,Linux在服务端与存储的文件系统之间加了一层抽象,为服务调用提供通用的文件操作和文件系统操作接口,屏蔽不同文件系统操作的差异,让服务端感受不到底层文件系统的区别。

通过VFS系统，Linux提供了通用的系统调用，可以跨越不同文件系统和介质之间执行，极大简化了用户访问不同文件系统的过程。另一方面，新的文件系统、新类型的存储介质，可以无须编译的情况下，动态加载到Linux中。

"一切皆文件"是Linux的基本哲学之一，不仅是普通的文件，包括目录、字符设备、块设备、套接字等，都可以以文件的方式被对待。实现这一行为的基础，正是Linux的虚拟文件系统机制。

二、虚拟文件系统原理

VFS之所以能够衔接各种各样的文件系统，是因为它抽象了一个通用的文件系统模型，定义了通用文件系统都支持的、概念上的接口。新的文件系统只要支持并实现这些接口，并注册到Linux内核中，即可安装和使用。

举个例子，比如Linux写一个文件：

int ret = write(fd, buf, len);

调用了write()系统调用，它的过程简要如下：

首先，勾起VFS通用系统调用sys_write()处理。
接着，sys_write()根据fd找到所在的文件系统提供的写操作函数，比如op_write()。
最后，调用op_write()实际的把数据写入到文件中。

操作示意图如下

三、虚拟文件系统组成部分

Linux为了实现这种VFS系统，采用面向对象的设计思路，主要抽象了四种对象类型：

超级块对象：代表一个已安装的文件系统。
索引节点对象：代表具体的文件。
目录项对象：代表一个目录项，是文件路径的一个组成部分。
文件对象：代表进程打开的文件。

每个对象都包含一组操作方法，用于操作相应的文件系统。

注意

Linux将目录当做文件对象来处理，是另一种形式的文件，它里面包含了一个或多个目录项。而目录项是单独抽象的对象，主要包括文件名和索引节点号。因为目录是可以层层嵌套，以形成文件路径，而路径中的每一部分，其实就是目录项。

超级块

超级块用于管理挂载点对实际文件系统中的一些参数，包括：块长度，文件系统可处理的最大文件长度，文件系统类型，对应的存储设备等。在之前的整体结构图中superblock会有一个files指向所有打开的文件，但是在下面的数据结构中并没有找到相关的代码，是因为之前该结构会用于判断umount逻辑时，确保所有文件都已被关闭，新版的不知道怎么处理这个逻辑了，后续看到了再补上,相关superblock的管理主要在文件系统的挂载逻辑，这个后续在讲到挂载相关的模块是详细分析。而superblock主要功能是管理inode。

超级块用于存储文件系统的元信息，元信息里面包含文件系统的基本属性信息，比如有：

索引节点信息
挂载的标志
操作方法 s_op
安装权限
文件系统类型、大小、区块数

其中操作方法 s_op 对每个文件系统来说，是非常重要的，它指向该超级块的操作函数表，包含一系列操作方法的实现，这些方法有：

分配inode
销毁inode
读、写inode
文件同步

下面是super_block的代码:

struct super_block {
    struct list_head    s_list;     /* Keep this first */
    dev_t           s_dev;      /* search index; _not_ kdev_t */
    unsigned char       s_blocksize_bits; // 块字节
    unsigned long       s_blocksize; // log2(块字节)
    loff_t          s_maxbytes; /* Max file size */
    struct file_system_type *s_type; // 文件系统类型
    const struct super_operations   *s_op; // 超级块的操作
    const struct dquot_operations   *dq_op;
    const struct quotactl_ops   *s_qcop;
    const struct export_operations *s_export_op;
    unsigned long       s_flags;
    unsigned long       s_iflags;   /* internal SB_I_* flags */
    unsigned long       s_magic;
    struct dentry       *s_root; // 根目录项。所有的path lookup 都是从此开始
    struct rw_semaphore s_umount;
    int         s_count;
    atomic_t        s_active;
#ifdef CONFIG_SECURITY
    void                    *s_security;
#endif
    const struct xattr_handler **s_xattr;
#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
    const struct fscrypt_operations *s_cop;
#endif
    struct hlist_bl_head    s_roots;    /* alternate root dentries for NFS */
    struct list_head    s_mounts;   /* list of mounts; _not_ for fs use */
    struct block_device *s_bdev;
    struct backing_dev_info *s_bdi;
    struct mtd_info     *s_mtd;
    struct hlist_node   s_instances;
    unsigned int        s_quota_types;  /* Bitmask of supported quota types */
    struct quota_info   s_dquot;    /* Diskquota specific options */
 
    struct sb_writers   s_writers;
 
    /*
     * Keep s_fs_info, s_time_gran, s_fsnotify_mask, and
     * s_fsnotify_marks together for cache efficiency. They are frequently
     * accessed and rarely modified.
     */
    void            *s_fs_info; /* Filesystem private info */
 
    /* Granularity of c/m/atime in ns (cannot be worse than a second) */
    u32         s_time_gran;
#ifdef CONFIG_FSNOTIFY
    __u32           s_fsnotify_mask;
    struct fsnotify_mark_connector __rcu    *s_fsnotify_marks;
#endif
 
    char            s_id[32];   /* Informational name */
    uuid_t          s_uuid;     /* UUID */
 
    unsigned int        s_max_links;
    fmode_t         s_mode;
 
    /*
     * The next field is for VFS *only*. No filesystems have any business
     * even looking at it. You had been warned.
     */
    struct mutex s_vfs_rename_mutex;    /* Kludge */
 
    /*
     * Filesystem subtype.  If non-empty the filesystem type field
     * in /proc/mounts will be "type.subtype"
     */
    char *s_subtype;
 
    const struct dentry_operations *s_d_op; /* default d_op for dentries */
 
    /*
     * Saved pool identifier for cleancache (-1 means none)
     */
    int cleancache_poolid;
 
    struct shrinker s_shrink;   /* per-sb shrinker handle */
 
    /* Number of inodes with nlink == 0 but still referenced */
    atomic_long_t s_remove_count;
 
    /* Pending fsnotify inode refs */
    atomic_long_t s_fsnotify_inode_refs;
 
    /* Being remounted read-only */
    int s_readonly_remount;
 
    /* AIO completions deferred from interrupt context */
    struct workqueue_struct *s_dio_done_wq;
    struct hlist_head s_pins;
 
    /*
     * Owning user namespace and default context in which to
     * interpret filesystem uids, gids, quotas, device nodes,
     * xattrs and security labels.
     */
    struct user_namespace *s_user_ns;
 
    /*
     * The list_lru structure is essentially just a pointer to a table
     * of per-node lru lists, each of which has its own spinlock.
     * There is no need to put them into separate cachelines.
     */
    struct list_lru     s_dentry_lru; // 目录项缓存
    struct list_lru     s_inode_lru; // inode 缓存
    struct rcu_head     rcu;
    struct work_struct  destroy_work;
 
    struct mutex        s_sync_lock;    /* sync serialisation lock */
 
    /*
     * Indicates how deep in a filesystem stack this SB is
     */
    int s_stack_depth;
 
    /* s_inode_list_lock protects s_inodes */
    spinlock_t      s_inode_list_lock ____cacheline_aligned_in_smp;
    struct list_head    s_inodes;   /* all inodes */
 
    spinlock_t      s_inode_wblist_lock;
    struct list_head    s_inodes_wb;    /* writeback inodes */
} __randomize_layout;
struct super_operations {
    struct inode *(*alloc_inode)(struct super_block *sb); // 在当前sb创建inode
    void (*destroy_inode)(struct inode *); // 在当前sb删除inode
    void (*dirty_inode) (struct inode *, int flags); // 标记为脏inode
    int (*write_inode) (struct inode *, struct writeback_control *wbc);// inode 写回
    int (*drop_inode) (struct inode *); // 同delete，不过inode的引用必须为0
    void (*evict_inode) (struct inode *);
    void (*put_super) (struct super_block *);  // 卸载sb
    int (*sync_fs)(struct super_block *sb, int wait); 
    int (*freeze_super) (struct super_block *);
    int (*freeze_fs) (struct super_block *);
    int (*thaw_super) (struct super_block *);
    int (*unfreeze_fs) (struct super_block *);
    int (*statfs) (struct dentry *, struct kstatfs *); // 查询元信息
    int (*remount_fs) (struct super_block *, int *, char *); //重新挂载
    void (*umount_begin) (struct super_block *); // 主要用于NFS
        // 查询相关
    int (*show_options)(struct seq_file *, struct dentry *);
    int (*show_devname)(struct seq_file *, struct dentry *);
    int (*show_path)(struct seq_file *, struct dentry *);
    int (*show_stats)(struct seq_file *, struct dentry *);
#ifdef CONFIG_QUOTA
    ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
    ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
    struct dquot **(*get_dquots)(struct inode *);
#endif
    int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
    long (*nr_cached_objects)(struct super_block *,
                  struct shrink_control *);
    long (*free_cached_objects)(struct super_block *,
                    struct shrink_control *);
};

当VFS需要对超级块进行操作时，首先要在超级块的操作方法 s_op 中，找到对应的操作方法后再执行。比如文件系统要写自己的超级块：

superblock->s_op->write_supper(sb);

创建文件系统时，其实就是往存储介质的特定位置，写入超级块信息；而卸载文件系统时，由VFS调用释放超级块。

Linux支持众多不同的文件系统，file_system_type结构体用于描述每种文件系统的功能和行为，包括：

名称、类型等
超级块对象链表
等

当向内核注册新的文件系统时，其实是将file_system_type对象实例化，然后加入到Linux的根文件系统的目录树结构上。

索引

索引节点对象包含Linux内核在操作文件、目录时，所需要的全部信息，这些信息由inode结构体来描述，定义在<linux/fs.h>中，主要包含：

超级块相关信息
目录相关信息
文件大小、访问时间、权限相关信息
引用计数

关于inode,可以参考我这篇博文:Linux文件系统之INode

一个索引节点inode代表文件系统中的一个文件，只有当文件被访问时，才在内存中创建索引节点。与超级块类似的是，索引节点对象也提供了许多操作接口，供VFS系统使用，这些接口包括：

create(): 创建新的索引节点（创建新的文件）
link(): 创建硬链接
symlink(): 创建符号链接。
mkdir(): 创建新的目录。

我们常规的文件操作，都能在索引节点中找到相应的操作接口。

Inode结构源码:

struct inode {
	/* 全局的散列表 */
	struct hlist_node	i_hash;
	/* 根据inode的状态可能处理不同的链表中（inode_unused/inode_in_use/super_block->dirty） */
	struct list_head	i_list;
	/* super_block->s_inodes链表的节点 */
	struct list_head	i_sb_list;
	/* inode对应的dentry链表，可能多个dentry指向同一个文件 */
	struct list_head	i_dentry;
	/* inode编号 */
	unsigned long		i_ino;
	/* 访问该inode的进程数目 */
	atomic_t		i_count;
	/* inode的硬链接数 */
	unsigned int		i_nlink;
	uid_t			i_uid;
	gid_t			i_gid;
	/* inode表示设备文件时的设备号 */
	dev_t			i_rdev;
	u64			i_version;
	/* 文件的大小，以字节为单位 */
	loff_t			i_size;
#ifdef __NEED_I_SIZE_ORDERED
	seqcount_t		i_size_seqcount;
#endif
	/* 最后访问时间 */
	struct timespec		i_atime;
	/* 最后修改inode数据的时间 */
	struct timespec		i_mtime;
	/* 最后修改inode自身的时间 */
	struct timespec		i_ctime;
	/* 以block为单位的inode的大小 */
	blkcnt_t		i_blocks;
	unsigned int		i_blkbits;
	unsigned short          i_bytes;
	/* 文件属性，低12位为文件访问权限，同chmod参数含义，其余位为文件类型，如普通文件、目录、socket、设备文件等 */
	umode_t			i_mode;
	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
	struct mutex		i_mutex;
	struct rw_semaphore	i_alloc_sem;
	/* inode操作 */
	const struct inode_operations	*i_op;
	/* file操作 */
	const struct file_operations	*i_fop;
	/* inode所属的super_block */
	struct super_block	*i_sb;
	struct file_lock	*i_flock;
	/* inode的地址空间映射 */
	struct address_space	*i_mapping;
	struct address_space	i_data;
#ifdef CONFIG_QUOTA
	struct dquot		*i_dquot[MAXQUOTAS];
#endif
	struct list_head	i_devices; /* 若为设备文件的inode，则为设备的打开文件列表节点 */
	union {
		struct pipe_inode_info	*i_pipe;
		struct block_device	*i_bdev; /* 若为块设备的inode，则指向该设备实例 */
		struct cdev		*i_cdev; /* 若为字符设备的inode，则指向该设备实例 */
	};
 
	__u32			i_generation;
 
#ifdef CONFIG_FSNOTIFY
	__u32			i_fsnotify_mask; /* all events this inode cares about */
	struct hlist_head	i_fsnotify_mark_entries; /* fsnotify mark entries */
#endif
 
#ifdef CONFIG_INOTIFY
	struct list_head	inotify_watches; /* watches on this inode */
	struct mutex		inotify_mutex;	/* protects the watches list */
#endif
 
	unsigned long		i_state;
	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
	unsigned int		i_flags; /* 文件打开标记，如noatime */
 
	atomic_t		i_writecount;
#ifdef CONFIG_SECURITY
	void			*i_security;
#endif
#ifdef CONFIG_FS_POSIX_ACL
	struct posix_acl	*i_acl;
	struct posix_acl	*i_default_acl;
#endif
	void			*i_private; /* fs or device private pointer */
};

目录项

前面提到VFS把目录当做文件对待，比如/usr/bin/vim，usr、bin和vim都是文件，不过vim是一个普通文件，usr和bin都是目录文件，都是由索引节点对象标识。

由于VFS会经常的执行目录相关的操作，比如切换到某个目录、路径名的查找等等，为了提高这个过程的效率，VFS引入了目录项的概念。一个路径的组成部分，不管是目录还是普通文件，都是一个目录项对象。/、usr、bin、vim都对应一个目录项对象。不过目录项对象没有对应的磁盘数据结构，是VFS在遍历路径的过程中，将它们逐个解析成目录项对象。

目录项由dentry结构体标识，定义在``中，主要包含：

父目录项对象地址
子目录项链表
目录关联的索引节点对象
目录项操作指针
等等

目录项有三种状态：

被使用：该目录项指向一个有效的索引节点，并有一个或多个使用者，不能被丢弃。
未被使用：也对应一个有效的索引节点，但VFS还未使用，被保留在缓存中。如果要回收内存的话，可以撤销未使用的目录项。
负状态：没有对应有效的索引节点，因为索引节点被删除了，或者路径不正确，但是目录项仍被保留了。

将整个文件系统的目录结构解析成目录项，是一件费力的工作，为了节省VFS操作目录项的成本，内核会将目录项缓存起来。

文件

文件对象是进程打开的文件在内存中的实例。Linux用户程序可以通过open()系统调用来打开一个文件，通过close()系统调用来关闭一个文件。由于多个进程可以同时打开和操作同一个文件，所以同一个文件，在内存中也存在多个对应的文件对象，但对应的索引节点和目录项是唯一的。

文件对象由file结构体表示，定义在``中，主要包含：

文件操作方法
文件对象的引用计数
文件指针的偏移
打开文件时的读写标识

类似于目录项，文件对象也没有实际的磁盘数据，只有当进程打开文件时，才会在内存中产生一个文件对象。

每个进程都有自己打开的一组文件，由file_struct结构体标识，该结构体由进程描述符中的files字段指向。主要包括：

fdt
fd_array[NR_OPEN_DEFAULT]
引用计数

fd_array数组指针指向已打开的文件对象，如果打开的文件对象个数 > NR_OPEN_DEFAULT，内核会分配一个新数组，并将 fdt 指向该数组。

除此之外，内核还为所有打开文件维持一张文件表，包括：

文件状态标志
文件偏移量

四、虚拟文件系统实战

由此对于虚拟文件的基本架构有了一定的理解，但是如果想要对于虚拟文件有比较深刻的认识还是比较模糊的，那么我们来通过自己伪码来操作一下文件，以描述linux内核是如何来读写文件的，我们以写文件为例来过一下整个

流程

需求：从0开始向文件/testmount/testdir/testfile1.txt 中写入 hello world

基本过程其基本系统调用过程为:1.mkdir 2. creat 3. open 4. write
mkdir对应的函数调用的执行过程如下：

rootInode = sb->s_root->d_inode;
testDirDentry = dentry("testdir")
testDirInode = rootInode->i_op->mkdir(rootInode , testDirDentry, 777))

creat对应的函数调用的执行过程如下：

testFileDentry = dentry("testfile1.txt")
testFileInode = testDirInode->i_op->create(testDirInode, testFileDentry, 777 )

open 的系统调用的执行过程如下:

testFileInode->f_op->open(testFileInode, testfile)

write的系统调用的执行过程如下

testfile->f_op->write(file, "hello world", len, 0)

具体流程

假设现在我们有一个快磁盘设备/dev/sda，我们将其格式化为EX2文件系统，具体怎么将块设备格式化这个我们再设备管理章节在描述。
我们将该磁盘挂载到/testmount 目录，这样内核就会通过挂载模块注册对应的superblock，具体如何挂载且听下回分解。
我们想要写文件/testmount/testdir/testfile1.txt文件，那么首先会要根据文件名完整路径查找对应的目录项，并在不存在的时候创建对应的inode文件。
3.1 根据完整路径找到对应的挂载点的superblock，我们这里最精确的匹配sb是/testmount
3.2 找到sb后，找到当前sb的root dentry，找到root dentry对应的inode，通过inode中的address_space从磁盘中读取信息，如果是目录则其中存储内容为所有子条目信息，从而构建完整的root dentry中的子条目；发现没有对应testdir的目录，这时候就会报目录不存在的错误；用户开始创建对应的目录，并将对应的信息写回inode对应的设备；同理也需要在/testdir目录下创建testfile1.txt文件并写回/testdir对应的inode设备。
找到inode之后，我们需要通过open系统调用打开对应的文件，进程通过files_struct中的next_fd申请分配一个文件描述符，然后调用inode->f_op->open(inode, file)，生成一个file对象，并将inode中的address_space信息传到file中，然后将用户空间的fd关联到该file对象。
打开文件之后所有后续的读写操作都是通过该fd来进行，在内核层面就是通过对应的file数据结构操作文件，比如我们要写入hello world，那么就是通过调用file->f_op->write；
其实file->f_op其实是讲对应的字节内容写入到address_space中对应的内存中，address_space再选择合适的时间写回磁盘，这就是我们常说的缓存系统，当然我们也可以通过fsync系统调用强制将数据同步回存储系统。在f_op的函数中都可以看到__user描述信息，说明数据是来自用户空间的内存地址，这些数据最终要写到内核缓存的address_space中的page内存中，这就是我们常说的内核拷贝，后来就出来了大家所熟知的零拷贝sendfile，直接在两个fd直接拷贝数据，操作的都是内核里面的page数据，不需要到用户地址空间走一遭。

posted @ 2021-01-20 23:21 等不到的口琴阅读(692) 评论(0) 收藏举报

刷新页面返回顶部

等不到的口琴

随写遇到的问题、学习的心得，于己总结回顾，与他分享讨论。追求技术的本质,享受科技的喜悦

Linux虚拟文件系统VFS

一、为什么要有虚拟文件系统?

二、虚拟文件系统原理

操作示意图如下

三、虚拟文件系统组成部分

超级块

索引

目录项

文件

四、虚拟文件系统实战

流程

具体流程

公告

等不到的口琴

随写遇到的问题、学习的心得，于己总结回顾，与他分享讨论。 追求技术的本质,享受科技的喜悦

Linux虚拟文件系统VFS

一、为什么要有虚拟文件系统?

二、虚拟文件系统原理

操作示意图如下

三、虚拟文件系统组成部分

超级块

索引

目录项

文件

四、虚拟文件系统实战

流程

具体流程

公告

随写遇到的问题、学习的心得，于己总结回顾，与他分享讨论。追求技术的本质,享受科技的喜悦