代码改变世界

Linux虚拟文件系统(VFS)

2013-02-24 00:25 by 横刀天笑, ... 阅读, ... 评论, 收藏, 编辑

文件系统是操作系统里非常重要的一个子系统。虚拟文件系统,顾名思义。它为应用程序员提供一层抽象,屏蔽底层各种文件系统的差异。Linux的文件系统采用面向对象的方式设计,这使得Linux的文件系统非常容易扩展,我们可以非常容易将一个新的文件系统添加到Linux中。

Linux本身主要是C语言编写的(少量汇编),而大家都知道C语言是典型的结构化语言,不属于面向对象语言,那为什么又说Linux的文件系统采用面向对象设计呢?(从这里可以看出,面向对象设计和面向对象编程语言其实关系并不大,殊不知很多人使用者Java之类的面向对象语言却编写着结构化的代码,也有人用C之类的结构化语言做出精妙的面向对象设计)。这在后面会讲解。

文件系统类型

VFS所支持的文件系统类型可以归结为以下三大类:

  • 基于磁盘的文件系统(Ext2, Ext3等)
  • 网络文件系统(NFS等)
  • 特殊文件系统(proc, sysfs)

Linux的目录形成一个树形结构,根目录是 /。根目录位于根文件系统中。在Linux中通常是Ext2或Ext3。其他文件系统挂载在根文件系统的子目录下。

通用文件模型

可以说,Linux VFS等强大扩展能力,正是因为这种通用文件模型的设计。新支持的文件系统,只需要将自己的结构转换成这种通用模型即可插入到Linux中。首先我们来看看VFS中几个主要的对象结构:

super_block

位于<linux/fs.h>中

struct super_block {
	struct list_head	s_list;		/* Keep this first */
	dev_t			s_dev;		/* search index; _not_ kdev_t */
	unsigned char		s_blocksize_bits;
	unsigned long		s_blocksize;
	loff_t			s_maxbytes;	/* Max file size */
	struct file_system_type	*s_type;
	const struct super_operations	*s_op;
	const struct dquot_operations	*dq_op;
	const struct quotactl_ops	*s_qcop;
	const struct export_operations *s_export_op;
	unsigned long		s_flags;
	unsigned long		s_magic;
	struct dentry		*s_root;
	struct rw_semaphore	s_umount;
	int			s_count;
	atomic_t		s_active;
#ifdef CONFIG_SECURITY
	void                    *s_security;
#endif
	const struct xattr_handler **s_xattr;

	struct list_head	s_inodes;	/* all inodes */
	struct hlist_bl_head	s_anon;		/* anonymous dentries for (nfs) exporting */
#ifdef CONFIG_SMP
	struct list_head __percpu *s_files;
#else
	struct list_head	s_files;
#endif
	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
	/* s_dentry_lru, s_nr_dentry_unused protected by dcache.c lru locks */
	struct list_head	s_dentry_lru;	/* unused dentry lru */
	int			s_nr_dentry_unused;	/* # of dentry on lru */

	/* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
	spinlock_t		s_inode_lru_lock ____cacheline_aligned_in_smp;
	struct list_head	s_inode_lru;		/* unused inode lru */
	int			s_nr_inodes_unused;	/* # of inodes on lru */

	struct block_device	*s_bdev;
	struct backing_dev_info *s_bdi;
	struct mtd_info		*s_mtd;
	struct hlist_node	s_instances;
	struct quota_info	s_dquot;	/* Diskquota specific options */

	struct sb_writers	s_writers;

	char s_id[32];				/* Informational name */
	u8 s_uuid[16];				/* UUID */

	void 			*s_fs_info;	/* Filesystem private info */
	unsigned int		s_max_links;
	fmode_t			s_mode;

	/* Granularity of c/m/atime in ns.
	   Cannot be worse than a second */
	u32		   s_time_gran;

	/*
	 * The next field is for VFS *only*. No filesystems have any business
	 * even looking at it. You had been warned.
	 */
	struct mutex s_vfs_rename_mutex;	/* Kludge */

	/*
	 * Filesystem subtype.  If non-empty the filesystem type field
	 * in /proc/mounts will be "type.subtype"
	 */
	char *s_subtype;

	/*
	 * Saved mount options for lazy filesystems using
	 * generic_show_options()
	 */
	char __rcu *s_options;
	const struct dentry_operations *s_d_op; /* default d_op for dentries */

	/*
	 * Saved pool identifier for cleancache (-1 means none)
	 */
	int cleancache_poolid;

	struct shrinker s_shrink;	/* per-sb shrinker handle */

	/* Number of inodes with nlink == 0 but still referenced */
	atomic_long_t s_remove_count;

	/* Being remounted read-only */
	int s_readonly_remount;
};

super_block存储对应的文件系统信息,对于基于磁盘的文件系统通常对应着存储在磁盘上的文件系统控制块(filesystem control block)。所有的super_block会放在一个循环双链表中,链表的第一个元素是super_blocks(在<linux/fs.h>中)。struct list_head s_list指向前后邻居。s_fs_info指向具体文件系统的superblock信息,这个一般保存在磁盘里,比如Ext2就会指向ext2_sb_info。

在这个结构里我们看到有一个s_op指针。这个指针指向的结构是具体的文件系统对应的super_block的操作。也就是说,当注册一个新的文件系统时,新的文件系统会提供这些操作。这就是如何在C里实现面向对象的设计(虽然在C里可以做到面向对象设计,但并没有真正的面向对象语言做的那么自然)。下面是struct super_operations结构:

struct super_operations {
   	struct inode *(*alloc_inode)(struct super_block *sb);
	void (*destroy_inode)(struct inode *);

   	void (*dirty_inode) (struct inode *, int flags);
	int (*write_inode) (struct inode *, struct writeback_control *wbc);
	int (*drop_inode) (struct inode *);
	void (*evict_inode) (struct inode *);
	void (*put_super) (struct super_block *);
	int (*sync_fs)(struct super_block *sb, int wait);
	int (*freeze_fs) (struct super_block *);
	int (*unfreeze_fs) (struct super_block *);
	int (*statfs) (struct dentry *, struct kstatfs *);
	int (*remount_fs) (struct super_block *, int *, char *);
	void (*umount_begin) (struct super_block *);

	int (*show_options)(struct seq_file *, struct dentry *);
	int (*show_devname)(struct seq_file *, struct dentry *);
	int (*show_path)(struct seq_file *, struct dentry *);
	int (*show_stats)(struct seq_file *, struct dentry *);
#ifdef CONFIG_QUOTA
	ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
#endif
	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
	int (*nr_cached_objects)(struct super_block *);
	void (*free_cached_objects)(struct super_block *, int);
};

比如我们要调用alloc_inode方法:

sb->s_op->alloc_inode(sb)。

这里与面向对象语言的差别是,面向对象语言里实例方法可以访问到this,这样就可以访问到自身的所有成员,但是在C里却做不到,所以需要将自身作为参数传入到函数中。

inode

位于<linux/fs.h>中

struct inode {
	umode_t			i_mode;
	unsigned short		i_opflags;
	kuid_t			i_uid;
	kgid_t			i_gid;
	unsigned int		i_flags;

#ifdef CONFIG_FS_POSIX_ACL
	struct posix_acl	*i_acl;
	struct posix_acl	*i_default_acl;
#endif

	const struct inode_operations	*i_op;
	struct super_block	*i_sb;
	struct address_space	*i_mapping;

#ifdef CONFIG_SECURITY
	void			*i_security;
#endif

	/* Stat data, not accessed from path walking */
	unsigned long		i_ino;
	/*
	 * Filesystems may only read i_nlink directly.  They shall use the
	 * following functions for modification:
	 *
	 *    (set|clear|inc|drop)_nlink
	 *    inode_(inc|dec)_link_count
	 */
	union {
		const unsigned int i_nlink;
		unsigned int __i_nlink;
	};
	dev_t			i_rdev;
	loff_t			i_size;
	struct timespec		i_atime;
	struct timespec		i_mtime;
	struct timespec		i_ctime;
	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
	unsigned short          i_bytes;
	unsigned int		i_blkbits;
	blkcnt_t		i_blocks;

#ifdef __NEED_I_SIZE_ORDERED
	seqcount_t		i_size_seqcount;
#endif

	/* Misc */
	unsigned long		i_state;
	struct mutex		i_mutex;

	unsigned long		dirtied_when;	/* jiffies of first dirtying */

	struct hlist_node	i_hash;
	struct list_head	i_wb_list;	/* backing dev IO list */
	struct list_head	i_lru;		/* inode LRU list */
	struct list_head	i_sb_list;
	union {
		struct hlist_head	i_dentry;
		struct rcu_head		i_rcu;
	};
	u64			i_version;
	atomic_t		i_count;
	atomic_t		i_dio_count;
	atomic_t		i_writecount;
	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
	struct file_lock	*i_flock;
	struct address_space	i_data;
#ifdef CONFIG_QUOTA
	struct dquot		*i_dquot[MAXQUOTAS];
#endif
	struct list_head	i_devices;
	union {
		struct pipe_inode_info	*i_pipe;
		struct block_device	*i_bdev;
		struct cdev		*i_cdev;
	};

	__u32			i_generation;

#ifdef CONFIG_FSNOTIFY
	__u32			i_fsnotify_mask; /* all events this inode cares about */
	struct hlist_head	i_fsnotify_marks;
#endif

#ifdef CONFIG_IMA
	atomic_t		i_readcount; /* struct files open RO */
#endif
	void			*i_private; /* fs or device private pointer */
};

inode存储的是特定文件的信息。对于基于磁盘的文件系统,这个通常对应着存储在磁盘上的文件控制块(file control block)。每个inode对应着一个inode编号,用来唯一标识文件系统中的文件。需要注意的是在Linux中文件夹也是一个文件,所以它也有一个inode对应着。

注意上面的结构,与super_block相比较,同样也有一个i_op指针。它指向的是下面的结构:

struct inode_operations {
	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
	void * (*follow_link) (struct dentry *, struct nameidata *);
	int (*permission) (struct inode *, int);
	struct posix_acl * (*get_acl)(struct inode *, int);

	int (*readlink) (struct dentry *, char __user *,int);
	void (*put_link) (struct dentry *, struct nameidata *, void *);

	int (*create) (struct inode *,struct dentry *, umode_t, bool);
	int (*link) (struct dentry *,struct inode *,struct dentry *);
	int (*unlink) (struct inode *,struct dentry *);
	int (*symlink) (struct inode *,struct dentry *,const char *);
	int (*mkdir) (struct inode *,struct dentry *,umode_t);
	int (*rmdir) (struct inode *,struct dentry *);
	int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
	int (*rename) (struct inode *, struct dentry *,
			struct inode *, struct dentry *);
	void (*truncate) (struct inode *);
	int (*setattr) (struct dentry *, struct iattr *);
	int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
	int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
	ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
	ssize_t (*listxattr) (struct dentry *, char *, size_t);
	int (*removexattr) (struct dentry *, const char *);
	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
		      u64 len);
	int (*update_time)(struct inode *, struct timespec *, int);
	int (*atomic_open)(struct inode *, struct dentry *,
			   struct file *, unsigned open_flag,
			   umode_t create_mode, int *opened);
} ____cacheline_aligned;

这个操作上就有一些我们常用的文件操作命令,比如rename, mkdir等。这些命令通过系统调用最后都会映射到这里来。

file

位于<linux/fs.h>

struct file {
	/*
	 * fu_list becomes invalid after file_free is called and queued via
	 * fu_rcuhead for RCU freeing
	 */
	union {
		struct list_head	fu_list;
		struct rcu_head 	fu_rcuhead;
	} f_u;
	struct path		f_path;
#define f_dentry	f_path.dentry
#define f_vfsmnt	f_path.mnt
	const struct file_operations	*f_op;

	/*
	 * Protects f_ep_links, f_flags, f_pos vs i_size in lseek SEEK_CUR.
	 * Must not be taken from IRQ context.
	 */
	spinlock_t		f_lock;
#ifdef CONFIG_SMP
	int			f_sb_list_cpu;
#endif
	atomic_long_t		f_count;
	unsigned int 		f_flags;
	fmode_t			f_mode;
	loff_t			f_pos;
	struct fown_struct	f_owner;
	const struct cred	*f_cred;
	struct file_ra_state	f_ra;

	u64			f_version;
#ifdef CONFIG_SECURITY
	void			*f_security;
#endif
	/* needed for tty driver, and maybe others */
	void			*private_data;

#ifdef CONFIG_EPOLL
	/* Used by fs/eventpoll.c to link all the hooks to this file */
	struct list_head	f_ep_links;
	struct list_head	f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
	struct address_space	*f_mapping;
#ifdef CONFIG_DEBUG_WRITECOUNT
	unsigned long f_mnt_write_state;
#endif
};

file即表示一个进程内打开的文件,所以与之前的两个结构不同,它是与进程对应的。对于这个结构来说,最重要的就是当前位置:f_pos。进程对文件的下一次操作就是从这个位置开始的。file对象会放在进程的文件描述符(file descriptions)表里。而我们使用read等函数时所使用的文件描述符即进程的文件描述符的索引。系统内部通过这个索引得到对应的file对象,然后使用对应的file_operations里的函数进行操作。

在task_struct结构(即用来表示进程描述符那个结构)里有一个struct files_struct *files:

struct files_struct {
  /*
   * read mostly part
   */
	atomic_t count;
	struct fdtable __rcu *fdt;
	struct fdtable fdtab;
  /*
   * written part on a separate cache line in SMP
   */
	spinlock_t file_lock ____cacheline_aligned_in_smp;
	int next_fd;
	unsigned long close_on_exec_init[1];
	unsigned long open_fds_init[1];
	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};

fdtable:

struct fdtable {
	unsigned int max_fds;
	struct file __rcu **fd;      /* current fd array */
	unsigned long *close_on_exec;
	unsigned long *open_fds;
	struct rcu_head rcu;
	struct fdtable *next;
};

这个fdtable结构里的fd指向进程打开的文件,一般这个fd会指向files_struct的fd_array。fd_array的大小是NR_OPEN_DEFAULT,默认值是BITS_PER_LONG。在32位上默认是32。这里的策略是:如果进程打开的文件数小于等于32,fd指向fd_array,file对象也都是放在fd_array里,当打开的文件数超过32时,系统会开辟一个更大的数组,并更新fdtable的fd指针以及max_fds。我们程序里使用的那个文件描述符也就是这里数组的索引了。

值得注意的是有可能存在多个文件描述符指向同一个file。

另外task_struct还关联到一个fs_struct结构上:

struct fs_struct {
	int users;
	spinlock_t lock;
	seqcount_t seq;
	int umask;
	int in_exec;
	struct path root, pwd;
};
struct path {
	struct vfsmount *mnt;
	struct dentry *dentry;
};

这个结构里也有几个有意思的字段:root, pwd。root是进程的根目录,pwd是进程的当前工作目录(working directory)。还记得linux里的pwd命令不?很多人奇怪这个命令为什么叫这个名字呢?这个名字应该是跟密码什么的相关的啊。这里就是原因了,pwd的意思猜测应该是process working directory。

这里的root和pwd有什么用呢?比如你写了一个程序想访问文件,如果你的路径是以 / 开始的,则系统会从root开始查找该文件,如果没有以 / 开始,则系统会从pwd开始查找文件。  

file_operations

struct file_operations {
	struct module *owner;
	loff_t (*llseek) (struct file *, loff_t, int);
	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
	int (*readdir) (struct file *, void *, filldir_t);
	unsigned int (*poll) (struct file *, struct poll_table_struct *);
	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
	int (*mmap) (struct file *, struct vm_area_struct *);
	int (*open) (struct inode *, struct file *);
	int (*flush) (struct file *, fl_owner_t id);
	int (*release) (struct inode *, struct file *);
	int (*fsync) (struct file *, loff_t, loff_t, int datasync);
	int (*aio_fsync) (struct kiocb *, int datasync);
	int (*fasync) (int, struct file *, int);
	int (*lock) (struct file *, int, struct file_lock *);
	ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
	unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
	int (*check_flags)(int);
	int (*flock) (struct file *, int, struct file_lock *);
	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
	int (*setlease)(struct file *, long, struct file_lock **);
	long (*fallocate)(struct file *file, int mode, loff_t offset,
			  loff_t len);
};

dentry

位于<linux/dcache.h>

struct dentry {
	/* RCU lookup touched fields */
	unsigned int d_flags;		/* protected by d_lock */
	seqcount_t d_seq;		/* per dentry seqlock */
	struct hlist_bl_node d_hash;	/* lookup hash list */
	struct dentry *d_parent;	/* parent directory */
	struct qstr d_name;
	struct inode *d_inode;		/* Where the name belongs to - NULL is
					 * negative */
	unsigned char d_iname[DNAME_INLINE_LEN];	/* small names */

	/* Ref lookup also touches following */
	unsigned int d_count;		/* protected by d_lock */
	spinlock_t d_lock;		/* per dentry lock */
	const struct dentry_operations *d_op;
	struct super_block *d_sb;	/* The root of the dentry tree */
	unsigned long d_time;		/* used by d_revalidate */
	void *d_fsdata;			/* fs-specific data */

	struct list_head d_lru;		/* LRU list */
	/*
	 * d_child and d_rcu can share memory
	 */
	union {
		struct list_head d_child;	/* child of parent list */
	 	struct rcu_head d_rcu;
	} d_u;
	struct list_head d_subdirs;	/* our children */
	struct hlist_node d_alias;	/* inode alias list */
};

dentry_operations

struct dentry_operations {
	int (*d_revalidate)(struct dentry *, unsigned int);
	int (*d_hash)(const struct dentry *, const struct inode *,
			struct qstr *);
	int (*d_compare)(const struct dentry *, const struct inode *,
			const struct dentry *, const struct inode *,
			unsigned int, const char *, const struct qstr *);
	int (*d_delete)(const struct dentry *);
	void (*d_release)(struct dentry *);
	void (*d_prune)(struct dentry *);
	void (*d_iput)(struct dentry *, struct inode *);
	char *(*d_dname)(struct dentry *, char *, int);
	struct vfsmount *(*d_automount)(struct path *);
	int (*d_manage)(struct dentry *, bool);
} ____cacheline_aligned;

内核给进程访问的路径名上的每个部分都创建一个dentry,比如/users/yuyijq/test,这样就会有 /, users, yuyijq, test四个dentry。dentry对象保存在slab cache里,这个也叫directory entry cache(or dcahce)。这样就可以快速的寻找一个文件。一个或多个dentry会关联到一个inode上,比如硬链接(hard link)。

路径名查找(pathname lookup)

当我们使用open(), mkdir()等系统调用的时候,需要根据路径名找出对应的VFS对象(inode, dentry等),这个过程就涉及pathname lookup。前面提到过,如果操作的这个pathname,是以"/"开头的,则从current(当前进程)->fs->root开始查找,否则就是从current->fs->pwd开始。有了开始点,就可以递归的开始查找了。另外,为了加速这个查找过程,内核还使用了dentry cache(dcache)。 

注册文件系统

要注册或反注册一个文件系统,使用下面的函数<linux/fs.h>:

extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);

从/proc/filesystems里可以看出系统里已经注册的文件系统类型。

file_system_type

struct file_system_type {
	const char *name;
	int fs_flags;
#define FS_REQUIRES_DEV		1 
#define FS_BINARY_MOUNTDATA	2
#define FS_HAS_SUBTYPE		4
#define FS_REVAL_DOT		16384	/* Check the paths ".", ".." for staleness */
#define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
	struct dentry *(*mount) (struct file_system_type *, int,
		       const char *, void *);
	void (*kill_sb) (struct super_block *);
	struct module *owner;
	struct file_system_type * next;
	struct hlist_head fs_supers;

	struct lock_class_key s_lock_key;
	struct lock_class_key s_umount_key;
	struct lock_class_key s_vfs_rename_key;
	struct lock_class_key s_writers_key[SB_FREEZE_LEVELS];

	struct lock_class_key i_lock_key;
	struct lock_class_key i_mutex_key;
	struct lock_class_key i_mutex_dir_key;
};

当使用mount挂载一个文件系统时,内部会调用对应的file_system_type的mount函数来挂载该文件系统。mount函数调用后会返回一个dentry,这个dentry会对应着一个super_block。mount方法会设置super_block的s_op指针,指向具体的实现。  

struct file_system_type *get_fs_type(const char *name)函数会根据文件系统类型的名称,遍历已注册的文件系统类型,返回对应的file_system_type对象。

在系统启动阶段,内核直接挂载根文件系统,其他文件系统由启动脚本或用户执行命令挂载到根文件系统。

设备文件

上面介绍的都是VFS的传统作用,对应的也是什么磁盘文件系统Ext2, Ext3之类的。可我们都知道,在Unix/Linux里一切都是文件(当然除了网络除外)。你要操作串口设备?打开一个设备文件,读取啊,写入啊就像操作普通文件一样:open, read, write等。这又是怎么办到的呢?这还是要归功VFS的抽象能力。VFS很好的隐藏了这些设备之间的差异。

设备文件分为两种类型

块设备(block) 可以随机寻址,传输一个数据块的时间基本一样。典型的块设备有硬盘,软盘,光盘等。

字符设备(character) 字符设备有的不可以随机寻址(比如声卡);有的可以随机寻址,但随机访问一块数据的时间依赖数据在设备中的位置(比如磁带)。

设备文件和普通文件不同的是,它只是存储在文件系统中。比如它的inode没有指向磁盘上数据的指针,但是它包含一个标识设备的标识符。一般这个标识符由这些数据祖成:类型(块或字符),一对数字。第一个是主数字,确定设备类型。一般,具有相同类型和相同主数字的设备文件具有相同的文件操作函数(想想VFS是怎么抽象的)。第二个数字是次数字,用来区分主数字相同的一组设备。比如,由相同的磁盘控制器控制的磁盘主数字是相同的,但次数字不同。

mknod系统调用可以用来创建设备文件:

SYSCALL_DEFINE3(mknod, const char __user *, filename, umode_t, mode, unsigned, dev)
{
	return sys_mknodat(AT_FDCWD, filename, mode, dev);
}

设备文件一般位于/dev目录(擦,我以前一直以为这个目录是developer的意思)。要注意一点是,字符设备和块设备之间的主次号是分开的。设备文件也不一定就对应着真实的物理设备。比如/dev/null就对应着一个黑洞,向这个文件写东西会直接丢弃。对于内核来说文件的名称没有什么意义。