EXT2 笔记

| 分类 Kernel  | 标签 ext2  filesystem 

记录一下 Ext2 文件系统相关的一些知识。

1 Specification

  • Blocks

磁盘或文件被划分成块 (Blocks),块大小在文件系统创建时指定,可为 1024, 2048, 4096 或 8192。较小的 block size 意味着单个文件浪费的空间更小,但也意味着内核需要管理更多的块,从而在管理和统计上花费更多的资源。

  • Block Groups
    • Blocks are clustered into block groups in order to reduce fragmentation and minimise the amount of head seeking when reading a large amount of consecutive data.
    • Information about each block group is kept in a descriptor table stored in the block(s) immediately after the superblock.
    • Two blocks near the start of each group are reserved for the block usage bitmap and the inode usage bitmap which show which blocks and inodes are in use.
    • The block(s) following the bitmaps in each block group are designated as the inode table for that block group and the remainder are the data blocks
    • The block allocation algorithm attempts to allocate data blocks in the same block group as the inode which contains them.

    block1.png

  • Superblock

    超块中保存了文件系统的所有配置信息。超块信息保存在设备起始点的1024字节偏移处,该信息在文件系统挂载时被内核读取。此外,超块信息还有若干备份,用户可以选择将其保存在所有的 BlockGroup 中,或者选择将其保存在 0,1,以及 2 的 3,5,7次冥的超块中。

    超块中保存的数据包括:

    • 文件系统中 iNode 与 block 的数量以及使用情况
    • 每个 block group 中的 inode 和 block 数量
    • 文件系统上一次被挂载和修改的时间
    • 文件系统的创建时间和文件系统的版本信息。

    超块中的所有信息均以小端格式写入磁盘。

  • Inodes (Index Nodes) 索引节点是 Ext2 文件系统中的一个基础概念,文件系统的每个对象都通过一个 inode 来表示。 inode 中通过指针指向包含对象内容的 blocks,以及相关的除名字之外的所有元数据 (metadata) 。
  • Directories

    目录同文件一样,也是包含有 inode 的文件系统对象。不同之处在于,目录其实是一个特殊格式的文件,它里面包含了若干属于该目录的文件名字和与之对应的 inode number。

  • Special files
    • Symbolic Links: 如果符号链接小于 60 字节,则其数据会占用若干应该指向 block 的指针,从而将数据存储到 inode 里面,避免了浪费额外的 block。
    • 字符设备与块设备都直接将数据存到 inode 中,没有额外的 block,与小型的符号链接一样。

2 Module Initialization

 1: static int __init init_ext2_fs(void)
 2: {
 3:     int err = init_ext2_xattr();
 4:     if (err)
 5:         return err;
 6:     err = init_inodecache();
 7:     if (err)
 8:         goto out1;
 9:         err = register_filesystem(&ext2_fs_type);
10:     if (err)
11:         goto out;
12:     return 0;
13: out:
14:     destroy_inodecache();
15: out1:
16:     exit_ext2_xattr();
17:     return err;
18: }

初始化 xattrinode cache 之后,将 ext2_fs_type 注册到内核中。

fs_type.png

1: static struct file_system_type ext2_fs_type = {
2:     .owner      = THIS_MODULE,
3:     .name       = "ext2",
4:     .mount      = ext2_mount,
5:     .kill_sb    = kill_block_super,
6:     .fs_flags   = FS_REQUIRES_DEV,
7: };

初始化就这么结束了,函数返回,默默等待触发。怎么触发?何时触发? mount….

3 Mount

3.1 从系统调用到注册的函数指针

 1: MOUNT(2)                               Linux Programmer's Manual                              MOUNT(2)
 2: 
 3: NAME
 4:        mount - mount filesystem
 5: 
 6: SYNOPSIS
 7:        #include <sys/mount.h>
 8: 
 9:        int mount(const char *source, const char *target,
10:                  const char *filesystemtype, unsigned long mountflags,
11:                  const void *data);
12: 
13: DESCRIPTION
14:        mount() attaches the filesystem specified by source (which is often a device name, but can also
15:        be a directory name or a dummy) to the directory specified by target.
16: 
17:        Appropriate privilege (Linux: the CAP_SYS_ADMIN capability) is required to mount filesystems.
1: SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
2:                 char __user *, type, unsigned long, flags, void __user *, data)
3: {
4:     // ...
5:     ret = do_mount(kernel_dev, kernel_dir->name, kernel_type, flags,
6:                    (void *) data_page);
7:     //....
8:     return ret;
9: }

mount_flow.png

do_new_mount 从已注册的文件系统列表中根据文件系统类型找到之前注册的 ext2_fs_type ,然后使 vfs_kern_mount 来挂载文件系统。 vfs_kern_mount 分配出新的 mount 对象,然后通过 mount_fs() 来调用之前注册的 ext2_fs_type 中的函数指针。

vfs_mount.png

3.2 挂载 ext2 文件系统

1: static struct dentry *ext2_mount(struct file_system_type *fs_type,
2:                                  int flags, const char *dev_name, void *data)
3: {
4:     return mount_bdev(fs_type, flags, dev_name, data, ext2_fill_super);
5: }

该函数调用 mount_bdev 来挂载文件系统。

3.3 mount_bdev

mount_bdev 是文件系统提供的 Utility, 该函数首先使用 blkdev_get_by_path 来根据磁盘(分区)路径获取对应的块设备,并增加块设备的引用计数。随后尝试 创建新的 super_block 对象,并设置一些相关信息, super_block 定义如下:

 1: struct super_block {
 2:     struct list_head    s_list;     /* Keep this first */
 3:     dev_t           s_dev;      /* search index; _not_ kdev_t */
 4:     unsigned char       s_blocksize_bits;
 5:     unsigned long       s_blocksize;
 6:     loff_t          s_maxbytes; /* Max file size */
 7:     struct file_system_type *s_type;
 8:     const struct super_operations   *s_op;
 9:     const struct dquot_operations   *dq_op;
10:     const struct quotactl_ops   *s_qcop;
11:     const struct export_operations *s_export_op;
12:     unsigned long       s_flags;
13:     unsigned long       s_magic;
14:     struct dentry       *s_root;
15:     struct rw_semaphore s_umount;
16:     int         s_count;
17:     atomic_t        s_active;
18: #ifdef CONFIG_SECURITY
19:     void                    *s_security;
20: #endif
21:     const struct xattr_handler **s_xattr;
22: 
23:     struct list_head    s_inodes;   /* all inodes */
24:     struct hlist_bl_head    s_anon;     /* anonymous dentries for (nfs) exporting */
25:     struct list_head    s_mounts;   /* list of mounts; _not_ for fs use */
26:     struct block_device *s_bdev;
27:     struct backing_dev_info *s_bdi;
28:     struct mtd_info     *s_mtd;
29:     struct hlist_node   s_instances;
30:     struct quota_info   s_dquot;    /* Diskquota specific options */
31: 
32:     struct sb_writers   s_writers;
33: 
34:     char s_id[32];              /* Informational name */
35:     u8 s_uuid[16];              /* UUID */
36: 
37:     void            *s_fs_info; /* Filesystem private info */
38:     unsigned int        s_max_links;
39:     fmode_t         s_mode;
40: 
41:     /* Granularity of c/m/atime in ns.
42:        Cannot be worse than a second */
43:     u32        s_time_gran;
44: 
45:     /*
46:      * The next field is for VFS *only*. No filesystems have any business
47:      * even looking at it. You had been warned.
48:      */
49:     struct mutex s_vfs_rename_mutex;    /* Kludge */
50: 
51:     /*
52:      * Filesystem subtype.  If non-empty the filesystem type field
53:      * in /proc/mounts will be "type.subtype"
54:      */
55:     char *s_subtype;
56: 
57:     /*
58:      * Saved mount options for lazy filesystems using
59:      * generic_show_options()
60:      */
61:     char __rcu *s_options;
62:     const struct dentry_operations *s_d_op; /* default d_op for dentries */
63: 
64:     /*
65:      * Saved pool identifier for cleancache (-1 means none)
66:      */
67:     int cleancache_poolid;
68: 
69:     struct shrinker s_shrink;   /* per-sb shrinker handle */
70: 
71:     /* Number of inodes with nlink == 0 but still referenced */
72:     atomic_long_t s_remove_count;
73: 
74:     /* Being remounted read-only */
75:     int s_readonly_remount;
76: 
77:     /* AIO completions deferred from interrupt context */
78:     struct workqueue_struct *s_dio_done_wq;
79: 
80:     /*
81:      * Keep the lru lists last in the structure so they always sit on their
82:      * own individual cachelines.
83:      */
84:     struct list_lru     s_dentry_lru ____cacheline_aligned_in_smp;
85:     struct list_lru     s_inode_lru ____cacheline_aligned_in_smp;
86:     struct rcu_head     rcu;
87: };

随后,调用 mount_bdev 调用传入的 ext2_fill_super 来读取和填充超块信息。

3.4 ext2_fill_super

3.4.1 super_block, ext2_sb_info 与 ext2_super_block

ext2_fill_super 初始化了一些数据结构后,从指定的块设备中读取指定大小的数据,这些数据其实是 ext2_super_block 在硬盘上存储,其大致内容参考前文。值得提到的是,其中的数据都是小端存储,这在将持久信息转换为内存信息时,需要做适当的转换,比如:

1: sb->s_magic = le16_to_cpu(es->s_magic);

超块的内容在内存中作为 node_ext2_sb_info 来存储,而 node_ext2_sb_info 本身是 super_block 的私有成员。

ext2_sb.png

基本信息读取和转换完成之后, ext2_fill_super 会设置若干函数指针,用于操作 super_block,这些函数将会被 VFS 使用。

 1: static const struct super_operations ext2_sops = {
 2:     .alloc_inode    = ext2_alloc_inode,
 3:     .destroy_inode  = ext2_destroy_inode,
 4:     .write_inode    = ext2_write_inode,
 5:     .evict_inode    = ext2_evict_inode,
 6:     .put_super  = ext2_put_super,
 7:     .sync_fs    = ext2_sync_fs,
 8:     .freeze_fs  = ext2_freeze,
 9:     .unfreeze_fs    = ext2_unfreeze,
10:     .statfs     = ext2_statfs,
11:     .remount_fs = ext2_remount,
12:     .show_options   = ext2_show_options,
13: #ifdef CONFIG_QUOTA
14:     .quota_read = ext2_quota_read,
15:     .quota_write    = ext2_quota_write,
16: #endif
17: };
18: 
19: static const struct export_operations ext2_export_ops = {
20:         .fh_to_dentry = ext2_fh_to_dentry,
21:         .fh_to_parent = ext2_fh_to_parent,
22:         .get_parent = ext2_get_parent,
23: };
24: 
25: static int ext2_fill_super(struct super_block *sb, void *data, int silent)
26: {
27:     //...
28:     /*
29:      * set up enough so that it can read an inode
30:      */
31:     sb->s_op = &ext2_sops;
32:     sb->s_export_op = &ext2_export_ops;
33:     sb->s_xattr = ext2_xattr_handlers;
34:     //...
35: }

之后,开始读取 root inode.

3.4.2 root inode

EXT2 文件系统预先定义了几个特殊的 inode number:

 1: /*
 2:  * Special inode numbers
 3:  */
 4: #define EXT2_BAD_INO         1  /* Bad blocks inode */
 5: #define EXT2_ROOT_INO        2  /* Root inode */
 6: #define EXT2_BOOT_LOADER_INO     5  /* Boot loader inode */
 7: #define EXT2_UNDEL_DIR_INO   6  /* Undelete directory inode */
 8: 
 9: /* First non-reserved inode for old ext2 filesystems */
10: #define EXT2_GOOD_OLD_FIRST_INO 11

函数 ext2_iget 用于读取指定的 super_block 上的指定 inode ,如果之前没有加载过该文件系统,它会调用之前设置的函数指针来创建新的 inode,这里通过 sb->s_op->alloc_inode (即 ext2_alloc_inode )来创建和初始化新的 inode。

ext2_alloc_inode 实际分配的是 ext2_inode_info ,VFS 中的 inode 作为 ext2_inode_info 中的一个成员变量。

ext2_inode_infoinode 都是 inode 在内存中的表现,在 ext2 文件系统的磁盘上, inode 有另外一种表现: ext_inode , 或者叫 raw-inode:

 1: /*
 2:  * Structure of an inode on the disk
 3:  */
 4: struct ext2_inode {
 5:     __le16  i_mode;     /* File mode */
 6:     __le16  i_uid;      /* Low 16 bits of Owner Uid */
 7:     __le32  i_size;     /* Size in bytes */
 8:     __le32  i_atime;    /* Access time */
 9:     __le32  i_ctime;    /* Creation time */
10:     __le32  i_mtime;    /* Modification time */
11:     __le32  i_dtime;    /* Deletion Time */
12:     __le16  i_gid;      /* Low 16 bits of Group Id */
13:     __le16  i_links_count;  /* Links count */
14:     __le32  i_blocks;   /* Blocks count */
15:     __le32  i_flags;    /* File flags */
16:     union {
17:         struct {
18:             __le32  l_i_reserved1;
19:         } linux1;
20:         struct {
21:             __le32  h_i_translator;
22:         } hurd1;
23:         struct {
24:             __le32  m_i_reserved1;
25:         } masix1;
26:     } osd1;             /* OS dependent 1 */
27:     __le32  i_block[EXT2_N_BLOCKS];/* Pointers to blocks */
28:     __le32  i_generation;   /* File version (for NFS) */
29:     __le32  i_file_acl; /* File ACL */
30:     __le32  i_dir_acl;  /* Directory ACL */
31:     __le32  i_faddr;    /* Fragment address */
32:     union {
33:         struct {
34:             __u8    l_i_frag;   /* Fragment number */
35:             __u8    l_i_fsize;  /* Fragment size */
36:             __u16   i_pad1;
37:             __le16  l_i_uid_high;   /* these 2 fields    */
38:             __le16  l_i_gid_high;   /* were reserved2[0] */
39:             __u32   l_i_reserved2;
40:         } linux2;
41:         struct {
42:             __u8    h_i_frag;   /* Fragment number */
43:             __u8    h_i_fsize;  /* Fragment size */
44:             __le16  h_i_mode_high;
45:             __le16  h_i_uid_high;
46:             __le16  h_i_gid_high;
47:             __le32  h_i_author;
48:         } hurd2;
49:         struct {
50:             __u8    m_i_frag;   /* Fragment number */
51:             __u8    m_i_fsize;  /* Fragment size */
52:             __u16   m_pad1;
53:             __u32   m_i_reserved2[2];
54:         } masix2;
55:     } osd2;             /* OS dependent 2 */
56: };

ext2_get_inode() 用于读取 raw inode。三者之间的关系:

ext2_inode.png

ext2_iget 除了分配和读取 inode 之外,还根据 inode 类型为将来的 inode 操作指定了函数指,用于将来的其他操作:

  1: struct inode_operations {
  2:     struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
  3:     void * (*follow_link) (struct dentry *, struct nameidata *);
  4:     int (*permission) (struct inode *, int);
  5:     struct posix_acl * (*get_acl)(struct inode *, int);
  6: 
  7:     int (*readlink) (struct dentry *, char __user *,int);
  8:     void (*put_link) (struct dentry *, struct nameidata *, void *);
  9: 
 10:     int (*create) (struct inode *,struct dentry *, umode_t, bool);
 11:     int (*link) (struct dentry *,struct inode *,struct dentry *);
 12:     int (*unlink) (struct inode *,struct dentry *);
 13:     int (*symlink) (struct inode *,struct dentry *,const char *);
 14:     int (*mkdir) (struct inode *,struct dentry *,umode_t);
 15:     int (*rmdir) (struct inode *,struct dentry *);
 16:     int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
 17:     int (*rename) (struct inode *, struct dentry *,
 18:                    struct inode *, struct dentry *);
 19:     int (*setattr) (struct dentry *, struct iattr *);
 20:     int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
 21:     int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
 22:     ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
 23:     ssize_t (*listxattr) (struct dentry *, char *, size_t);
 24:     int (*removexattr) (struct dentry *, const char *);
 25:     int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
 26:                   u64 len);
 27:     int (*update_time)(struct inode *, struct timespec *, int);
 28:     int (*atomic_open)(struct inode *, struct dentry *,
 29:                        struct file *, unsigned open_flag,
 30:                        umode_t create_mode, int *opened);
 31:     int (*tmpfile) (struct inode *, struct dentry *, umode_t);
 32:     int (*set_acl)(struct inode *, struct posix_acl *, int);
 33: };
 34: 
 35: struct file_operations {
 36:     struct module *owner;
 37:     loff_t (*llseek) (struct file *, loff_t, int);
 38:     ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
 39:     ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 40:     ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 41:     ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 42:     int (*iterate) (struct file *, struct dir_context *);
 43:     unsigned int (*poll) (struct file *, struct poll_table_struct *);
 44:     long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 45:     long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 46:     int (*mmap) (struct file *, struct vm_area_struct *);
 47:     int (*open) (struct inode *, struct file *);
 48:     int (*flush) (struct file *, fl_owner_t id);
 49:     int (*release) (struct inode *, struct file *);
 50:     int (*fsync) (struct file *, loff_t, loff_t, int datasync);
 51:     int (*aio_fsync) (struct kiocb *, int datasync);
 52:     int (*fasync) (int, struct file *, int);
 53:     int (*lock) (struct file *, int, struct file_lock *);
 54:     ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
 55:     unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 56:     int (*check_flags)(int);
 57:     int (*flock) (struct file *, int, struct file_lock *);
 58:     ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 59:     ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 60:     int (*setlease)(struct file *, long, struct file_lock **);
 61:     long (*fallocate)(struct file *file, int mode, loff_t offset,
 62:                       loff_t len);
 63:     int (*show_fdinfo)(struct seq_file *m, struct file *f);
 64: };
 65: 
 66: struct inode {
 67:     // ...
 68:     const struct inode_operations   *i_op;
 69:     // ...
 70:     const struct file_operations    *i_fop; /* former ->i_op->default_file_ops */
 71:     // ...
 72: };
 73: 
 74: extern const struct inode_operations ext2_file_inode_operations;
 75: extern const struct file_operations ext2_file_operations;
 76: extern const struct file_operations ext2_xip_file_operations;
 77: 
 78: struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 79: {
 80: 
 81:     // ...
 82:     if (S_ISREG(inode->i_mode)) {
 83:         inode->i_op = &ext2_file_inode_operations;
 84:         if (ext2_use_xip(inode->i_sb)) {
 85:             inode->i_mapping->a_ops = &ext2_aops_xip;
 86:             inode->i_fop = &ext2_xip_file_operations;
 87:         } else if (test_opt(inode->i_sb, NOBH)) {
 88:             inode->i_mapping->a_ops = &ext2_nobh_aops;
 89:             inode->i_fop = &ext2_file_operations;
 90:         } else {
 91:             inode->i_mapping->a_ops = &ext2_aops;
 92:             inode->i_fop = &ext2_file_operations;
 93:         }
 94:     } else if (S_ISDIR(inode->i_mode)) {
 95:         inode->i_op = &ext2_dir_inode_operations;
 96:         inode->i_fop = &ext2_dir_operations;
 97:         if (test_opt(inode->i_sb, NOBH))
 98:             inode->i_mapping->a_ops = &ext2_nobh_aops;
 99:         else
100:             inode->i_mapping->a_ops = &ext2_aops;
101:     }
102:     //...
103: }

root inode 读取完成之后,为 root inode 建立 dentry,并将 super_block 信息写回磁盘,完成了 super_block 初始化工作。

4 Open

4.1 Open Flow

ext2_open_flow.png

inode->i_op->create 是一个函数指针,对 ext2 文件系统来讲,指向 ext2_dir_inode_operations.create 95.

4.2 ext2_create (namei.c(ext2))


上一篇     下一篇