随笔- 71  文章- 3  评论- 17 
2011年6月22日

最近看了些kernel fs code, 从实际例子,简单分析一下ext2文件系统的结构,  希望对大家有帮助

本文涉及到一些结构,主要是:

超级块 struct ext2_super_block { }

组描述    struct ext2_group_desc { }

索引节点    struct ext2_inode { }

目录结构    struct ext2_dir_entry_2 { }

1. 准备工作

为了分析,特地格式化了一个100MB左右的ext2文件系统,block size 1024 Bytes

可以看一下这个分区的主要信息:

debugfs:   stats
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          9c0e702c-f80e-4382-a95d-444fafaab34c
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      resize_inode filetype sparse_super
Default mount options:    (none)
Filesystem state:         not clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              26104
Block count:              104296
Reserved block count:     5214
Free blocks:              99442
Free inodes:              26091
First block:              1
Block size:               1024
Fragment size:            1024
Reserved GDT blocks:      256
Blocks per group:         8192
Fragments per group:      8192
Inodes per group:         2008
Inode blocks per group:   251
Filesystem created:       Sun Feb 13 22:36:23 2011
Last mount time:          Mon Feb 14 04:49:41 2011
Last write time:          Mon Feb 14 04:49:41 2011
Mount count:              2
Maximum mount count:      34
Last checked:             Sun Feb 13 22:36:23 2011
Check interval:           15552000 (6 months)
Next check after:         Fri Aug 12 22:36:23 2011
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Default directory hash:   tea
Directory Hash Seed:      de62f19c-eaa2-4cb7-a2c6-84fca69baafd
Directories:              2
Group  0: block bitmap at 259, inode bitmap at 260, inode table at 261
           7665 free blocks, 1995 free inodes, 2 used directories

Group  1: block bitmap at 8451, inode bitmap at 8452, inode table at 8453
           7681 free blocks, 2008 free inodes, 0 used directories
Group  2: block bitmap at 16385, inode bitmap at 16386, inode table at 16387
           7939 free blocks, 2008 free inodes, 0 used directories
Group  3: block bitmap at 24835, inode bitmap at 24836, inode table at 24837
           7681 free blocks, 2008 free inodes, 0 used directories
Group  4: block bitmap at 32769, inode bitmap at 32770, inode table at 32771
           7939 free blocks, 2008 free inodes, 0 used directories
Group  5: block bitmap at 41219, inode bitmap at 41220, inode table at 41221
           7681 free blocks, 2008 free inodes, 0 used directories
Group  6: block bitmap at 49153, inode bitmap at 49154, inode table at 49155
           7939 free blocks, 2008 free inodes, 0 used directories
Group  7: block bitmap at 57603, inode bitmap at 57604, inode table at 57605
           7681 free blocks, 2008 free inodes, 0 used directories
Group  8: block bitmap at 65537, inode bitmap at 65538, inode table at 65539
           7939 free blocks, 2008 free inodes, 0 used directories
Group  9: block bitmap at 73987, inode bitmap at 73988, inode table at 73989
           7681 free blocks, 2008 free inodes, 0 used directories
Group 10: block bitmap at 81921, inode bitmap at 81922, inode table at 81923
           7939 free blocks, 2008 free inodes, 0 used directories
Group 11: block bitmap at 90113, inode bitmap at 90114, inode table at 90115
           7939 free blocks, 2008 free inodes, 0 used directories
Group 12: block bitmap at 98305, inode bitmap at 98306, inode table at 98307
           5738 free blocks, 2008 free inodes, 0 used directories

2. 查看 super block

先看超级块,因为第一个block(1024字节)是引导块 ,所以我们从 1024 字节 开始
inode count 等都可以对得上 (注意是little endian)
-------------------------------------------------------------
struct ext2_super_block {
     __u32     s_inodes_count;          /* Inodes count */ //   f8 65 00 00  = 26140
     __u32     s_blocks_count;          /* Blocks count */  //   68 97 01 00 = 104296
......
    __u16 s_magic;  /* Magic signature */  // ef 53
......
 
[root@ms3003 ~]# dd if=/dev/hdb1 bs=1 count=1024 skip=1024 | od -t x1 -Ax
1024+0 records in
1024+0 records out
000000 f8 65 00 00 68 97 01 00 5e 14 00 00 72 84 01 00
000010 eb 65 00 00 01 00 00 00 00 00 00 00 00 00 00 00
000020 00 20 00 00 00 20 00 00 d8 07 00 00 e5 43 58 4d
000030 e5 43 58 4d 02 00 22 00 53 ef 00 00 01 00 00 00
000040 67 ec 57 4d 00 4e ed 00 00 00 00 00 01 00 00 00
000050 00 00 00 00 0b 00 00 00 80 00 00 00 10 00 00 00
000060 02 00 00 00 01 00 00 00 9c 0e 70 2c f8 0e 43 82
000070 a9 5d 44 4f af aa b3 4c 00 00 00 00 00 00 00 00
000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
......

3. 接下来我们看组描述

再看组描述符 ,32个字节一个,我们可以把debugfs的结果和实际的磁盘信息对照一下

/*
* Structure of a blocks group descriptor
*/
struct ext2_group_desc
{
     __u32     bg_block_bitmap;          /* Blocks bitmap block */
     __u32     bg_inode_bitmap;          /* Inodes bitmap block */
     __u32     bg_inode_table;          /* Inodes table block */
     __u16     bg_free_blocks_count;     /* Free blocks count */
     __u16     bg_free_inodes_count;     /* Free inodes count */
     __u16     bg_used_dirs_count;     /* Directories count */
     __u16     bg_pad;
     __u32     bg_reserved[3];
};


Group  0: block bitmap at 259, inode bitmap at 260, inode table at 261
           7665 free blocks, 1995 free inodes, 2 used directories
Group  1: block bitmap at 8451, inode bitmap at 8452, inode table at 8453
           7681 free blocks, 2008 free inodes, 0 used directories
......

观察磁盘上的信息,都可以对得上
03 01 00 00 = 259
04 01 00 00 = 260
F1 1D          = 7665
cb 07           = 1995
02 00           = 2
03 21 00 00 = 8451
04 21 00 00 = 8452

组描述分布在 super block 后面 (所以从2048开始) ,根据block group数量而有多个,32个字节一组(2行一组),

磁盘100MB,每个block gourp 物理8192KB, 所以有有13个block group, 32*13=0x1A0 , 所以1A0 后面就都是0了

[root@ms3003 ~]# dd if=/dev/hdb1 bs=1 count=1024 skip=2048 | od -t x1 -Ax
1024+0 records in
1024+0 records out
000000 03 01 00 00 04 01 00 00 05 01 00 00 f1 1d cb 07
000010 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000020 03 21 00 00 04 21 00 00 05 21 00 00 01 1e d8 07
000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000040 01 40 00 00 02 40 00 00 03 40 00 00 03 1f d8 07
000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000060 03 61 00 00 04 61 00 00 05 61 00 00 01 1e d8 07
000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000080 01 80 00 00 02 80 00 00 03 80 00 00 03 1f d8 07
000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000a0 03 a1 00 00 04 a1 00 00 05 a1 00 00 01 1e d8 07
0000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000c0 01 c0 00 00 02 c0 00 00 03 c0 00 00 03 1f d8 07
0000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000e0 03 e1 00 00 04 e1 00 00 05 e1 00 00 01 1e d8 07
0000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000100 01 00 01 00 02 00 01 00 03 00 01 00 03 1f d8 07
000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000120 03 21 01 00 04 21 01 00 05 21 01 00 01 1e d8 07
000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000140 01 40 01 00 02 40 01 00 03 40 01 00 03 1f d8 07
000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000160 01 60 01 00 02 60 01 00 03 60 01 00 03 1f d8 07
000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000180 01 80 01 00 02 80 01 00 03 80 01 00 6a 16 d8 07
000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
000400

4. 查看根目录的inode
 
ext2 根目录的inode number = 2 ,所以这个inode 在 group 0  , 是第二个inode
Group  0: block bitmap at 259, inode bitmap at 260, inode table at 261
每个 inode 是一个 struct ext2_inode , 128 个字节,所以磁盘上的偏移就是 261*1024 + 128 = 267392
 
[root@ms3003 ext2]# dd if=/dev/hdb1 bs=1 count=256 skip=267392 | od -t x1 -Ax  
256+0 records in
256+0 records out
000000 ed 41 00 00 00 04 00 00 52 13 5b 4d e1 82 58 4d
000010 e1 82 58 4d 00 00 00 00 00 00 03 00 02 00 00 00
000020 00 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00
000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

i_mode = 0x41ed = 0040755 (8进制 ) , 也就是 目录 + 755 权限(drwxr-xr-x)

block 索引第一个是 00 02 ,就是512块

对照 struct ext2_inode ,

 {
 __u16 i_mode;  /* File mode */
 __u16 i_uid;  /* Low 16 bits of Owner Uid */
 __u32 i_size;  /* Size in bytes */

......

__u32 i_block[EXT2_N_BLOCKS];/* Pointers to blocks */ 从40个字节处开始

.....

}

5. 查看根目录的目录结构

我们看一下根目录有哪些文件(目录是特殊的文件),和磁盘分区信息对照一下
注意最左边的数字是inode number

[root@ms3003 ext2]# ll -ia
total 23
     2 drwxr-xr-x  3 root root  1024 Feb 14 09:18 .
812001 drwxr-xr-x  4 root root  4096 Feb 13 22:36 ..
    11 drwx------  2 root root 12288 Feb 13 22:36 lost+found
    15 -rw-r--r--  1 root root    72 Feb 14 09:18 test
    12 -rw-r--r--  1 root root    69 Feb 13 22:45 test2

[root@ms3003 ext2]#  dd if=/dev/hdb1 bs=1024 count=1 skip=512 | hexdump -C
1+0 records in
1+0 records out
00000000  02 00 00 00 0c 00 01 02  2e 00 00 00 02 00 00 00  |................|
00000010  0c 00 02 02 2e 2e 00 00  0b 00 00 00 14 00 0a 02  |................|
00000020  6c 6f 73 74 2b 66 6f 75  6e 64 00 00 0c 00 00 00  |lost+found......|
00000030  10 00 05 01 74 65 73 74  32 2e 73 77 0f 00 00 00  |....test2.sw....|
00000040  c4 03 04 01 74 65 73 74  74 65 73 74 00 00 00 00  |....testtest....|
00000050  b4 03 09 01 2e 74 65 73  74 2e 73 77 70 70 00 00  |.....test.swpp..|
00000060  00 00 00 00 a0 03 05 01  74 65 73 74 7e 2e 73 77  |........test~.sw|
00000070  70 78 78 00 00 00 00 00  00 00 00 00 00 00 00 00  |pxx.............|
00000080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

对照 ext2_dir_entry_2 分析,注意 这是个可变长度的结构,长度为rec_len , 而名字长度为 name_len
struct ext2_dir_entry_2 {
     __u32     inode;               /* Inode number */
     __u16     rec_len;          /* Directory entry length */
     __u8     name_len;          /* Name length */
     __u8     file_type;
     char     name[EXT2_NAME_LEN];     /* File name */
};

蓝色和红色分别是目录  .  和 .. inode id = 2
绿色是 lost+found , inode id = 11 (0b 00 00 00)
紫色是 test2     , inode id = 12 (0c 00 00 00)
深黄色 是 test     , inode id = 15 (0f 00 00 00)

该块信息和 ext2 根目录文件都可以对应上

6.查看普通文件 test 的内容

test 文件的 inode 是 15,那么我们计算一下它的inode的位置, 是group 0 的 第15个inode, 磁盘偏移是

261*1024 + 128*14 = 269056

[root@ms3003 ~]# dd if=/dev/hdb1 bs=1 count=256 skip=269056 | od -t x1 -Ax      
256+0 records in
256+0 records out
000000 a4 81 00 00 48 00 00 00 df 0d 5b 4d e1 82 58 4d
000010 e1 82 58 4d 00 00 00 00 00 00 01 00 02 00 00 00
000020 00 00 00 00 00 00 00 00 09 14 00 00 00 00 00 00
000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

我们找到这个 inode , imode = 0x81a4 = 00100644 (8进制 ) , 也就是普通文件+权限644(-rw-r--r-- )

第一个block号码是 0x 14 09 = 5192

那我们查看一下 5192 block 的内容, 和 cat test 的结果是一致的

[root@ms3003 ext2]# cat test
apeonaaaaaaaaaaaaaaaaaaab
bbbbbbbbbbbbbbbbbbbb
ccccccccccccccccccc:q!
\

[root@ms3003 ~]# dd if=/dev/hdb1 bs=1024 count=1 skip=5129 | od -t x1 -aAx
1+0 records in
1+0 records out
000000 61 70 65 6f 6e 61 61 61 61 61 61 61 61 61 61 61
         a   p   e   o   n   a   a   a   a   a   a   a   a   a   a   a
000010 61 61 61 61 61 61 61 61 62 0a 62 62 62 62 62 62
         a   a   a   a   a   a   a   a   b  nl   b   b   b   b   b   b
000020 62 62 62 62 62 62 62 62 62 62 62 62 62 62 0a 63
         b   b   b   b   b   b   b   b   b   b   b   b   b   b  nl   c
000030 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63
         c   c   c   c   c   c   c   c   c   c   c   c   c   c   c   c
000040 63 63 3a 71 21 0a 5c 0a 00 00 00 00 00 00 00 00
         c   c   :   q   !  nl   \  nl nul nul nul nul nul nul nul nul
000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul

 

菊子曰 本文用菊子曰发布
posted @ 2011-06-22 00:04 加菲猫 阅读(115) 评论(0) 编辑
2011年6月12日

上佳好文,可惜被墙,怎能不转?

Scaling Redis

When a database is limited to running on a single computer, only certain load can be served with acceptable latency. Adding hardware to this single computer will help only so much. Doubling or tripling the load may require significantly more than twice or thrice cost of hardware to scale up. Such approach is expensive to scale and it eventually hits its limit. Ideally, we would start with a single inexpensive computer and as load increases we would keep on adding same inexpensive computers resulting in a near-linear function between load and cost. Such horizontal scaling out is common place in today’s web applications because it provides a more predictable cost model. 

Clusters of inexpensive commodity hardware lead to a disruption in the database ecosystem. What further amplifies this disruption in falling prices of RAM and solid state storage. A number NoSQL of databases are truly leveraging this disruption. At Meshin, one of our key requirements is the lightning-fast delivery of query results. Think of Google Instant search. Showing results with “as you type” latency enables a whole new class of use cases. This lead us to considering various in-memory database engines. Redis, with its simple and elegant data model and very transparent performance characteristics came out on top. For in-memory databases, single-threaded design becomes an obvious choice, essentially removing significant overhead common to traditional database architectures. Another assumption in Meshin design is that if the index is partially or completely lost it can be recovered by re-indexing original data. Sure, this introduces short-term inconsistencies, but on the other hand it allows to relax durability and further simplify the design. Such a trade-off is a perfect fit to the Redis persistence strategy.

Single-threaded design of Redis brings up another interesting issue. Scaling of a single instance of Redis is not limited just by the computer it runs on, which would amount to RAM size, but is further limited by the single core or hardware thread it will utilize. So if your load exceeds the single hardware thread or size of available RAM, you need a second instance of Redis. This requires some approach to clustering. Effective clustering is all about figuring out the right partitioning scheme. Ideally, you need to split load into identically-sized partitions. If load becomes biased towards one or more partitions you will have a bottleneck. Such a bottleneck will limit your system’s scalability or in other words will make your load-to-cost function non-linear. It is important that partitions are as independent as possible. If an operation spans a group of partitions its scalability will be limited by the number of such groups in the cluster. In extreme cases of spanning all partitions in a cluster, the scalability will be as good as running on a single partition. 

Another aspect is differentiating the load by reads and writes. If the Redis hardware thread capacity is exceeded by reads, it is easy to scale out by putting additional read-only replicas of the same data. It is much harder to scale out writes by replication where trading off consistency is often required. Redis again takes simple a approach with its master-slave replication. A writable master replica asynchronously updates one or more read-only slave replicas. Right after replying to a write operation the master notifies all replicas. No acknowledgment is required by the master. This means that there is short period of time when slave replicas may return old data. This provides with write-write consistency guarantees that are as strong as without replication. However, the write-read consistency guarantees are now less strong or “eventual”. It is important to note that replication not only helps scalability of reads, but also improves reliability when replicas are placed on different machines. For the Meshin application eventual write-read consistency is acceptable tradeoff for higher reliability.

With the Meshin application, we have a large number of user indexes, each of roughly equal size. Meshin keeps a sliding window index of email, Twitter, Facebook etc. messages to maintain predictable maximum size of index. This provides a great opportunity for partitioning. If a partition handles roughly an equal number of users, we have balanced storage and load requirements and at the same time made the most frequent operations directed at only one partition. One approach is choosing between N partitions is hashing a unique user identifier, for example an email address, such that the hash space is in the [0..N) range. If the quality of the hash function is sufficient, we will get a well-balanced distribution.

hash-fixed

Now we can place our Redis instances on M = N / K computer nodes in the cluster. K is number of Redis instances that we can comfortably run on single node. This number is limited by the number of cores or hardware threads. With a large number of hardware threads (8 and up) it makes sense to map each Redis instance to one thread. RAM is then equally divided between these instances. Now, knowing each user’s maximum required RAM, we can calculate how many users our cluster can accommodate. One thing to keep in mind is that if you use Redis snapshotting for persistence — which employs the OS copy-on-write mechanism — it may require as much as twice amount the RAM dedicated to a partition to take a snapshot.

It is important to note that in the case of the Meshin application with regard to specific RAM and thread count per computer the load is “memory-bound”. Meaning that a partition having a full memory pool sill won’t saturate the hardware thread with Redis. It is important to note that for some applications the load may become “thread-bound”. In this case the hardware thread is being saturated before the memory pool is maxed out. Interestingly, with the “memory-bound” case we have underutilized CPUs and with the “thread-bound” case we may have underutilized RAM.

Great so far, but what if we keep on adding users and they start exceeding the capacity of a partition? In the “memory-bound” case they exceed the memory pool allocated for a partition and in the “thread-bound” case they fully saturate the corresponding hardware thread. In our case, the number of partitions is fixed. Also, since partitions are equally populated with same size users they all about to overflow. As we add new nodes to the cluster, we may try to move some partitions from existing ones and rebalance such that our nodes have an equal number of partitions. There are two problems with this approach. In the “thread-bound” case this won’t help at all — we can’t really utilize new cores, since partitions are single threaded! In “memory-bound” case the machines will have fewer number of Redis instances than number of hardware threads. This leaves those threads idle and CPUs even more underutilized! One way to fix this problem is to overprovision partitions initially, so that K is a multiple of the number of hardware threads. The multiplier is essentially the number of times we want to grow our cluster by. Other approach is to allow changing number of partitions.

At Meshin we took approach where we are able to add partitions over time. This is accomplished by changing the way initial partitioning is done. Instead of limiting the hash function space to (0..N] we allow it to generate 64-bit unsigned integers. These hashes are mapped to a ring and wrapped from 2^64-1 to 0. We compared several hash functions homogeneity performance for a large set of email addresses as unique user identifiers. The best homogeneity was achieved by the SHA1 hash reduced to 64 bits by XOR-ing its parts.

hash-ring

Each partition is placed on the ring with equal distances in between. When a user identifier is hashed we get its position on the ring and thus can find the closest partition where user’s index resides by going clockwise.

hash-ring-distance

When new partitions are added they are placed in between existing partitions to maintain equal distances. We also have the flexibility to add partitions in multiple points on the ring if partitions are of variable size to enable more users’ indexes allocated in larger partitions. Now, to lookup an existing user’s index you may hit a newly added partition and miss the old partition where the index was originally allocated. One way to address this problem is to move the corresponding user indexes to new partitions or rebalance the whole hash ring. We decided to employ simple routing instead. If a partition doesn’t have the user index, it proceeds to the next partition in clockwise direction.

hash-ring-old-forward

Such routing is fairly scalable. It does not have a single point of contention since each partition participates in routing. We also use a caching layer on the application server to remember partition addresses for recent user identities. As with lookups, it is also possible to route new user index allocations. When a partition is full it will simply forward to next partition clockwise until a partition available for allocation is found. If new partitions’ points are added in between old ones this will keep routing paths short. 

hash-ring-full-forward

To add reliability and further scale reads we replicate each partition to a different node in the cluster. Replicas are placed so that each node has an equal number of masters and slaves. All writes are directed at the master replica, while reads are balanced between master and slave replicas. Balancing uses stickiness by user identity to keep CPU caches warm. For example, replicating N partitions twice, K partitions per node, 2 * K replicas per node and M nodes in cluster.

cluster

Partitioned replicas are taking snapshots asynchronously. We place the snapshots on redundant networked storage. All snapshot traffic goes through a dedicated network. Taking snapshots requires additional memory to be allocated by Redis. The amount of this memory depends on write activity on a partition and I/O latency of snapshot operation. To minimize snapshotting overhead on a specific node in the cluster we run only one snapshotting operation at the time. A dedicated agent calls the SAVE command for each partition on a node in a round-robin fashion. The SAVE command initiates snapshotting and returns when it is completed. The period of full cycle of snapshotting defines the maximum time slice of changes that could be lost if cluster a node fails.

Cluster configuration, including partition IP addresses and ports, positions on hash ring and health check status are stored in dedicated Redis instances. We call this dedicated instance the “directory”. The directory is replicated as well with the master doing the writes, which are quite rare. The directory reads are served by the master and one or more slave replicas in round-robin manner. When the application fails to access a certain partition replica, it marks it as offline in the directory. This means other instances will stop trying to access it since it’s been removed from the list of available replicas for a partition. If a failed replica was in slave role the cluster will remain fully functional, just with less redundancy for failed partition. We call this partial outage. In turn, if the replica was in master role, the partition will switch to read-only outage mode. Read-only outage mode still allows the partition to handle read queries. If all replicas in a partition have failed cluster is in full outage. Full outage is a rare event since it requires two nodes to fail simultaneously. A special agent monitors partitions for outages by polling the directory. If there is master-less partition it will elect one of the slave replicas to become master, reconfigure the rest of the replicas and adjust the directory to clear the read-only outage. While we attempt to recover from read-only outages automatically, partial outages and full outages must be addressed manually. At some point in scaling this will not work and we realize the need for a fully-automated recovery system down the road.

Currently the cluster runs on 14 nodes each with 96GB of RAM — this amounts a total of 1.3TB. The nodes have 16 hardware threads each. We deployed 160 partitions each replicated twice. Separate computers handle the directory and networked storage for snapshotting.

菊子曰 本文用菊子曰发布
posted @ 2011-06-12 22:43 加菲猫 阅读(96) 评论(0) 编辑
2011年6月7日

 Linux文件系统阅读清单,里面包含了大量以文件系统为主题的文章, 可以完善对文件系统的了解。

http://linuxfs.pbworks.com/w/page/17647516/ReadingList  (该地址已被墙)

This page lists useful reading for people interested in Linux file systems development. Please include a description with each link and keep things organized.

 

Ext2/Ext3 improvement project is dealing with 64 bitness

 

File and Storage System Design at Harvard has some publication links.

 

The 2006 Linux File Systems Workshop Summary at Linux Weekly News is a good summary of the problems Linux file systems will be facing in the next 5 years, and the ideas people have for tackling them. Includes information on chunkfs, doublefs, and more.

 

Here is a description of chunkfs, an idea for splitting up file systems into small, individually fsck-able chunks, which results in many advantages in both performance and repairability.

 

Some documentation on how to write a Linux filesystem includes this OLS Tutorial and a paper describing a simple Linux fs.

 

The adaptable filesystems framework paper explores the possibility of easing the problem of switching formats by supporting on-disk layout flexibility at a per-element level.

 

Exploring High Bandwidth Filesystems on Large Systems - some testing of filesystem throughput by SGI. Focuses on XFS, of course, but there's some comparison with other common Linux filesystems.

 

Practical File System Design with the Be File System - A book (now out of print) on how to design your own filesystem, which is illustrated with examples from the author's own experience designing BeOS's BFS. The author also provides the filesystem construction kit mentioned in the book (a userspace test harness for experimenting with on-disk layout).

 

LinLogFS - A Log-Structured Filesystem For Linux gives an overview and discusses some implementation issues of LinLogFS. The lessons learned led to new, but yet unimplemented ideas for a log-structured/copy-on-write file system.

 

Log-structured file system for Linux 2.6 has fairly extensive documentation covering an implementation of a log-structured file system for Linux which has a working garbage collector. The site also contains the file system source code and results of a number of measurements that nicely show how it performs, often compared to ext3.

 

WinFS - WinFS was a data storage and management system based on relational databases, developed by Microsoft from 2003 to 2006 for use as an advanced storage subsystem for the Microsoft Windows operating system. It has since been cancelled as a separate product, and some of its technologies will be moved into future versions of ADO.NET and Microsoft SQL Server.

 

LogFS - LogFS is the next generation file system for flash media and a replacement for jffs2. It provides support for large partitions and has good scalability to the size of the file system, while keeping all the good stuff from jffs2, like wear levelling, compression, and robustness.

 

Union Mounts - VFS based union mounts let you mount multiple file systems to a single mount point and see the files on each of them simultaneously. This approach adds a much cleaner abstraction than UnionFS, which has similar goals but is implemented as a stacked file system.

 

Abstractions

The Box: A Replacement for Files - The file is a well-understood and very simple abstraction, which has been successfully used in both centralized and distributed systems. However, files are not expressive enough to capture high-level data semantics and relationships, including sharing and data dependencies. How many times has your operating system translated a JPEG figure into postscript, automatically when needed? How do files help applications to deal with heterogeneity? Moreover, optimization opportunities are lost due to the low-level nature of read and write; e.g. in NFS data is sent twice through the network when copying a remote file within the same file server. We feel that a better abstraction is needed and propose an alternate abstraction, the "box", to solve the problems mentioned above. Boxes, while being as simple as files, are more expressive. We intend to use boxes as a central abstraction for a distributed adaptable operating system we are building.

 

High level

Design Issues: Architectural and philosophical points - These statements of architectural principle explain the thinking behind the specifications. These are personal notes by Tim Berners-Lee: they are not endorsed by W3C. They are aimed at the technical community, to explain reasons, provide a framework to provide consistency for for future developments, and avoid repetition of discussions once resolved.

 

The Naming System Venture - The utility of an operating system is more proportional to the number of connections possible between its components than it is to the number of those components. Namespace fragmentation is the most important determinant of that number of possible connections between OS components. Unix at its beginning increased the integration of I/O by putting devices into the file system name space. This is a winning strategy, let's take the file system name space, and one-by-one eliminate the reasons why the filesystem is inadequate for what other name spaces are used for, one missing feature at a time. Only once we have done so will the hobbles be removed from OS architects, or even OS conspiracies.

 

HCI focused writings

This may be too far off topic for this page. Edit it out if it's the case. But concider that HCI concideration should have an impact on FS design

 

Exploring the Mnemonic user interface - This article is an offspring of personal reflections over too much time being spent traversing file structures and organizing windows in the user interfaces of today’s modern operating systems.

 

File types revisited - An analyais of Apples "Uniform Type Identifiers" (kind of a replacement for file extensions and mime-types)

 

GNOME Storage - Storage is an exciting project to replace the traditional filesystem with a new document store. Storage is part of a larger design for a new desktop environment.

 

Workshop Notes

 

LSF07 Workshop Notes

 

 

posted @ 2011-06-07 22:30 加菲猫 阅读(53) 评论(0) 编辑
2011年6月6日

有声读物相比普通mp3有不少优点,比如速度控制,自动从上次播放点开始等,有声读物有网上有很多转换方法和工具,其实就用 iTunes 就可以了。颇为简单方便

在音乐上点击鼠标右键 -> "显示简介" -> 选项 -> 媒体种类选择"有声读物"即可,这个是可以批量操作的。

对于wma等格式,直接导入然后转换即可(转为AAC格式),然后设置媒体种类就好了,码率的设置在菜单"编辑"->"偏好设置"->"常规"选项卡->"插入CD"旁边的"导入设置"

 

菊子曰 本文用菊子曰发布
posted @ 2011-06-06 12:32 加菲猫 阅读(91) 评论(0) 编辑
2011年3月4日

最近在做渐进下载播放,把这个规范要求copy到blog,省的老是找

《Transparent end-to-end packet switched streaming service (PSS);3GPP file format (3GP)(Release 7)》

5.4.5 Progressive-download profile
The 3GP Progressive-download profile is branded ‘3gr7’. It is used to label 3GP files that are suitable for progressive download, i.e. a scenario where a file may be played during download (with some delay).

 
The following constraints shall apply to 3GP files conforming to Progressive-download profile: 
    - the ‘moov’ box shall be placed right after the ‘ftyp’ box in the beginning of the file; 
    - all media tracks (if more than one) shall be interleaved with an interleaving depth of one second or less.

NOTE 1: This profile functions as an aid and not a requirement for progressive download, which has been an inherent feature of the 3GPP file format since the first version in Release 4. By parsing a 3GP file, a client can always determine whether a file can be progressively downloaded, and then calculate the interleaving depth from the meta-data in the ‘moov’ box.

NOTE 2: The ‘interleaving depth of one second or less’ means that: 
    -  Each chunk contains one or more samples, with the total duration of the samples being either: no greater than 1 second, or the duration of a single sample if that sample’s duration is greater than 1 second; 
    -  Within a track, chunks must be in decoding time order within the media-data box ‘mdat’; 
    -  It is recommended that, in ‘mdat’, regardless of media type, the chunks for all tracks are stored in ascending order by decoding time.  However, this order may be perturbed so that, when two chunks from different tracks overlap in time, the chunk of one track (e.g. audio) is stored before the chunk of the other track (e.g. video), even if the first sample in the second track has a slightly earlier timestamp than the first sample in the first track.

-----------------------------------------

《ISMA2.0》

Progressive Download Guidelines:
To enable progressive download, session information should always be put at the beginning of the file and the media tracks should be interleaved within the file. The following guidelines should be followed during the creation of ISMA 2.0 files for progressive download: 
    •  The Movie box (‘moov’) SHOULD be placed at the start of the file, right after the File Type box (‘ftyp’). 
    •  All media data SHOULD be present within the file (the data source of every Data Reference box should be ‘self-contained’); 
    •  The media tracks SHOULD be interleaved inside the file. The typical interleaving depth is 1 second.

The meaning of interleaving depth is given as follows: Each chunk contains one or more samples, with the total duration of the samples being either: no greater than 1 second, or the duration of a single sample if that sample’s duration is greater than 1 second. Within a track, chunks must be in decoding time order within the Media Data box (‘mdat’).
A client that does not support progressive download can always play the file once it has been completely downloaded. A progressive download client can start playing a ISMA 2.0 file that has been created along the progressive download guidelines once it has received a first chunk of all media in the session.

 

posted @ 2011-03-04 10:40 加菲猫 阅读(147) 评论(0) 编辑
2011年2月19日
摘要: Origin Pull 这个技术其实说起来也不算什么新鲜玩意,大致就是 PULL from Original Server 的意思。 内容自动的从源站点传输到CDN之中,这个和手工的上传内容相对应。 比如一个域名: www.duba.net, 它的CNAME是 www.duba.net.cachecn.com, 我们访问页面,其中有首页的图标, 浏览器请求logo图片 http://www.dub...阅读全文
posted @ 2011-02-19 11:22 加菲猫 阅读(110) 评论(0) 编辑
2011年2月9日
摘要: 昨晚看《A Hierarchical Internet Object Cache》的时候,看到paper里面吹Harvest是如何的牛,就google了一下,还真的很牛: Harvest 后来发展成了大名鼎鼎的 Squid Netcache 产品也是 based on Harvest (netcache 曾经是如雷贯耳的 netapp 产品,,后来卖给了 blue coat ) Harvest ...阅读全文
posted @ 2011-02-09 08:29 加菲猫 阅读(61) 评论(0) 编辑
2011年2月6日
摘要: 这里主要是对paper的笔记,和一些将访问模式应用到实际video CDN系统设计的考虑 【Key new observations】 with 77-79%of media sessions being less than 10 min long, 7-12% of the sessions being 10-30 min, and 6-13% ofsessions continued for...阅读全文
posted @ 2011-02-06 18:30 加菲猫 阅读(102) 评论(0) 编辑
2011年2月1日
摘要: 本节主要来自 web caching 的 Chapter 7. Cache Hierarchies , 这节主要就是笔记。 cache层级化主要是为了提高命中率,cache无论是容量还是性能都是有限的,假如cache miss了,从neighbor cache获取内容一般来讲比从origin server好: 可以节约延迟和WAN的带宽。 层级化意味着多个cache,cache之间常见的关系是两种...阅读全文
posted @ 2011-02-01 11:02 加菲猫 阅读(95) 评论(0) 编辑
2011年1月8日
摘要: 了解过 apple 的 http live streaming 以后,也测试了一下,用的是 http://iphone.akamai.com/ 。安装quicktime以后视频出来了,因为用了XX代理所以没有抓包,当时就告一段落。 最近想彻底了解一下,结果发现在windows下,不管我用IE / firefox / Safari / chrome 还是quicktime/vlc , 都无法观看,抓...阅读全文
posted @ 2011-01-08 11:21 加菲猫 阅读(383) 评论(0) 编辑