Linux内核中的slab/slob/slub-- 在搞晕前先记下来

很久很久以前:一个叫做Mark Hemment的哥儿们写了Slab。在接下来的一些年里,其他人对Slab进行了完善。一年半以前,SLOB问世了。SLOB的目标是针对嵌入式系统的,主要是适用于那些内存非常有限的系统,比如32MB以下的内存,它不太注重large smp系统,虽然最近在这方面有一些小的改进。几个月之前,SLUB闪亮登场。它基本上属于对Slab的重设计(redesign),但是代码更少,并且能更好的适应large NUMA系统。SLUB被很认为是Slab和Slob的取代者,大概在2.6.24/2.6.25将会被同志们抛弃。而SLUB将是未来Linux Kernel中的首选。

Linux内核真是变化太快了,内存管理这块就是一个好例子。

本来Linux内核只有Slab的,现在好了,Slab多了两个兄弟:Slob和Slub。瞧!这就是内核的命名风格,让你光看名字就糊涂了!这也是我这两天读内核源代码的深刻体会,什么cache啊,cache_cache啊,free_area啊,绕不晕你才怪呢~!

以前搞不懂这三个到底什么关系,为什么要有这三个。今天搜了一下,明白了一些。简单的说:Slab是基础,是最早从Sun OS那引进的;Slub是在Slab上进行的改进,在大型机上表现出色(不知道在普通PC上如何),据说还被IA-64作为默认;而Slob是针对小型系统设计的,当然了,主要是嵌入式。相关文章如下:

Anatomy of the Linux slab allocator
The SLUB allocator
The SLOB allocator

这也正好体现了一个Linux内核开发一贯的思想:提供一种机制,而不是一种策略(Provide mechanism not policy)。其它软件开发又何尝不是如此呢?

----------------------------------------------------------华丽的分割线之The SLOB allocator----------------------------------------------------------
slob: introduce the SLOB allocator
configurable replacement for slab allocator

This adds a CONFIG_SLAB option under CONFIG_EMBEDDED. When CONFIG_SLAB
is disabled, the kernel falls back to using the 'SLOB' allocator.

SLOB is a traditional K&R/UNIX allocator with a SLAB emulation layer,
similar to the original Linux kmalloc allocator that SLAB replaced.
It's signicantly smaller code and is more memory efficient. But like
all similar allocators, it scales poorly and suffers from
fragmentation more than SLAB, so it's only appropriate for small
systems.

It's been tested extensively in the Linux-tiny tree. I've also
stress-tested it with make -j 8 compiles on a 3G SMP+PREEMPT box (not
recommended).

Here's a comparison for otherwise identical builds, showing SLOB
saving nearly half a megabyte of RAM:

$ size vmlinux*
   text    data     bss     dec     hex filename
3336372  529360  190812 4056544  3de5e0 vmlinux-slab
3323208  527948  190684 4041840  3dac70 vmlinux-slob

$ size mm/{slab,slob}.o
   text    data     bss     dec     hex filename
  13221     752      48   14021    36c5 mm/slab.o
   1896      52       8    1956     7a4 mm/slob.o

/proc/meminfo:
                  SLAB          SLOB      delta
MemTotal:        27964 kB      27980 kB     +16 kB
MemFree:         24596 kB      25092 kB    +496 kB
Buffers:            36 kB         36 kB       0 kB
Cached:           1188 kB       1188 kB       0 kB
SwapCached:          0 kB          0 kB       0 kB
Active:            608 kB        600 kB      -8 kB
Inactive:          808 kB        812 kB      +4 kB
HighTotal:           0 kB          0 kB       0 kB
HighFree:            0 kB          0 kB       0 kB
LowTotal:        27964 kB      27980 kB     +16 kB
LowFree:         24596 kB      25092 kB    +496 kB
SwapTotal:           0 kB          0 kB       0 kB
SwapFree:            0 kB          0 kB       0 kB
Dirty:               4 kB         12 kB      +8 kB
Writeback:           0 kB          0 kB       0 kB
Mapped:            560 kB        556 kB      -4 kB
Slab:             1756 kB          0 kB   -1756 kB
CommitLimit:     13980 kB      13988 kB      +8 kB
Committed_AS:     4208 kB       4208 kB       0 kB
PageTables:         28 kB         28 kB       0 kB
VmallocTotal:  1007312 kB    1007312 kB       0 kB
VmallocUsed:        48 kB         48 kB       0 kB
VmallocChunk:  1007264 kB    1007264 kB       0 kB

(this work has been sponsored in part by CELF)

Signed-off-by: Matt Mackall <mpm@selenic.com>

Index: 2.6.14-slob/mm/slob.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ 2.6.14-slob/mm/slob.c 2005-11-01 09:42:01.000000000 -0800
@@ -0,0 +1,395 @@
+/*
+ * SLOB Allocator: Simple List Of Blocks
+ *
+ * Matt Mackall <mpm@selenic.com> 12/30/03
+ *
+ * How SLOB works:
+ *
+ * The core of SLOB is a traditional K&R style heap allocator, with
+ * support for returning aligned objects. The granularity of this
+ * allocator is 8 bytes on x86, though it's perhaps possible to reduce
+ * this to 4 if it's deemed worth the effort. The slob heap is a
+ * singly-linked list of pages from __get_free_page, grown on demand
+ * and allocation from the heap is currently first-fit.
+ *
+ * Above this is an implementation of kmalloc/kfree. Blocks returned
+ * from kmalloc are 8-byte aligned and prepended with a 8-byte header.
+ * If kmalloc is asked for objects of PAGE_SIZE or larger, it calls
+ * __get_free_pages directly so that it can return page-aligned blocks
+ * and keeps a linked list of such pages and their orders. These
+ * objects are detected in kfree() by their page alignment.
+ *
+ * SLAB is emulated on top of SLOB by simply calling constructors and
+ * destructors for every SLAB allocation. Objects are returned with
+ * the 8-byte alignment unless the SLAB_MUST_HWCACHE_ALIGN flag is
+ * set, in which case the low-level allocator will fragment blocks to
+ * create the proper alignment. Again, objects of page-size or greater
+ * are allocated by calling __get_free_pages. As SLAB objects know
+ * their size, no separate size bookkeeping is necessary and there is
+ * essentially no allocation space overhead.
+ */
+
+#include <linux/config.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+#include <linux/cache.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/timer.h>
+
+struct slob_block {
+ int units;
+ struct slob_block *next;
+};
+typedef struct slob_block slob_t;
+
+#define SLOB_UNIT sizeof(slob_t)
+#define SLOB_UNITS(size) (((size) + SLOB_UNIT - 1)/SLOB_UNIT)
+#define SLOB_ALIGN L1_CACHE_BYTES
+
+struct bigblock {
+ int order;
+ void *pages;
+ struct bigblock *next;
+};
+typedef struct bigblock bigblock_t;
+
+static slob_t arena = { .next = &arena, .units = 1 };
+static slob_t *slobfree = &arena;
+static bigblock_t *bigblocks;
+static DEFINE_SPINLOCK(slob_lock);
+static DEFINE_SPINLOCK(block_lock);
+
+static void slob_free(void *b, int size);
+
+static void *slob_alloc(size_t size, gfp_t gfp, int align)
+{
+ slob_t *prev, *cur, *aligned = 0;
+ int delta = 0, units = SLOB_UNITS(size);
+ unsigned long flags;
+
+ spin_lock_irqsave(&slob_lock, flags);
+ prev = slobfree;
+ for (cur = prev->next; ; prev = cur, cur = cur->next) {
+ if (align) {
+ aligned = (slob_t *)ALIGN((unsigned long)cur, align);
+ delta = aligned - cur;
+ }
+ if (cur->units >= units + delta) { /* room enough? */
+ if (delta) { /* need to fragment head to align? */
+ aligned->units = cur->units - delta;
+ aligned->next = cur->next;
+ cur->next = aligned;
+ cur->units = delta;
+ prev = cur;
+ cur = aligned;
+ }
+
+ if (cur->units == units) /* exact fit? */
+ prev->next = cur->next; /* unlink */
+ else { /* fragment */
+ prev->next = cur + units;
+ prev->next->units = cur->units - units;
+ prev->next->next = cur->next;
+ cur->units = units;
+ }
+
+ slobfree = prev;
+ spin_unlock_irqrestore(&slob_lock, flags);
+ return cur;
+ }
+ if (cur == slobfree) {
+ spin_unlock_irqrestore(&slob_lock, flags);
+
+ if (size == PAGE_SIZE) /* trying to shrink arena? */
+ return 0;
+
+ cur = (slob_t *)__get_free_page(gfp);
+ if (!cur)
+ return 0;
+
+ slob_free(cur, PAGE_SIZE);
+ spin_lock_irqsave(&slob_lock, flags);
+ cur = slobfree;
+ }
+ }
+}
+
+static void slob_free(void *block, int size)
+{
+ slob_t *cur, *b = (slob_t *)block;
+ unsigned long flags;
+
+ if (!block)
+ return;
+
+ if (size)
+ b->units = SLOB_UNITS(size);
+
+ /* Find reinsertion point */
+ spin_lock_irqsave(&slob_lock, flags);
+ for (cur = slobfree; !(b > cur && b < cur->next); cur = cur->next)
+ if (cur >= cur->next && (b > cur || b < cur->next))
+ break;
+
+ if (b + b->units == cur->next) {
+ b->units += cur->next->units;
+ b->next = cur->next->next;
+ } else
+ b->next = cur->next;
+
+ if (cur + cur->units == b) {
+ cur->units += b->units;
+ cur->next = b->next;
+ } else
+ cur->next = b;
+
+ slobfree = cur;
+
+ spin_unlock_irqrestore(&slob_lock, flags);
+}
+
+static int FASTCALL(find_order(int size));
+static int fastcall find_order(int size)
+{
+ int order = 0;
+ for ( ; size > 4096 ; size >>=1)
+ order++;
+ return order;
+}
+
+void *kmalloc(size_t size, gfp_t gfp)
+{
+ slob_t *m;
+ bigblock_t *bb;
+ unsigned long flags;
+
+ if (size < PAGE_SIZE - SLOB_UNIT) {
+ m = slob_alloc(size + SLOB_UNIT, gfp, 0);
+ return m ? (void *)(m + 1) : 0;
+ }
+
+ bb = slob_alloc(sizeof(bigblock_t), gfp, 0);
+ if (!bb)
+ return 0;
+
+ bb->order = find_order(size);
+ bb->pages = (void *)__get_free_pages(gfp, bb->order);
+
+ if (bb->pages) {
+ spin_lock_irqsave(&block_lock, flags);
+ bb->next = bigblocks;
+ bigblocks = bb;
+ spin_unlock_irqrestore(&block_lock, flags);
+ return bb->pages;
+ }
+
+ slob_free(bb, sizeof(bigblock_t));
+ return 0;
+}
+
+EXPORT_SYMBOL(kmalloc);
+
+void *kzalloc(size_t size, gfp_t flags)
+{
+ void *ret = kmalloc(size, flags);
+ if (ret)
+ memset(ret, 0, size);
+ return ret;
+}
+
+EXPORT_SYMBOL(kzalloc);
+
+void kfree(const void *block)
+{
+ bigblock_t *bb, **last = &bigblocks;
+ unsigned long flags;
+
+ if (!block)
+ return;
+
+ if (!((unsigned int)block & (PAGE_SIZE-1))) {
+ /* might be on the big block list */
+ spin_lock_irqsave(&block_lock, flags);
+ for (bb = bigblocks; bb; last = &bb->next, bb = bb->next) {
+ if (bb->pages == block) {
+ *last = bb->next;
+ spin_unlock_irqrestore(&block_lock, flags);
+ free_pages((unsigned long)block, bb->order);
+ slob_free(bb, sizeof(bigblock_t));
+ return;
+ }
+ }
+ spin_unlock_irqrestore(&block_lock, flags);
+ }
+
+ slob_free((slob_t *)block - 1, 0);
+ return;
+}
+
+EXPORT_SYMBOL(kfree);
+
+unsigned int ksize(const void *block)
+{
+ bigblock_t *bb;
+ unsigned long flags;
+
+ if (!block)
+ return 0;
+
+ if (!((unsigned int)block & (PAGE_SIZE-1))) {
+ spin_lock_irqsave(&block_lock, flags);
+ for (bb = bigblocks; bb; bb = bb->next)
+ if (bb->pages == block) {
+ spin_unlock_irqrestore(&slob_lock, flags);
+ return PAGE_SIZE << bb->order;
+ }
+ spin_unlock_irqrestore(&block_lock, flags);
+ }
+
+ return ((slob_t *)block - 1)->units * SLOB_UNIT;
+}
+
+struct kmem_cache_s {
+ unsigned int size, align;
+ const char *name;
+ void (*ctor)(void *, kmem_cache_t *, unsigned long);
+ void (*dtor)(void *, kmem_cache_t *, unsigned long);
+};
+
+kmem_cache_t *kmem_cache_create(const char *name, size_t size, size_t align,
+ unsigned long flags,
+ void (*ctor)(void*, kmem_cache_t *, unsigned long),
+ void (*dtor)(void*, kmem_cache_t *, unsigned long))
+{
+ kmem_cache_t *c;
+
+ c = slob_alloc(sizeof(kmem_cache_t), flags, 0);
+
+ if (c) {
+ c->name = name;
+ c->size = size;
+ c->ctor = ctor;
+ c->dtor = dtor;
+ /* ignore alignment unless it's forced */
+ c->align = (flags & SLAB_MUST_HWCACHE_ALIGN) ? SLOB_ALIGN : 0;
+ if (c->align < align)
+ c->align = align;
+ }
+
+ return c;
+}
+EXPORT_SYMBOL(kmem_cache_create);
+
+int kmem_cache_destroy(kmem_cache_t *c)
+{
+ slob_free(c, sizeof(kmem_cache_t));
+ return 0;
+}
+EXPORT_SYMBOL(kmem_cache_destroy);
+
+void *kmem_cache_alloc(kmem_cache_t *c, gfp_t flags)
+{
+ void *b;
+
+ if (c->size < PAGE_SIZE)
+ b = slob_alloc(c->size, flags, c->align);
+ else
+ b = (void *)__get_free_pages(flags, find_order(c->size));
+
+ if (c->ctor)
+ c->ctor(b, c, SLAB_CTOR_CONSTRUCTOR);
+
+ return b;
+}
+EXPORT_SYMBOL(kmem_cache_alloc);
+
+void kmem_cache_free(kmem_cache_t *c, void *b)
+{
+ if (c->dtor)
+ c->dtor(b, c, 0);
+
+ if (c->size < PAGE_SIZE)
+ slob_free(b, c->size);
+ else
+ free_pages((unsigned long)b, find_order(c->size));
+}
+EXPORT_SYMBOL(kmem_cache_free);
+
+unsigned int kmem_cache_size(kmem_cache_t *c)
+{
+ return c->size;
+}
+EXPORT_SYMBOL(kmem_cache_size);
+
+const char *kmem_cache_name(kmem_cache_t *c)
+{
+ return c->name;
+}
+EXPORT_SYMBOL(kmem_cache_name);
+
+static struct timer_list slob_timer = TIMER_INITIALIZER(
+ (void (*)(unsigned long))kmem_cache_init, 0, 0);
+
+void kmem_cache_init(void)
+{
+ void *p = slob_alloc(PAGE_SIZE, 0, PAGE_SIZE-1);
+
+ if (p)
+ free_page((unsigned int)p);
+
+ mod_timer(&slob_timer, jiffies + HZ);
+}
+
+atomic_t slab_reclaim_pages = ATOMIC_INIT(0);
+EXPORT_SYMBOL(slab_reclaim_pages);
+
+#ifdef CONFIG_SMP
+
+void *__alloc_percpu(size_t size, size_t align)
+{
+ int i;
+ struct percpu_data *pdata = kmalloc(sizeof (*pdata), GFP_KERNEL);
+
+ if (!pdata)
+ return NULL;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_possible(i))
+ continue;
+ pdata->ptrs[i] = kmalloc(size, GFP_KERNEL);
+ if (!pdata->ptrs[i])
+ goto unwind_oom;
+ memset(pdata->ptrs[i], 0, size);
+ }
+
+ /* Catch derefs w/o wrappers */
+ return (void *) (~(unsigned long) pdata);
+
+unwind_oom:
+ while (--i >= 0) {
+ if (!cpu_possible(i))
+ continue;
+ kfree(pdata->ptrs[i]);
+ }
+ kfree(pdata);
+ return NULL;
+}
+EXPORT_SYMBOL(__alloc_percpu);
+
+void
+free_percpu(const void *objp)
+{
+ int i;
+ struct percpu_data *p = (struct percpu_data *) (~(unsigned long) objp);
+
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_possible(i))
+ continue;
+ kfree(p->ptrs[i]);
+ }
+ kfree(p);
+}
+EXPORT_SYMBOL(free_percpu);
+
+#endif
Makefile补丁略。。。

----------------------------------------------------------华丽的分割线之The SLUB allocator----------------------------------------------------------
The SLUB allocator
The slab allocator has been at the core of the kernel's memory management for many years. This allocator (sitting on top of the low-level page allocator) manages caches of objects of a specific size, allowing for fast and space-efficient allocations. Kernel hackers tend not to wander into the slab code because it's complex and because, for the most part, it works quite well.

Christoph Lameter is one of those people for whom the slab allocator does not work quite so well. Over time, he has come up with a list of complaints that is getting impressively long. The slab allocator maintains a number of queues of objects; these queues can make allocation fast but they also add quite a bit of complexity. Beyond that, the storage overhead tends to grow with the size of the system:

 

SLAB Object queues exist per node, per CPU. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues. One fears that the whole memory of the machine could one day be consumed by those queues.

Beyond that, each slab (a group of one or more continuous pages from which objects are allocated) contains a chunk of metadata at the beginning which makes alignment of objects harder. The code for cleaning up caches when memory gets tight adds another level of complexity. And so on.

Christoph's response is the SLUB allocator, a drop-in replacement for the slab code. SLUB promises better performance and scalability by dropping most of the queues and related overhead and simplifying the slab structure in general, while retaining the current slab allocator interface.

In the SLUB allocator, a slab is simply a group of one or more pages neatly packed with objects of a given size. There is no metadata within the slab itself, with the exception that free objects are formed into a simple linked list. When an allocation request is made, the first free object is located, removed from the list, and returned to the caller.

Given the lack of per-slab metadata, one might well wonder just how that first free object is found. The answer is that the SLUB allocator stuffs the relevant information into the system memory map - the page structures associated with the pages which make up the slab. Making struct page larger is frowned upon in a big way, so the SLUB allocator makes this complicated structure even more so with the addition of another union. The end result is that struct page gets three new fields which only have meaning when the associated page is part of a slab:

void *freelist; short unsigned int inuse; short unsigned int offset;

For slab use, freelist points to the first free object within a slab, inuse is the number of objects which have been allocated from the slab, and offset tells the allocator where to find the pointer to the next free object. The SLUB allocator can use RCU to free objects, but, to do so, it must be able to put the "next object" pointer outside of the object itself; the offset pointer is the allocator's way of tracking where that pointer was put.

When a slab is first created by the allocator, it has no objects allocated from it. Once an object has been allocated, it becomes a "partial" slab which is stored on a list in the kmem_cache structure. Since this is a patch aimed at scalability, there is, in fact, one "partial" list for each NUMA node on the system. The allocator tries to keep allocations node-local, but it will reach across nodes before filling the system with partial slabs.

There is also a per-CPU array of active slabs, intended to prevent cache line bouncing even within a NUMA node. There is a special thread which runs (via a workqueue) which monitors the usage of per-CPU slabs; if a per-CPU slab is not being used, it gets put back onto the partial list for use by other processors.

If all objects within a slab are allocated, the allocator simply forgets about the slab altogether. Once an object in a full slab is freed, the allocator can relocate the containing slab via the system memory map and put it back onto the appropriate partial list. If all of the objects within a given slab (as tracked by the inuse counter) are freed, the entire slab is given back to the page allocator for reuse.

One interesting feature of the SLUB allocator is that it can combine slabs with similar object sizes and parameters. The result is fewer slab caches in the system (a 50% reduction is claimed), better locality of slab allocations, and less fragmentation of slab memory. The patch does note:

 

Note that merging can expose heretofore unknown bugs in the kernel because corrupted objects may now be placed differently and corrupt differing neighboring objects. Enable sanity checks to find those.

Causing bugs to stand out is generally considered to be a good thing, but wider use of the SLUB allocator could lead to some quirky behavior until those new bugs are stamped out.

Wider use may be in the cards: the SLUB allocator is in the -mm tree now and could hit the mainline as soon as 2.6.22. The simplified code is attractive, as is the claimed 5-10% performance increase. If merged, SLUB is likely to coexist with the current slab allocator (and the SLOB allocator intended for small systems) for some time. In the longer term, the current slab code may be approaching the end of its life.

posted on 2018-08-15 13:14  lucelu  阅读(1221)  评论(0编辑  收藏

导航