DragonFly BSD
DragonFly kernel List (threaded) for 2010-03
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Some libc malloc work...


From: Venkatesh Srinivas <me@xxxxxxxxxxxxxxxxxxx>
Date: Wed, 17 Mar 2010 11:24:35 -0400 (EDT)

Hi world,

I've spent some time in the last few weeks working on the libc malloc, this is just a note about the work I've done.

The existing nmalloc handles multithreaded applications by having four SLGlobalData structures, each of which controls an independent set of allocation zones. Each SLGlobaldata structure has a spinlock, protecting all of the associated zones. On allocation, a thread attempts to lock the last SLGlobalData structure it allocated from; if the lock is already held, it moves on to the next global data structure and uses _SPINLOCK rather than _SPINTRYLOCK this time.

Each SLGlobalData structure also contains a small (4-element) cache of free zones; this is meant to avoid having to call _vmem_alloc (ultimately mmap) when recently free zones are available. Since this cache is per-SLGD, its possible that there will be mmap requests while other SLGDs still have free zones, but there is a strict bound on this sort of behavior. The free zone list is only populated when a zone becomes free; a burst of different-sized allocations will end up bypassing it and calling mmap a number of times.

I tried to improve these two areas; my work is available at http://endeavour.zapto.org/dfly/nmalloc.c. (the diff is also there, at nmalloc.c.diff, if you prefer).
Here's what I've changed:


I implemented a magazine layer, in the style of Solaris's libumem; each thread has pointers to two magazines, 'loaded' and 'previous', per zone size (struct thr_mags is the structure; it is accessed via an __thread variable, thread_mags). In addition to the __thread variable, there is a pthread_key associated with thread_mags; we don't use the pthread_get_specific mechanism to access the thread_mags, but we do have this key so that a destructor, mtmagazine_destructor(), is called on thread exit, to drain all thread-specific resources.

There are per-Zone Depot structures, which are lists of full and empty magazines associated with a size class, and a pthread_spin_lock. I use a pthread_spin_lock rather than a spinlock_t since there is a static number of spinlocks available, and using spinlock_t here will use many of them (it works with libthread_xu here, but I did development on FreeBSD 6.3 with libpthread, which runs out). Unlike the Solaris libumem, there is no contention counter or counts on the size of the free/full lists; I haven't implemented magazine scaling, so they weren't necessary.

I removed the existing MT strategy; there is only one SLGlobalData structure.

The multithreaded path for allocation now looks like:

_slaballoc calls mtmagazine_alloc(); if there is a loaded magazine for the current size class (via zoneindex), it allocates the first object from that magazine and returns. If that magazine is empty, it swaps in a full prev magazine and retries; It then attempts to lock and allocate a full magazine from the depot for a given size; if it cannot, it gives _slaballoc a chance to handle the allocation as it always has. (For more details on why the two magazines, see the Solaris libumem paper).
_slabfree calls mtmagazine_free, as expected. Its path is pretty much symettric with mtmagazine_alloc.


To try to reduce the number of mmaps, I added one more magazine, a cache of zones; when a zone becomes free, I now insert it into zone_magazine, from zone_free, and zone_alloc attempts to pull from this cache before hitting mmap. The zone magazine has a flag, M_BURST, and another, M_BURST_EARLY, set; when M_BURST is set and a magazine fails to allocate from itself, it calls the underlying allocator (_vmem_alloc here) to allocate a number of objects, initially 8. M_BURST_EARLY causes that 8 to then be scaled down by NSCALE each time, till it hits one. This behavior was meant to help with program startup and bursts of uneven allocation sizes that aren't connected with the steady-state behavior of a program. The zone magazine, when full, sheds load as well, releasing via _vmem_free until low_factor (currently 48) is reached. All of this code is in zone_alloc / zone_free.

A few quick tests - I captured all of the allocation traffic during the startup of gnome-terminal and the epiphany browser; these are available in gt.c and ep.c in my home directory on leaf; warning, compiling ep.c causes gcc to eat ~4.6GB of RAM, so be careful.

Gnome-terminal makes ~85000 allocations/frees on startup; when run with the existing nmalloc, this results in 69 mmaps, 54 for zone structures, and 11 munmaps. When my patched version is used, there are 66 mmaps, 11 munmaps; not a major difference. When M_BURST is raise to 16 and the allocator scales by subtraction rather than division (16, 14....), we drop to 22 mmaps and 9 munmaps (for comparison, FreeBSD's jemalloc calls mmap 10 times and munmap 4). I'll post the Epiphany numbers a little later today; also the virt/res sizes for each.

I've not done any particular MT scaling measurements; I used the malloc-test tool, from Lever & Boreham (referenced from the jemalloc paper) and got an excellent upper bound on scalability, but that test is a fairly cheesy one. Any suggestions for good things to try would be pretty cool.

Any comments on the code or ideas would rock...

Thanks,
-- vs



[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]