DragonFly On-Line Manual Pages
SWAPCACHE(8) DragonFly System Manager's Manual SWAPCACHE(8)
swapcache -- a mechanism to use fast swap to cache filesystem data and
swapcache is a system capability which allows a solid state disk (SSD) in
a swap space configuration to be used to cache clean filesystem data and
meta-data in addition to its normal function of backing anonymous memory.
Sysctls are used to manage operational parameters and can be adjusted at
any time. Typically a large initial burst is desired after system boot,
controlled by the initial vm.swapcache.curburst parameter. This parame-
ter is reduced as data is written to swap by the swapcache and increased
at a rate specified by vm.swapcache.accrate. Once this parameter reaches
zero write activity ceases until it has recovered sufficiently for write
activity to resume.
vm.swapcache.meta_enable enables the writing of filesystem meta-data to
the swapcache. Filesystem metadata is any data which the filesystem
accesses via the disk device using buffercache. Meta-data is cached
globally regardless of file or directory flags.
vm.swapcache.data_enable enables the writing of clean filesystem file-
data to the swapcache. Filesystem filedata is any data which the
filesystem accesses via a regular file. In technical terms, when the
buffer cache is used to access a regular file through its vnode. Please
do not blindly turn on this option, see the PERFORMANCE TUNING section
for more information.
vm.swapcache.use_chflags enables the use of the cache and noscache
chflags(1) flags to control which files will be data-cached. If this
sysctl is disabled and data_enable is enabled, the system will ignore
file flags and attempt to swapcache all regular files.
vm.swapcache.read_enable enables reading from the swapcache and should be
set to 1 for normal operation.
vm.swapcache.maxfilesize controls which files are to be cached based on
their size. If set to non-zero only files smaller than the specified
size will be cached. Larger files will not be cached.
vm.swapcache.maxlaunder controls the maximum number of clean VM pages
which will be added to the swap cache and written out to swap on each
poll. Swapcache polls ten times a second.
vm.swapcache.hysteresis controls how many pages swapcache waits to be
added to the inactive page queue before continuing its scan. Once it
decides to scan it continues subject to the above limitations until it
reaches the end of the inactive page queue. This parameter is designed
to make swapcache generate more bulky bursts to swap which helps SSDs
reduce write amplification effects.
Best operation is achieved when the active data set fits within the swap-
This specifies the burst accumulation rate in bytes per second and
ultimately controls the write bandwidth to swap averaged over a
long period of time. This parameter must be carefully chosen to
manage the write endurance of the SSD in order to avoid wearing it
out too quickly. Even though SSDs have limited write endurance,
there is massive cost/performance benefit to using one in a swap-
Let's use the old Intel X25V 40GB MLC SATA SSD as an example. This
device has approximately a 40TB (40 terabyte) write endurance, but
see later notes on this, it is more a minimum value. Limiting the
long term average bandwidth to 100KB/sec leads to no more than
~9GB/day writing which calculates approximately to a 12 year
endurance. Endurance scales linearly with size. The 80GB version
of this SSD will have a write endurance of approximately 80TB.
MLC SSDs have a 1000-10000x write endurance, while the lower den-
sity higher-cost SLC SSDs have a 10000-100000x write endurance,
approximately. MLC SSDs can be used for the swapcache (and swap)
as long as the system manager is cognizant of its limitations.
However, over the years tests have shown the SLC SSDs do not really
live up to their hype and are no more reliable than MLC SSDs.
Instead of worrying about SLC vs MLC, just use MLC (or TLC or what-
eve), leave more space unpartitioned which the SSD can utilize to
improve durability, and be cognizant of the SSDs rate of wear.
Turning on just meta_enable causes only filesystem meta-data to be
cached and will result in very fast directory operations even over
millions of inodes and even in the face of other invasive opera-
tions being run by other processes.
For HAMMER filesystems meta-data includes the B-Tree, directory
entries, and data related to tiny files. Approximately 6 GB of
swapcache is needed for every 14 million or so inodes cached,
effectively giving one the ability to cache all the meta-data in a
multi-terabyte filesystem using a fairly small SSD.
Turning on data_enable (with or without other features) allows bulk
file data to be cached. This feature is very useful for web server
operation when the operational data set fits in swap. However,
care must be taken to avoid thrashing the swapcache. In almost all
cases you will want to leave chflags mode enabled and use 'chflags
cache' on governing directories to control which directory subtrees
file data should be cached for.
DragonFly uses generously large kern.maxvnodes values, typically in
excess of 400K vnodes, but large numbers of small files can still
cause problems for swapcache. When operating on a filesystem con-
taining a large number of small files, vnode recycling by the ker-
nel will cause related swapcache data to be lost and also cause the
swapcache to potentially thrash. Cache thrashing due to vnode
recyclement can occur whether chflags mode is used or not.
To solve the thrashing problem you can turn on HAMMER's double
buffering feature via vfs.hammer.double_buffer. This causes HAMMER
to cache file data via its block device. HAMMER cannot avoid also
caching file data via individual vnodes but will try to expire the
second copy more quickly (hence why it is called double buffer
mode), but the key point here is that swapcache will only cache the
data blocks via the block device when double_buffer mode is used
and since the block device is associated with the mount, vnode
recycling will not mess with it. This allows the data for any num-
ber (potentially millions) of files to be swapcached. You still
should use chflags mode to control the size of the dataset being
cached to remain under 75% of configured swap space.
Data caching is definitely more wasteful of the SSD's write dura-
bility than meta-data caching. If not carefully managed the swap-
cache may exhaust its burst and smack against the long term average
bandwidth limit, causing the SSD to wear out at the maximum rate
you programmed. Data caching is far less wasteful and more effi-
cient if you provide a sufficiently large SSD.
When caching large data sets you may want to use a medium-sized SSD
with good write performance instead of a small SSD to accommodate
the higher burst write rate data caching incurs and to reduce
interference between reading and writing. Write durability also
tends to scale with larger SSDs, but keep in mind that newer flash
technologies use smaller feature sizes on-chip which reduce the
write durability of the chips, so pay careful attention to the type
of flash employed by the SSD when making durability assumptions.
For example, an Intel X25-V only has 40MB/s in write performance
and burst writing by swapcache will seriously interfere with con-
current read operation on the SSD. The 80GB X25-M on the otherhand
has double the write performance. Higher-capacity and larger form-
factor SSDs tend to have better write-performance. But the Intel
310 series SSDs use flash chips with a smaller feature size so an
80G 310 series SSD will wind up with a durability relative close to
the older 40G X25-V.
When data caching is turned on you can fine-tune what gets swap-
cached by also turning on swapcache's chflags mode and using
chflags(1) with the cache flag to enable data caching on a direc-
tory-tree (recursive) basis. This flag is tracked by the namecache
and does not need to be recursively set in the directory tree.
Simply setting the flag in a top level directory or mount point is
usually sufficient. However, the flag does not track across mount
points. A typical setup is something like this:
chflags cache /etc /sbin /bin /usr /home
chflags noscache /usr/obj
It is possible to tell swapcache to ignore the cache flag by leav-
ing vm.swapcache.use_chflags set to zero. In many situations it is
convenient to simply not use chflags mode, but if you have numerous
mixed SSDs and HDDs you may want to use this flag to enable swap-
cache on the HDDs and disable it on the SSDs even if you do not
care about fine-grained control. chflag'ing.
Filesystems such as NFS which do not support flags generally have a
cache mount option which enables swapcache operation on the mount.
This may be used to reduce cache thrashing when a focus on a small
potentially fragmented filespace is desired, leaving the larger
(more linearly accessed) files alone.
This controls hysteresis and prevents nickel-and-dime write burst-
ing. Once curburst drops to zero, writing to the swapcache ceases
until it has recovered past minburst. The idea here is to avoid
creating a heavily fragmented swapcache where reading data from a
file must alternate between the cache and the primary filesystem.
Doing so does not save disk seeks on the primary filesystem so we
want to avoid doing small bursts. This parameter allows us to do
larger bursts. The larger bursts also tend to improve SSD perfor-
mance as the SSD itself can do a better job write-combining and
This controls the maximum amount of swapspace swapcache may use, in
percentage terms. The default is 75%, leaving the remaining 25% of
swap available for normal paging operations.
It is important to ensure that your swap partition is nicely aligned.
The standard DragonFly disklabel(8) program guarantees high alignment
(~1MB) automatically. Swap-on HDDs benefit because HDDs tend to use a
larger physical sector size than 512 bytes, and proper alignment for SSDs
will reduce write amplification and write-combining inefficiencies.
Finally, interleaved swap (multiple SSDs) may be used to increase swap
and swapcache performance even further. A single SATA-II SSD is typi-
cally capable of reading 120-220MB/sec. Configuring two SSDs for your
swap will improve aggregate swapcache read performance by 1.5x to 1.8x.
In tests with two Intel 40GB SSDs 300MB/sec was easily achieved. With
two SATA-III SSDs it is possible to achieve 600MB/sec or better and well
over 400MB/sec random-read performance (versus the ~3MB/sec random read
performance a hard drive gives you). Faster SATA interfaces or newer
NVMe technologies have significantly more read bandwidth (3GB/sec+ for
NVMe), but may still lag on the write bandwidth. With newer technolo-
gies, one swap device is usually plenty.
DragonFly defaults to a maximum of 512G of configured swap. Keep in mind
that each 1GB of actually configured swap requires approximately 1MB of
wired ram to manage.
In addition there will be periods of time where the system is in steady
state and not writing to the swapcache. During these periods curburst
will inch back up but will not exceed maxburst. Thus the maxburst value
controls how large a repeated burst can be. Remember that curburst
dynamically tracks burst and will go up and down depending.
A second bursting parameter called vm.swapcache.minburst controls burst-
ing when the maximum write bandwidth has been reached. When minburst
reaches zero write activity ceases and curburst is allowed to recover up
to minburst before write activity resumes. The recommended range for the
minburst parameter is 1MB to 50MB. This parameter has a relationship to
how fragmented the swapcache gets when not in a steady state. Large
bursts reduce fragmentation and reduce incidences of excessive seeking on
the hard drive. If set too low the swapcache will become fragmented
within a single regular file and the constant back-and-forth between the
swapcache and the hard drive will result in excessive seeking on the hard
SWAPCACHE SIZE & MANAGEMENT
The swapcache feature will use up to 75% of configured swap space by
default. The remaining 25% is reserved for normal paging operations.
The system operator should configure at least 4 times the SWAP space ver-
sus main memory and no less than 8GB of swap space. A typical 128GB SSD
might use 64GB for boot + base and 56GB for swap, with 8GB left unparti-
tioned. The system might then have a large additional hard drive for
bulk data. Even with many packages installed, 64GB is comfortable for
boot + base.
When configuring a SSD that will be used for swap or swapcache it is a
good idea to leave around 10% unpartitioned to improve the SSDs durabil-
You do not need to use swapcache if you have no hard drives in the sys-
tem, though in fact swapcache can help if you use NFS heavily as a
The vm_swapcache.maxswappct sysctl may be used to change the default.
You may have to change this default if you also use tmpfs(5), vn(4), or
if you have not allocated enough swap for reasonable normal paging activ-
ity to occur (in which case you probably shouldn't be using swapcache
If swapcache reaches the 75% limit it will begin tearing down swap in
linear bursts by iterating through available VM objects, until swap space
use drops to 70%. The tear-down is limited by the rate at which new data
is written and this rate in turn is often limited by
vm.swapcache.accrate, resulting in an orderly replacement of cached data
and meta-data. The limit is typically only reached when doing full
data+meta-data caching with no file size limitations and serving primar-
ily large files, or bumping kern.maxvnodes up to very high values.
NORMAL SWAP PAGING ACTIVITY WITH SSD SWAP
This is not a function of swapcache per se but instead a normal function
of the system. Most systems have sufficient memory that they do not need
to page memory to swap. These types of systems are the ones best suited
for MLC SSD configured swap running with a swapcache configuration. Sys-
tems which modestly page to swap, in the range of a few hundred megabytes
a day worth of writing, are also well suited for MLC SSD configured swap.
Desktops usually fall into this category even if they page out a bit more
because swap activity is governed by the actions of a single person.
Systems which page anonymous memory heavily when swapcache would other-
wise be turned off are not usually well suited for MLC SSD configured
swap. Heavy paging activity is not governed by swapcache bandwidth con-
trol parameters and can lead to excessive uncontrolled writing to the
SSD, causing premature wearout. This isn't to say that swapcache would
be ineffective, just that the aggregate write bandwidth required to sup-
port the system might be too large to be cost-effective for a SSD.
With this caveat in mind, SSD based paging on systems with insufficient
RAM can be extremely effective in extending the useful life of the sys-
tem. For example, a system with a measly 192MB of RAM and SSD swap can
run a -j 8 parallel build world in a little less than twice the time it
would take if the system had 2GB of RAM, whereas it would take 5x to 10x
as long with normal HDD based swap.
USING SWAPCACHE WITH NORMAL HARD DRIVES
Although swapcache is designed to work with SSD-based storage it can also
be used with HD-based storage as an aid for offloading the primary stor-
age system. Here we need to make a distinction between using RAID for
fanning out storage versus using RAID for redundancy. There are numerous
situations where RAID-based redundancy does not make sense.
A good example would be in an environment where the servers themselves
are redundant and can suffer a total failure without effecting ongoing
operations. When the primary storage requirements easily fit onto a sin-
gle large-capacity drive it doesn't make a whole lot of sense to use RAID
if your only desire is to improve performance. If you had a farm of,
say, 20 servers supporting the same facility adding RAID to each one
would not accomplish anything other than to bloat your deployment and
In these sorts of situations it may be desirable and convenient to have
the primary filesystem for each machine on a single large drive and then
use the swapcache facility to offload the drive and make the machine more
effective without actually distributing the filesystem itself across mul-
tiple drives. For the purposes of offloading while a SSD would be the
most effective from a performance standpoint, a second medium sized HD
with its much lower cost and higher capacity might actually be more cost
EXPLANATION OF STATIC VS DYNAMIC WEARING LEVELING, AND WRITE-COMBINING
Modern SSDs keep track of space that has never been written to. This
would also include space freed up via TRIM, but simply not touching a bit
of storage in a factory fresh SSD works just as well. Once you touch
(write to) the storage all bets are off, even if you reformat/repartition
later. It takes sending the SSD a whole-device TRIM command or special
format command to take it back to its factory-fresh condition (sans wear
SSDs have wear leveling algorithms which are responsible for trying to
even out the erase/write cycles across all flash cells in the storage.
The better a job the SSD can do the longer the SSD will remain usable.
The more unused storage there is from the SSDs point of view the easier a
time the SSD has running its wear leveling algorithms. Basically the
wear leveling algorithm in a modern SSD (say Intel or OCZ) uses a combi-
nation of static and dynamic leveling. Static is the best, allowing the
SSD to reuse flash cells that have not been erased very much by moving
static (unchanging) data out of them and into other cells that have more
wear. Dynamic wear leveling involves writing data to available flash
cells and then marking the cells containing the previous copy of the data
as being free/reusable. Dynamic wear leveling is the worst kind but the
easiest to implement. Modern SSDs use a combination of both algorithms
plus also do write-combining.
USB sticks often use only dynamic wear leveling and have short life spans
because of that.
In anycase, any unused space in the SSD effectively makes the dynamic
wear leveling the SSD does more efficient by giving the SSD more 'unused'
space above and beyond the physical space it reserves beyond its stated
storage capacity to cycle data through, so the SSD lasts longer in the-
Write-combining is a feature whereby the SSD is able to reduced write
amplification effects by combining OS writes of smaller, discrete, non-
contiguous logical sectors into a single contiguous 128KB physical flash
On the flip side write-combining also results in more complex lookup
tables which can become fragmented over time and reduce the SSDs read
performance. Fragmentation can also occur when write-combined blocks are
rewritten piecemeal. Modern SSDs can regain the lost performance by de-
combining previously write-combined areas as part of their static wear
leveling algorithm, but at the cost of extra write/erase cycles which
slightly increase write amplification effects. Operating systems can
also help maintain the SSDs performance by utilizing larger blocks.
Write-combining results in a net-reduction of write-amplification effects
but due to having to de-combine later and other fragmentary effects it
isn't 100%. From testing with Intel devices write-amplification can be
well controlled in the 2x-4x range with the OS doing 16K writes, versus a
worst-case 8x write-amplification with 16K blocks, 32x with 4K blocks,
and a truly horrid worst-case with 512 byte blocks.
The DragonFly swapcache feature utilizes 64K-128K writes and is specifi-
cally designed to minimize write amplification and write-combining
stresses. In terms of placing an actual filesystem on the SSD, the
DragonFly hammer(8) filesystem utilizes 16K blocks and is well behaved as
long as you limit reblocking operations. For UFS you should create the
filesystem with at least a 4K fragment size, versus the default 2K. Mod-
ern Windows filesystems use 4K clusters but it is unclear how SSD-
friendly NTFS is.
EXPLANATION OF FLASH CHIP FEATURE SIZE VS ERASE/REWRITE CYCLE DURABILITY
Manufacturers continue to produce flash chips with smaller feature sizes.
Smaller flash cells means reduced erase/rewrite cycle durability which in
turn reduces the durability of the SSD.
The older 34nm flash typically had a 10,000 cell durability while the
newer 25nm flash is closer to 1000. The newer flash uses larger ECCs and
more sensitive voltage comparators on-chip to increase the durability
closer to 3000 cycles. Generally speaking you should assume a durability
of around 1/3 for the same storage capacity using the new chips versus
the older chips. If you can squeeze out a 400TB durability from an older
40GB X25-V using 34nm technology then you should assume around a 400TB
durability from a newer 120GB 310 series SSD using 25nm technology.
I am going to repeat and expand a bit on SSD wear. Wear on SSDs is a
function of the write durability of the cells, whether the SSD implements
static or dynamic wear leveling (or both), write amplification effects
when the OS does not issue write-aligned 128KB ops or when the SSD is
unable to write-combine adjacent logical sectors, or if the SSD has a
poor write-combining algorithm for non-adjacent sectors. In addition
some additional erase/rewrite activity occurs from cleanup operations the
SSD performs as part of its static wear leveling algorithms and its
write-decombining algorithms (necessary to maintain performance over
time). MLC flash uses 128KB physical write/erase blocks while SLC flash
typically uses 64KB physical write/erase blocks.
The algorithms the SSD implements in its firmware are probably the most
important part of the device and a major differentiator between e.g. SATA
and USB-based SSDs. SATA form factor drives will universally be far
superior to USB storage sticks. SSDs can also have wildly different
wearout rates and wildly different performance curves over time. For
example the performance of a SSD which does not implement write-decombin-
ing can seriously degrade over time as its lookup tables become severely
fragmented. For the purposes of this manual page we are primarily using
Intel and OCZ drives when describing performance and wear issues.
swapcache parameters should be carefully chosen to avoid early wearout.
For example, the Intel X25V 40GB SSD has a minimum write durability of
40TB and an actual durability that can be quite a bit higher. Generally
speaking, you want to select parameters that will give you at least 10
years of service life. The most important parameter to control this is
vm.swapcache.accrate. swapcache uses a very conservative 100KB/sec
default but even a small X25V can probably handle 300KB/sec of continuous
writing and still last 10 years.
Depending on the wear leveling algorithm the drive uses, durability and
performance can sometimes be improved by configuring less space (in a
manufacturer-fresh drive) than the drive's probed capacity. For example,
by only using 32GB of a 40GB SSD. SSDs typically implement 10% more
storage than advertised and use this storage to improve wear leveling.
As cells begin to fail this overallotment slowly becomes part of the pri-
mary storage until it has been exhausted. After that the SSD has basi-
cally failed. Keep in mind that if you use a larger portion of the SSD's
advertised storage the SSD will not know if/when you decide to use less
unless appropriate TRIM commands are sent (if supported), or a low level
factory erase is issued.
smartctl (from dports(7)'s sysutils/smartmontools) may be used to
retrieve the wear indicator from the drive. One usually runs something
like `smartctl -d sat -a /dev/daXX' (for AHCI/SILI/SCSI), or `smartctl -a
/dev/adXX' for NATA. Some SSDs (particularly the Intels) will brick the
SATA port when smart operations are done while the drive is busy with
normal activity, so the tool should only be run when the SSD is idle.
ID 232 (0xe8) in the SMART data dump indicates available reserved space
and ID 233 (0xe9) is the wear-out meter. Reserved space typically starts
at 100 and decrements to 10, after which the SSD is considered to operate
in a degraded mode. The wear-out meter typically starts at 99 and decre-
ments to 0, after which the SSD has failed.
swapcache tends to use large 64KB writes and tends to cluster multiple
writes linearly. The SSD is able to take significant advantage of this
and write amplification effects are greatly reduced. If we take a 40GB
Intel X25V as an example the vendor specifies a write durability of
approximately 40TB, but swapcache should be able to squeeze out upwards
of 200TB due the fairly optimal write clustering it does. The theoreti-
cal limit for the Intel X25V is 400TB (10,000 erase cycles per MLC cell,
40GB drive, with 34nm technology), but the firmware doesn't do perfect
static wear leveling so the actual durability is less. In tests over
several hundred days we have validated a write endurance greater than
200TB on the 40G Intel X25V using swapcache.
In contrast, filesystems directly stored on a SSD could have fairly
severe write amplification effects and will have durabilities ranging
closer to the vendor-specified limit.
Tests have shown that power cycling (with proper shutdown) and read oper-
ations do not adversely effect a SSD. Writing within the wearout con-
straints provided by the vendor also does not make a powered SSD any less
reliable over time. Time itself seems to be a factor as the SSD encoun-
ters defects and weak cells in the flash chips. Writes to a SSD will
effect cold durability (a typical flash chip has 10 years of cold data
retention when fresh and less than 1 year of cold data retention near the
end of its wear life). Keeping a SSD cool improves its data retention.
Beware the standard comparison between SLC, MLC, and TLC-based flash in
terms of wearout and durability. Over the years, tests have shown that
SLC is not actually any more reliable than MLC, despite having a signifi-
cantly larger theoretical durability. Cell and chip failures seem to
trump theoretical wear limitations in terms of device reliability. With
that in mind, we do not recommend using SLC for anything any more.
Instead we recommend that the flash simply be over-provisioned to provide
the needed durability. This is already done in numerous NVMe solutions
for the vendor to be able to provide certain minimum wear guarantees.
Durability scales with the amount of flash storage (but the fab process
typically scales the opposite... smaller feature sizes for flash cells
greatly reduce their durability). When wear calculations are in years,
these differences become huge, but often the quantity of storage needed
trumps the wear life so we expect most people will be using MLC.
Beware the huge difference between larger (e.g. 2.5") form-factor SSDs
and smaller SSDs such as USB sticks are very small M.2 storage. Smaller
form-factor devices have fewer flash chips and, much lower write band-
widths, less ram for caching and write-combining, and usb sticks in par-
ticular will usually have unsophisticated wear-leveling algorithms com-
pared to a 2.5" SSD. It is generally not a good idea to make a USB stick
your primary storage. Long-form-factor NGFF/M.2 devices will be better,
and 2.5" form factor devices even better. The read-bandwidth for a SATA
SSD caps out more quickly than the read-bandwidth for a NVMe SSD, but the
larger form factor of a 2.5" SATA SSD will often have superior write per-
formance to a NGFF NVMe device. There are 2.5" NVMe devices as well,
requiring a special connector or PCIe adapter, which give you the best of
chflags(1), fstab(5), disklabel64(8), hammer(8), swapon(8)
swapcache first appeared in DragonFly 2.5.
DragonFly 5.5 February 7, 2010 DragonFly 5.5