DragonFly BSD

performance

Please note: Many of the figures on this page utilize SVG, if your browser does not show these a plugin needs to be installed or your browser updated.

DragonFly BSD has numerous attributes that make it compare favorably to other operating systems under a great diversity of workloads. Some select benchmarks that represent the general performance attributes of DragonFly are included in this page. If you have a DragonFly BSD performance success story, the developers would like to hear about it on the mailing lists!

Symmetric Multi-Processor Scaling (2012)

It is true that one of the original goals of the DragonFly BSD project was performance-oriented, the project sought to do SMP in more straightforward, composable, understandable and algorithmically superior ways to the work being done in other operating system kernels. The results of this process have become staggeringly obvious with the 3.0 and 3.2 releases of DragonFly, which saw a significant amount of polishing and general scalability work, and the culmination of which can be seen in the following graph.

The following graph charts the performance of the PostgreSQL 9.3 development version as of late June 2012 on DragonFly BSD 3.0 and 3.2, FreeBSD 9.1, NetBSD 6.0 and Scientific Linux 6.2 running Linux kernel version 2.6.32. The tests were performed using system defaults on each platform with pgbench as the test client with a scaling factor of 800. The test system in question was a dual-socket Intel Xeon X5650 with 24GB RAM.

NetBSD 6.0 was unable to complete the benchmark run.

In this particular test, which other operating systems have often utilized to show how well they scale, you can see the immense improvement in scalability of DragonFly BSD 3.2. PostgreSQL's scalability in 3.2 far surpasses other BSD-derived codebases and is roughly on par with the Linux kernel (which has been the object of expensive, multi-year optimization efforts). DragonFly performs better than Linux at lower concurrencies and the small performance hit at high client counts was given up willingly to ensure that we maintain acceptable interactivity even at extremely high throughput levels.

Note: single-host pgbench runs like the ones above are not directly useful for estimating PostgreSQL scaling behavior. For example, any real-world setup that needs to handle that many transactions would be using multiple PostgreSQL servers and the client themselves would be running on a set of different hosts. This would be much less demanding of the underlying OS. If you plan to use PostgreSQL on DragonFly and are targeting high-throughput, we encourage you to do your own testing and would appreciate any reports of inadequate performance. That said, the above workload does demonstrate the effect of algorithmic improvements that have been incorporated into the 3.2 kernel and should positively affect many real-world setups (not just PostgreSQL ones).


Swapcache

One of the novel features in DragonFly that is able to boost the throughput of a large number of workloads is called swapcache. Swapcache gives the kernel the ability to retire cached pages to one or more interleaved swap devices, usually using commodity solid state disks. By caching filesystem metadata, data or both on an SSD the performance of many read-centric workloads is improved and worst case performance is kept well bounded.

The following chart depicts relative performance of a system with and without swapcache. The application being tested is a PostgreSQL database under a read-only workload, with varying database sizes ranging from smaller than the total ram in the system to double the size of total available memory.

As you can plainly see, performance with swapcache is more than just well bounded, it is dramatically improved. Similar gains can be seen in many other scenarios. As with all benchmarks, the above numbers are indicative only of the specific test performed and to get a true sense of whether or not it will be a benefit to a specific workload it must be tested in that environment. Disclaimers aside, swapcache is appropriate for a huge variety of common workloads, the DragonFly team invites you to try it and see what a difference it can make.

Symmetric Multi-Processor Scaling (2018)

It's time to update! We ran a new set of pgbench tests on a Threadripper 2990WX (32-core/64-thread) running with 128G of ECC 2666C14 memory (8 sticks) and a power cap of 250W at the wall. The power cap was set below stock at a more maximally efficient point in the power curve. Stock power consumption usually runs in 330W range.

With the huge amount of SMP work done since 2012, the entire query path no longer has any SMP contention whatsoever and we expect a roughly linear curve until we run out of CPU cores and start digging into hyperthreads. The only possible sources of contention are increasing memory latencies as more cores load memory, and probably a certain degree of cache mastership ping-ponging that occurs even when locks are shared and do not contend.

The purple curve is the TPS, the green curve is the measured TPS/cpu-thread (with 16 threads selected as the baseline 1.0 ratio), and the light blue curve shows the approximate cpu performance for the system (load x CPU frequency). This particular CPU was running at around 4.16 GHz with one thread, 3.73 GHz @ 16 threads, 3.12 GHz @ 32 threads, and 2.90 GHz or so at 64 threads. The horizontal thread count is the number of client processes. There is also a server process for each client process so the actual number of CPU cores in use begins to use hyperthreads past 16, but don't fully load full cores until it hits 32 due to the IPC between client and server. pgbench was allowed to heat-up during each run, an average of the last three 60-second samples is then graphed.

To best determine actual cpu performance, the load factor percentage is multiplied against the CPU frequency of the non-idle cores and then normalized to fit against the TPS graph. The match-up starts to diverge significantly at just below 32 clients (32 client and 32 server processes). At n=1 the load is 1.6%, at n=16 the load is 25.7%, at n=32 the load is 51%, and at 64 the load is 100% x 2.9 GHz.

From this it is reasonable to assume, roughly, that the hyperthreads really get dug into past n=32. At n=32 we had around 750K TPS and at n=64 we had 1180K TPS, 57% higher. So, roughly, hyperthreading adds around 50-60% more TPS to the results. While this test is a bit wishy-washy on when hyperthreads actually start to get exercised, the DragonFlyBSD scheduler is very, very good at its job and will use hyperthread pairs for correlated clients and servers at N=32, with one or the other idle at any given moment. This bears out in where the graph begins to diverge. Since there is no SMP contention, the divergence left over from this is likely caused by memory latency as more and more cpu threads bang on memory at the same time.