DragonFly users List (threaded) for 2011-10
Performance results / VM related SMP locking work - committed (3)
Another huge performance improvement for many-cores systems. I removed
the last bottleneck spinlock in the VM system. This spinlock was only
locking the PQ_INACTIVE vm_page_queue for a very short period of time
but with 48 cores it was enough to limit the VM fault rate. With the
fix concurrent compiles go much, MUCH faster, a major improvement on
top of the major improvement prior commits had.
Test Machine specs:
monster: CPU: AMD Opteron(tm) Processor 6168 (1900.01-MHz K8-class CPU)
48-cores, 64G ram, running 64-bit DFly 1.13 master
test29: CPU: AMD Phenom(tm) II X4 820 Processor (2799.91-MHz K8-class CPU)
4-cores, 8G ram, running 64-bit DFly 1.13 master
Monster is a 48-core opteron (4 cpu sockets). Test29 is a quad-core
Phenom II 820 (Deneb). On a per-core basis Test29 is about 1.47x
faster, and of course it is significantly faster dealing with contention
since it is a single-chip cpu vs the 4-socket monster. The many
cores monster is a very good environment for contention testing.
The tests below do not test paging to swap. There is plenty of memory
to cache the source trees, object trees, and the wost-case run-time
memory footprint. These are strictly kernel/cpu contention tests
running in a heavily fork/exec'd environment (aka buildworld -j N).
Because even parallel (-j) buildworlds have a lot of bottlenecks there
just isn't a whole lot of difference between -j 40 and -j 8 on monster.
Usually the CC's run in parllel but then it bottlenecks at the LD line
(as well as waiting for the 'last' cc to finish, which happens a lot
when the buildworld is working on GCC). The slower opteron cores
become very obvious during these bottleneck moments. Only the libraries
like libc or a kernel NO_MODULES build has enough source files to
actually fan-out to all available cpus.
That said, buildworlds exercise more edge cases in the kernel than
a kernel NO_MODULES build, so I prefer using buildworlds for general
testing. I have some -j 4 tests on test29 and monster for a buildkernel
at the end.
To better-utilize available cores on monster the main VM contention
test runs FOUR buildworld -j 40's in parallel instead of one.
The realtime numbers are what matter here (the 'real' column).
Note that the 4x numbers are taken from just one of the builds,
but all four finish at the same time.
monster buildworld -j 40 timings 1x prepatch: (BASELINE)
2975.43 real 4409.48 user 12971.42 sys
2999.20 real 4423.16 user 13014.44 sys
monster buildworld -j 40 timings 1x postpatch: (+14.9% improvement)
2587.42 real 4328.87 user 8551.91 sys
monster buildworld -j 40 timings 1x commit 1: (+15.4% improvement)
2577.46 real 4125.42 user 13079.62 sys
2552.94 real 4087.60 user 13085.19 sys
monster buildworld -j 40 timings 1x commit 3: (+43.8% improvement)<<<
2068.67 real 4124.96 user 4227.38 sys
2062.34 real 4139.10 user 4301.78 sys
monster buildworld -j 40 timings 4x prepatch: (BASELINE)
8302.17 real 4629.97 user 17617.84 sys
8308.01 real 4716.70 user 22330.26 sys
monster buildworld -j 40 timings 4x postpatch: (+43.2% improvement)
5799.53 real 5254.76 user 23651.73 sys
5800.49 real 5314.23 user 23499.59 sys
monster buildworld -j 40 timings 4x commit 1: (+96.8% improvement)
4207.85 real 4869.90 user 20673.71 sys
4248.45 real 4899.08 user 21697.11 sys
monster buildworld -j 40 timings 4x commit 2: (+107% improvement)
3943.25 real 4630.76 user 21062.91 sys
monster buildworld -j 40 timings 4x commit 3: (+229% improvement)<<<
2518.78 real 4344.02 user 4674.45 sys
test29 (quad cpu) buildworld -j 8 timings 1x prepatch: (BASELINE)
1964.60 real 3004.07 user 1388.79 sys
1963.29 real 3002.82 user 1386.75 sys
test29 (quad cpu) buildworld -j 8 timings 1x postpatch: (+11.07% improvement)
1768.93 real 2864.34 user 1212.24 sys
1771.11 real 2875.10 user 1203.29 sys
test29 (quad cpu) buildworld -j 8 timings 1x commit 1: (+13.4% improvement)
1731.45 real 2749.91 user 1106.81 sys
1729.24 real 2756.48 user 1100.90 sys
test29 (quad cpu) buildworld -j 8 timings 1x commit 3: (+7.4% improvement)
1828.75 real 2737.53 user 1387.10 sys
The results show a truly massive improvement in performance on our
48-core machine. A +229% improvement is well over 3x as fast. The
build times for the completion of four concurrent buildworlds (that is,
all four finish at the same time, 2500 seconds after all four were
started) is only 500 seconds slower than for one, meaning that we are
getting very good concurrency now.
BUILDKERNEL NO_MODULES=YES TESTS
This set of tests is using a buildkernel without modules, which has
much greater compiler concurrency verses a buildworld tests since
the make can keep N gcc's running most the time.
137.95 real 277.44 user 155.28 sys monster -j4 (prepatch)
143.44 real 276.47 user 126.79 sys monster -j4 (patch)
122.24 real 281.13 user 97.74 sys monster -j4 (commit)
127.16 real 274.20 user 108.37 sys monster -j4 (commit 3)
89.61 real 196.30 user 59.04 sys test29 -j4 (patch)
86.55 real 195.14 user 49.52 sys test29 -j4 (commit)
93.77 real 195.94 user 67.68 sys test29 -j4 (commit 3)
167.62 real 360.44 user 4148.45 sys monster -j48 (prepatch)
110.26 real 362.93 user 1281.41 sys monster -j48 (patch)
101.68 real 380.67 user 1864.92 sys monster -j48 (commit 1)
59.66 real 349.45 user 208.59 sys monster -j48 (commit 3)<<<
96.37 real 209.52 user 63.77 sys test29 -j48 (patch)
85.72 real 196.93 user 52.08 sys test29 -j48 (commit 1)
90.01 real 196.91 user 70.32 sys test29 -j48 (commit 3)
Kernel build results are as expected for the most part. -j 48 build
times on the many-cores monster are GREATLY improved, from 101 seconds
to 59.66 seconds (and down from 167 seconds before this work began).
That's a +181% improvement, almost 3x faster.
The -j 4 build and the quad-core test29 build were not expected to show
any improvement since there isn't really any spinlock contention with
only 4 cores. There was a slight nerf on test28 (the quad-core box) but
that might be related to some of the lwkt_yield()s added and not so
much the PQ_INACTIVE/PQ_ACTIVE vm_page_queues changes.