Re: Performance results / VM related SMP locking work - committed (3)

From: Alex Hornung <ahornung@xxxxxxxxx>
Date: Sat, 29 Oct 2011 05:00:41 +0100

Great work!

Nonetheless I feel that the last few changes nerf a quad-core machine
way too much; you are killing 50% of what you gained in the -j48 case
for buildkernel and even worse than in the original case with -j4, which
is the most common case. buildworld -j8 on test29 also loses 50% of the
original improvement with commit 2 or 3.

I don't think this is a good trade-off at all; are we really optimizing
for 4-socket 48-core machines and letting the way more common 4-8 core
machines out?

Simply adding lwkt_yield()s all over the place doesn't really sound like
a great strategy in the first place. It sounds more like a stopgap or
debug solution for a 48-core machine than something that should be
committed (straight ahead).

Alex Hornung

On 29/10/11 00:28, Matthew Dillon wrote:
>       89.61 real       196.30 user        59.04 sys  test29 -j4 (patch)
>       86.55 real       195.14 user        49.52 sys  test29 -j4 (commit)
>       93.77 real       195.94 user        67.68 sys  test29 -j4 (commit 3)
>      167.62 real       360.44 user      4148.45 sys  monster -j48 (prepatch)
>      110.26 real       362.93 user      1281.41 sys  monster -j48 (patch)
>      101.68 real       380.67 user      1864.92 sys  monster -j48 (commit 1)
>       59.66 real       349.45 user       208.59 sys  monster -j48 (commit 3)<<<
>       96.37 real       209.52 user        63.77 sys  test29 -j48 (patch)
>       85.72 real       196.93 user        52.08 sys  test29 -j48 (commit 1)
>       90.01 real       196.91 user        70.32 sys  test29 -j48 (commit 3)
>     Kernel build results are as expected for the most part.  -j 48 build
>     times on the many-cores monster are GREATLY improved, from 101 seconds
>     to 59.66 seconds (and down from 167 seconds before this work began).
>     That's a +181% improvement, almost 3x faster.
>     The -j 4 build and the quad-core test29 build were not expected to show
>     any improvement since there isn't really any spinlock contention with
>     only 4 cores.  There was a slight nerf on test28 (the quad-core box) but
>     that might be related to some of the lwkt_yield()s added and not so
>     much the PQ_INACTIVE/PQ_ACTIVE vm_page_queues[] changes.

