DragonFly kernel List (threaded) for 2011-07
Re: Blogbench RAID benchmarks
Ok, after much experimentation I've figured out what is going on.
First, why is UFS skewed towards writing to the extreme detriment of
reads while HAMMER is skewed towards reading to the extreme detriment
of writes? In a word: flushing meta-data out in UFS doesn't require
as many locks to be held as flushing meta-data out in HAMMER does.
The issue in UFS can be somewhat controlled by an I/O scheduler
but it isn't straightforward due to the way disk drives handle write
I/O's verses read I/O's. Write I/O's tend to get acknowledged instantly
by the hard drive up until the point where the hard drive's own ram
cache fills up with dirty data, and there is no way to gauge and control
the backlog. One must also ensure that some hardware protocol tags are
reserved for reading and some are reserved for writing so read I/O isn't
able to completely stall out write I/O or vise versa. DragonFly does
this in its CAM layer (I don't know about FreeBSD, it is something I
added recently). It's very difficult to control write bandwidth in an
I/O scheduler without simulating/calculating probable seek times for
random vs linear write I/O.
For HAMMER the problem is that HAMMER's flusher threads are constantly
getting stalled out by B-Tree locks being held by the ~100 reader
threads (in the blogbench test). Fixing this in HAMMER cannot be done
in the I/O scheduler, because stalling out read I/O's in the I/O
scheduler (in order to try to make more bw available for writing)
will simply cause the related B-Tree locks to be held even longer and
cause write activity to actually go down. The fix has to be in HAMMER
itself. NOTE: I cannot solve this by giving the flusher's exclusive
locks priority over the frontend's shared locks without creating
major 3-thread deadlock chains, and using exclusive locks in the readers
results in reduced read concurrency.
So, I am going to commit some experimental code to HAMMER which tries
to manage the locking conflicts between the frontend reader threads and
the backend flusher threads. I am going to do this by creating a
pulse-width modulated time-domain multiplexer in HAMMER which tries
to 'slot in' reads and writes based on the number of inodes backlogged
in the flusher.
Basically the idea of using a PWM is this: You take a fixed period of
time, say 1/5 of a second:
You alot a portion of the time slice to the backend flusher and the
remainder to the frontend.
[wwwwwwrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr] Flusher lightly loaded
[wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwrrrrrrr] Flusher heavily loaded
Even though read I/O operations in a heavily loaded system can stall
for much more than 1/5 of a second causing the read operations to
delays a certain number of ticks before being initiated gives the flusher
a chance to win locking conflicts and thus the flusher is able to
gain performance over the frontend reads.
This change isn't just to help blogbench out. It also appears to solve
some major issues with namecache stalls that occur when HAMMER is
heavily write-loaded, and issues with things like vi ':wq' operations
(which fsync()) seem to be improved. My commit message also mentions
it helping with 'ls' and 'find' but I think the 'ls' and 'find' issue
needs a bit more work.
The effect on the blogbench tests is basically to improve write
performance a little at the cost of read performance. This tradeoff
is due to hard drive seek times and is unavoidable.
read write For blogbench in stage 2 after the
system caches are blown out.
Approximate values only. R articles
vs W articles.
UFS: 600 4000 (freebsd)
HAMMER BEFORE: 20000 50 (dragonfly)
HAMMER AFTER: 2500 150 (dragonfly) <-- this is an improvement
even though it may
not seem that way.
As you can see HAMMER still prioritizes reads, and that is precisely
what I want to have happen... reads are far more important than writes.
We don't want writes to stall out completely but neither do we want
writes to be able to stall reads out completely. In the blogbench test
one basically has ~100 threads issuing random reads, but the read issued
by each thread is for a while file and is thus linear. In otherwords,
increasing the write activity by a little decreases the disk bandwidth
(due to spindles/seeks) by a lot.
So, Francois, lets see how the stuff I committed works out. You need
to remove the temporary patches I forwarded to you on IRC. What I
committed is the final version. I dunno if the graphs will look any
better since they are so badly skewed towards the pre-system-cache-
blowout numbers, but things should run more smoothly.