DragonFly kernel List (threaded) for 2004-02
Re: lkwt in DragonFly
On Sun, Feb 08, 2004 at 04:01:32PM -0800, Matthew Dillon wrote:
> : It should be theoretically possible to disable CPU migration with a
> : simple interlock even on FreeBSD 5.x. That, along with masking out
> : interrupts, could be used to protect pcpu data without the use of
> : bus-locked instructions.
> : However - and this may not be the case for DragonFly - I have
> : previously noticed that when I ripped out the PCPU mutexes
> : protecting my pcpu uma_keg structures (just basic per-cpu
> : structures), thereby replacing the xchgs on the pcpu mutex with
> : critical sections, performance in general on a UP machine decreased
> : by about 8%. I can only assume at this point that the pessimisation
> : is due to the cost of interrupt unpending when compared to the nasty
> : scheduling requirements for an xchg (a large and mostly empty
> : pipeline).
> :Bosko Milekic * bmilekic@xxxxxxxxxxxxxxxx * bmilekic@xxxxxxxxxxx
> :TECHNOkRATIS Consulting Services * http://www.technokratis.com/
> It would only add cost to unpending if an interrupt occured
> while the critical section was active. This implies that the
> critical section is being held for too long a period of time. I
> would expect this since the UMA code rightfully assumes that it can
> hold the pcpu uma mutex for as long as it wants without penalty.
No, my observations have nothing to do with the way UMA currently
holds PCPU locks. Sorry, I should have been more specific.
What I did in a side branch, in order to accomodate grouped mbuf and
cluster allocations within the framework of UMA, is define some
basic extensions to UMA. The main idea is to overlay something
I ended up calling a "uma_keg" on top of two distinct zones. The
keg itself only contains small fast-caches of objects taken from
each of the two zones, one fast-cache for each CPU. This allows me
to define a grouped-mbuf-and-cluster allocation routine, like
m_getcl(), which tries to always allocate from the overlayed keg
before dipping into the zones. The overlayed kegs use the same pcpu
locks as UMA does for its regular pcpu caches in the zone.
Since the locking required to grab objects from the keg is pretty
straight-forward (and since there is no interlocking involved), it
was fairly trivial to merely replace the PCPU lock acquisitions and
releases around the keg manipulation with mere critical enters/exits
and guage performance. This is not the case for the regular pcpu
caches in UMA, which are protected by the PCPU cache locks but for
which there may be interlocking and so merely replacing them with
critical sections is more difficult. It is more difficult because
the interlocking, unless removed, means that you have to grab a
mutex while in a critical section, which may end up migrating you to
another CPU - not to mention which will exit the critical section.
Anyway, so for the UMA _keg_ extensions, since there's no
interlocking, the replacement was next to trivial and the amount of
code within the critical section was minimal (all you do is check
the keg and if non-empty, allocate from it or exit the
critical section and do something else). And it is precisely with
this change that I noticed a slight pessimization. So either I
grossly underestimated the number of interrupts that occur on
average while in that critical section or the cost of
entering/exiting the critical section is at least as high as that of
grabbing/dropping a mutex. Again, this may not be the case for
DragonFly. [Ed.: And now that I've read what follows, I predict it
> The DragonFly slab allocator does not need to use a mutex or token
> at all and only uses a critical section for very, very short periods
> of time within the code. I have suggested that Jeff recode UMA to
> remove the UMA per-cpu mutex on several occassions.
I have been in touch with Jeff regarding UMA issues for a long while
and he has mentionned that he did exactly that, several months ago.
However, I'm not sure exactly what preempted that work from going in.
It's very possible that the interlocking issues involving being in
the critical section and having to grab the zone lock in the case
where the pcpu cache was empty remained unresolved. Also, since I
did something similar (and simpler) and noticed a pessimisation,
actual performance would have to be evaluated prior to making a
change like that - and perhaps it was only to find that performance
> You should also check the critical section API overhead in FreeBSD-5.
> If it is a subroutine call and if it actually disables interrupts
> physically, the overhead is going to be horrendous... probably similar
> to the overhead of a mutex (sti and cli are very expensive instructions).
> In DFly the critical section code is only two or three 'normal' inlined
> instructions and does not physically disable interrupts.
> Matthew Dillon
If you are already in a critical section, the cost is negligeable.
If you are not, which is ALWAYS when it comes to the UMA keg code,
then you always disable interrupts. I remember you a while back
committing changes that made the general critical section enter and
exit faster in the common case, deferring the cli to the scenario
where an interrupt actually occurs. I don't remember the details
behind the backout. I guess I'd have to dig up the archives.
Perhaps CPU migration should not be permitted as a side-effect of
being pre-empted within the kernel, then we can consider similar
optimisations in FreeBSD 5.x. Prior to that, however, I wonder what
measurable gains there are from allowing full-blown pre-emption with
CPU migration within the kernel, if any. I'll assume for the moment
that there is a reasonable rationale behind that design decision.
Bosko Milekic * bmilekic@xxxxxxxxxxxxxxxx * bmilekic@xxxxxxxxxxx
TECHNOkRATIS Consulting Services * http://www.technokratis.com/