DragonFly BSD
DragonFly kernel List (threaded) for 2004-02
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: lkwt in DragonFly

From: Bosko Milekic <bmilekic@xxxxxxxxxxxxxxxx>
Date: Sun, 8 Feb 2004 19:41:56 -0500

On Sun, Feb 08, 2004 at 04:01:32PM -0800, Matthew Dillon wrote:
> :    It should be theoretically possible to disable CPU migration with a
> :    simple interlock even on FreeBSD 5.x.  That, along with masking out
> :    interrupts, could be used to protect pcpu data without the use of
> :    bus-locked instructions.
> :
> :    However - and this may not be the case for DragonFly - I have
> :    previously noticed that when I ripped out the PCPU mutexes
> :    protecting my pcpu uma_keg structures (just basic per-cpu
> :    structures), thereby replacing the xchgs on the pcpu mutex with
> :    critical sections, performance in general on a UP machine decreased
> :    by about 8%.  I can only assume at this point that the pessimisation
> :    is due to the cost of interrupt unpending when compared to the nasty
> :    scheduling requirements for an xchg (a large and mostly empty
> :    pipeline).
> :...
> :Bosko Milekic  *  bmilekic@xxxxxxxxxxxxxxxx  *  bmilekic@xxxxxxxxxxx
> :TECHNOkRATIS Consulting Services  *  http://www.technokratis.com/
>     It would only add cost to unpending if an interrupt occured
>     while the critical section was active.  This implies that the
>     critical section is being held for too long a period of time.  I
>     would expect this since the UMA code rightfully assumes that it can
>     hold the pcpu uma mutex for as long as it wants without penalty.

    No, my observations have nothing to do with the way UMA currently
    holds PCPU locks.  Sorry, I should have been more specific.

    What I did in a side branch, in order to accomodate grouped mbuf and
    cluster allocations within the framework of UMA, is define some
    basic extensions to UMA.  The main idea is to overlay something
    I ended up calling a "uma_keg" on top of two distinct zones.  The
    keg itself only contains small fast-caches of objects taken from
    each of the two zones, one fast-cache for each CPU.  This allows me
    to define a grouped-mbuf-and-cluster allocation routine, like
    m_getcl(), which tries to always allocate from the overlayed keg
    before dipping into the zones.  The overlayed kegs use the same pcpu
    locks as UMA does for its regular pcpu caches in the zone.

    Since the locking required to grab objects from the keg is pretty
    straight-forward (and since there is no interlocking involved), it
    was fairly trivial to merely replace the PCPU lock acquisitions and
    releases around the keg manipulation with mere critical enters/exits
    and guage performance.  This is not the case for the regular pcpu
    caches in UMA, which are protected by the PCPU cache locks but for
    which there may be interlocking and so merely replacing them with
    critical sections is more difficult.  It is more difficult because
    the interlocking, unless removed, means that you have to grab a
    mutex while in a critical section, which may end up migrating you to
    another CPU - not to mention which will exit the critical section.

    Anyway, so for the UMA _keg_ extensions, since there's no
    interlocking, the replacement was next to trivial and the amount of
    code within the critical section was minimal (all you do is check
    the keg and if non-empty, allocate from it or exit the
    critical section and do something else).  And it is precisely with
    this change that I noticed a slight pessimization.  So either I
    grossly underestimated the number of interrupts that occur on
    average while in that critical section or the cost of
    entering/exiting the critical section is at least as high as that of
    grabbing/dropping a mutex.  Again, this may not be the case for
    DragonFly. [Ed.: And now that I've read what follows, I predict it
    likely isn't]

>     The DragonFly slab allocator does not need to use a mutex or token
>     at all and only uses a critical section for very, very short periods 
>     of time within the code.  I have suggested that Jeff recode UMA to 
>     remove the UMA per-cpu mutex on several occassions.

    I have been in touch with Jeff regarding UMA issues for a long while
    and he has mentionned that he did exactly that, several months ago.
    However, I'm not sure exactly what preempted that work from going in.
    It's very possible that the interlocking issues involving being in
    the critical section and having to grab the zone lock in the case
    where the pcpu cache was empty remained unresolved.  Also, since I
    did something similar (and simpler) and noticed a pessimisation,
    actual performance would have to be evaluated prior to making a
    change like that - and perhaps it was only to find that performance
    was worse.

>     You should also check the critical section API overhead in FreeBSD-5.
>     If it is a subroutine call and if it actually disables interrupts
>     physically, the overhead is going to be horrendous... probably similar
>     to the overhead of a mutex (sti and cli are very expensive instructions).
>     In DFly the critical section code is only two or three 'normal' inlined 
>     instructions and does not physically disable interrupts.
> 					-Matt
> 					Matthew Dillon 
> 					<dillon@xxxxxxxxxxxxx>

    If you are already in a critical section, the cost is negligeable.
    If you are not, which is ALWAYS when it comes to the UMA keg code,
    then you always disable interrupts.  I remember you a while back
    committing changes that made the general critical section enter and
    exit faster in the common case, deferring the cli to the scenario
    where an interrupt actually occurs.  I don't remember the details
    behind the backout.  I guess I'd have to dig up the archives.

    Perhaps CPU migration should not be permitted as a side-effect of
    being pre-empted within the kernel, then we can consider similar
    optimisations in FreeBSD 5.x.  Prior to that, however, I wonder what
    measurable gains there are from allowing full-blown pre-emption with
    CPU migration within the kernel, if any.  I'll assume for the moment
    that there is a reasonable rationale behind that design decision.

Bosko Milekic  *  bmilekic@xxxxxxxxxxxxxxxx  *  bmilekic@xxxxxxxxxxx
TECHNOkRATIS Consulting Services  *  http://www.technokratis.com/

[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]