DragonFly kernel List (threaded) for 2004-02
Re: lkwt in DragonFly
On Fri, Feb 06, 2004 at 06:33:47PM -0800, Matthew Dillon wrote:
> The performance trade off is basically the ability for the cpu owning a
> token to operate on the token without having to use bus-locked
> instructions, verses having to send an IPI message to the owning cpu
> when the requesting cpu does not own the token. This is not the major
> reason for using tokens, though. Still, judging by the benchmark
> comparisons we've done between 4.x and DFly, the overhead of a 'miss'
> appears to be very small. It turns out that most locks (like vnode locks)
> hit the same-cpu case a *LOT* more then they hit the foreign-cpu case
> so the overall benefit is at least equivalent to FreeBSD-5's mutex
> overhead. It certainly isn't worse. We do lose when it comes to
> IPC operations between two processes running on difference CPUs, but
> if this is the only thing that winds up being a problem it is a simple
> enough case to fix.
But if your vnode lock observations are accurate (i.e., that we hit
the same-cpu case a lot more than the foreign cpu case), then even
with lock-prefixed instructions we should not be noticing the same
penalty as if cache invalidations need to occur.
I have always wondered what the effects of a lock-prefixed instruction
are with respect to the data caches; in other words, say I atomically
grab a mutex and then release it only to grab it again on the same cpu
a little while later, then the cost of the regrabbing of the lock
atomically should not be the same as when I am initially atomically
grabbing a mutex previously owned by another CPU. So I dug and dug
and as it turns out, on processors later than the Pentium Pro, my
assumption seems to be correct:
(Specifically referring to the post timestamped 06-23-2003, 11:45PM,
So if two processors have the value stored in their respective caches and
it's shareable in both caches, a bus operation will be required so that
the processor doing the write has exclusive ownership of the memory area.
This is not particularly expensive and not nearly as expensive as locking
the entire front-side bus for the duration of the read-modify-write
The Pentium Pro and prior processors did just this. They locked the
entire bus for the duration of the locked operation, in fact, this was
what the lock prefix meant.
Later processors were much smarter. They don't actually lock out the
entire front-side bus. They simply acquire the necessary cache state
using normal transactions and lock the element in the cache.
If there's no attempt by another processor to steal the cache line,
the bus impact of a locked operation is no different than the
corresponding unlocked operation.
Unfortunately, the impact on the processor pipelines is not the same.
So I guess the point is merely that for reasonably warm caches, the
overhead of a bus-locked instruction is mitigated. Although, as also
noted, the fact that ordering needs to be ensured still sucks.
Bosko Milekic * bmilekic@xxxxxxxxxxxxxxxx * bmilekic@xxxxxxxxxxx
TECHNOkRATIS Consulting Services * http://www.technokratis.com/
"Of course people don't want war... that is understood. But voice or
no voice, the people can always be brought to the bidding of the
leaders. That's easy. All you have to do is tell them they are
being attacked, and denounce the pacifists for lack of patriotism
and for exposing the country to danger. It works the same in any
country." -- Hermann Goering