DragonFly kernel List (threaded) for 2004-04
Re: pipe testing and kernel copyin/copyout/bcopy performance
:I've also read something about caches not being updated by using SSE
:instructions such that if you refer to the memory you just copied that the
:wins for having used SSE in the copy are much diminished.
These are the so-called 'non-temporal' instructions. So, for example,
the standard 128 bit move instruction is 'movdqa' or 'movdqu' (for
double-quad-aligned or double-quad-unaligned). The non-temporal
version is 'movntdq'.
The non-temporal instructions supposedly queue directly to memory and
do not 'pollute' the caches. You can max out memory bandwidth using
non-temporal instructions (on the Athlon 64 this is about double the
write bandwidth you can get using normal writes). However, the problem
with this is that even maxed out memory only has 1/4 the bandwidth of
the L1 cache, so if you write a general bcopy() function using
non-termporal writes it will have great performance for huge multi-megabyte
copies but horrendously bad performance for block sizes that fit in
the L1/L2 caches, like 16K, 32K, 64K, even 256K (which easily fits in
a Athlon 64's L2 cache).
You also cannot mix normal writes with non-temporal writes. Well, you
*can* mix the instructions, but the result will be truely hideous (verses
simply horrible) memory performance... I saw a 3GByte/sec test drop to
< 100 MBytes/sec when I replaced half the movedqa's with moventdq's.
I tried using non-temporals with XMM, MXX, and even integer registers
(movnti instruction), with the same hideous results.
The effects are probably different on Intel chips. All the testing I did
was on Athlon 64's (Athlon 3200+).