DragonFly kernel List (threaded) for 2004-02
In-pipeline instruction timing tests

From: Matthew Dillon <dillon@xxxxxxxxxxxxxxxxxxxx>
Date: Mon, 9 Feb 2004 10:17:58 -0800 (PST)

    Here are some basic instruction timing tests.  My particular interest
    is in the compare/jz/addl-to-mem test verses a cmpxchgl or locked
    cmpxchgl.   The compare/jz/addl-to-mem test simulates the new token
    code overhead (minus the %fs load-from-memory which I cannot easily
    simulate from userland), while a locked compare-exchange simulates 
    a mutex.

    Note in particular that a cmp/jz/addl sequence seems to be far better
    pipelined on both the AMD64 and a P4 then a cmpxchgl no matter which
    way you turn it, and that *ANY* locked bus cycle instruction does really
    horrible things to the cpu's pipeline.

				2xP3	AMD64	1xP4
				1.2GHz	3200+	1.7GHz

cpu_add, addl to mem		1.535ns	0.194ns	0ns (1)
cpu_ladd, lock; addl to mem	37.50ns 7.869ns 69.660ns
cpu_call, 1 call/ret		3.934ns 1.921ns 3.550ns
cpu_cmpadd, cmp/jz/addl	mem	4.027ns	0.583ns 0.765ns
cpu_cmpexg, cmpex		6.420ns	2.169ns	7.100ns
cpu_lcmpext,			42.84ns 7.479ns 72.35ns

	note(1): addl to mem is completely absorbed or almost
	completely absorbed by the cpu's pipeline in this test.

    In anycase, this really solidifies my desire to avoid locked bus
    cycle instructions.

						Matthew Dillon

