On Tue, Apr 29, 2003 at 10:39:46PM +0200, Andi Kleen wrote:
> > Don't get me wrong, we would certainly drop any notions of this if we
> > found that it was slower and I will be glad to post any results. The
> > goal is to take advantage of the hardware to make things faster.
> You have no hardware to make the remote TLB flushes fast ;)
> I'm sure you can show it being an advantage with a single threaded process.
> But when you run it on a multithreaded application just with two threads
> it may look very different.
Last time I checked, the IA64 processor provides a ptc.g instruction for
exactly this. The only hit we take from using it is Intel limits it to
a single outstanding ptc.g pending machine wide. This is accomplished with
a global spinlock. I would love to convince Intel to change this instruction,
but that probably will not happen any time soon.
I will concede that the ptc.g instruction takes a considerable period of
time on our 64 processor machines, but that comes out to a lot of local
TLB coherence domains that need to be updated.
I believe there is a similar instruction for x86. Could someone verify