I took the liberty of asking one of the IA64 guru's about the indirect calls.
This is what he had to say (reposted with his permision if not my complete
McKinley-type cores (includes Madison, etc.)
do not have indirect branch target hardware. Instead, indirect
branches are executed as follows:
At the time an indirect branch is fetched, the frontend reads the
contents of the branch register that contains the branch target. The
contents of that register is then used as the predicted target.
For example, "br.call.sptk.many rp=b6" would read register "b6" at the
time the "br.call" is fetched by the frontend and then the contents of
"b6" is used as the predicted target.
This has the following implications:
(1) To _guarantee_ correct prediction, the branch register has to be
loaded way before the indirect branch direction (at least 6
front-end L1I cache accesses; which is up to 6 bundle-pairs or 36
instructions, I believe).
(2) If (1) isn't possible (it often isn't, in small functions),
another possibility is to test whether the branch targets one of a
few common targets and, if so, invoke those targets via direct
branches. This is generally done automatically by compilers (at
least if there is PBO info or a programmer-provided hint
available), but sadly GCC doesn't do this at the moment.
The good news is that since McKinley-types cores don't have
complicated branch-target predictors, misprediction penalty is
_relative_ small (10 cycles). The bad news is that the network path
is extremely sensitive to even such relatively small penalities, it
does make a significant difference.
As mentioned earlier, we could fix some of the most egregious effects
with a "call_likely" macro which hints which target(s) are the most
netperf feedback always welcome...