Jim Houston wrote:
> The real problem is the non-maskable part of the non-
> maskable IPI. When there are multiple processors hitting
> breakpoints at the same time, you never know how much of
> the initial entry code the slave processor got to execute
> before it was hit by the NMI.
This is what I explicitly considered fixed by these changes. In kdb(),
line 1342 is the beginning of the code where kdb_initial_cpu is grabbed.
After this block, you either are the kdb_initial_cpu, or you entered kdb
because of the IPI. So the future-slave-proecssor could not have gotten
past this if () clause before it was hit by the NMI.
Looking back before this, there are very few lines of code that examine
global state, and none that modify global state. The few references to
KDB_STATE before line 1342 can, I believe all be justified. Either the
code knows that it is kdb_initial_cpu, or it is DOING_SS in which case
we cannot have received an IPI from KDB, or it is HOLD_CPU. HOlD_CPU is
used to generate "reentry", and I'm not sure why, but it seems harmless.
Can you suggest a code path through kdb() which could lead to harm for
a CPU which hits a breakpoint, fails to win the race for
kdb_initial_cpu, and gets an IPI?
> I have a couple of ideas in the works. First, I wonder about
> having the kdb_ipi() check if it has interrupted a
> breakpoint entry. If it has, it could just set a flag and
> return. I might do this with a stack trace back or by
> setting a flag early in the breakpoint handling (e.g. entry.S).
I don't see how this helps -- whoever won the race for kdb_initial_cpu
is expecting all the CPUs to gather up and enter kdb. I would expect
that everyone who hits a breakpoint should enter kdb.
> Ethan, I'm curious if you're using an NMI on the Sparc.
Sparc doesn't have an NMI, but the interrupt I use (an IPI) is rarely
blocked in the kernel. Certainly not blocked by local_irq_save() and