Ethan Solomita wrote:
>
> Jim Houston wrote:
> >
> > The real problem is the non-maskable part of the non-
> > maskable IPI. When there are multiple processors hitting
> > breakpoints at the same time, you never know how much of
> > the initial entry code the slave processor got to execute
> > before it was hit by the NMI.
> >
> This is what I explicitly considered fixed by these changes. In kdb(),
> line 1342 is the beginning of the code where kdb_initial_cpu is grabbed.
> After this block, you either are the kdb_initial_cpu, or you entered kdb
> because of the IPI. So the future-slave-proecssor could not have gotten
> past this if () clause before it was hit by the NMI.
>
> Looking back before this, there are very few lines of code that
> examine
> global state, and none that modify global state. The few references to
> KDB_STATE before line 1342 can, I believe all be justified. Either the
> code knows that it is kdb_initial_cpu, or it is DOING_SS in which case
> we cannot have received an IPI from KDB, or it is HOLD_CPU. HOlD_CPU is
> used to generate "reentry", and I'm not sure why, but it seems harmless.
>
> Can you suggest a code path through kdb() which could lead to harm
> for
> a CPU which hits a breakpoint, fails to win the race for
> kdb_initial_cpu, and gets an IPI?
>
> > I have a couple of ideas in the works. First, I wonder about
> > having the kdb_ipi() check if it has interrupted a
> > breakpoint entry. If it has, it could just set a flag and
> > return. I might do this with a stack trace back or by
> > setting a flag early in the breakpoint handling (e.g. entry.S).
>
> I don't see how this helps -- whoever won the race for kdb_initial_cpu
> is expecting all the CPUs to gather up and enter kdb. I would expect
> that everyone who hits a breakpoint should enter kdb.
>
> > Ethan, I'm curious if you're using an NMI on the Sparc.
> >
> Sparc doesn't have an NMI, but the interrupt I use (an IPI) is rarely
> blocked in the kernel. Certainly not blocked by local_irq_save() and
> family.
> -- Ethan
Hi Ethan,
I have been in hack mode, and I probably have some self inflicted
problems. Your analysis seems correct, but I still had problems
with the combination of your patch + the version kdba_bp.c that
I sent out on Friday. I did not mean to impugn your patch and
appologize if I have.
The initial enthusiasm wore off once I started putting breakpoints
at places like do_schedule or sys_open. More often than not, it hung.
I also ran into the panic processing breakpoints that have been
removed. They are described in the comment before kdba_db_trap().
It still hung even doing bd instead of bc.
I went on to experiment with splitting the kdb_state into separate
variables for the per-cpu-private vs inter-cpu synchronization.
I was hoping that I could simplify the problem by eliminating the
interactions between most of the flags. I was worried
about interactions between processors leaving kdb and new
arrivals.
Regards NMI racing with normal breakpoints - I want to
solve a larger problem. If I can avoid the extra layer of nesting,
I will solve the deleted breakpoint problem. It seems ugly to
switch to the other cpu, do a stack trace and see part of kdb
rather than what that cpu was doing. I would also like to switch
to the other cpu and then single step. I also worry about what
happens if the NMI interupts the spinlock which protects
kdb_initial_cpu.
I have some changes maybe 50% done. I'm using a flag set in
entry.s to detect that the NMI has interrupted the breakpoint
entry. Hopefully I will have something useful in another day
or so.
Jim Houston - Concurrent Computer Corp.
|