[Top] [All Lists]

Re: [lkcd-general] Converting lkcd from a pull to a push model

To: Keith Owens <kaos@xxxxxxx>
Subject: Re: [lkcd-general] Converting lkcd from a pull to a push model
From: "S Vamsikrishna" <vamsi_krishna@xxxxxxxxxx>
Date: Tue, 1 Apr 2003 13:09:15 +0530
Cc: lkcd-general@xxxxxxxxxxxxxxxxxxxxx, kdb@xxxxxxxxxxx
Sender: kdb-bounce@xxxxxxxxxxx
Hello Keith,

I am not familiar with IA64, so what I say may only apply to IA32.

As far as I can tell, in kdb v4.0, you call kdb_save_running() /
kdb_unsave_running() around kdb_main_loop() in kdba_main_loop(). We will
execute kdba_main_loop() on all processors only if the KDB IPI (NMI class)
to stop other processors works.

Current lkcd does something similiar: it sends an NMI class IPI to capture
register state of other processors. I agree that timing out the IPI is a
bug (I thought I removed it, may be never submitted the patch for
inclusion), but this method will capture the state of all CPUs as long as
the IPI goes to all CPUs.

So, how is the push model any better, when the push itself relies on the
IPI to be delivered and handled on all cpus? Did I miss something? Can you
please explain?


Vamsi Krishna S.
Linux Technology Center,
IBM Software Lab, Bangalore.
Ph: +91 80 5044959
Internet: vamsi_krishna@xxxxxxxxxx

|         |           Keith Owens <kaos@xxxxxxx>   |
|         |           Sent by:                     |
|         |           lkcd-general-admin@xxxxxxxxxx|
|         |           ceforge.net                  |
|         |                                        |
|         |                                        |
|         |           03/19/03 07:41 PM            |
|         |                                        |
  |       To:       lkcd-general@xxxxxxxxxxxxxxxxxxxxx                          
  |       cc:                                                                   
  |       Subject:  [lkcd-general] Converting lkcd from a pull to a push model  

This a heads up about an alternative (and hopefully much more reliable)
method of capturing per cpu data for debugging.  Obviously I would like
to feed this back to lkcd for inclusion eventually.

You may have seen my announcement on l-k[1] about kdb v4.0, where the
big change is converting from a "pull the other cpus into kdb" model to
a "push this cpu's state to a known place" model.  Although that
announcement was only for kdb, I will be changing lkcd in SGI kernels
to use a push instead of a pull mode.

[1] http://marc.theaimsgroup.com/?l=linux-kernel&m=104787987626450&w=2

The pull model works as long as inter processor interrupts (IPI) are
working and the other cpus are responding to interrupts.  It fails
badly when cpus are hung and are not responding to interrupts, of
course this is precisely the case when you want debug data.  lkcd 2.4
hangs solid if any one cpu is not responding.  lkcd 2.5 will time out
the hung cpus and continue, with two nasty side effects :-

(1) You get no data for the hung cpus, which makes the crash dump
    almost useless.

(2) The timeout code can result in corrupt IPI data, especially when
    lkcd releases the other cpus at the end of the dump.  I have seen
    several oops at the end of dumping as handle_IPI() is allowed to
    proceed but its data no longer exists (it was deleted during the
    time out).

kdb v4.1 (maybe v4.2) will attempt to get the attention of the other
cpus via a graduated set of probes.  The pseudo code is :-

if (safe_to_send_interrupts) {
    if (!all_cpus_in_kdb && arch_supports_nmi)
    if (!all_cpus_in_kdb && arch_supports_pmi)
if (!all_cpus_in_kdb)
  kdb_enter_debugger = 1;       // trip spin_locks and scheduler into kdb
if (!all_cpus_in_kdb && arch_supports_init &&
    (user_requested_destructive_ipi || destructive_kernel_error_detected))
if (!all_cpus_in_kdb)
  kdb_printf("not all cpus entered the debugger");

pmi and init are ia64 specific interrupts, even harder to mask than
nmi.  A big problem with ia64 is that nmi is masked :( An even bigger
problem is the ia64 that MCA and INIT interrupts have mandated
requirements which can prevent the OS from getting full control.  A
pull model requires that the OS be in control on each cpu, a push model
lets each cpu save its state before the OS surrenders control.

The next part of the kdb patch is changing the ia64 spinlock model to
be faster in the uncontended path with a single bit of out of line code
to handle the contended path.  Debuggers like kdb can enhance the out
of line code to detect hung spin locks or kdb_enter_debugger == 1 and
enter the debugger, even when interrupts are disabled.

Obviously the data that kdb captures is generic debugging data and can
be used to get the state of each cpu for any debugging tool, not just
kdb.  Once I finish the spinlock and init slave handlers on ia64 and
implement the above pseudo code, I will hook lkcd into the data that
kdb has already captured.  This will go a long way to improving the
reliablity of both kdb and lkcd, to the extent that lkcd should be
usable from anywhere, including panic, oops, interrupt context, nmi,
pmi, init etc.

Have a look at the kdb patches in [1] above.  Think about what
additional data lkcd needs, if any.  kdb v4.0 already captures all
application registers at the time of error from most cpus, v4.[12] will
reduce the window of lost cpus to almost zero.

This SF.net email is sponsored by: Does your code think in ink?
You could win a Tablet PC. Get a free Tablet PC hat just for playing.
What are you waiting for?
Lkcd-general mailing list

<Prev in Thread] Current Thread [Next in Thread>
  • Re: [lkcd-general] Converting lkcd from a pull to a push model, S Vamsikrishna <=