lkcd
[Top] [All Lists]

Crash Dump : Non-disruptive dumps vs standalone dumps

To: "Matt D. Robinson" <yakker@xxxxxxxxxxxxxx>, lkcd@xxxxxxxxxxx
Subject: Crash Dump : Non-disruptive dumps vs standalone dumps
From: bsuparna@xxxxxxxxxx
Date: Fri, 13 Jul 2001 18:50:01 +0530
Sender: owner-lkcd@xxxxxxxxxxx
Hello,

Here are a few thoughts on some of the crash dump requirements that we've
been looking into.

In a broad sense, these requirements appear to address two different
aspects of dump which come into play in very different situations. This
makes it possible to try to tackle each of these independently, which
simplifies our job a little.

(a) One of these applies to panic type dumps, which can occur when the
system is in a damaged state, and a reboot is necessary (where it may not
even be safe to continue running/using parts of the OS, due to the risk of
corruption or further damage). Addressing the extreme end of spectrum for
this would involve some kind of a "standalone dump capability".  (I'll send
out a separate note discussing some prevailing design options/possibilities
for achieving this in a separate note). In such a situation, we don't need
to bother about disruption to the system as loss of information or
in-progress work (e.g through forced resets of devices) is acceptable,
because the system cannot continue as is anyhow.

(b) The other, which the rest of this note is dedicated to, applies to
situations where a dump needs to be taken (preferably as an accurate
snapshot of the state at an event or point of execution), but where the
basic OS is expected to continue running after the dump is taken, and where
it is desirable that it indeed do so.  This is what we refer to as
"Non-disruptive dump support". A key assumption we make is that, in this
situation, it is all right, in principle, to depend on the basic OS
infrastructure, in order to take the dump.

Of course these are two extreme possibilities in terms of how much we can
rely on the OS, and it is perhaps the shades of grey in between that occur
in reality, but if we can address these two ends we would have a lot of
ground covered.
If we could achieve both in one shot, it would be great, but maybe to start
with even solving these independently in a nice manner would be good
progress.

Does this sound reasonable ?


Now, a little about (b):

Non-disruptive dumps:
---------------------------------

In an ideal world, this would mean being able to take an *accurate*
snapshot dump of the system (probably selective sections of it, along the
lines of OS/2 process dump, i.e. flexible dump) *without disrupting* the
operation of the system - i.e. have the system continue normally after the
dump is taken.

This is something that we expect to help with serviceability of live/remote
customer systems - the dump could be sent over for analysis of problems
that are non-fatal in terms of system availability, but require a
visibility into kernel data structures and system state to resolve. For
example, rather than have a person on-site running a kernel debugger to
examine the system, use dprobes to gather data about the history of a
certain situation that is recreateable only on the customer site and then
trigger a dump, and let the system continue to run.  In this situation the
malfunctioning is not crippling and does not affect the integrity of the
system.

>From a requirements perspective, we do need to clearly establish why we
need this to be an accurate snapshot, rather than what the livedump
capability in lcrash (i.e. the ability to generate a crash dump from the
running kernel memory core) already gives us. The example above is
indicative, but there may be tradeoff between  accuracy  --- and ---
non-disruption, it  would be good to have some inputs to help understand
where to position ourselves there.

To appreciate this tradeoff consider the fact that at the instant when the
dump is triggered, the system may be in a state where the i/o
layer/driver/the device where the dump is to be stored is not prepared to
immediately accept dump commands (e.g there could be some i/os in flight or
DMA's in progress, or even if we have a dedicated dump device, it still may
be possible for the bus to be in an intermediate state; besides locks might
have just been held, interrupt servicing in progress etc - think of the
problems crash dump has been having - and think of what could happen if we
wanted network dumps). A certain amount of quiescing (I use this term here
for the want of a better word) may be required to get to a state when it is
safe to dump (even if we attempt to switch to a software path that is
independent of the current OS state, we may have the h/w state to think
of). However, during this quiescing, the system state could or rather would
change, affecting accuracy. And then, of course, during the process of
dumping, if we go via the block i/o path, system state is changing as we
are dumping. If we could predict exactly which parts of memory would
change, we could perhaps save that off in a reserved area, but that could
get kind of messy or too tied to the implementation.

Another point here is that if we do freeze the system while the dump is
going on, we also need to understand the consequences of such a freeze when
we want to resume normalcy (there would be some amount of disruption -
perhaps some thinking along the lines of power management suspension
handling could give us some clues).

Actually, if the added effort/complexity for effecting a memory snapshot
mechanism seems worthwhile, then we could design a way to achieve both of
these (well, almost) with some extra complexity and some extra memory
space. It is possible that such a scheme (i.e. the interesting possibility
of implementing a snapshot memory feature through page/segment protections
and copy-on-write mechanisms (think of the way snapshotting for block
devices/lvm/filesystems happens)  may even turn out to be useful outside of
just crash dump.  But at the same I realise that it may be a little
intrusive and an added complexity in the dump path. It also requires extra
memory to save modified state - though the lesser the drift during a
quiesce, the lesser this would be.
However, one question at the moment is if it is worth it from a user's
point of view and where the priority for this lies.

A practical compromise in the form of a partial quiesce which essentially
involves locks to ensure consistent dump data (e.g. for the subset of state
being dumped as with flexible dump options) is another possibility , in
addition to providing an option to run on as is, even if data is in an
inconsistent state at the point of dump. This also means that when we
(eventually) have customized snapshots involving a small portion of memory,
the amount of drift would be reduced (e.g if we aren't dumping memory that
changes during the dump i/o path). However this has to be worked out
further.

In any case, the steps may be something like this:
     1. Send an IPI to other CPUs to get them to capture their register
state (i.e. save it in a memory area).
     2. Quiesce (to the minimal extent necessary for the following to work
without disrupting the system).
     3. Set things up for i/o to dump device to work  (e.g change IRQ
affinity settings, make device ready etc). This may involve waiting till
the setup is done. Also before changing anything on the system, save
associated state so we can restore these back to original settings after we
are done.
     4. Perform actual dumping. Wait for completion.
     5. Restore the settings changed just for dump (i.e. get system ready
to continue normal operation)
     6. Release the system (i.e. let it continue)

Does this capture the overall process correctly ?

Each of these steps may involve some alternatives which may differ
depending on the degree/strictness of accuracy required, and resources
available.
I'd like to discuss them in some detail in separate notes. (Just to have
multiple threads of discussion going on and to keep the overall view
separate from the details of how we work out the elements). Some of the
recent work that's been happening, in terms of blocking scheduling and
stopping other processors would fit in under these points.


Regards
Suparna



  Suparna Bhattacharya
  IBM Software Lab, India
  E-mail : bsuparna@xxxxxxxxxx
  Phone : 91-80-5267117, Extn : 2525




<Prev in Thread] Current Thread [Next in Thread>