Hello Suparna,
I think the idea of capturing crash dumps on a live system seem
to quite attractive to end-users (e.g. customer) since it
implies there is no system down time or unavailble. From
my limited exposure/experience in NCR supporting Teradata
DBMS product on SVR4 MPRAS system, developers were time
to time requested to dial-in to the customer system (typically
MPP system) to diagnoze the performance or system slow
down/hung related problems. We often explictly request
for customer or site personnel to force the particular node
in question, out of X node in MPP system, to panic to capture
the crash dump for post-mortem analysis. Each node is
typically configured to 4/2/1 GB of memory. SVR4 MPRAS
also has the utility to take a memory dump on a live system
and we could not use it at all because:
1. The utility basically writes the entire memory to
a flat file and often there is no system disk space
that could handle 4/2/1 GB of data.
2. The utility takes a long time to finish (several hours)
because it is an user-level application.
3. Lastly I don't know whether utility quieses the system
or not, but I was told that it may not take a clean
snapshot of the memory that there is a small chance
that live crash dump file may not contain the
information one is after.
Please note that this email is not to discourage any of your
proposal below, but to share my limited experience. Aside
from this, I am more eager to see the basic functionality of
lkcd being integrated into the main linux source tree or
standard linux distribution as part of RAS (Reliability,
Availability, Supportability) feature to Linux soon since
being able to reliably save the panic dump and identify/debug
any software problem (e.g. post-mortem analysis) have been
recognized by management as one of key RAS features for
supporting Teradata RDBMS product at NCR.
Thanks,
Moo Kim Moo.Kim@xxxxxxx
NCR Corporation
On Fri, Jul 13, 2001 at 06:50:01PM +0530, bsuparna@xxxxxxxxxx wrote:
>
> Hello,
>
> Here are a few thoughts on some of the crash dump requirements that we've
> been looking into.
>
> In a broad sense, these requirements appear to address two different
> aspects of dump which come into play in very different situations. This
> makes it possible to try to tackle each of these independently, which
> simplifies our job a little.
>
> (a) One of these applies to panic type dumps, which can occur when the
> system is in a damaged state, and a reboot is necessary (where it may not
> even be safe to continue running/using parts of the OS, due to the risk of
> corruption or further damage). Addressing the extreme end of spectrum for
> this would involve some kind of a "standalone dump capability". (I'll send
> out a separate note discussing some prevailing design options/possibilities
> for achieving this in a separate note). In such a situation, we don't need
> to bother about disruption to the system as loss of information or
> in-progress work (e.g through forced resets of devices) is acceptable,
> because the system cannot continue as is anyhow.
>
> (b) The other, which the rest of this note is dedicated to, applies to
> situations where a dump needs to be taken (preferably as an accurate
> snapshot of the state at an event or point of execution), but where the
> basic OS is expected to continue running after the dump is taken, and where
> it is desirable that it indeed do so. This is what we refer to as
> "Non-disruptive dump support". A key assumption we make is that, in this
> situation, it is all right, in principle, to depend on the basic OS
> infrastructure, in order to take the dump.
>
> Of course these are two extreme possibilities in terms of how much we can
> rely on the OS, and it is perhaps the shades of grey in between that occur
> in reality, but if we can address these two ends we would have a lot of
> ground covered.
> If we could achieve both in one shot, it would be great, but maybe to start
> with even solving these independently in a nice manner would be good
> progress.
>
> Does this sound reasonable ?
>
>
> Now, a little about (b):
>
> Non-disruptive dumps:
> ---------------------------------
>
> In an ideal world, this would mean being able to take an *accurate*
> snapshot dump of the system (probably selective sections of it, along the
> lines of OS/2 process dump, i.e. flexible dump) *without disrupting* the
> operation of the system - i.e. have the system continue normally after the
> dump is taken.
>
> This is something that we expect to help with serviceability of live/remote
> customer systems - the dump could be sent over for analysis of problems
> that are non-fatal in terms of system availability, but require a
> visibility into kernel data structures and system state to resolve. For
> example, rather than have a person on-site running a kernel debugger to
> examine the system, use dprobes to gather data about the history of a
> certain situation that is recreateable only on the customer site and then
> trigger a dump, and let the system continue to run. In this situation the
> malfunctioning is not crippling and does not affect the integrity of the
> system.
>
> From a requirements perspective, we do need to clearly establish why we
> need this to be an accurate snapshot, rather than what the livedump
> capability in lcrash (i.e. the ability to generate a crash dump from the
> running kernel memory core) already gives us. The example above is
> indicative, but there may be tradeoff between accuracy --- and ---
> non-disruption, it would be good to have some inputs to help understand
> where to position ourselves there.
>
> To appreciate this tradeoff consider the fact that at the instant when the
> dump is triggered, the system may be in a state where the i/o
> layer/driver/the device where the dump is to be stored is not prepared to
> immediately accept dump commands (e.g there could be some i/os in flight or
> DMA's in progress, or even if we have a dedicated dump device, it still may
> be possible for the bus to be in an intermediate state; besides locks might
> have just been held, interrupt servicing in progress etc - think of the
> problems crash dump has been having - and think of what could happen if we
> wanted network dumps). A certain amount of quiescing (I use this term here
> for the want of a better word) may be required to get to a state when it is
> safe to dump (even if we attempt to switch to a software path that is
> independent of the current OS state, we may have the h/w state to think
> of). However, during this quiescing, the system state could or rather would
> change, affecting accuracy. And then, of course, during the process of
> dumping, if we go via the block i/o path, system state is changing as we
> are dumping. If we could predict exactly which parts of memory would
> change, we could perhaps save that off in a reserved area, but that could
> get kind of messy or too tied to the implementation.
>
> Another point here is that if we do freeze the system while the dump is
> going on, we also need to understand the consequences of such a freeze when
> we want to resume normalcy (there would be some amount of disruption -
> perhaps some thinking along the lines of power management suspension
> handling could give us some clues).
>
> Actually, if the added effort/complexity for effecting a memory snapshot
> mechanism seems worthwhile, then we could design a way to achieve both of
> these (well, almost) with some extra complexity and some extra memory
> space. It is possible that such a scheme (i.e. the interesting possibility
> of implementing a snapshot memory feature through page/segment protections
> and copy-on-write mechanisms (think of the way snapshotting for block
> devices/lvm/filesystems happens) may even turn out to be useful outside of
> just crash dump. But at the same I realise that it may be a little
> intrusive and an added complexity in the dump path. It also requires extra
> memory to save modified state - though the lesser the drift during a
> quiesce, the lesser this would be.
> However, one question at the moment is if it is worth it from a user's
> point of view and where the priority for this lies.
>
> A practical compromise in the form of a partial quiesce which essentially
> involves locks to ensure consistent dump data (e.g. for the subset of state
> being dumped as with flexible dump options) is another possibility , in
> addition to providing an option to run on as is, even if data is in an
> inconsistent state at the point of dump. This also means that when we
> (eventually) have customized snapshots involving a small portion of memory,
> the amount of drift would be reduced (e.g if we aren't dumping memory that
> changes during the dump i/o path). However this has to be worked out
> further.
>
> In any case, the steps may be something like this:
> 1. Send an IPI to other CPUs to get them to capture their register
> state (i.e. save it in a memory area).
> 2. Quiesce (to the minimal extent necessary for the following to work
> without disrupting the system).
> 3. Set things up for i/o to dump device to work (e.g change IRQ
> affinity settings, make device ready etc). This may involve waiting till
> the setup is done. Also before changing anything on the system, save
> associated state so we can restore these back to original settings after we
> are done.
> 4. Perform actual dumping. Wait for completion.
> 5. Restore the settings changed just for dump (i.e. get system ready
> to continue normal operation)
> 6. Release the system (i.e. let it continue)
>
> Does this capture the overall process correctly ?
>
> Each of these steps may involve some alternatives which may differ
> depending on the degree/strictness of accuracy required, and resources
> available.
> I'd like to discuss them in some detail in separate notes. (Just to have
> multiple threads of discussion going on and to keep the overall view
> separate from the details of how we work out the elements). Some of the
> recent work that's been happening, in terms of blocking scheduling and
> stopping other processors would fit in under these points.
>
>
> Regards
> Suparna
>
>
>
> Suparna Bhattacharya
> IBM Software Lab, India
> E-mail : bsuparna@xxxxxxxxxx
> Phone : 91-80-5267117, Extn : 2525
>
>
|