lkcd
[Top] [All Lists]

Re: Crash Dump : Non-disruptive dumps vs standalone dumps

To: bsuparna@xxxxxxxxxx
Subject: Re: Crash Dump : Non-disruptive dumps vs standalone dumps
From: Moo Kim <Moo.Kim@xxxxxxx>
Date: Fri, 13 Jul 2001 14:36:27 -0700
Cc: "Matt D. Robinson" <yakker@xxxxxxxxxxxxxx>, lkcd@xxxxxxxxxxx
In-reply-to: <CA256A88.00501923.00@xxxxxxxxxxxxxxxxxxx>; from bsuparna@xxxxxxxxxx on Fri, Jul 13, 2001 at 06:50:01PM +0530
References: <CA256A88.00501923.00@xxxxxxxxxxxxxxxxxxx>
Sender: owner-lkcd@xxxxxxxxxxx
User-agent: Mutt/1.2.5i
 Hello Suparna,

 I think the idea of capturing crash dumps on a live system seem
 to quite attractive to end-users (e.g. customer) since it 
 implies there is no system down time or unavailble.  From
 my limited exposure/experience in NCR supporting Teradata 
 DBMS product on SVR4 MPRAS system, developers were time 
 to time requested to dial-in to the customer system (typically
 MPP system) to diagnoze the performance or system slow 
 down/hung related problems.   We often explictly request
 for customer or site personnel to force the particular node 
 in question, out of X node in MPP system, to panic to capture 
 the crash dump for post-mortem analysis.   Each node is
 typically configured to 4/2/1 GB of memory.   SVR4 MPRAS
 also has the utility to take a memory dump on a live system
 and we could not use it at all because:

 1. The utility basically writes the entire memory to 
    a flat file and often there is no system disk space 
    that could handle 4/2/1 GB of data.   

 2. The utility takes a long time to finish (several hours)
    because it is an user-level application.

 3. Lastly I don't know whether utility quieses the system
    or not, but I was told that it may not take a clean 
    snapshot of the memory that there is a small chance 
    that live crash dump file may not contain the 
    information one is after.

 Please note that this email is not to discourage any of your 
 proposal below, but to share my limited experience.  Aside
 from this, I am more eager to see the basic functionality of 
 lkcd being integrated into the main linux source tree or
 standard linux distribution as part of RAS (Reliability, 
 Availability, Supportability) feature to Linux soon since 
 being able to reliably save the panic dump and identify/debug 
 any software problem (e.g. post-mortem analysis) have been 
 recognized by management as one of key RAS features for 
 supporting Teradata RDBMS product at NCR.

 Thanks,

 Moo Kim    Moo.Kim@xxxxxxx
 NCR Corporation

On Fri, Jul 13, 2001 at 06:50:01PM +0530, bsuparna@xxxxxxxxxx wrote:
> 
> Hello,
> 
> Here are a few thoughts on some of the crash dump requirements that we've
> been looking into.
> 
> In a broad sense, these requirements appear to address two different
> aspects of dump which come into play in very different situations. This
> makes it possible to try to tackle each of these independently, which
> simplifies our job a little.
> 
> (a) One of these applies to panic type dumps, which can occur when the
> system is in a damaged state, and a reboot is necessary (where it may not
> even be safe to continue running/using parts of the OS, due to the risk of
> corruption or further damage). Addressing the extreme end of spectrum for
> this would involve some kind of a "standalone dump capability".  (I'll send
> out a separate note discussing some prevailing design options/possibilities
> for achieving this in a separate note). In such a situation, we don't need
> to bother about disruption to the system as loss of information or
> in-progress work (e.g through forced resets of devices) is acceptable,
> because the system cannot continue as is anyhow.
> 
> (b) The other, which the rest of this note is dedicated to, applies to
> situations where a dump needs to be taken (preferably as an accurate
> snapshot of the state at an event or point of execution), but where the
> basic OS is expected to continue running after the dump is taken, and where
> it is desirable that it indeed do so.  This is what we refer to as
> "Non-disruptive dump support". A key assumption we make is that, in this
> situation, it is all right, in principle, to depend on the basic OS
> infrastructure, in order to take the dump.
> 
> Of course these are two extreme possibilities in terms of how much we can
> rely on the OS, and it is perhaps the shades of grey in between that occur
> in reality, but if we can address these two ends we would have a lot of
> ground covered.
> If we could achieve both in one shot, it would be great, but maybe to start
> with even solving these independently in a nice manner would be good
> progress.
> 
> Does this sound reasonable ?
> 
> 
> Now, a little about (b):
> 
> Non-disruptive dumps:
> ---------------------------------
> 
> In an ideal world, this would mean being able to take an *accurate*
> snapshot dump of the system (probably selective sections of it, along the
> lines of OS/2 process dump, i.e. flexible dump) *without disrupting* the
> operation of the system - i.e. have the system continue normally after the
> dump is taken.
> 
> This is something that we expect to help with serviceability of live/remote
> customer systems - the dump could be sent over for analysis of problems
> that are non-fatal in terms of system availability, but require a
> visibility into kernel data structures and system state to resolve. For
> example, rather than have a person on-site running a kernel debugger to
> examine the system, use dprobes to gather data about the history of a
> certain situation that is recreateable only on the customer site and then
> trigger a dump, and let the system continue to run.  In this situation the
> malfunctioning is not crippling and does not affect the integrity of the
> system.
> 
> From a requirements perspective, we do need to clearly establish why we
> need this to be an accurate snapshot, rather than what the livedump
> capability in lcrash (i.e. the ability to generate a crash dump from the
> running kernel memory core) already gives us. The example above is
> indicative, but there may be tradeoff between  accuracy  --- and ---
> non-disruption, it  would be good to have some inputs to help understand
> where to position ourselves there.
> 
> To appreciate this tradeoff consider the fact that at the instant when the
> dump is triggered, the system may be in a state where the i/o
> layer/driver/the device where the dump is to be stored is not prepared to
> immediately accept dump commands (e.g there could be some i/os in flight or
> DMA's in progress, or even if we have a dedicated dump device, it still may
> be possible for the bus to be in an intermediate state; besides locks might
> have just been held, interrupt servicing in progress etc - think of the
> problems crash dump has been having - and think of what could happen if we
> wanted network dumps). A certain amount of quiescing (I use this term here
> for the want of a better word) may be required to get to a state when it is
> safe to dump (even if we attempt to switch to a software path that is
> independent of the current OS state, we may have the h/w state to think
> of). However, during this quiescing, the system state could or rather would
> change, affecting accuracy. And then, of course, during the process of
> dumping, if we go via the block i/o path, system state is changing as we
> are dumping. If we could predict exactly which parts of memory would
> change, we could perhaps save that off in a reserved area, but that could
> get kind of messy or too tied to the implementation.
> 
> Another point here is that if we do freeze the system while the dump is
> going on, we also need to understand the consequences of such a freeze when
> we want to resume normalcy (there would be some amount of disruption -
> perhaps some thinking along the lines of power management suspension
> handling could give us some clues).
> 
> Actually, if the added effort/complexity for effecting a memory snapshot
> mechanism seems worthwhile, then we could design a way to achieve both of
> these (well, almost) with some extra complexity and some extra memory
> space. It is possible that such a scheme (i.e. the interesting possibility
> of implementing a snapshot memory feature through page/segment protections
> and copy-on-write mechanisms (think of the way snapshotting for block
> devices/lvm/filesystems happens)  may even turn out to be useful outside of
> just crash dump.  But at the same I realise that it may be a little
> intrusive and an added complexity in the dump path. It also requires extra
> memory to save modified state - though the lesser the drift during a
> quiesce, the lesser this would be.
> However, one question at the moment is if it is worth it from a user's
> point of view and where the priority for this lies.
> 
> A practical compromise in the form of a partial quiesce which essentially
> involves locks to ensure consistent dump data (e.g. for the subset of state
> being dumped as with flexible dump options) is another possibility , in
> addition to providing an option to run on as is, even if data is in an
> inconsistent state at the point of dump. This also means that when we
> (eventually) have customized snapshots involving a small portion of memory,
> the amount of drift would be reduced (e.g if we aren't dumping memory that
> changes during the dump i/o path). However this has to be worked out
> further.
> 
> In any case, the steps may be something like this:
>      1. Send an IPI to other CPUs to get them to capture their register
> state (i.e. save it in a memory area).
>      2. Quiesce (to the minimal extent necessary for the following to work
> without disrupting the system).
>      3. Set things up for i/o to dump device to work  (e.g change IRQ
> affinity settings, make device ready etc). This may involve waiting till
> the setup is done. Also before changing anything on the system, save
> associated state so we can restore these back to original settings after we
> are done.
>      4. Perform actual dumping. Wait for completion.
>      5. Restore the settings changed just for dump (i.e. get system ready
> to continue normal operation)
>      6. Release the system (i.e. let it continue)
> 
> Does this capture the overall process correctly ?
> 
> Each of these steps may involve some alternatives which may differ
> depending on the degree/strictness of accuracy required, and resources
> available.
> I'd like to discuss them in some detail in separate notes. (Just to have
> multiple threads of discussion going on and to keep the overall view
> separate from the details of how we work out the elements). Some of the
> recent work that's been happening, in terms of blocking scheduling and
> stopping other processors would fit in under these points.
> 
> 
> Regards
> Suparna
> 
> 
> 
>   Suparna Bhattacharya
>   IBM Software Lab, India
>   E-mail : bsuparna@xxxxxxxxxx
>   Phone : 91-80-5267117, Extn : 2525
> 
> 

<Prev in Thread] Current Thread [Next in Thread>