lkcd
[Top] [All Lists]

Re: Non-disruptive dumps - expanding on the steps

To: bsuparna@xxxxxxxxxx
Subject: Re: Non-disruptive dumps - expanding on the steps
From: "Matt D. Robinson" <yakker@xxxxxxxxxxxxxx>
Date: Wed, 18 Jul 2001 14:24:06 -0700
Cc: lkcd@xxxxxxxxxxx
Organization: Alacritech, Inc.
References: <CA256A8D.0052DE88.00@xxxxxxxxxxxxxxxxxxx>
Sender: owner-lkcd@xxxxxxxxxxx
bsuparna@xxxxxxxxxx wrote:
> 
> This note expands on the steps listed in the earlier posting on
> non-disruptive (vs standalone ) dumps (with non-disruptive dumps, the
> system is expected to continue running after the dump is taken, but it is
> desirable to have an accurate  consistent snapshot of the system).
> 
> A slight digression on the point of accuracy vs consistency:
> 
> Perfect accuracy would imply capturing the snapshot of the system, as is,
> at the exact instant when the dump was triggered, even if it is in an
> intermediate/ inconsistent state.

This is always going to be unlikely for SMP systems, but it's a nice
goal.  Interrupts on secondary processors at the time of the system
panic can cause all kinds of wierd results when you try and analyze
the state after the crash.

> Consistency, on the other hand would apply to related data/state within the
> snapshot being consistent with each other, for correct interpretation of
> some of data values to be possible. If a dump is triggered when state
> modifications are in progress and have not reached a consistency point,
> then in order to get consistent data, one may wait for the consistency
> point to be reached, typically by acquiring related (read)locks before
> capturing the state. Obviously, the kind of state that needs to be
> consistent depends on the kind of related data that the interpretation
> mechanism, or dump marshalling logic hinges on and hence requires
> consistency for.
> 
> There is a requirement level tradeoff between the two, so in general, the
> objective is to attain maximum accuracy (minimum drift) while maintaining
> the minimum consistency requirements.
> 
> Now onto the steps:
> 
> This is again a high level view - more details on quiesce to follow as
> separate notes (in a drill down philosophy) :
> 
> 1. Send an IPI to other CPUs to get them to capture their register state
> (i.e. save it in a memory area).

Sure ... if you're going this route, also capture:

        - system mode (kernel, user, interrupt, other)
        - CPU number
        - any task running on the CPU
        - stack frame pointer (or exception frame) for task per CPU

>     Mechanism to initiate this:
>      - Using an NMI IPI would help with accuracy (though it still would
> take non-zero time for the other CPUs to receive the IPI). However, during
> the later steps, spinning the CPU within this NMI handler while dump is in
> progress may not be appropriate. In such a case we could also consider
> saving the stack contents, in case spinning is postponed to outside of the
> NMI (which could affect consistency)
>      - With a non-NMI IPI, there is a lesser likelihood of the above
> problem, though there would be a little more drift as the other CPUs
> wouldn't service the IPI as long as interrupts are disabled on those CPUs.
> 
> BTW, the register state for the dumping CPU should also be saved in memory
> likewise.
> 
> 2. Quiesce (to the minimal extent necessary for the following to work
> without disrupting the system).

If you're talking about non-disruptive dumps here, I agree.  Otherwise,
I'm not so certain.  I don't believe you should wait for anything.  Stable
system state when a panic() or die_if_kernel() is activated is completely
ambiguous.  Therefore, you should start acting right away, regardless of
system state.

>      Here "quiesce" means waiting till the system reaches a (stable) state
> where dump can be initiated. There are 3 aspects:
>      i.  h/w state quiesce for i/o to work    (transient states, in-flight
> DMA's or ios )
>      ii. s/w state quiesce for dump i/o s/w path to work  (relevant locks /
> data structure consistencies, if we use the standard i/o path rather than a
> separate raw dump interface)
>      iii. s/w state quiesce for dump data consistency
> Because a dump can be triggered at any point, for ii and iii,
> self-deadlocks with interrupted code on the same processor need to be
> avoided (self-deadlocks may sometimes appear in not-so-obvious ways as in
> the problem with dumping from interrupt context ). For this reason it may
> help to have the actual dump execute under a legal, preferably, separate
> thread of context, which gets activated on reaching a quiesce* on at least
> one CPU and waits for a complete quiesce prior to dump. (In (1), the
> register state, stack etc at the instant of dumping would have already been
> saved, so this is not lost during the quiesce, though other memory state
> drift may need to be dealt with).

If you're using your own I/O driver for dumping, then it eliminates the
uncertainty associated with disruptive dumps.

> [*Will treat quiesce point detection as a separate subject ]
> 
> Some of the things this involves:
>      - Stop fresh operations from happening (some kind way to make the
> system and drivers hold off new work )
>      - Wait for relevant existing in-flight operations to drain out  (some
> mechanism/support for getting notified about or explicitly poll for
> completion)

Turn off interrupts, and prevent schedule() from doing anything.  We do
the latter today, but not really the former, as we can't determine which
interrupts affect which I/O device.

> Now, if we go via standard device driver interfaces then the wait required
> for (i) would happen automatically. If the dump device has a separate
> queue, and we issue the i/o from a legal context as far the system is
> concerned (avoid self-deadlock type situations) then (ii) would also get
> taken care of.
> As far as iii is concerned this is specific to the kind of data dumped, so
> won't discuss that rightway.
> 
> The tradeoff which we have to consider as we quiesce is the drift that
> occurs during this time:
> 
> We do have a segment/page protection based COW scheme in mind for being
> able to retain a snapshot at the point in time of dump, while system state
> changes, which we can detail in a separate note, but haven't yet figured
> out if the complexity (and additional resource requirements) for this is
> worth it.
> With that in place we should be able to do with minimal disruption, but its
> not without tradeoffs.
> 
> The other aspect is to hold off new operations until the dump is complete.
>      - Let other CPUs spin when they reach quiesce point, so that they
> don't generate any more activity
>      - (Debatable/Optional) Interface with drivers to delay processing
> requests till dump is done (if this is possible). In some cases, this may
> involve blocking handling of interrupts / incoming packets or avoid feeding
> further (non-dump) i/os down post interrupt handling/non-task time. This
> calls for caution because the effects of such delays may result in some
> disruption (e.g loss of packets on a network interface) of the system.
>      - Turn off scheduling/switches to other tasks (so that wait's in the
> dump code path don't schedule anything else to run )

Again, something to keep in mind.

Fundamentally, if your goal is to create a dump that a customer support
person (or anyone for that matter) can review, then normally it's okay
if a CPU continues operating, even for a few instructions, while the
system prepares to dump.  The reason being, most of the time the state of
the system that caused the panic() or die_if_kernel() in the first place
is not going to change drastically between the time that the crash dump
situation takes place, and the time when you actually start dumping.

There are only a _tiny_ set of cases where the amount of time would
be so critical as to require that type of granularity.  Most of the time,
the other CPUs will try to continue their execution.  The other CPUs
are either:

        - running in another part of the kernel (so they don't matter)
        - running in the same part of the kernel (but don't touch the
          same data structures, so they don't matter)
        - touching the same data structures, but the corruption has
          already taken place, so you can see who did what already
          without worrying about timing.

The only time that amount of time would matter is if the CPU that caused
the problem (not the one that is detecting the problem) leaves an
interrupt handler between the time that the crash took place and the
time you've frozen the system CPUs.  I think it's a corner case.

> 3. Set things up for i/o to dump device to work . This may involve waiting
> till the setup is done. Also before changing anything on the system, save
> associated state so we can restore these back to original settings after we
> are done.
> 
>      This may not be required as a separate step in general if we are using
> existing i/o paths/interfaces and not disabling interrupt handling on other
> CPUs.  If we did know exactly which interrupts are involved, then we could
> have temporarily blocked all others (provided that does not affect
> continuation of operation after a dump), but that requires special support
> from the driver which may be hard.
>      With a raw dump interface there may be dump_prepare request issued on
> the interface.
> 
> 4. Perform actual dumping. Wait for completion.
> 
>      This is where the dump logic fits in - it picks up the dump snapshot
> pages and writes them out to the dump device, via either
>           - The standard i/o path
>                or
>           - Raw dump interface (if the dump device driver supports one)
> 
> 5. Restore the settings changed just for dump (i.e. get system ready to
> continue normal operation)
> 
>      This applies if we had any special device setup or interrupts
> disabling / irq changes in 3 above.
> 
> 6. Release the system (i.e. let it continue)
> 
>      Unfreeze whatever was held up in step 2.

Everything else looks great. :)

>   Suparna Bhattacharya
>   IBM Software Lab, India
>   E-mail : bsuparna@xxxxxxxxxx
>   Phone : 91-80-5267117, Extn : 2525

--Matt

<Prev in Thread] Current Thread [Next in Thread>