This note expands on the steps listed in the earlier posting on
non-disruptive (vs standalone ) dumps (with non-disruptive dumps, the
system is expected to continue running after the dump is taken, but it is
desirable to have an accurate consistent snapshot of the system).
A slight digression on the point of accuracy vs consistency:
Perfect accuracy would imply capturing the snapshot of the system, as is,
at the exact instant when the dump was triggered, even if it is in an
intermediate/ inconsistent state.
Consistency, on the other hand would apply to related data/state within the
snapshot being consistent with each other, for correct interpretation of
some of data values to be possible. If a dump is triggered when state
modifications are in progress and have not reached a consistency point,
then in order to get consistent data, one may wait for the consistency
point to be reached, typically by acquiring related (read)locks before
capturing the state. Obviously, the kind of state that needs to be
consistent depends on the kind of related data that the interpretation
mechanism, or dump marshalling logic hinges on and hence requires
consistency for.
There is a requirement level tradeoff between the two, so in general, the
objective is to attain maximum accuracy (minimum drift) while maintaining
the minimum consistency requirements.
Now onto the steps:
This is again a high level view - more details on quiesce to follow as
separate notes (in a drill down philosophy) :
1. Send an IPI to other CPUs to get them to capture their register state
(i.e. save it in a memory area).
Mechanism to initiate this:
- Using an NMI IPI would help with accuracy (though it still would
take non-zero time for the other CPUs to receive the IPI). However, during
the later steps, spinning the CPU within this NMI handler while dump is in
progress may not be appropriate. In such a case we could also consider
saving the stack contents, in case spinning is postponed to outside of the
NMI (which could affect consistency)
- With a non-NMI IPI, there is a lesser likelihood of the above
problem, though there would be a little more drift as the other CPUs
wouldn't service the IPI as long as interrupts are disabled on those CPUs.
BTW, the register state for the dumping CPU should also be saved in memory
likewise.
2. Quiesce (to the minimal extent necessary for the following to work
without disrupting the system).
Here "quiesce" means waiting till the system reaches a (stable) state
where dump can be initiated. There are 3 aspects:
i. h/w state quiesce for i/o to work (transient states, in-flight
DMA's or ios )
ii. s/w state quiesce for dump i/o s/w path to work (relevant locks /
data structure consistencies, if we use the standard i/o path rather than a
separate raw dump interface)
iii. s/w state quiesce for dump data consistency
Because a dump can be triggered at any point, for ii and iii,
self-deadlocks with interrupted code on the same processor need to be
avoided (self-deadlocks may sometimes appear in not-so-obvious ways as in
the problem with dumping from interrupt context ). For this reason it may
help to have the actual dump execute under a legal, preferably, separate
thread of context, which gets activated on reaching a quiesce* on at least
one CPU and waits for a complete quiesce prior to dump. (In (1), the
register state, stack etc at the instant of dumping would have already been
saved, so this is not lost during the quiesce, though other memory state
drift may need to be dealt with).
[*Will treat quiesce point detection as a separate subject ]
Some of the things this involves:
- Stop fresh operations from happening (some kind way to make the
system and drivers hold off new work )
- Wait for relevant existing in-flight operations to drain out (some
mechanism/support for getting notified about or explicitly poll for
completion)
Now, if we go via standard device driver interfaces then the wait required
for (i) would happen automatically. If the dump device has a separate
queue, and we issue the i/o from a legal context as far the system is
concerned (avoid self-deadlock type situations) then (ii) would also get
taken care of.
As far as iii is concerned this is specific to the kind of data dumped, so
won't discuss that rightway.
The tradeoff which we have to consider as we quiesce is the drift that
occurs during this time:
We do have a segment/page protection based COW scheme in mind for being
able to retain a snapshot at the point in time of dump, while system state
changes, which we can detail in a separate note, but haven't yet figured
out if the complexity (and additional resource requirements) for this is
worth it.
With that in place we should be able to do with minimal disruption, but its
not without tradeoffs.
The other aspect is to hold off new operations until the dump is complete.
- Let other CPUs spin when they reach quiesce point, so that they
don't generate any more activity
- (Debatable/Optional) Interface with drivers to delay processing
requests till dump is done (if this is possible). In some cases, this may
involve blocking handling of interrupts / incoming packets or avoid feeding
further (non-dump) i/os down post interrupt handling/non-task time. This
calls for caution because the effects of such delays may result in some
disruption (e.g loss of packets on a network interface) of the system.
- Turn off scheduling/switches to other tasks (so that wait's in the
dump code path don't schedule anything else to run )
3. Set things up for i/o to dump device to work . This may involve waiting
till the setup is done. Also before changing anything on the system, save
associated state so we can restore these back to original settings after we
are done.
This may not be required as a separate step in general if we are using
existing i/o paths/interfaces and not disabling interrupt handling on other
CPUs. If we did know exactly which interrupts are involved, then we could
have temporarily blocked all others (provided that does not affect
continuation of operation after a dump), but that requires special support
from the driver which may be hard.
With a raw dump interface there may be dump_prepare request issued on
the interface.
4. Perform actual dumping. Wait for completion.
This is where the dump logic fits in - it picks up the dump snapshot
pages and writes them out to the dump device, via either
- The standard i/o path
or
- Raw dump interface (if the dump device driver supports one)
5. Restore the settings changed just for dump (i.e. get system ready to
continue normal operation)
This applies if we had any special device setup or interrupts
disabling / irq changes in 3 above.
6. Release the system (i.e. let it continue)
Unfreeze whatever was held up in step 2.
Suparna Bhattacharya
IBM Software Lab, India
E-mail : bsuparna@xxxxxxxxxx
Phone : 91-80-5267117, Extn : 2525
|