Hi Matt,
We have integrated our changes for non-disruptive dumps into the
dump_silence/dump_resume framework. We have a kind of working version,
but we still need to sort out some issues.
In our current approach, the dumping cpu sends call function vector ipi
to other cpus to put them to spin. When the dump is complete, other cpus
are released from spin and made to continue. We saw that spinning
with interrupts disabled will not work always as sometimes the disk interrupts
go to the spinning cpus and get lost. This results in the dumping process to
hang. As a work around, we are now enabling the interrupts and making the
local_irq_count zero (to make sure that disk interrupts are not missed
and softirqs are not prevented from running) and restoring the local_irq_count
at the end of spin. This approach is found to work in most cases.
Here the system state during dump can drift to the extent that other cpus
can handle interrupts and softirqs.
We are not sure if this is a right approach. As an alternative approach,
we are also thinking of changing the irq affinity of disk interrupts
(or all interrupts or some interrupts) to the dumping cpu. Currently we are
tyring this approach.
Comments ?
Regards,
Bharata.
|