Hello Moo Kim,
Thank you for sharing your experiences with us. Its good to hear that this
would be a useful feature to have.
Let's hope that we can come up with at least a best effort solution which
addresses some of the difficulties you've observed.
One of the enhancements that we are also looking at somewhere down the line
along the direction of non-disruptive dumps is an increased flexibility in
customizing the kind of data to dump. This feature should make it possible
to bring down the sizes of the dumps considerably by focussing on the
relevant subset of the system data that is of interest for a particular
kind of problem.
And yes, of course, like many others who subscribe to this list, we too are
seriously interested in seeing RAS support in Linux become a standard
feature rather than remain confined to isolated patches. One of the areas
in this regard is interoperability between/coexistence of different RAS
facilities. Specifically on the subject of these discussions on crash dump,
I guess an important design consideration that we have to consciously carry
at the back of our minds for any of these extensions is to look for
minimally intrusive approaches of achieving our goals, so that it is easier
to integrate with the kernel when the time comes.
Regards
Suparna
>Hello Suparna,
>
>I think the idea of capturing crash dumps on a live system seem
>to quite attractive to end-users (e.g. customer) since it
>implies there is no system down time or unavailble. From
>my limited exposure/experience in NCR supporting Teradata
>DBMS product on SVR4 MPRAS system, developers were time
>to time requested to dial-in to the customer system (typically
>MPP system) to diagnoze the performance or system slow
>down/hung related problems. We often explictly request
>for customer or site personnel to force the particular node
>in question, out of X node in MPP system, to panic to capture
>the crash dump for post-mortem analysis. Each node is
>typically configured to 4/2/1 GB of memory. SVR4 MPRAS
>also has the utility to take a memory dump on a live system
>and we could not use it at all because:
>
>1. The utility basically writes the entire memory to
> a flat file and often there is no system disk space
> that could handle 4/2/1 GB of data.
>
>2. The utility takes a long time to finish (several hours)
> because it is an user-level application.
>
>3. Lastly I don't know whether utility quieses the system
> or not, but I was told that it may not take a clean
> snapshot of the memory that there is a small chance
> that live crash dump file may not contain the
> information one is after.
>
>Please note that this email is not to discourage any of your
>proposal below, but to share my limited experience. Aside
>from this, I am more eager to see the basic functionality of
>lkcd being integrated into the main linux source tree or
>standard linux distribution as part of RAS (Reliability,
>Availability, Supportability) feature to Linux soon since
>being able to reliably save the panic dump and identify/debug
>any software problem (e.g. post-mortem analysis) have been
>recognized by management as one of key RAS features for
>supporting Teradata RDBMS product at NCR.
>
>Thanks,
Moo Kim Moo.Kim@xxxxxxx
NCR Corporation
Suparna Bhattacharya
IBM Software Lab, India
E-mail : bsuparna@xxxxxxxxxx
Phone : 91-80-5267117, Extn : 2525
|