lkcd
[Top] [All Lists]

RE: Suggestion Box + SMP blues

To: "'Matt D. Robinson'" <yakker@xxxxxxxxxxx>
Subject: RE: Suggestion Box + SMP blues
From: "Schaal, Richard" <richard.schaal@xxxxxxxxx>
Date: Tue, 18 Sep 2001 15:04:52 -0700
Cc: "'lkcd@xxxxxxxxxxx'" <lkcd@xxxxxxxxxxx>
Sender: owner-lkcd@xxxxxxxxxxx
To respond to your questions,

I'm capturing the console output through a serial port.  The serial console
is set up in the 
ordinary manner through lilo and boot time arguments.

The panic is occurring in the file system, not in the console code.  I'm
checking whether I can send the 
test script out so you can see the activity generated.  

I've been looking at your code in dump_silence_system() .  Let's say that
CPU 3 wants to panic and take a dump.  It calls dump_silence_system which
calls __dump_silence_system which calls smp_call_function(dump_stop_cpu...)
this causes cpu 1 and 2 to execute the HLT loop, cpu 0 comes in and then
leaves, and cpu 3 then goes off to try to do the dump while cpu 0 is busy
doing whatever it was doing - possibly blurring the dump image.  I don't
think that was what you had in mind.  Did you want CPU 0 to do the dump?  If
so, we'll have to send him to the dump routine directly after getting his
attention with the smp_call_function() call.

I've hooked up a call to do the dump from the unknown NMI routine - I have
an NMI switch on the front panel of my box.  I'm figuring if we can dump
from the NMI, then dumping after the smp_call_function() call should also
work ok.
So far, I managed to park the other CPUs and start dumping the header -
according to printk, but then I hit a null pointer in schedule().  

Regards,
Richard








-----Original Message-----
From: Matt D. Robinson [mailto:yakker@xxxxxxxxxxx]
Sent: Monday, September 17, 2001 11:12 PM
To: Schaal, Richard
Cc: 'Matt D. Robinson'; 'lkcd@xxxxxxxxxxx'
Subject: Re: Suggestion Box + SMP blues


On Mon, 17 Sep 2001, Schaal, Richard wrote:
|>Hi Matt,
|>
|>In doing my development and testing, now that the dump recovery seems to
be
|>working, I find my disk is filling
|>pretty rapidly because the same dump is recovered more than once. - I dump
|>to a separate device and not to the
|>swap area.  I wonder if the dump save step shouldn't set some sort of flag
|>in the dump header on the dump device
|>that would say this is a stale dump, which might need some --force flag in
|>order to "save" it again.

This should always be done.  A special flag overwrites something in
the dump header to prevent re-saves from taking place (well, let me
re-state that ... re-saves can take place, but they won't go too
far, they'll stop as soon as they see the dump header's magic number
overwritten).

|>I seem to be dumping ok on my SMP system when I have a relatively simple
|>"oops" to cause a dump for testing, but with increased activity and
possible
|>multiple processor panics, the dump is still failing with the console
|>messages pretty well scrambled - apparently messages being intermixed from
|>multiple processors.  Here's a sample...

I've never seen anything scrambled like this before.  This is
absolutely bizarre.  Are you re-directing your console output
or klogd/syslogd to this console?  Are you doing anything
special in your kernel builds related to serial consoles?
Is the crash taking place in any console code?

This is very wierd to me.  Of course, I don't have an 8P to
test it on, but if you have a spare one available, I'm more
than willing to help. :)  Seriously, let me know what you're
doing to crash the system so I can help provide more details.
I've seriously never seen a console this jumbled before.
It's like you've got something redirected in character/raw
mode to the console.

|>Red Hat Linux release 7.1 (Seawolf)
|>Kernel 2.4.8 on an 8-processor i686
|>
|>dopey login: (scsi0:A:1:0): Locking max tag count at 64
|>U<n1a<>1b<l1eU> nUna<t<>b11U>nalb<oa<>1e1U bln> l>UethneoaaUU  tnb
|>tnhblaalbnoe
|>de l ltaeehtobaloaon  n  edkdlth e hhrataonod enad llneee lnl kkhhed a el
|>aereN
|>krknnkndeederlUelnlrn neerLNLe e lUnl k Lp ekeNLoleUiernL  lnLpNrtNUUoneee
|>iLlNl
|>LL UL nr  tpNLN  pdpUeLoroLLi   ni tpopniao
|>dneio tntreaLt te ir ndtrtev re e prirdvdo iredee
|>tefnedrriuarertteenfeelcerf ur
|>eaarel fre eefddrdeeeraerrernene aenftnecsccd cereseevdire  nre at tca
|>uetvsaa0l
|>it  sa0  vr ia00tdvrdt0ua0rut0i0 l0rvt00 ia0ea0rdlu0tdsua s0ard
|>l0 e
|> s0dal s0 pi p0r0:pp
|>0ri0ni0ent0:s0ts0i0ie
|>
|>s0nn0s g0 p 0g0e0i0e 0s00s0p e000:0000
|>0ir
|>00 i
|>pp0n0tr:000ii
|>n0n0
|>
|>c820
|>npg pr ieren i sipen pttrs:iii pnnr0tg0i0i n
|>pg:e0ini nd00*ep0 =00c
|>1001090040619e06010c8
|>0
|>0
|>041908810040g :i p00c
|>i910118:p1
|>144:81
|>*=dppe9c8990*1=eee0 0> *p =
|> ==0dp<*ed0p d10e0 e  0==0 =0>0O20c
|>00
|>000001101
|>104100dppd0*d0ep0e
|>ede  = = =*  =0000P :   C   0*=0 *00p1p00d0e
|>0
|>0>*0=p0d 0d0e0e
|>:
|>
|>Oddly enough, if you take every third or fourth character, you can
assemble
|>some of the common
|>error messages. :-)
|>
|>I'll take a look at the panic and dump path to see if there's a window of
|>opportunity for the processors to
|>wander about after a panic.

There is the possibility, but the printk()s shouldn't criss-cross
like this.

|>Regards,
|>Richard

Thanks, Richard.  Let me know.

--Matt

<Prev in Thread] Current Thread [Next in Thread>