I tried to get this to Matt but the message was bounced.
-Larry Cohen
--- Begin Message ---
Your message
To: Matt D. Robinson
Subject: Re: Cant seem to create crash dumps .... help
Sent: Fri, 2 Mar 2001 12:18:45 -0500
did not reach the following recipient(s):
Matt D. Robinson on Fri, 2 Mar 2001 12:29:57 -0500
The recipient name is not recognized
The MTS-ID of the original message is: c=us;a= ;p=storigen
systems;l=XCHANGESERVER0103021729F6F4RC8G
MSEXCH:IMS:Storigen Systems Inc.:Lowell:XCHANGESERVER 3550
(000B09AA) 550 5.1.1 <matt@xxxxxxxxxxxxxx>... User unknown
--- Begin Message ---
Hi Matt,
Sorry I did not get back sooner. Sadly the reason is I did not notice
this message
... sigh.
Anyhow it looks like "unused_list_lock" was 1 ... I remember diving
into that and
recalling that it
was not locked.
-Larry
"Matt D. Robinson" wrote:
> Larry Cohen wrote:
> >
> > Hi Matt,
> > It seems gdb works in some capacity. I was able to break it just
before the
> > hang point and got the
> > following trace :
> >
> > (gdb) bt
> > #0 wait_kio (rw=1, nr=16, bh=0xcd797a38, size=4096) at
buffer.c:2015
> > #1 0xc013d79b in brw_kiovec (rw=1, nr=1, iovec=0xc03831c8, dev=774,
> > b=0xcd797c74, size=4096) at buffer.c:2147
> > #2 0xc019f836 in dump_kernel_write () at vmdump.c:558
> > #3 0xc019fb6d in dump_write_header () at vmdump.c:728
> > #4 0xc019fc3f in dump_execute_memdump (panic_str=0xc035d820
"testing crash",
> > regs=0x0) at vmdump.c:780
> > #5 0xc019feb7 in dump_execute (panic_str=0xc035d820 "testing
crash", regs=0x0)
> >
> > at vmdump.c:944
> > #6 0xc011a661 in panic (fmt=0xc02e3aa8 "testing crash") at
panic.c:77
> > #7 0xc0236e54 in my_function(start=0xcd797f14) at my_function.c:74
> > #8 0xd0909070 in ?? ()
> > #9 0xc011bad6 in sys_init_module (name_user=0x8058580
"my_function_module",
> > mod_user=0x8060920) at module.c:544
> > #10 0xc01092ff in system_call () at af_packet.c:1876
> > #11 0x804ab0e in ?? () at af_packet.c:1876
> > #12 0x804b1a7 in ?? () at af_packet.c:1876
> > #13 0x804b3f2 in ?? () at af_packet.c:1876
> > #14 0x400349cb in ?? () at af_packet.c:1876
>
> Yea, looks like someone's holding one of the I/O locks in the
> kiobuf path. Do you know what the value of unused_list_lock is?
> It's very likely that there's nothing LKCD can do (in this
> version of the driver) to resolve the problem, since it's the
> kiobuf code that's locked up, not LKCD. :(
>
> --Matt
>
> > Matt D. Robinson" wrote:
> >
> > > Larry Cohen wrote:
> > > >
> > > > >
> > > >
> > > > Hi Matt,
> > > >
> > > > Thanks much for responding so quickly.
> > > > I did check below and I believe everything is correct.
> > > >
> > > > /proc/sys/dumpdev contains /dev/vmdump.
> > > > /dev/vmdump is a link /dev/hda6 which is the swap device.
> > > > ls -l /dev/hda6 yields:
> > > > brw-rw---- 1 root disk 3, 6 May 5 1998
/dev/hda6
> > > > swapon -s yeilds:
> > > > [root@ipa vmdump]# swapon -s
> > > > Filename Type Size Used
Priority
> > > > /dev/hda6 partition 2097136 0
-1
> > > >
> > > > /proc/sys/kernel/panic is 5.
> > > >
> > > > rc.sysinit has been updated.
> > > >
> > > > I am running Linux 2.4.0 with the 3.1.1 patches so I presume I
do not
> > > > need the raw I/O patch.
> > > > I asked the fellow who put the system together to mail me the
hardware
> > > > specs and I'll send those along
> > > > as soon as I get them. It is an Intel machine.
> > > >
> > > > The output I see is:
> > > >
> > > > (gdb) c
> > > > Continuing.
> > > > CPU: 1
> > > > EIP: 0010:[<c02207ee>]
> > > > EFLAGS: 00010246
> > > > eax: 00000000 ebx: c7c36000 ecx: 00000010 edx: 00000000
> > > > esi: c89c4000 edi: 0000009c ebp: c7c97ef0 esp: c7c979a8
> > > > ds: 0018 es: 0018 ss: 0018
> > > > Process mount.sfs (pid: 1062, stackpage=c7c95000)
> > > > Stack: c89e2a00 00000008 c0302adc c037d680 c037d640 50193d03
c037d680
> > > > 4088c800
> > > > c7c97ae0 c01da8d8 c7c97a14 c01da935 c037d680 cff26ae0
00000001
> > > > c037d640
> > > > c037d680 c1473e60 c0300980 20e466e0 0000003c 00000005
0000009c
> > > > c037d640
> > > > Call Trace: [<c01da8d8>] [<c01da935>] [<c0139f14>] [<c014d88d>]
> > > > [<c014daf5>] [<c013d5cc>] [<c013d735>]
> > > > [<c013d9ff>] [<c013e583>] [<c013e394>] [<c013e737>]
[<c0109183>]
> > > >
> > > > Code: f3 ab 8b 85 28 fb ff ff 83 b8 14 10 00 00 00 75 11 68 40
2e
> > > > Dumping to device 0x306 [ide0(3,6)] ...
> > > > Writing dump header ...
> > > >
> > > > .... the system is wedged at this point and I can not break in
with gdb.
> > > > The hang appears to happen waiting in wait_on_buffer (). A few
blocks
> > > > appear to be written
> > > > but then everything locks up.
> > >
> > > Looks like your configuration is correct. My presumption here is
that
> > > you're dying somewhere in the page cache code -- otherwise, you
wouldn't
> > > be hung up waiting to write out to disk. This means that
someone's
> > > holding the io_request_lock (or one of the other locks in the I/O
space)
> > > which is preventing us from being able to write out the dump
pages.
> > >
> > > Since a dump may not be possible, what you should do is take the
built
> > > kernel, and run 'lcrash' on the live system, and disassemble the
> > > instructions based on the "Call Trace" above, which should give
you a
> > > good idea as to what the stack trace at the time of the panic is.
So
> > > if you 'dis c01da8d8', you should see the function name where the
problem
> > > occurred. Do the same for the rest of the arguments (or just use
> > > 'ksymoops' in the meantime).
> > >
> > > If this isn't a page cache/buffer/request queue lock scenario,
then we
> > > shouldn't have hung up.
> > >
> > > Oh, one other thing ... gdb probably won't work after you call
down
> > > into our function, because die() has been called, and you're
beyond the
> > > kgdb breakpoint area (and smp_send_stop() should have stopped all
CPUs
> > > except the one we are executing on).
> > >
> > > Let's see what the disassembled functions show ... let me know as
> > > soon as you can. Thanks. :)
> > >
> > > --Matt
> > >
> > > > Thanks very much for your help,
> > > > -Larry
> > > >
> > > > > Larry Cohen wrote:
> > > > > >
> > > > > > Hi, I'm pretty sure I have installed the lkcd patch and
utilities ok
> > > > > > but when I the system panics it
> > > > > > will hang when trying to write out the dump to the swap
device.
> > > > > > My disks are IDE not SCSI is this a problem?
> > > > > > Any other thoughts?
> > > > > >
> > > > > > Thanks,
> > > > > > Larry Cohen
> > > > >
> > > > > Do you have the output from the console? It sounds like you
> > > > > have everything configured correctly. If you run
"/sbin/vmdump config",
> > > > > are the right values showing up in /proc/sys/vmdump? IDE and
SCSI
> > > > > disks should work with the 3.1.1 patches.
> > > > >
> > > > > Typically there are problems when the right devices aren't
configured
> > > > > in /proc/sys/vmdump (which are updated when "/sbin/vmdump
config" is
> > > > > run), or /proc/sys/kernel/panic is zero, which means the
system won't
> > > > > reset after taking a dump.
> > > > >
> > > > > Note that you have to modify your /etc/rc.d/rc.sysinit (or
like script
> > > > > depending on the Linux variant you are running) to add the
/sbin/vmdump
> > > > > calls. Check out the README.
> > > > >
> > > > > If you've done all this and it still fails, send me the
console output,
> > > > > along with your hardware specifications, and we'll see what we
can do
> > > > > about getting to the bottom of your problem. Thanks, Larry.
> > > > >
> > > > > --Matt
--- End Message ---
--- End Message ---
|