xfs
[Top] [All Lists]

Oops with 2.4.16 [update]

To: linux-xfs@xxxxxxxxxxx
Subject: Oops with 2.4.16 [update]
From: Pascal Haakmat <a.haakmat@xxxxxxxxx>
Date: Fri, 11 Jan 2002 22:58:16 +0100
Sender: owner-linux-xfs@xxxxxxxxxxx
User-agent: Mutt/1.2.5i
It seems I hit the jackpot.

I've been running Andre's script overnight. When I woke up to look at the
machine this morning, the entire disk had gone south. xfs_repair finds
something resembling a secondary superblock, but then gives up with a fatal
error.

The machine did not crash. /var/log/messages says:


Jan 11 06:55:10 awacs syslogd 1.3-3: restart.
....
Jan 11 11:03:53 awacs kernel: xfs_iunlink_remove: XFS_TEST_ERROR()  returned an 
error on ide1(22,1).  Returning EFSCORRUPTED.
Jan 11 11:03:53 awacs kernel: xfs_inactive:  xfs_ifree() returned an error = 
990 on ide1(22,1)
Jan 11 11:03:53 awacs kernel: xfs_force_shutdown(ide1(22,1),0x1) called from 
line 1952 of file xfs_vnodeops.c.  Return address = 0xc01cf942
Jan 11 11:03:53 awacs kernel: I/O Error Detected.  Shutting down filesystem: 
ide1(22,1)
Jan 11 11:03:53 awacs kernel: Please umount the filesystem, and rectify the 
problem(s)


So it took approx. three hours for the error to occur with 80 simultaneous
cp processes. I turned off the sound drivers during this run, because with
the sound drivers enabled, trying to record audio while the test was running
would cause the kernel to Oops:


Jan 11 06:53:34 awacs kernel: Unable to handle kernel paging request at virtual 
address 01dae918
Jan 11 06:53:34 awacs kernel:  printing eip:
Jan 11 06:53:34 awacs kernel: c01792d9
Jan 11 06:53:34 awacs kernel: *pde = 00000000
Jan 11 06:53:34 awacs kernel: Oops: 0002
Jan 11 06:53:34 awacs kernel: CPU:    0
Jan 11 06:53:34 awacs kernel: EIP:    0010:[convert_page+65/160]    Not tainted
Jan 11 06:53:34 awacs kernel: EFLAGS: 00010203
Jan 11 06:53:34 awacs kernel: eax: 0000000a   ebx: 01dae900   ecx: 005ae3b8 
edx: c0328f80
Jan 11 06:53:34 awacs kernel: esi: c0328f80   edi: 00000002   ebp: c100ca80 
esp: d9fb5be4
Jan 11 06:53:34 awacs kernel: ds: 0018   es: 0018   ss: 0018
Jan 11 06:53:34 awacs kernel: Process cp (pid: 2506, stackpage=d9fb5000)
Jan 11 06:53:34 awacs kernel: Stack: 00000065 d9fb5c40 00000065 ddb04080 
c01793fa ddb04080 c100ca80 d9fb5c40 
Jan 11 06:53:34 awacs kernel:        00000000 00000000 c100f640 d9fb5c40 
00000800 c01794d1 ddb04080 c100f640 
Jan 11 06:53:34 awacs kernel:        d9fb5c40 00000000 00010002 ddb04080 
c100f640 d9fb5c3c 00000001 005ae3b8 
Jan 11 06:53:34 awacs kernel: Call Trace: [cluster_write+194/216] 
[pagebuf_delalloc_convert+193/208] [pagebuf_write_full_page+107/372] 
[linvfs_read_full_page+4/24] [xfs_read+443/556]  
Jan 11 06:53:34 awacs kernel:    [linvfs_read_full_page+4/24] 
[_write_buffer+85/112] [write_some_buffers+121/300] [<e097d97e>] 
[kmem_cache_alloc+93/260] [get_unused_buffer_head+78/156] 
Jan 11 06:53:34 awacs kernel:    [create_buffers+93/236] 
[__refile_buffer+84/92] [balance_dirty_state+15/68] [balance_dirty+32/56] 
[hook_buffers_to_page_delay+63/72] [__pb_block_commit_write_async+71/76] 
Jan 11 06:53:34 awacs kernel:    [__pagebuf_do_delwri+342/496] 
[_pagebuf_file_write+365/552] [pagebuf_generic_file_write+179/776] 
[linvfs_read_full_page+4/24] [xfs_write+1231/1344] [linvfs_read_full_page+4/24] 
Jan 11 06:53:34 awacs kernel:    [xfs_ilock_nowait+30/196] 
[linvfs_readdir+252/596] [sys_write+146/200] [system_call+51/56] 



So I upgraded the sound drivers (from OSS 3.9.5f to 3.9.6b) and applied
Andre Hedrick's IDE patch for 2.4.16, available at
http://www.linuxdiskcert.org/.

This configuration (i.e. 2.4.16+lowlatency+ide+xfs) has now successfully
been running the cp script for a couple of hours. The last two hours I have
been running it with 160 cp processes, and although the machine is very
slow, it seems to chug along.

Both upgrades (sound & IDE patch) solve separate parts of the problem: with
the new sound drivers I can record while running the test; with the older
sound drivers this causes an Oops. In both cases, there are no longer any
"hdc: lost interrupt" messages and there is no filesystem corruption yet
(well, the filesystem is now completely empty, that might make a
difference).

In any case the problem does not seem to be with XFS; also, Andre Hedrick
has been quite vocal on lkml about problems with the block layer/IDE/ATA (I
think) support in the 2.4 kernel. 

I will see if the machine survives the 160 cp processes test tonight. If it
does, then I would consider the problem solved. Meanwhile thanks for all
your help.

(I am typing this message from another machine that might be misconfigured
for sending mail, so I hope this message gets through. I can't read mail yet
until I've restored from backups, probably tomorrow.)

Attaching dmesg output for Andre who wanted to know my hardware
configuration:

Linux version 2.4.16-xfs-ll (root@xxxxxxxxxxxxx) (gcc version egcs-2.91.66 
19990314/Linux (egcs-1.1.2 release)) #20 SMP Fri Jan 11 16:17:01 CET 2002
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 0000000020000000 (usable)
 BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000fffc0000 - 0000000100000000 (reserved)
found SMP MP-table at 000fb4c0
hm, page 000fb000 reserved twice.
hm, page 000fc000 reserved twice.
hm, page 000f2000 reserved twice.
hm, page 000f3000 reserved twice.
On node 0 totalpages: 131072
zone(0): 4096 pages.
zone(1): 126976 pages.
zone(2): 0 pages.
Intel MultiProcessor Specification v1.1
    Virtual Wire compatibility mode.
OEM ID: INTEL    Product ID: 440BX        APIC at: 0xFEE00000
Processor #0 Pentium(tm) Pro APIC version 17
Processor #1 Pentium(tm) Pro APIC version 17
I/O APIC #2 Version 17 at 0xFEC00000.
Processors: 2
Kernel command line: auto BOOT_IMAGE=linux-ll-ide ro root=301 
BOOT_FILE=/boot/vmlinuz-2.4.16-xfs-ll-ide
Initializing CPU#0
Detected 601.380 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 1199.30 BogoMIPS
Memory: 513048k/524288k available (1636k kernel code, 10852k reserved, 363k 
data, 224k init, 0k highmem)
Dentry-cache hash table entries: 65536 (order: 7, 524288 bytes)
Inode-cache hash table entries: 32768 (order: 6, 262144 bytes)
Mount-cache hash table entries: 8192 (order: 4, 65536 bytes)
Buffer-cache hash table entries: 32768 (order: 5, 131072 bytes)
Page-cache hash table entries: 131072 (order: 7, 524288 bytes)
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
CPU:     After generic, caps: 0383fbff 00000000 00000000 00000000
CPU:             Common caps: 0383fbff 00000000 00000000 00000000
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
mtrr: v1.40 (20010327) Richard Gooch (rgooch@xxxxxxxxxxxxx)
mtrr: detected mtrr type: Intel
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
Intel machine check reporting enabled on CPU#0.
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
CPU:     After generic, caps: 0383fbff 00000000 00000000 00000000
CPU:             Common caps: 0383fbff 00000000 00000000 00000000
CPU0: Intel Pentium III (Coppermine) stepping 06
per-CPU timeslice cutoff: 731.53 usecs.
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000004
ESR value after enabling vector: 00000000
Booting processor 1/1 eip 2000
Initializing CPU#1
masked ExtINT on CPU#1
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Calibrating delay loop... 1202.58 BogoMIPS
CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
Intel machine check reporting enabled on CPU#1.
CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
CPU:     After generic, caps: 0383fbff 00000000 00000000 00000000
CPU:             Common caps: 0383fbff 00000000 00000000 00000000
CPU1: Intel Pentium III (Coppermine) stepping 06
Total of 2 processors activated (2401.89 BogoMIPS).
ENABLING IO-APIC IRQs
Setting 2 in the phys_id_present_map
...changing IO-APIC physical APIC ID to 2 ... ok.
init IO_APIC IRQs
 IO-APIC (apicid-pin) 2-0, 2-5, 2-9, 2-10, 2-11, 2-20, 2-21, 2-22, 2-23 not 
connected.
..TIMER: vector=0x31 pin1=2 pin2=0
number of MP IRQ sources: 17.
number of IO-APIC #2 registers: 24.
testing the IO APIC.......................

IO APIC #2......
.... register #00: 02000000
.......    : physical APIC id: 02
.... register #01: 00170011
.......     : max redirection entries: 0017
.......     : PRQ implemented: 0
.......     : IO APIC version: 0011
.... register #02: 00000000
.......     : arbitration: 00
.... IRQ redirection table:
 NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:   
 00 000 00  1    0    0   0   0    0    0    00
 01 003 03  0    0    0   0   0    1    1    39
 02 003 03  0    0    0   0   0    1    1    31
 03 003 03  0    0    0   0   0    1    1    41
 04 003 03  0    0    0   0   0    1    1    49
 05 000 00  1    0    0   0   0    0    0    00
 06 003 03  0    0    0   0   0    1    1    51
 07 003 03  0    0    0   0   0    1    1    59
 08 003 03  0    0    0   0   0    1    1    61
 09 000 00  1    0    0   0   0    0    0    00
 0a 000 00  1    0    0   0   0    0    0    00
 0b 000 00  1    0    0   0   0    0    0    00
 0c 003 03  0    0    0   0   0    1    1    69
 0d 003 03  0    0    0   0   0    1    1    71
 0e 003 03  0    0    0   0   0    1    1    79
 0f 003 03  0    0    0   0   0    1    1    81
 10 003 03  1    1    0   1   0    1    1    89
 11 003 03  1    1    0   1   0    1    1    91
 12 003 03  1    1    0   1   0    1    1    99
 13 003 03  1    1    0   1   0    1    1    A1
 14 000 00  1    0    0   0   0    0    0    00
 15 000 00  1    0    0   0   0    0    0    00
 16 000 00  1    0    0   0   0    0    0    00
 17 000 00  1    0    0   0   0    0    0    00
IRQ to pin mappings:
IRQ0 -> 0:2
IRQ1 -> 0:1
IRQ3 -> 0:3
IRQ4 -> 0:4
IRQ5 -> 0:19
IRQ6 -> 0:6
IRQ7 -> 0:7
IRQ8 -> 0:8
IRQ9 -> 0:18
IRQ10 -> 0:16
IRQ11 -> 0:17
IRQ12 -> 0:12
IRQ13 -> 0:13
IRQ14 -> 0:14
IRQ15 -> 0:15
.................................... done.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 601.3388 MHz.
..... host bus clock speed is 100.2228 MHz.
cpu: 0, clocks: 1002228, slice: 334076
CPU0<T0:1002224,T1:668144,D:4,S:334076,C:1002228>
cpu: 1, clocks: 1002228, slice: 334076
CPU1<T0:1002224,T1:334064,D:8,S:334076,C:1002228>
checking TSC synchronization across CPUs: passed.
Waiting on wait_init_idle (map = 0x2)
All processors have done init_idle
PCI: PCI BIOS revision 2.10 entry at 0xfdb81, last bus=1
PCI: Using configuration type 1
PCI: Probing PCI hardware
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Starting kswapd
SGI XFS with ACLs, EAs, no debug enabled
pty: 256 Unix98 ptys configured
Serial driver version 5.05c (2001-07-08) with MANY_PORTS SHARE_IRQ SERIAL_PCI 
enabled
ttyS00 at 0x03f8 (irq = 4) is a 16550A
Real Time Clock Driver v1.10e
block: 128 slots per queue, batch=32
Uniform Multi-Platform E-IDE driver Revision: 6.31
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
PIIX4: IDE controller on PCI bus 00 dev 39
PIIX4: chipset revision 1
PIIX4: not 100% native mode: will probe irqs later
    ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:DMA
    ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:DMA, hdd:pio
hda: MAXTOR 4K060H3, ATA DISK drive
hdb: QUANTUM FIREBALL EX6.4A, ATA DISK drive
hdc: Maxtor 91531U3, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide1 at 0x170-0x177,0x376 on irq 15
hda: 117266688 sectors (60041 MB) w/2000KiB Cache, CHS=7299/255/63, UDMA(33)
hdb: 12594960 sectors (6449 MB) w/418KiB Cache, CHS=784/255/63, UDMA(33)
hdc: 30015216 sectors (15368 MB) w/512KiB Cache, CHS=29777/16/63, UDMA(33)
Partition check:
 hda: hda1
 hdb: hdb1 hdb2 hdb3
 hdc: unknown partition table
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
8139too Fast Ethernet driver 0.9.22
eth0: RealTek RTL8139 Fast Ethernet at 0xe0800f00, 00:30:ab:02:fd:e7, IRQ 9
eth0:  Identified 8139 chip type 'RTL-8139C'
eth1: RealTek RTL8139 Fast Ethernet at 0xe0802e00, 00:50:fc:0e:22:4d, IRQ 9
eth1:  Identified 8139 chip type 'RTL-8139C'
Linux agpgart interface v0.99 (c) Jeff Hartmann
agpgart: Maximum main memory to use for agp memory: 440M
agpgart: Detected Intel 440GX chipset
agpgart: AGP aperture is 64M @ 0xf8000000
SCSI subsystem driver Revision: 1.00
(scsi0) <Adaptec AIC-7890/1 Ultra2 SCSI host adapter> found at PCI 0/14/0
(scsi0) Wide Channel, SCSI ID=7, 32/255 SCBs
(scsi0) Downloading sequencer code... 398 instructions downloaded
scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.4/5.2.0
       <Adaptec AIC-7890/1 Ultra2 SCSI host adapter>
  Vendor: PLEXTOR   Model: CD-ROM PX-40TS    Rev: 1.10
  Type:   CD-ROM                             ANSI SCSI revision: 02
  Vendor: PLEXTOR   Model: CD-R   PX-W8220T  Rev: 1.03
  Type:   CD-ROM                             ANSI SCSI revision: 02
Attached scsi CD-ROM sr0 at scsi0, channel 0, id 3, lun 0
Attached scsi CD-ROM sr1 at scsi0, channel 0, id 4, lun 0
(scsi0:0:3:0) Synchronous at 20.0 Mbyte/sec, offset 15.
sr0: scsi-1 drive
Uniform CD-ROM driver Revision: 3.12
(scsi0:0:4:0) Synchronous at 10.0 Mbyte/sec, offset 8.
sr1: scsi3-mmc drive: 20x/20x writer cd/rw xa/form2 cdda tray
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP
IP: routing cache hash table of 4096 buckets, 32Kbytes
TCP: Hash tables configured (established 131072 bind 65536)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
XFS mounting filesystem ide0(3,1)
XFS: WARNING: recovery required on readonly filesystem.
XFS: write access will be enabled during mount.
Starting XFS recovery on filesystem: ide0(3,1) (dev: 3/1)
Ending XFS recovery on filesystem: ide0(3,1) (dev: 3/1)
VFS: Mounted root (xfs filesystem) readonly.


<Prev in Thread] Current Thread [Next in Thread>