Hi,
I have a problem on a XFS partition on a debian stable system. The system
running was 2.4.20 and
I compiled a 2.4.26. As performance was taking a hit when running on the
2.4.26, I rebooted it several
times to go from the 2.4.26 to the 2.4.20, and vice versa.
For each reboot I did, I found this in the logs:
Jun 23 19:24:39 dotnet kernel: XFS mounting filesystem sd(8,38)
Jun 23 19:24:39 dotnet kernel: Starting XFS recovery on filesystem: sd(8,38)
(dev: sd(8,38))
Jun 23 19:24:39 dotnet kernel: Ending XFS recovery on filesystem: sd(8,38)
(dev: sd(8,38))
Does this mean the recovery was successful? I suppose not as it came back at
subsequent reboots. Or
Would it mean the partition is not correctly unmounted at each reboot?
The problem became worse:
Jun 23 19:24:39 dotnet kernel: XFS mounting filesystem sd(8,38)
Jun 23 19:24:39 dotnet kernel: Starting XFS recovery on filesystem: sd(8,38)
(dev: sd(8,38))
Jun 23 19:24:39 dotnet kernel: Ending XFS recovery on filesystem: sd(8,38)
(dev: sd(8,38))
Jun 23 19:24:39 dotnet kernel: kjournald starting. Commit interval 5 seconds
Jun 23 19:24:39 dotnet kernel: EXT3 FS 2.4-0.9.19, 19 August 2002 on sd(8,33),
internal journal
Jun 23 19:24:39 dotnet kernel: EXT3-fs: mounted filesystem with ordered data
mode.
Jun 23 19:24:39 dotnet kernel: XFS mounting filesystem sd(8,35)
Jun 23 19:24:39 dotnet kernel: Ending clean XFS mount for filesystem: sd(8,35)
Jun 23 19:24:48 dotnet kernel: eth0: no IPv6 routers present
Jun 23 19:24:56 dotnet kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO at
line 1583 of file xfs_alloc.c. Caller 0xc017cd0b
Jun 23 19:24:56 dotnet kernel: dd0d7d9c c01a636e c017c063 c02fd946 00000001
00000000 c02fd920 0000062f
Jun 23 19:24:56 dotnet kernel: c017cd0b 00000000 0001b1ce dfa01554
00000000 df9e1800 00000000 df9d2e8c
Jun 23 19:24:56 dotnet kernel: dfe3c580 00000000 00000001 0001b1ce
00000010 00000001 00000001 c017cd0b
Jun 23 19:24:56 dotnet kernel: Call Trace: [add_entropy_words+62/200]
[count_semncnt+43/104] [sys_semop+7/28] [sys_semop+7/28] [acpi_leave_sleep_
state+247/596]
Jun 23 19:24:56 dotnet kernel: [do_con_write+1466/1672]
[ide_taskfile_ioctl+1057/1204] [flagged_taskfile+264/616] [init_irq+370/772]
[fput+77/244]
[filp_close+156/168]
Jun 23 19:24:56 dotnet kernel: [sys_close+91/112] [system_call+51/56]
Jun 23 19:24:56 dotnet kernel: xfs_force_shutdown(sd(8,38),0x8) called from
line 4049 of file xfs_bmap.c. Return address = 0xc01d52ea
Jun 23 19:24:56 dotnet kernel: Filesystem "sd(8,38)": Corruption of in-memory
data detected. Shutting down filesystem: sd(8,38)
Jun 23 19:24:56 dotnet kernel: Please umount the filesystem, and rectify the
problem(s)
This happened when booting on the 2.4.26, but I suspect it could as well have
happened with the 2.4.20.
The partition causing the problem is mounted on /usr. The partition was
mounted, but when listing the directory content,
it returned nothing.
I then umounted the file system and remounted it. here's what dmesg tells
of that:
XFS mounting filesystem sd(8,38)
Starting XFS recovery on filesystem: sd(8,38) (dev: 8/38)
XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1577 of file xfs_alloc.c.
Caller 0xc0182098
dcbf9c0c c0181076 c02b4b80 00000001 00000000 c02b4b5a 00000629 c0182098
00000001 df5aa104 00000000 00000001 00000000 0000b36f 00000027 00000001
00000001 00000000 df5aa0d0 dfab0574 00000000 c0182098 df5aa0d0 df76d680
Call Trace: [<c0181076>] [<c0182098>] [<c0182098>] [<c01dae3f>] [<c01be11c>] [<c01be1b3>] [<c01bf2b4>] [<c01b7d03>] [<c01c0895>]
[<c01bfbcb>] [<c01b5928>] [<c01c7a11>] [<c01d9f01>] [<c01d9d36>] [<c013ce90>] [<c013d147>] [<c014f1fa>] [<c014f46b>] [<c014f30b>]
[<c014f86e>] [<c0108c83>]
Ending XFS recovery on filesystem: sd(8,38) (dev: 8/38)
As all seems running fine now, I don't dare unmounting the partition before I
migrate all things to another server.
So I don't know the result of xfs_check now (which showed problems when I ran
it but of which I haven't kept the output)
At the boot, I also get some error messages from the scsi controller:
SCSI subsystem driver Revision: 1.00
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.8
<Adaptec aic7896/97 Ultra2 SCSI adapter>
aic7896/97: Ultra2 Wide Channel A, SCSI Id=7, 32/253 SCBs
scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.8
<Adaptec aic7896/97 Ultra2 SCSI adapter>
aic7896/97: Ultra2 Wide Channel B, SCSI Id=7, 32/253 SCBs
Vendor: QUANTUM Model: ATLAS_V_18_SCA Rev: 0200
Type: Direct-Access ANSI SCSI revision: 03
(scsi0:A:1): 80.000MB/s transfers (40.000MHz, offset 63, 16bit)
Vendor: QUANTUM Model: ATLAS_V_18_SCA Rev: 0200
Type: Direct-Access ANSI SCSI revision: 03
(scsi0:A:2): 80.000MB/s transfers (40.000MHz, offset 63, 16bit)
Vendor: QUANTUM Model: ATLAS_V__9_SCA Rev: 0230
Type: Direct-Access ANSI SCSI revision: 03
(scsi0:A:3): 80.000MB/s transfers (40.000MHz, offset 63, 16bit)
Vendor: VA Linux Model: Fullon 2x2 Rev: 1.01
Type: Processor ANSI SCSI revision: 02
scsi0:A:1:0: Tagged Queuing enabled. Depth 253
scsi0:A:2:0: Tagged Queuing enabled. Depth 253
scsi0:A:3:0: Tagged Queuing enabled. Depth 253
PCI: Enabling device 00:0c.0 (0116 -> 0117)
PCI: Enabling device 00:0c.1 (0116 -> 0117)
Attached scsi disk sda at scsi0, channel 0, id 1, lun 0
Attached scsi disk sdb at scsi0, channel 0, id 2, lun 0
Attached scsi disk sdc at scsi0, channel 0, id 3, lun 0
scsi0: PCI error Interrupt at seqaddr = 0x9
scsi0: Signaled a Target Abort
scsi1: PCI error Interrupt at seqaddr = 0x9
scsi1: Signaled a Target Abort
SCSI device sda: 35861388 512-byte hdwr sectors (18361 MB)
Partition check:
sda:
SCSI device sdb: 35861388 512-byte hdwr sectors (18361 MB)
sdb:
SCSI device sdc: 17930694 512-byte hdwr sectors (9181 MB)
sdc: sdc1 sdc2 sdc3 sdc4 < sdc5 sdc6 >
From what I read on the web, this could be due to a suboptimal cable,
but could it be the cause of the XFS problems? I don't get these
SCSI error message later, when running (I also read this could mean the
driver has adapted transfer speed to the quality of the cable).
I hope you can help me identify the cause of the problem. I'm not sure
where the problem is
exactly:
- hardware?
- the kernel upgrade?
- xfs only?
Thanks in advance.
Raph
|