We noticed that NFS mounts from the fileserver had gone stale this
morning. These correspond to two hardware RAID LUNs (info below). I logged
into the fileserver and found that the mountpoints were dead as well, even
though according to mount they were still there. Checked the kernel log
and found a whole slew of SCSI errors had started shortly after 4am (hmm,
cron-time) and then continued when a user showed up to work, culminating
in an xfs_force_shutdown of the filesystem at 9am. Which of course
triggered a whole slew of further I/O errors.
After rebooting (with NFS shares disabled), the two RAID volumes mounted
as clean. xfs_check found no errors and exited silently. The data appears
to be there, although I haven't run anything to generate much file I/O,
and haven't yet re-opened the NFS shares.
Should I upgrade to a new kernel and XFS release before investigating this
further? System info and some kernel log excerpts are below; the full
kernel log (events related to this) can be downloaded from
http://cryoem.berkeley.edu/~slaton/kernel.040915.scsicrash.gz
thanks,
slaton
system info:
hardware: dual 32-bit Xeon system
OS: Red Hat Linux 8.0
kernel: custom 2.4.19 kernel compiled with SGI XFS 1.2pre5
kernel args: max_scsi_luns=255
host adapter: Adaptec 29160
RAID volume: 3.7 TB hardware RAID5+0 box, SATA drives, SCSI system
interface,
divided into two LUNs of 2.0 and 1.7 TB size.
kernel log excerpts:
scsi1:0:3:0: Attempting to queue an ABORT message
scsi1: Dumping Card State while idle, at SEQADDR 0x8
DevQ(0:3:0): 0 waiting
DevQ(0:3:1): 0 waiting
scsi1:A:3: parity error detected in DT Data-in phase. SEQADDR(0x1a2)
SCSIRATE(0x0)
^IUnexpected non-DT Data Phase
scsi1:0:3:0: Attempting to queue an ABORT message
scsi1: Dumping Card State in Message-in phase, at SEQADDR 0x168
scsi1:0:3:0: Cmd aborted from QINFIFO
aic7xxx_abort returns 0x2002
scsi: device set offline - not ready or command
retry failed after bus reset: host 1 channel 0 id 3 lun 0
SCSI disk error : host 1 channel 0 id 3 lun 0 return code = 70002
I/O error: dev 08:11, sector 671088736
I/O error in filesystem ("sd(8,17)") meta-data dev 0x811 block
0x28000060^I ("xfs_trans_read_buf") error 5 buf count 4096
EFSCORRUPTED returned from file xfs_ialloc.c line 1313
last message repeated 29 times
xfs_btree_check_sblock: Not OK:
magic 0x3a0eb8a5 level 47532 numrecs 50791 leftsib -1188756534 rightsib
-1171161293
nfsd: non-standard errno: -990
xfs_force_shutdown(sd(8,17),0x2) called from line 957 of file xfs_log.c.
Return address = 0xf8bc4b2f
Log I/O Error Detected. Shutting down filesystem: sd(8,17)
Please umount the filesystem, and rectify the problem(s)
SCSI disk error : host 1 channel 0 id 3 lun 1 return code = 70002
I/O error: dev 08:21, sector 64
SCSI disk error : host 1 channel 0 id 3 lun 1 return code = 70002
I/O error: dev 08:21, sector 72
I/O error in filesystem ("sd(8,33)") meta-data dev 0x821 block 0x40^I
("xfs_trans_read_buf") error 5 buf count 8192
XFS unmount got error 5
linvfs_put_super: vfsp/0xc28df640 left dangling!
VFS: Busy inodes after unmount. Self-destruct in 5 seconds. Have a nice
day...
SCSI disk error : host 1 channel 0 id 3 lun 1 return code = 70002
I/O error: dev 08:21, sector 0
XFS: bad magic number
XFS: SB validate failed
SCSI disk error : host 1 channel 0 id 3 lun 1 return code = 70002
I/O error: dev 08:21, sector 0
SCSI disk error : host 1 channel 0 id 3 lun 1 return code = 70002
I/O error: dev 08:21, sector 1
SCSI disk error : host 1 channel 0 id 3 lun 1 return code = 70002
I/O error: dev 08:21, sector 2
SCSI disk error : host 1 channel 0 id 3 lun 1 return code = 70002
I/O error: dev 08:21, sector 3
SCSI disk error : host 1 channel 0 id 3 lun 1 return code = 70002
I/O error: dev 08:21, sector 4
SCSI disk error : host 1 channel 0 id 3 lun 1 return code = 70002
I/O error: dev 08:21, sector 5
SCSI disk error : host 1 channel 0 id 3 lun 1 return code = 70002
I/O error: dev 08:21, sector 6
SCSI disk error : host 1 channel 0 id 3 lun 1 return code = 70002
I/O error: dev 08:21, sector 7
|