I've been noticing a problem which I have worked down to a reliable
test case. We have a SAN with three servers (2 Dell 6450 dual CPU and a Dell
Optiplex Gx running RedHat 7.1 with the 2.4.5 kernel (SMP enabled on the
duals) and XFS 1.0.1) a Qualstar tape library, and a Zzyzx RocketStor 2000
disk array. These machines are all connected together with a SANbox. The
Dell's have Qlogic 2100 cards and are running the stock qlogicfc kernel
driver as a module. Other than XFS, we have modified the scsi_scan.c driver
(patch below). Everything else is unchanged. I orignally observed the
problem with the tape library and created the following test case:
1) unload st, sg, qlogicfc from host
2) unplug Qualstar library from SAN
3) load qlogic on host
4) mount a disk on host from the Zzyzx array
5) start a large dd write to the disk
6) load st,sg module on host
7) plug Qualstar library into SAN
8) unload sg (or st)
This sequence of actions will cause the host running the dd's to
report I/O errors and eventually force the disk offline. On occassion, I had
been able to duplicate this by simply connecting the library to the san
while a disk write is occuring. We worked around this problem by installing
a second HBA into the Optiplex and plugging the tape library directly into
this host, effectively isolating the library from the rest of the SAN (this
one machine is the backup server, so it's no big deal.) However, early today
one of the Dell servers crashed and when we rebooted it the *other* Dell,
which serves the disk space on the SAN via NFS, reported the attached errors
in its log. When I tried to unexport the file-system and unmount the disk,
the VFS dies with the all too pleasant "Have a nice day" error. We then
rebooted the machine and all of the disks recovered OK. Should I run
xfs_repair manually, or is the recovery XFS runs at mount sufficient? Any
ideas on how we can avoid this in the future?
Thanks!
-poul
Sep 7 14:01:05 albatross kernel: qlogicfc0 : RSCN Received
Sep 7 14:01:05 albatross kernel: qlogicfc0 : Fabric found.
Sep 7 14:01:05 albatross kernel: qlogicfc0 : Error performing port login
4008
Sep 7 14:01:05 albatross kernel: qlogicfc0 : Port Database
Sep 7 14:01:05 albatross kernel: wwn: 210000e08b02bcf9 scsi_id: 0
loop_id: 0
Sep 7 14:01:05 albatross kernel: wwn: 210000e08b02e5f9 scsi_id: 1
loop_id: Not Available
Sep 7 14:01:05 albatross kernel: wwn: 201200208d010d71 scsi_id: 2
loop_id: Not Available
Sep 7 14:01:05 albatross kernel: wwn: 201000208d010d71 scsi_id: 3
loop_id: 81
Sep 7 14:01:05 albatross kernel: wwn: 201300208d010d71 scsi_id: 4
loop_id: 82
Sep 7 14:01:05 albatross kernel: wwn: 201100208d010d71 scsi_id: 5
loop_id: 83
Sep 7 14:01:05 albatross kernel: wwn: 210000e08b02dcf9 scsi_id: 6
loop_id: 84
Sep 7 14:01:37 albatross kernel: qlogicfc0 : scsi abort failure: 4006
Sep 7 14:01:37 albatross kernel: qlogicfc0 : abort failed
Sep 7 14:01:37 albatross kernel: qlogicfc0 : firmware status is 4000 3
Sep 7 14:01:37 albatross kernel: qlogicfc0 : scsi abort failure: 4006
Sep 7 14:01:37 albatross kernel: qlogicfc0 : abort failed
Sep 7 14:01:37 albatross kernel: qlogicfc0 : firmware status is 4000 3
..etc..
Sep 7 14:01:38 albatross kernel: scsi: device set offline - command error
recover failed: host 2 channel 0 id 2 lun 0
Sep 7 14:01:38 albatross kernel: SCSI disk error : host 2 channel 0 id 2
lun 0 return code = 6040000
Sep 7 14:01:38 albatross kernel: I/O error: dev 08:21, sector 125941000
Sep 7 14:01:38 albatross kernel: XFS: device 0x821- XFS write error in file
system meta-data block 0x781b508 in sd(8,33)
Sep 7 14:01:38 albatross kernel: I/O error: dev 08:21, sector 100740168
Sep 7 14:01:38 albatross kernel: I/O error: dev 08:21, sector 100740400
Sep 7 14:01:38 albatross kernel: I/O error: dev 08:21, sector 100741088
...etc... many many lines here...
Sep 7 14:01:40 albatross kernel: I/O error: dev 08:21, sector 16864360
Sep 7 14:01:40 albatross kernel: I/O error: dev 08:21, sector 25165825
Sep 7 14:01:40 albatross kernel: I/O error: dev 08:21, sector 25165848
Sep 7 14:01:40 albatross kernel: I/O error: dev 08:21, sector 25168072
Sep 7 14:01:40 albatross kernel: I/O error: dev 08:21, sector 25173472
Sep 7 14:01:44 albatross kernel: SCSI disk error : host 2 channel 0 id 2
lun 0 return code = 6040000
Sep 7 14:01:44 albatross kernel: I/O error: dev 08:21, sector 125829376
Sep 7 14:01:44 albatross kernel: xfs_force_shutdown(sd(8,33),0x2) called
from line 942 of file xfs_log.c. Return address = 0xc01d580d
Sep 7 14:01:44 albatross kernel: I/O Error Detected. Shutting down
filesystem: sd(8,33)
Sep 7 14:01:45 albatross kernel: Please umount the filesystem, and rectify
the problem(s)
Sep 7 14:01:45 albatross kernel: I/O error: dev 08:21, sector 42011112
Sep 7 14:01:45 albatross kernel: xfs_force_shutdown(sd(8,33),0x2) called
from line 714 of file xfs_log.c. Return address = 0xc01d5527
Sep 7 14:01:45 albatross kernel: xfs_force_shutdown(sd(8,33),0x2) called
from line 714 of file xfs_log.c. Return address = 0xc01d5527
Sep 7 14:01:45 albatross kernel: SCSI disk error : host 2 channel 0 id 2
lun 0 return code = 6040000
Sep 7 14:01:45 albatross kernel: I/O error: dev 08:21, sector 41969056
Sep 7 14:01:45 albatross kernel: SCSI disk error : host 2 channel 0 id 2
lun 0 return code = 6040000
Sep 7 14:01:45 albatross kernel: I/O error: dev 08:21, sector 92286648
Sep 7 14:01:45 albatross kernel: I/O error: dev 08:21, sector 92286655
Sep 7 14:01:45 albatross kernel: xfs_force_shutdown(sd(8,33),0x2) called
from line 942 of file xfs_log.c. Return address = 0xc01d580d
Sep 7 14:01:45 albatross kernel: I/O error: dev 08:21, sector 67525328
Sep 7 14:01:45 albatross kernel: I/O error in filesystem ("sd(8,33)")
meta-data dev 0x821 block 0x4065ad0:
Sep 7 14:01:45 albatross kernel: xfs_trans_read_buf
--- scsi_scan.c.orig Mon Jul 23 09:24:53 2001
+++ scsi_scan.c Thu Jul 26 16:29:14 2001
@@ -153,6 +153,8 @@
{"DELL", "PSEUDO DEVICE .", "*", BLIST_SPARSELUN}, // Dell PV 530F
{"DELL", "PV530F", "*", BLIST_SPARSELUN}, // Dell PV 530F
{"EMC", "SYMMETRIX", "*", BLIST_SPARSELUN},
+ {"CMD", "CRA-7280", "*", BLIST_SPARSELUN}, // CMD RAID Controller
+ {"Zzyzx", "RocketStor 2000S", "*", BLIST_SPARSELUN}, // Zzyzx
RocketStor Raid
{"SONY", "TSL", "*", BLIST_FORCELUN}, // DDS3 & DDS4
autoloaders
{"DELL", "PERCRAID", "*", BLIST_FORCELUN},
{"HP", "NetRAID-4M", "*", BLIST_FORCELUN},
@@ -565,20 +567,26 @@
}
/*
- * Check the peripheral qualifier field - this tells us whether LUNS
- * are supported here or not.
+ * Check for SPARSELUN before checking the peripheral qualifier,
+ * so sparse lun devices are completely scanned.
*/
- if ((scsi_result[0] >> 5) == 3) {
- scsi_release_request(SRpnt);
- return 0; /* assume no peripheral if any sort of error
*/
- }
/*
* Get any flags for this device.
*/
bflags = get_device_flags (scsi_result);
-
+ if (bflags & BLIST_SPARSELUN) {
+ *sparse_lun = 1;
+ }
+ /*
+ * Check the peripheral qualifier field - this tells us whether LUNS
+ * are supported here or not.
+ */
+ if ((scsi_result[0] >> 5) == 3) {
+ scsi_release_request(SRpnt);
+ return 0; /* assume no peripheral if any sort of error
*/
+ }
/* The Toshiba ROM was "gender-changed" here as an inline hack.
This is now much more generic.
This is a mess: What we really want is to leave the
scsi_result
|