http://oss.sgi.com/bugzilla/show_bug.cgi?id=411
Summary: Reproducible memory corruption, oops, panic with xfs on
sata raid5 in 2.6
Product: Linux XFS
Version: unspecified
Platform: PC
URL: http://chardonnay.math.bme.hu/~korn/xfs-oops/
OS/Version: Linux
Status: NEW
Severity: major
Priority: P2
Component: XFS kernel code
AssignedTo: xfs-master@xxxxxxxxxxx
ReportedBy: korn-sgi.com@xxxxxxxxxxxxxxxxxxxxxx
Hi,
Setup:
I have 4 SATA disks in a software raid5 setup with a 2.6.11.7 (2.6.12-rc4)
kernel on a UP box. The raid5 array is /dev/md2. I created an LVM PV from
/dev/md2 and then proceeded to create a volume group called raid5 from that
single PV. I created some LVs and put xfs on them with 'mkfs.xfs -d unwritten=0
-i size=512 -l version=2'.
Actions:
I wanted to migrate to this raid5 array from a different volume group that had
no redundancy. I issued an rsync -avHP source dest command, and after
transferring a few hundred megabytes, the box oopsed, panicked and rebooted. I
could reproduce this behaviour as often as I wanted.
Experiments:
I tried JFS instead of XFS and things appeared to work, which leads me to
believe that XFS may either be the culprit or (more likely) somehow trigger a
kernel bug that JFS doesn't.
I tried putting xfs on /dev/md2 without LVM, and still I hit the oops, so it
doesn't seem to be related to LVM.
I could reproduce the problem both with 2.6.11.7 and 2.6.12-rc4. Unfortunately,
the oops is not committed to disk and only partially written to netconsole; it
is also too long for the display of my vt320 to hold. (Alas, I can't plug the
serial cable into anything with a scrollback buffer.)
Some of the oopses appeared to be related to kernel preemption, so I compiled a
kernel without PREEMPT. The problem persisted.
A friend tried to rsync a large directory to an xfs residing on a raid5 array
composed of four disks and was not able to reproduce the crash, so it must be
related to something specific to my configuration.
I occasionally see "APIC error on CPU0: 40(40)" and "APIC error on CPU0: 00(40)"
in my kernel log, especially during times of heavy disk activity. I tried
booting with "noapic acpi=off", and the messages disappeared, but not the oops.
I recall reading that these messages could be related to an undersized power
supply; however, my box runs off two units rated 400W, and neither of them draws
more than 120W from mains.
Unfortunately, I currently have no way to verify whether the problem occurs with
raid5 on IDE or SCSI.
Logs:
What oopses I was able to capture I'll upload to the URL (along with output from
lspci, mdadm etc.). What is interesting is that the text shown in 'process name'
is often a recognizable fragment of one of the files I was rsyncing. How does it
get there?
I now deployed JFS, but still have some space left in the volume group to
experiment with XFS if necessary.
Best regards,
Andras
--
Configure bugmail: http://oss.sgi.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
|