xfs-masters
[Top] [All Lists]

[xfs-masters] [Bug 411] New: Reproducible memory corruption, oops, pani

To: xfs-master@xxxxxxxxxxx
Subject: [xfs-masters] [Bug 411] New: Reproducible memory corruption, oops, panic with xfs on sata raid5 in 2.6
From: bugzilla-daemon@xxxxxxxxxxx
Date: Wed, 11 May 2005 12:42:51 -0700
Reply-to: xfs-masters@xxxxxxxxxxx
Sender: xfs-masters-bounce@xxxxxxxxxxx
http://oss.sgi.com/bugzilla/show_bug.cgi?id=411

           Summary: Reproducible memory corruption, oops, panic with xfs on
                    sata raid5 in 2.6
           Product: Linux XFS
           Version: unspecified
          Platform: PC
               URL: http://chardonnay.math.bme.hu/~korn/xfs-oops/
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P2
         Component: XFS kernel code
        AssignedTo: xfs-master@xxxxxxxxxxx
        ReportedBy: korn-sgi.com@xxxxxxxxxxxxxxxxxxxxxx


Hi,

Setup:

I have 4 SATA disks in a software raid5 setup with a 2.6.11.7 (2.6.12-rc4)
kernel on a UP box. The raid5 array is /dev/md2. I created an LVM PV from
/dev/md2 and then proceeded to create a volume group called raid5 from that
single PV. I created some LVs and put xfs on them with 'mkfs.xfs -d unwritten=0
-i size=512 -l version=2'.

Actions:

I wanted to migrate to this raid5 array from a different volume group that had
no redundancy. I issued an rsync -avHP source dest command, and after
transferring a few hundred megabytes, the box oopsed, panicked and rebooted. I
could reproduce this behaviour as often as I wanted.

Experiments:

I tried JFS instead of XFS and things appeared to work, which leads me to
believe that XFS may either be the culprit or (more likely) somehow trigger a
kernel bug that JFS doesn't.

I tried putting xfs on /dev/md2 without LVM, and still I hit the oops, so it
doesn't seem to be related to LVM.

I could reproduce the problem both with 2.6.11.7 and 2.6.12-rc4. Unfortunately,
the oops is not committed to disk and only partially written to netconsole; it
is also too long for the display of my vt320 to hold. (Alas, I can't plug the
serial cable into anything with a scrollback buffer.)

Some of the oopses appeared to be related to kernel preemption, so I compiled a
kernel without PREEMPT. The problem persisted.

A friend tried to rsync a large directory to an xfs residing on a raid5 array
composed of four disks and was not able to reproduce the crash, so it must be
related to something specific to my configuration.

I occasionally see "APIC error on CPU0: 40(40)" and "APIC error on CPU0: 00(40)"
in my kernel log, especially during times of heavy disk activity. I tried
booting with "noapic acpi=off", and the messages disappeared, but not the oops.
I recall reading that these messages could be related to an undersized power
supply; however, my box runs off two units rated 400W, and neither of them draws
more than 120W from mains.

Unfortunately, I currently have no way to verify whether the problem occurs with
raid5 on IDE or SCSI.

Logs:

What oopses I was able to capture I'll upload to the URL (along with output from
lspci, mdadm etc.). What is interesting is that the text shown in 'process name'
is often a recognizable fragment of one of the files I was rsyncing. How does it
get there?

I now deployed JFS, but still have some space left in the volume group to
experiment with XFS if necessary.

Best regards,

Andras

-- 
Configure bugmail: http://oss.sgi.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


<Prev in Thread] Current Thread [Next in Thread>
  • [xfs-masters] [Bug 411] New: Reproducible memory corruption, oops, panic with xfs on sata raid5 in 2.6, bugzilla-daemon <=