XFS File system in trouble

Gim Leong Chin chingimleong at yahoo.com.sg
Mon Jul 20 08:08:15 CDT 2015


Hi Leslie,
My two cents here, it appears you are using AMD FX CPU on ASUS Sabertooth motherboard?
I would strongly suggest you use unbuffered ECC DIMMs in your system.  Mcelog will warn of ECC errors in your DIMMs.  ECC will correct single bit errors and at least detect multi bit errors.
I had AMD Opteron servers with registered ECC DIMMs with continuous correctable ECC errors running HPC jobs for up to one month without any crashes until I could schedule down time for DIMM replacement.  The errors will be flagged either in BMC (service processor) or mcelog.
All my PC / workstations at work place and at home with consumer AMD Althon 64 and AMD Phenom II had unbuffered ECC DIMMs on ASUS motherboards.  I never had any memory errors; I know that if there are memory errors I will get notified.

Chin Gim Leong

      From: Leslie Rhorer <lrhorer at mygrande.net>
 To: Martin Papik <mp6058 at gmail.com> 
Cc: xfs at oss.sgi.com 
 Sent: Monday, 20 July 2015, 16:35
 Subject: Re: XFS File system in trouble
   
On 7/20/2015 3:05 AM, Martin Papik wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
>
> Since you've already found one HW related fault, would you consider
> booting into memtest for a couple of passes just to be on the safe
> side.

    I did that after confirming the one stick of memory was bad.  Twice.  I 
got over 20,000 errors on the bad stick, and 0 on the good one.  I also 
swapped the locations on the motherboard, and the bad stick still failed 
while the good one passed 100%.

> And did you by any chance look at SMART if applicable and
> possibly running a test on the drives.

    Yes. SMART found no errors, but think about it.  Every time tar tries 
to create a directory when untarring that file in that location, the 
file system croaks when it tries to create a directory. Not when reading 
and not when writing other than when it creates a directory.  When I 
create the directory manualy, the process quits failing at that point 
and fails later on during a different directory create.  The array 
remains intact when reading, and dmesg shows no drive errors.  I've 
re-synced the array, which reads every byte on all 8 drives without a 
single mismatch - several times.  To my knowledge, no read has ever 
failed except after the filesystem goes offline.  I thought reads were 
failing during the CRC checks, but that was a red herring.

> Another test I sometimes do
> when I'm unsure about disks is "cat /dev/sda > /dev/null" (i.e. a
> whole disk read test)

echo repair > /sys/block/md0/md/sync_action reads not one drive, but 
every byte on all 8 drives.

> and see (dmesg) if any errors show up, unless

    'Nary one, and no mismatches.



> you're willing to run badblocks in a read-write nondestructive mode.
> In my experience the read test or badblocks can be run simultaneously
> with smartctl -t long. But as a start I'd look at smartctl --all
> /dev/sd? and see if there are any bad signs. I hope this helps. Good luck
>
>
> On 07/20/2015 10:41 AM, Leslie Rhorer wrote:
>> On 7/19/2015 6:27 PM, Dave Chinner wrote:
>>> On Sat, Jul 18, 2015 at 08:02:50PM -0500, Leslie Rhorer wrote:
>>>>
>>>> I found the problem with md5sum (and probably nfs, as well).
>>>> One of the memory modules in the server was bad.  The problem
>>>> with XFS persists.  Every time tar tried to create the
>>>> directory:
>>>
>>> Now you need to run xfs_repair.
>>
>> I do that every time the array implodes.  It makes no difference.
>> It never mentions cleaning the structure tar says needs cleaning,
>> and the next time I run tar on that file, the filesystem craters.
>>
>> _______________________________________________ xfs mailing list
>> xfs at oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQIcBAEBCgAGBQJVrKuzAAoJELsEaSRwbVYrdjoP/3n1W9YtcpdiDoylp6tDYcjF
> vEVz7IWLv2cOky8Lp+0WAZ4Z0WMhcutFzT571H1Vc+jT/UgO25pQHa3yLYTboPuZ
> +tBidVUycs7ZIr9QCZFs2uPQ/7YstamB+F7paCTMKtOJJr5CZLiYX4iyJ9sFmWVY
> UFPAIhyoqD5CFgoaAkwCmk50kNiT0aPM7egizIUVEt14cWuxZxMN0NIJ5b0WJfAk
> qtNQjstVI/xYDgsImm2ZAm19SfOG9ltm2G9zafRr6lR6rRtXjtZX8zEg0l/o9XUw
> OifghjoSup8OCzvX6+4+Soj/3mCKZv4rkBm3exf4YzfQ9eVG6Ktele2rLIs1sl3O
> hUrZUNEl8hYGJeb5gBHFV/TLWDMMwNde/6JiBVy0V8EbDF1lvR4jYpUwThOE0jyL
> ZbzZe4N/B0qvB1OpLDkHrMVm9NPtDkfXdTtM2kRmo5955xtkK09yHF/v64kz7IKc
> 2rM5pOwTR6HWE8RF2j9UujgPjw6nEUuY01TvIMGYzMfkJTI+sVjeDQfwnPG8tzIa
> x4uLa4vTrBD5IaICjAmQiY69qqmt5Vg42G4latZVTYQLelvWQ774mXZfgfT/GtbT
> RKzVwvYowWr/EBhtp7ix/1rWANTFiX0lxOPnRmUFvu8UJnyZhR0/EYbJYy1+jTt7
> O7hZMfAayQBsnVcSK1JC
> =3Ubd
> -----END PGP SIGNATURE-----
>

_______________________________________________
xfs mailing list
xfs at oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs


  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20150720/540afc1b/attachment-0001.html>


More information about the xfs mailing list