xfs
[Top] [All Lists]

Re: PROBLEM: XFS on ARM corruption 'Structure needs cleaning'

To: Eric Sandeen <sandeen@xxxxxxxxxxx>, Brian Foster <bfoster@xxxxxxxxxx>
Subject: Re: PROBLEM: XFS on ARM corruption 'Structure needs cleaning'
From: Török Edwin <edwin@xxxxxxxxxxxx>
Date: Thu, 11 Jun 2015 19:32:04 +0300
Cc: Christopher Squires <christopher.squires@xxxxxxxx>, Wayne Burri <wayne.burri@xxxxxxxx>, Luca Gibelli <luca@xxxxxxxxxxxx>, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <5579B034.4070503@xxxxxxxxxxx>
Organization: Skylable Ltd.
References: <5579296A.8010208@xxxxxxxxxxxx> <20150611151620.GB59168@xxxxxxxxxxxxxxx> <5579A904.3020204@xxxxxxxxxxxx> <5579AE85.5080203@xxxxxxxxxxx> <5579B034.4070503@xxxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.7.0
On 06/11/2015 06:58 PM, Eric Sandeen wrote:
> On 6/11/15 10:51 AM, Eric Sandeen wrote:
>> On 6/11/15 10:28 AM, Török Edwin wrote:
>>> On 06/11/2015 06:16 PM, Brian Foster wrote:
>>>> On Thu, Jun 11, 2015 at 09:23:38AM +0300, Török Edwin wrote:
>>>>> [1.] XFS on ARM corruption 'Structure needs cleaning'
>>>>> [2.] Full description of the problem/report:
>>>>>
>>>>> I have been running XFS sucessfully on x86-64 for years, however I'm 
>>>>> having trouble running it on ARM.
>>>>>
>>>>> Running the testcase below [7.] reliably reproduces the filesystem 
>>>>> corruption starting from a freshly
>>>>> created XFS filesystem: running ls after 'sxadm node --new --batch 
>>>>> /export/dfs/a/b' shows a 'Structure needs cleaning' error,
>>>>> and dmesg shows a corruption error [6.].
>>>>> xfs_repair 3.1.9 is not able to repair the corruption: after mounting the 
>>>>> repair filesystem
>>>>> I still get the 'Structure needs cleaning' error.
>>>>>
>>>>> Note: using /export/dfs/a/b is important for reproducing the problem: if 
>>>>> I only use one level of directories in /export/dfs then the problem
>>>>> doesn't reproduce. Also if I use a tuned version of sxadm that creates 
>>>>> fewer database files then the problem doesn't reproduce either.
>>>>>
>>>>> [3.] Keywords: filesystems, XFS corruption, ARM
>>>>> [4.] Kernel information
>>>>> [4.1.] Kernel version (from /proc/version):
>>>>> Linux hornet34 3.14.3-00088-g7651c68 #24 Thu Apr 9 16:13:46 MDT 2015 
>>>>> armv7l GNU/Linux
>>>>>
>>>> ...
>>>>> [5.] Most recent kernel version which did not have the bug: Unknown, 
>>>>> first kernel I try on ARM
>>>>>
>>>>> [6.] dmesg stacktrace
>>>>>
>>>>> [4627578.440000] XFS (sda4): Mounting Filesystem
>>>>> [4627578.510000] XFS (sda4): Ending clean mount
>>>>> [4627621.470000] dd6ee000: 58 46 53 42 00 00 10 00 00 00 00 00 37 40 21 
>>>>> 00  XFSB........7@!.
>>>>> [4627621.480000] dd6ee010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
>>>>> 00  ................
>>>>> [4627621.490000] dd6ee020: 5b 08 7f 79 0e 3a 46 3d 9b ea 26 ad 9d 62 17 
>>>>> 8d  [..y.:F=..&..b..
>>>>> [4627621.490000] dd6ee030: 00 00 00 00 20 00 00 04 00 00 00 00 00 00 00 
>>>>> 80  .... ...........
>>>>
>>>> Just a data point... the magic number here looks like a superblock magic
>>>> (XFSB) rather than one of the directory magic numbers. I'm wondering if
>>>> a buffer disk address has gone bad somehow or another.
>>>>
>>>> Does this happen to be a large block device? I don't see any partition
>>>> or xfs_info data below. If so, it would be interesting to see if this
>>>> reproduces on a smaller device. It does appear that the large block
>>>> device option is enabled in the kernel config above, however, so maybe
>>>> that's unrelated.
>>>
>>> This is mkfs.xfs /dev/sda4:
>>> meta-data=/dev/sda4              isize=256    agcount=4, agsize=231737408 
>>> blks
>>>          =                       sectsz=512   attr=2, projid32bit=0
>>> data     =                       bsize=4096   blocks=926949632, imaxpct=5
>>>          =                       sunit=0      swidth=0 blks
>>> naming   =version 2              bsize=4096   ascii-ci=0
>>> log      =internal log           bsize=4096   blocks=452612, version=2
>>>          =                       sectsz=512   sunit=0 blks, lazy-count=1
>>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>>>
>>> But it also reproduces with this small loopback file:
>>> meta-data=/tmp/xfs.test          isize=256    agcount=2, agsize=5120 blks
>>>          =                       sectsz=512   attr=2, projid32bit=0
>>> data     =                       bsize=4096   blocks=10240, imaxpct=25
>>>          =                       sunit=0      swidth=0 blks
>>> naming   =version 2              bsize=4096   ascii-ci=0
>>> log      =internal log           bsize=4096   blocks=1200, version=2
>>>          =                       sectsz=512   sunit=0 blks, lazy-count=1
>>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>>
>> ok so not a block number overflow issue, thanks.
>>
>>> You can have a look at xfs.test here: 
>>> http://vol-public.s3.indian.skylable.com:8008/armel/testcase/xfs.test.gz
>>>
>>> If I loopback mount that on an x86-64 box it doesn't show the corruption 
>>> message though ...
>>
>> FWIW, this is the 2nd report we've had of something similar, both on Armv7, 
>> both ok on x86_64.
>>
>> I'll take a look at your xfs.test; that's presumably copied after it 
>> reported the error, and you unmounted it before uploading, correct?  And it 
>> was mkfs'd on armv7, never mounted or manipulated in any way on x86_64?

Thanks, yes it was mkfs.xfs on ARMv7 and unmounted.

> 
> Oh, and what were the kernel messages when you produced the corruption with 
> xfs.txt?

Takes only a couple of minutes to reproduce the issue so I've prepared a fresh 
set of xfs2.test and corresponding kernel messages to make sure its all 
consistent.
Freshly created XFS by mkfs.xfs: 
http://vol-public.s3.indian.skylable.com:8008/armel/testcase/xfs2.test.orig.gz
The corrupted XFS: 
http://vol-public.s3.indian.skylable.com:8008/armel/testcase/xfs2.test.corrupted.gz

All commands below were run on armv7, and unmounted, the files from /tmp copied 
over to x86-64, gzipped and uploaded, they were never mounted on x86-64:

# dd if=/dev/zero of=/tmp/xfs2.test bs=1M count=40
40+0 records in
40+0 records out
41943040 bytes (42 MB) copied, 0.419997 s, 99.9 MB/s
# mkfs.xfs /tmp/xfs2.test
meta-data=/tmp/xfs2.test         isize=256    agcount=2, agsize=5120 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=10240, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=1200, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
# cp /tmp/xfs2.test /tmp/xfs2.test.orig
# umount /export/dfs
# mount -o loop -t xfs /tmp/xfs2.test /export/dfs
# mkdir /export/dfs/a
# sxadm node --new --batch /export/dfs/a/b
# ls /export/dfs/a/b
ls: reading directory /export/dfs/a/b: Structure needs cleaning
# umount /export/dfs
# cp /tmp/xfs2.test /tmp/xfs2.test.corrupted
# dmesg >/tmp/dmesg
# exit

the latest corruption message from dmesg:
[4744604.870000] XFS (loop0): Mounting Filesystem
[4744604.900000] XFS (loop0): Ending clean mount
[4745016.610000] dc61e000: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 28 00  
XFSB..........(.
[4745016.620000] dc61e010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
................
[4745016.630000] dc61e020: 64 23 d2 06 32 2e 4c 20 82 6e f0 36 a7 d9 54 f9  
d#..2.L .n.6..T.
[4745016.640000] dc61e030: 00 00 00 00 00 00 20 04 00 00 00 00 00 00 00 80  
...... .........
[4745016.640000] XFS (loop0): Internal error xfs_dir3_data_read_verify at line 
274 of file fs/xfs/xfs_dir2_data.c.  Caller 0xc01c1528
[4745016.650000] CPU: 0 PID: 37 Comm: kworker/0:1H Not tainted 
3.14.3-00088-g7651c68 #24
[4745016.650000] Workqueue: xfslogd xfs_buf_iodone_work
[4745016.650000] [<c0013948>] (unwind_backtrace) from [<c0011058>] 
(show_stack+0x10/0x14)
[4745016.650000] [<c0011058>] (show_stack) from [<c01c3dc4>] 
(xfs_corruption_error+0x54/0x70)
[4745016.650000] [<c01c3dc4>] (xfs_corruption_error) from [<c01f7854>] 
(xfs_dir3_data_read_verify+0x60/0xd0)
[4745016.650000] [<c01f7854>] (xfs_dir3_data_read_verify) from [<c01c1528>] 
(xfs_buf_iodone_work+0x7c/0x94)
[4745016.650000] [<c01c1528>] (xfs_buf_iodone_work) from [<c00309f0>] 
(process_one_work+0xf4/0x32c)
[4745016.650000] [<c00309f0>] (process_one_work) from [<c0030fb4>] 
(worker_thread+0x10c/0x388)
[4745016.650000] [<c0030fb4>] (worker_thread) from [<c0035e10>] 
(kthread+0xbc/0xd8)
[4745016.650000] [<c0035e10>] (kthread) from [<c000e8f8>] 
(ret_from_fork+0x14/0x3c)
[4745016.650000] XFS (loop0): Corruption detected. Unmount and run xfs_repair
[4745016.650000] XFS (loop0): metadata I/O error: block 0xa000 
("xfs_trans_read_buf_map") error 117 numblks 8

Best regards,
--Edwin

<Prev in Thread] Current Thread [Next in Thread>