In message "Re: data corruption on nfs+xfs"
(04/06/11 09:38:53),
cattelan@xxxxxxx wrote...
>This looks really promising.
>I'm currently reading through the code again to
>see what kind of implications this might have.
>I'm worried that you're patch might increase file fragmentation,
>but that is just at first glance. I'll look some more and run
>some testing with and with out you're patch.
Thank you.
>I'm looking at xfs_inactive_free_eofblocks again, I think
>there may be an issue with the xfs_inode di_size and the linux
>inode i_size.
I also think so. I think that following issue may be caused
(I don't reproduce following issue).
This problem is another with the previous problem and is based
on asynchronous updating inode i_size and xfs_inode di_size.
Each process is in order of time.
1. write 8KB
The TP write 8KB data to file.
First, 1st 4KB data is processed [do_generic_file_write].
file image
offset=0
+----+
| |
+----+
inode i_size : -----> (4KB)
xfs_inode di_size : (0)
inode i_size is 4KB, but xfs_inode di_size is zero,
because a_op->write_commit update only i_size.
Next, 2nd 4KB data is processed [do_generic_file_write].
file image
offset=0
+----+----+
| | |
+----+----+
inode i_size : ----------> (8KB)
xfs_inode di_size : (0)
inode i_size is 8KB, but xfs_inode di_size remain zero.
2. revalidate
The revalidate runs (ex. by ls) [linvfs_revalidate].
At this time, inode i_size is changed to same value as
xfs_inode di_size [vn_revalidate].
As result i_size is zero!
file image
offset=0
+----+----+
| | |
+----+----+
inode i_size : (0)
xfs_inode di_size : (0)
3. flush data
The flushing runs by memory overload [balance_dirty].
Although it is going to flush the buffer (1st page) of write 8KB,
its result is EIO, because 1st page index is over inode i_size
[xfs_page_state_convert].
And the buffer is discarded on 2.4.26.
4. write 8KB (continue)
inode i_size and xfs_inode di_size is updated 8KB at the last of
write processing [xfs_write].
But write data is lossed.
file image
offset=0
+----+----+
| | |
+----+----+
inode i_size : ----------> (8KB)
xfs_inode di_size : ----------> (8KB)
+----+
no data
I think that the issue rarely is caused, and
one of solution for the issue is to simultaneously update
inode i_size and xfs_inode di_size at a_op->write_commit.
>BTW what tracing did you use to find this?
Although there were trial and error, finally I investigated with the tracing
(attached patch).
Regards,
Tsuda
>On Wed, 2004-06-09 at 20:30, Masanori TSUDA wrote:
>> Hi,
>>
>> I have reproduced similar problem on xfs1.3.1 (based on 2.4.21),
>> my environment is as follows.
>>
>> nfs server :
>> OS : RedHat9 + xfs1.3.1 (based on 2.4.21)
>> CPU : Xeon(2.4GHz) x 2
>> MEM : 1GB
>> NIC : Intel PRO/1000
>> Local Filesystem : XFS, the refcache is disabled.
>>
>> nfs client :
>> OS : RadHat9 (based on 2.4.20-8)
>> NIC : Intel PRO/1000
>> NFS Ver. : 3
>> NFS Mount Options : udp,hard,intr,wsize=8192
>>
>> Within 1 hour of running the test, the corruption was detected.
>> (to make it easy to detect the corruption, umount nfs, umount xfs,
>> mount xfs and mount nfs before comparing data, i.e. purge memory cache.)
>> The corruption width was a multiple of 4KB, starting at 4KB boundary.
>> In many cases, it was caused in the start part of the physical extent.
>>
>> I have investigated the issue using the kernel embeded local trace.
>> I think that the issue was caused by the delayed allocation mechanism.
>> I explain the example of curruption scenario which I guess.
>> Each process of the scenario is in order of time.
>>
>> 1. open and write in nfsd (for write1)
>> The nfs client write 8KB data to file (called write1).
>> The write request is processed in nfsd. The nfsd call open [linvfs_open],
>> and call write [linvfs_write]. After calling write, the file has several
>> delayed allocation blocks over end of the file, by allocation in chunks
>> and
>> alignment of writeiosize.
>>
>> file image
>> offset=0 eof
>> +----+----+----+----+----+- ... +----+
>> | | | | | | | |
>> +----+----+----+----+----+- ... +----+
>> 4KB 4KB
>> +---------+
>> write data (write1)
>> +------------------------------------+
>> delayed allocation blocks
>>
>> 2. allocate disk space in kupdated (for write1)
>> The disk space is allocated for delayed allocotion blocks before data
>> flushed to disk [linvfs_writepage, page_state_convert].
>>
>> file image
>> offset=0 eof
>> +----+----+----+----+----+- ... +----+
>> | | | | | | | |
>> +----+----+----+----+----+- ... +----+
>> 4KB 4KB
>> +---------+
>> write data (write1)
>> +------------------------------------+
>> allocated disk space
>> +---------+
>> called disk space1
>> +--------------------------+
>> called disk space2
>>
>> 3. close in nfsd (for write1)
>> The nfsd call close [linvfs_release]. At this time, allocated disk space
>> over end of the file (disk space2) is truncated, when the refcache is
>> disabled
>> [xfs_inactive_free_eofblocks].
>>
>> file image
>> offset=0 eof
>> +----+----+
>> | | |
>> +----+----+
>> 4KB 4KB
>> +---------+
>> write data (write1)
>> +---------+
>> disk space1
>>
>> 4. open and write in nfsd (for write2)
>> Furthermore the nfs client write 8KB data to file (called write2).
>> The nfsd call open [linvfs_open], and call write [linvfs_write].
>>
>> file image
>> offset=0 eof
>> +----+----+----+----+----+- ... +----+
>> | | | | | | | |
>> +----+----+----+----+----+- ... +----+
>> 4KB 4KB 4KB 4KB
>> +---------+
>> write data (write1)
>> +---------+
>> write data (write2)
>> +--------------------------+
>> delayed allocation blocks
>> +---------+
>> disk space1
>>
>> 5. flush data to disk in kupdated (for write1)
>> The write data (write1) is flushed to disk space1 [page_state_convert].
>> And the write data (write2) is flushed to disk space2 [cluster_write]
>> !!!,
>> because the buffer status of write data (write2) is dirty and delay.
>> But, the disk space2 dose not exist at this time.
>> The disk space2 may be used by the other file or free space.
>>
>> I think that one of solution for the issue is to flush only buffers in
>> end of the file before allocating disk space for delayed allocation blocks,
>> don't flush buffers over that.
>> I made patch for xfs1.3.1. I am running the test on the kernel added the
>> patch, it has been running for over 16 hours with no corruption.
>>
>> Could you please comment the attached patch.
>>
>> Regards,
>> Tsuda
>>
>> In message "data corruption on nfs+xfs"
>> (04/05/27 15:58:48),
>> kazuyuki@xxxxxxxxxxxxxxxxxxx wrote...
>> >We are experiencing the same problem as No.198.
>> > http://oss.sgi.com/bugzilla/show_bug.cgi?id=198
>> > http://marc.theaimsgroup.com/?t=108343605300001&r=1&w=2
>> >
>> >We have confirmed that even when the refcache is disabled, setting
>> >fs.xfs.refcache_size to zero through sysctl, the problem does not disappear.
>> >To run linux as single CPU mode, it makes the problem slightly hard to
>> >occur,
>> >but it still occurs.
>> >
>> >Two types of corruption we've seen:
>> >
>> > 1) Width is a multiple of 8kB, starting at 8kB boundary.
>> > *Maybe the same trouble as No.198.
>> >
>> > 2) Width is a 964 bytes, ending up to 4kB boundary.
>> > *I'm not sure the cause is same as 1) above.
>> >
>> >We have tested on 2.4.20-20.9.XFS1.3.1, 2.4.20-30.9.sgi1 XFS1.3.3 and other
>> >kernels
>> >based on 2.4.20-20 on which we made some changes.
>> >
>> >Anyone who knows where is the cause. On page cache, disk block handling, or
>> >other parts?
>> >Or who knows how to avoid this with some setting or another version?
>> >
trace.patch
Description: Binary data
|