xfs
[Top] [All Lists]

Re: data corruption on nfs+xfs

To: Russell Cattelan <cattelan@xxxxxxx>
Subject: Re: data corruption on nfs+xfs
From: Masanori TSUDA <tsuda@xxxxxxxxxxxxxx>
Date: Fri, 11 Jun 2004 17:26:46 +0900
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <1086914331.1160.63.camel@xxxxxxxxxxxxxxxxxxxxx>
References: <1086914331.1160.63.camel@xxxxxxxxxxxxxxxxxxxxx>
Sender: linux-xfs-bounce@xxxxxxxxxxx
In message "Re: data corruption on nfs+xfs" 
(04/06/11 09:38:53),
cattelan@xxxxxxx wrote...
>This looks really promising.
>I'm currently reading through the code again to
>see what kind of implications this might have.
>I'm worried that you're patch might increase file fragmentation, 
>but that is just at first glance. I'll look some more and run
>some testing with and with out you're patch.

Thank you.

>I'm looking at xfs_inactive_free_eofblocks again, I think
>there may be an issue with the xfs_inode di_size and the linux
>inode i_size.

I also think so. I think that following issue may be caused
(I don't reproduce following issue).
This problem is another with the previous problem and is based
on asynchronous updating inode i_size and xfs_inode di_size.
Each process is in order of time.

  1. write 8KB
     The TP write 8KB data to file.
     First, 1st 4KB data is processed [do_generic_file_write].

                         file image
                         offset=0 
                             +----+
                             |    |
                             +----+
        inode i_size      :  -----> (4KB)
        xfs_inode di_size : (0)

     inode i_size is 4KB, but xfs_inode di_size is zero,
     because a_op->write_commit update only i_size.

     Next, 2nd 4KB data is processed [do_generic_file_write].

                         file image
                         offset=0 
                             +----+----+
                             |    |    |
                             +----+----+
        inode i_size      :  ----------> (8KB)
        xfs_inode di_size : (0)

     inode i_size is 8KB, but xfs_inode di_size remain zero.

  2. revalidate
     The revalidate runs (ex. by ls) [linvfs_revalidate].
     At this time, inode i_size is changed to same value as
     xfs_inode di_size [vn_revalidate].
     As result i_size is zero!

                         file image
                         offset=0 
                             +----+----+
                             |    |    |
                             +----+----+
        inode i_size      : (0)
        xfs_inode di_size : (0)

  3. flush data
     The flushing runs by memory overload [balance_dirty].
     Although it is going to flush the buffer (1st page) of write 8KB, 
     its result is EIO, because 1st page index is over inode i_size
     [xfs_page_state_convert].
     And the buffer is discarded on 2.4.26.

  4. write 8KB (continue)
     inode i_size and xfs_inode di_size is updated 8KB at the last of
     write processing [xfs_write].
     But write data is lossed.

                         file image
                         offset=0 
                             +----+----+
                             |    |    |
                             +----+----+
        inode i_size      :  ----------> (8KB)
        xfs_inode di_size :  ----------> (8KB)
                             +----+
                              no data

I think that the issue rarely is caused, and 
one of solution for the issue is to simultaneously update
inode i_size and xfs_inode di_size at a_op->write_commit.

>BTW what tracing did you use to find this?

Although there were trial and error, finally I investigated with the tracing
(attached patch).

Regards,
Tsuda

>On Wed, 2004-06-09 at 20:30, Masanori TSUDA wrote:
>> Hi,
>> 
>> I have reproduced similar problem on xfs1.3.1 (based on 2.4.21),
>> my environment is as follows.
>> 
>>  nfs server :
>>    OS  : RedHat9 + xfs1.3.1 (based on 2.4.21) 
>>    CPU : Xeon(2.4GHz) x 2
>>    MEM : 1GB
>>    NIC : Intel PRO/1000
>>    Local Filesystem : XFS, the refcache is disabled.
>> 
>>  nfs client :
>>    OS  : RadHat9 (based on 2.4.20-8)
>>    NIC : Intel PRO/1000
>>    NFS Ver. : 3
>>    NFS Mount Options : udp,hard,intr,wsize=8192
>> 
>> Within 1 hour of running the test, the corruption was detected.
>> (to make it easy to detect the corruption, umount nfs, umount xfs, 
>>  mount xfs and mount nfs before comparing data, i.e. purge memory cache.)
>> The corruption width was a multiple of 4KB, starting at 4KB boundary.
>> In many cases, it was caused in the start part of the physical extent. 
>> 
>> I have investigated the issue using the kernel embeded local trace.
>> I think that the issue was caused by the delayed allocation mechanism.
>> I explain the example of curruption scenario which I guess.
>> Each process of the scenario is in order of time.
>> 
>>  1. open and write in nfsd (for write1)
>>     The nfs client write 8KB data to file (called write1).
>>     The write request is processed in nfsd. The nfsd call open [linvfs_open],
>>     and call write [linvfs_write]. After calling write, the file has several
>>     delayed allocation blocks over end of the file, by allocation in chunks 
>> and
>>     alignment of writeiosize.
>> 
>>      file image
>>      offset=0     eof
>>          +----+----+----+----+----+- ... +----+
>>          |    |    |    |    |    |      |    |
>>          +----+----+----+----+----+- ... +----+
>>           4KB  4KB
>>          +---------+
>>           write data (write1)
>>          +------------------------------------+
>>           delayed allocation blocks
>> 
>>  2. allocate disk space in kupdated (for write1)
>>     The disk space is allocated for delayed allocotion blocks before data
>>     flushed to disk [linvfs_writepage, page_state_convert].
>> 
>>      file image
>>      offset=0     eof
>>          +----+----+----+----+----+- ... +----+
>>          |    |    |    |    |    |      |    |
>>          +----+----+----+----+----+- ... +----+
>>           4KB  4KB
>>          +---------+
>>           write data (write1)
>>          +------------------------------------+
>>           allocated disk space
>>          +---------+
>>           called disk space1
>>                    +--------------------------+
>>                     called disk space2
>> 
>>  3. close in nfsd (for write1)
>>     The nfsd call close [linvfs_release]. At this time, allocated disk space
>>     over end of the file (disk space2) is truncated, when the refcache is 
>> disabled
>>     [xfs_inactive_free_eofblocks].
>> 
>>      file image
>>      offset=0     eof
>>          +----+----+
>>          |    |    |
>>          +----+----+
>>           4KB  4KB
>>          +---------+
>>           write data (write1)
>>          +---------+
>>           disk space1
>> 
>>  4. open and write in nfsd (for write2)
>>     Furthermore the nfs client write 8KB data to file (called write2).
>>     The nfsd call open [linvfs_open], and call write [linvfs_write].
>> 
>>      file image
>>      offset=0                eof
>>          +----+----+----+----+----+- ... +----+
>>          |    |    |    |    |    |      |    |
>>          +----+----+----+----+----+- ... +----+
>>           4KB  4KB  4KB  4KB
>>          +---------+
>>           write data (write1)
>>                    +---------+
>>                     write data (write2)
>>                    +--------------------------+
>>                     delayed allocation blocks
>>          +---------+
>>           disk space1
>> 
>>  5. flush data to disk in kupdated (for write1)
>>     The write data (write1) is flushed to disk space1 [page_state_convert].
>>     And the write data (write2) is flushed to disk space2 [cluster_write] 
>> !!!,
>>     because the buffer status of write data (write2) is dirty and delay.
>>     But, the disk space2 dose not exist at this time.
>>     The disk space2 may be used by the other file or free space.
>> 
>> I think that one of solution for the issue is to flush only buffers in
>> end of the file before allocating disk space for delayed allocation blocks,
>> don't flush buffers over that.
>> I made patch for xfs1.3.1. I am running the test on the kernel added the
>> patch, it has been running for over 16 hours with no corruption.
>> 
>> Could you please comment the attached patch.
>> 
>> Regards,
>> Tsuda
>> 
>> In message "data corruption on nfs+xfs" 
>> (04/05/27 15:58:48),
>> kazuyuki@xxxxxxxxxxxxxxxxxxx wrote...
>> >We are experiencing the same problem as No.198.
>> >  http://oss.sgi.com/bugzilla/show_bug.cgi?id=198
>> >  http://marc.theaimsgroup.com/?t=108343605300001&r=1&w=2
>> >
>> >We have confirmed that even when the refcache is disabled, setting
>> >fs.xfs.refcache_size to zero through sysctl, the problem does not disappear.
>> >To run linux as single CPU mode, it makes the problem slightly hard to 
>> >occur,
>> >but it still occurs.
>> >
>> >Two types of corruption we've seen:
>> >
>> > 1) Width is a multiple of 8kB,  starting at 8kB boundary. 
>> >   *Maybe the same trouble as No.198.
>> >
>> > 2) Width is a 964 bytes, ending up to 4kB boundary. 
>> >   *I'm not sure the cause is same as 1) above.
>> >
>> >We have tested on 2.4.20-20.9.XFS1.3.1, 2.4.20-30.9.sgi1 XFS1.3.3 and other 
>> >kernels
>> >based on 2.4.20-20 on which we made some changes.
>> >
>> >Anyone who knows where is the cause. On page cache, disk block handling, or 
>> >other parts?
>> >Or who knows how to avoid this with some setting or another version?
>> >

Attachment: trace.patch
Description: Binary data

<Prev in Thread] Current Thread [Next in Thread>