Bug 198 - file corruption over NFSv3 UDP using 1.2pre3
: file corruption over NFSv3 UDP using 1.2pre3
Status: RESOLVED FIXED
Product: XFS
Classification: Unclassified
Component: XFS kernel code
: 1.2.x
: All Linux
: P1 critical
: ---
Assigned To: Russell Cattelan
:
:
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2002-11-18 09:03 CST by Stephane Harnois
Modified: 2005-04-20 14:43 CDT (History)
1 user (show)

See Also:


Attachments
configuration file (20.81 KB, text/plain)
2002-11-18 09:06 CST, Stephane Harnois
Details
console output from kernel. (224 bytes, text/plain)
2002-11-18 09:11 CST, Stephane Harnois
Details
Log file from corruption (179.63 KB, text/plain)
2002-11-18 10:24 CST, Stephane Harnois
Details
log file from today's corruption, with PAGEBUF_DEBUG enabled. (536.51 KB, text/plain)
2002-11-18 10:36 CST, Stephane Harnois
Details
error out from consol when doing strace rm (6.50 KB, text/plain)
2002-11-28 09:41 CST, Knut J
Details
xfs corruption locally, without nfs (20.00 KB, application/octet-stream)
2002-12-05 13:54 CST, Stephane Harnois
Details
test program for local corruption on linux xfs (3.80 KB, application/x-compressed)
2003-02-27 10:36 CST, Stephane Harnois
Details
Fix from Masanori TSUDA (4.09 KB, patch)
2004-06-15 10:50 CDT, Russell Cattelan
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Stephane Harnois 2002-11-18 09:03:28 CST
If this is a userspace bug, what version of the package are you using:

What kernel are you using: 
2.4.18-17SGI_XFS1.2pre3 and 2.4.19-SGI_XFS1.2pre3

Where did the XFS code come from?  (CVS, Linus, your distribution, etc):
Downloaded the kernel source RPM from sgi web site.

Description of Problem:
	Using Dell Poweredge 4600 server and exporting the second system drive, file
integrity verification fails.  The failure occure on linux RH 7.3 clients,  O2
with IRIX and Sun Netra5.  The server is using Intel E1000 PCI-X (2X) with or
without bonding.

The test is ran concurently on 5 clients over gigabit ethernet or 100BT.
The test is perform by creating a 3 Gigabyte file on the client using local
storage.  The file is then copied onto the NFS mount point and then copied back
under a different name.  Then the original file is compared with both file, the
one on the NFS server and the copy locally.  Within 3 to 4 hours of running the
test so after 10 to 20 sucessful pass, the test fails.


How Reproducible:
	All the time, nothing reported by enabling CONFIG_XFS_DEBUG,
CONFIG_PAGEBUF_DEBUG, CONFIG_KERNEL_ASSERT.  Tested with max_cpus=1 also.

Steps to Reproduce:
1.  Mount the server over NFS with rsize=8192,wsize=8192,hard,udp under /ts1
2.  Create a local test file: dd if=/dev/random of=/var/tmp/test bs=1024k
count=3000
3.  Copy the file over to the server: cp /var/tmp/test /ts1/test-`hostname`
4. Copy the file back: cp /ts1/test-`hostname` /var/tmp/test-`hostname`.BACK
5. compare the files and their csum:

Actual Results:
	A difference between the original file and the file over NFS

Expected Results:
	No differences, please.

Additional Information:
	Running the same test with RH kernel without XFS using EXt2 din't fails after
30HRS.
	The internal SCSI is AIX7900.

.config file changes:  SCSI and AIC7XXX built in instead of as module.  Tested
also with Hardware Raid 5 over Fiber Channel, it just takes longer to hit the
Bug.

email me if you need more info.

Stephane Harnois
Comment 1 Stephane Harnois 2002-11-18 09:06:59 CST
Created attachment 48 [details]
configuration file

reproduced the problem with or without XFS_QUOTA, or XFS_DEBUG, PAGEBUF_DEBUG
Comment 2 Stephane Harnois 2002-11-18 09:11:58 CST
Created attachment 49 [details]
console output from kernel.

Restarted the test after mkfs.xfs /dev/sdb and got these messages...
Comment 3 Stephane Harnois 2002-11-18 09:13:20 CST
Comment on attachment 49 [details]
console output from kernel.

And corruption was detected on 1 of the NFS client.
Comment 4 Russell Cattelan 2002-11-18 09:35:13 CST
Subject: Re:  New: file corruption over NFSv3 UDP using 1.2pre3

Hmmm this does not look good.
I would be helpful to see exactly what went wrong in the file.
A cmp -l might be a good start or a hexdump -v on both the good and the
bad files and them diffing the output.
Basically what we are looking for is what "bad" data ended up in the
file and start position and length of the bad data.

On Mon, 2002-11-18 at 11:03, bugzilla-daemon@oss.sgi.com wrote:
> http://oss.sgi.com/bugzilla/show_bug.cgi?id=198
> 
>            Summary: file corruption over NFSv3 UDP using 1.2pre3
>            Product: Linux XFS
>            Version: 1.2.x
>           Platform: IA32
>         OS/Version: Linux
>             Status: NEW
>           Severity: critical
>           Priority: High
>          Component: XFS kernel code
>         AssignedTo: xfs-master@oss.sgi.com
>         ReportedBy: stephy32@videotron.ca
> 
> 
> If this is a userspace bug, what version of the package are you using:
> 
> What kernel are you using: 
> 2.4.18-17SGI_XFS1.2pre3 and 2.4.19-SGI_XFS1.2pre3
> 
> Where did the XFS code come from?  (CVS, Linus, your distribution, etc):
> Downloaded the kernel source RPM from sgi web site.
> 
> Description of Problem:
> 	Using Dell Poweredge 4600 server and exporting the second system drive, file
> integrity verification fails.  The failure occure on linux RH 7.3 clients,  O2
> with IRIX and Sun Netra5.  The server is using Intel E1000 PCI-X (2X) with or
> without bonding.
> 
> The test is ran concurently on 5 clients over gigabit ethernet or 100BT.
> The test is perform by creating a 3 Gigabyte file on the client using local
> storage.  The file is then copied onto the NFS mount point and then copied back
> under a different name.  Then the original file is compared with both file, the
> one on the NFS server and the copy locally.  Within 3 to 4 hours of running the
> test so after 10 to 20 sucessful pass, the test fails.
> 
> 
> How Reproducible:
> 	All the time, nothing reported by enabling CONFIG_XFS_DEBUG,
> CONFIG_PAGEBUF_DEBUG, CONFIG_KERNEL_ASSERT.  Tested with max_cpus=1 also.
> 
> Steps to Reproduce:
> 1.  Mount the server over NFS with rsize=8192,wsize=8192,hard,udp under /ts1
> 2.  Create a local test file: dd if=/dev/random of=/var/tmp/test bs=1024k
> count=3000
> 3.  Copy the file over to the server: cp /var/tmp/test /ts1/test-`hostname`
> 4. Copy the file back: cp /ts1/test-`hostname` /var/tmp/test-`hostname`.BACK
> 5. compare the files and their csum:
> 
> Actual Results:
> 	A difference between the original file and the file over NFS
> 
> Expected Results:
> 	No differences, please.
> 
> Additional Information:
> 	Running the same test with RH kernel without XFS using EXt2 din't fails after
> 30HRS.
> 	The internal SCSI is AIX7900.
> 
> .config file changes:  SCSI and AIC7XXX built in instead of as module.  Tested
> also with Hardware Raid 5 over Fiber Channel, it just takes longer to hit the
> Bug.
> 
> email me if you need more info.
> 
> Stephane Harnois
> 
> 
> 
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
> 


Comment 5 Russell Cattelan 2002-11-18 10:07:57 CST
Hmmm this does not look good.
I would be helpful to see exactly what went wrong in the file.
A cmp -l might be a good start or a hexdump -v on both the good and the
bad files and them diffing the output.
Basically what we are looking for is what "bad" data ended up in the
file and start position and length of the bad data.
Comment 6 Stephane Harnois 2002-11-18 10:24:05 CST
Created attachment 50 [details]
Log file from corruption

output from cmp64, something we wrote.	The bad Data is from another offset in
the file.  Do you want me to upload the 2 files (total 6 gigabytes)?
Comment 7 Stephane Harnois 2002-11-18 10:25:27 CST
Comment on attachment 50 [details]
Log file from corruption

first colomn: offset: original_data bad_data
Comment 8 Stephane Harnois 2002-11-18 10:36:29 CST
Created attachment 51 [details]
log file from today's corruption, with PAGEBUF_DEBUG enabled.

log file matching the pagebuf error messages run.
Comment 9 Knut J 2002-11-28 09:41:40 CST
Created attachment 52 [details]
error out from consol when doing strace rm

consol output when doing strace rm /var/lock/subsys/smb
Comment 10 Russell Cattelan 2002-12-03 11:36:27 CST
Additional comments from Stephane

I do have one bit of positive progress to report (finally)! I commented out the
purging of the XFS refcache and got no corruption (ran about 56 hours).

As you know, the refcache is just a table of open files kept by XFS,
specifically to speedup nfsd. Otherwise, whenever a file is closed XFS does a
truncate (and perhaps other stuff, probably to clear out any over allocation...)  
and the truncate takes a bit of time as it goes through all the file's pages. So
by keeping them open, this penalty is avoided. To make sure that file info gets
flushed, entries are purged regularly from the refcache: in short our files get
closed every
30seconds. Well I just commented that part out. That's a test of course, not a
solution.

--- linux/fs/xfs/xfs_vfsops.c.orig	Sat Nov 30 01:09:15 2002
+++ linux/fs/xfs/xfs_vfsops.c	Sat Nov 30 01:09:43 2002
@@ -1515,7 +1515,7 @@
 	 * eventually kicked out of the cache.
 	 */
 	if (flags & SYNC_BDFLUSH) {
-		xfs_refcache_purge_some(mp);
+		//xfs_refcache_purge_some(mp);
 	}
 
 	/*

I'll look at the code this morning and try to understand what's going on.  My
guess is that the file is being closed (truncated and perhaps other stuff) and
at the same time some other request is extending it, and this confuses XFS
somehow, perhaps some locking is missing. It's still surprising that data jumps
from one file to another though.
Comment 11 Stephane Harnois 2002-12-05 13:54:01 CST
Created attachment 56 [details]
xfs corruption locally, without nfs

simply extract the tar, make and ./tst.
This write 2 files in parallel, closing and opening the file every 16 X 16K.
The 16k blocks are written out of order, b2, b1, b4, b3, b6, b5.
The file get corrupted right after 1 or 2 iteration.

The test doesn't corrupt the file when ran on Ext2 or Ext3 or on xfs on IRIX
6.5.
Comment 12 Stephane Harnois 2002-12-16 13:07:59 CST
The patch provided by Russel fixed the corruption locally for out test case but
"not" over NFS.  I've ran the corrupter over night without a failure but got
corruption over NFS the same as before.  

 The difference between the test program and nfs is that nfsd uses
linvfs_open(inode, *fd) and the test program goes via libc from user space.

 With the patch you provided, you can also get the corruption by mounting over
NFS on the loopback interface.  In this case, I can still get the corruption
with corruptor without requiring large setup to test.  I simply needed to run 2
thread per CPU.

I noticed that the refcache_purge close the files without holding the  i_sem. 
Is this acquired latter on in the close path?  I was thinking that doing a
try_down on the i_sem and only closing the file when the semaphore could be
acquired would prevent closing/writing race condition while optimizing by not
closing a file and doing all the block cleanup while the file is still being
read/written to/from.
Make sense?
Comment 13 Russell Cattelan 2003-01-09 10:32:16 CST
I have not looked at this but I think this problem may be related to the 
fact that the xfs_inode is not torn down at the same time as the linux inode.
A recient bug in our remount ro code (shutdown code for root fs) has shown 
we don't clean up our inodes as often as we should.
This problem may be a related, if an inode has it's refcache dropped we may 
be flushing the data on the inode. 
Comment 14 Stephane Harnois 2003-02-27 10:36:19 CST
Created attachment 67 [details]
test program for local corruption on linux xfs

This can also show holes in the latest cvs xfs integrity.
Comment 15 SGI linux xfs people 2004-01-09 16:56:11 CST
Not that is really any progress on this bug.
But the test program also fails on linux 2.6.0
Comment 16 Shiv Sikand 2004-02-06 12:13:09 CST
Using the test program for local corruption, I can replicate the bug using the 
Mandrake 9.1 kernel when running multiple instances. 
 
2.4.21-0.13mdkenterprise 
 
 
Comment 17 Shiv Sikand 2004-02-09 13:02:50 CST
I've never seen it in the real world though. 
Comment 18 tsuda 2004-06-15 10:00:22 CDT
Subject:  Re: data corruption on nfs+xfs

Hi,

I have reproduced similar problem on xfs1.3.1 (based on 2.4.21),
my environment is as follows.

 nfs server :
   OS  : RedHat9 + xfs1.3.1 (based on 2.4.21)$B!!(B
   CPU : Xeon(2.4GHz) x 2
   MEM : 1GB
   NIC : Intel PRO/1000
   Local Filesystem : XFS, the refcache is disabled.

 nfs client :
   OS  : RadHat9 (based on 2.4.20-8)
   NIC : Intel PRO/1000
   NFS Ver. : 3
   NFS Mount Options : udp,hard,intr,wsize=8192

Within 1 hour of running the test, the corruption was detected.
(to make it easy to detect the corruption, umount nfs, umount xfs, 
 mount xfs and mount nfs before comparing data, i.e. purge memory cache.)
The corruption width was a multiple of 4KB, starting at 4KB boundary.
In many cases, it was caused in the start part of the physical extent. 

I have investigated the issue using the kernel embeded local trace.
I think that the issue was caused by the delayed allocation mechanism.
I explain the example of curruption scenario which I guess.
Each process of the scenario is in order of time.

 1. open and write in nfsd (for write1)
    The nfs client write 8KB data to file (called write1).
    The write request is processed in nfsd. The nfsd call open [linvfs_open],
    and call write [linvfs_write]. After calling write, the file has several
    delayed allocation blocks over end of the file, by allocation in chunks and
    alignment of writeiosize.

     file image
     offset=0     eof
         +----+----+----+----+----+- ... +----+
         |    |    |    |    |    |      |    |
         +----+----+----+----+----+- ... +----+
          4KB  4KB
         +---------+
          write data (write1)
         +------------------------------------+
          delayed allocation blocks

 2. allocate disk space in kupdated (for write1)
    The disk space is allocated for delayed allocotion blocks before data
    flushed to disk [linvfs_writepage, page_state_convert].

     file image
     offset=0     eof
         +----+----+----+----+----+- ... +----+
         |    |    |    |    |    |      |    |
         +----+----+----+----+----+- ... +----+
          4KB  4KB
         +---------+
          write data (write1)
         +------------------------------------+
          allocated disk space
         +---------+
          called disk space1
                   +--------------------------+
                    called disk space2

 3. close in nfsd (for write1)
    The nfsd call close [linvfs_release]. At this time, allocated disk space
    over end of the file (disk space2) is truncated, when the refcache is disabled
    [xfs_inactive_free_eofblocks].

     file image
     offset=0     eof
         +----+----+
         |    |    |
         +----+----+
          4KB  4KB
         +---------+
          write data (write1)
         +---------+
          disk space1

 4. open and write in nfsd (for write2)
    Furthermore the nfs client write 8KB data to file (called write2).
    The nfsd call open [linvfs_open], and call write [linvfs_write].

     file image
     offset=0                eof
         +----+----+----+----+----+- ... +----+
         |    |    |    |    |    |      |    |
         +----+----+----+----+----+- ... +----+
          4KB  4KB  4KB  4KB
         +---------+
          write data (write1)
                   +---------+
                    write data (write2)
                   +--------------------------+
                    delayed allocation blocks
         +---------+
          disk space1

 5. flush data to disk in kupdated (for write1)
    The write data (write1) is flushed to disk space1 [page_state_convert].
    And the write data (write2) is flushed to disk space2 [cluster_write] !!!,
    because the buffer status of write data (write2) is dirty and delay.
    But, the disk space2 dose not exist at this time.
    The disk space2 may be used by the other file or free space.

I think that one of solution for the issue is to flush only buffers in
end of the file before allocating disk space for delayed allocation blocks,
don't flush buffers over that.
I made patch for xfs1.3.1. I am running the test on the kernel added the
patch, it has been running for over 16 hours with no corruption.

Could you please comment the attached patch.

Regards,
Tsuda

Comment 19 tsuda 2004-06-15 10:03:34 CDT
Subject:  Re: data corruption on nfs+xfs


In message "Re: data corruption on nfs+xfs" 
(04/06/11 09:38:53),
cattelan@xfs.org wrote...
>This looks really promising.
>I'm currently reading through the code again to
>see what kind of implications this might have.
>I'm worried that you're patch might increase file fragmentation, 
>but that is just at first glance. I'll look some more and run
>some testing with and with out you're patch.

Thank you.

>I'm looking at xfs_inactive_free_eofblocks again, I think
>there may be an issue with the xfs_inode di_size and the linux
>inode i_size.

I also think so. I think that following issue may be caused
(I don't reproduce following issue).
This problem is another with the previous problem and is based
on asynchronous updating inode i_size and xfs_inode di_size.
Each process is in order of time.

  1. write 8KB
     The TP write 8KB data to file.
     First, 1st 4KB data is processed [do_generic_file_write].

                         file image
                         offset=0 
                             +----+
                             |    |
                             +----+
        inode i_size      :  -----> (4KB)
        xfs_inode di_size : (0)

     inode i_size is 4KB, but xfs_inode di_size is zero,
     because a_op->write_commit update only i_size.

     Next, 2nd 4KB data is processed [do_generic_file_write].

                         file image
                         offset=0 
                             +----+----+
                             |    |    |
                             +----+----+
        inode i_size      :  ----------> (8KB)
        xfs_inode di_size : (0)

     inode i_size is 8KB, but xfs_inode di_size remain zero.

  2. revalidate
     The revalidate runs (ex. by ls) [linvfs_revalidate].
     At this time, inode i_size is changed to same value as
     xfs_inode di_size [vn_revalidate].
     As result i_size is zero!

                         file image
                         offset=0 
                             +----+----+
                             |    |    |
                             +----+----+
        inode i_size      : (0)
        xfs_inode di_size : (0)

  3. flush data
     The flushing runs by memory overload [balance_dirty].
     Although it is going to flush the buffer (1st page) of write 8KB, 
     its result is EIO, because 1st page index is over inode i_size
     [xfs_page_state_convert].
     And the buffer is discarded on 2.4.26.

  4. write 8KB (continue)
     inode i_size and xfs_inode di_size is updated 8KB at the last of
     write processing [xfs_write].
     But write data is lossed.

                         file image
                         offset=0 
                             +----+----+
                             |    |    |
                             +----+----+
        inode i_size      :  ----------> (8KB)
        xfs_inode di_size :  ----------> (8KB)
                             +----+
                              no data

I think that the issue rarely is caused, and 
one of solution for the issue is to simultaneously update
inode i_size and xfs_inode di_size at a_op->write_commit.

>BTW what tracing did you use to find this?

Although there were trial and error, finally I investigated with the tracing
(attached patch).

Regards,
Tsuda

Comment 20 Russell Cattelan 2004-06-15 10:50:34 CDT
Created attachment 124 [details]
Fix from Masanori TSUDA 

Masanaori's patch effectively  fixes the problem!

We still want to investigate the problem some more 
because we believe this fixes a symptom of a hole
in i_size != di_size / vaild page handling.


-Russell