Bugzilla – Bug 198
file corruption over NFSv3 UDP using 1.2pre3
Last modified: 2005-04-20 14:43:56 CDT
If this is a userspace bug, what version of the package are you using: What kernel are you using: 2.4.18-17SGI_XFS1.2pre3 and 2.4.19-SGI_XFS1.2pre3 Where did the XFS code come from? (CVS, Linus, your distribution, etc): Downloaded the kernel source RPM from sgi web site. Description of Problem: Using Dell Poweredge 4600 server and exporting the second system drive, file integrity verification fails. The failure occure on linux RH 7.3 clients, O2 with IRIX and Sun Netra5. The server is using Intel E1000 PCI-X (2X) with or without bonding. The test is ran concurently on 5 clients over gigabit ethernet or 100BT. The test is perform by creating a 3 Gigabyte file on the client using local storage. The file is then copied onto the NFS mount point and then copied back under a different name. Then the original file is compared with both file, the one on the NFS server and the copy locally. Within 3 to 4 hours of running the test so after 10 to 20 sucessful pass, the test fails. How Reproducible: All the time, nothing reported by enabling CONFIG_XFS_DEBUG, CONFIG_PAGEBUF_DEBUG, CONFIG_KERNEL_ASSERT. Tested with max_cpus=1 also. Steps to Reproduce: 1. Mount the server over NFS with rsize=8192,wsize=8192,hard,udp under /ts1 2. Create a local test file: dd if=/dev/random of=/var/tmp/test bs=1024k count=3000 3. Copy the file over to the server: cp /var/tmp/test /ts1/test-`hostname` 4. Copy the file back: cp /ts1/test-`hostname` /var/tmp/test-`hostname`.BACK 5. compare the files and their csum: Actual Results: A difference between the original file and the file over NFS Expected Results: No differences, please. Additional Information: Running the same test with RH kernel without XFS using EXt2 din't fails after 30HRS. The internal SCSI is AIX7900. .config file changes: SCSI and AIC7XXX built in instead of as module. Tested also with Hardware Raid 5 over Fiber Channel, it just takes longer to hit the Bug. email me if you need more info. Stephane Harnois
Created attachment 48 [details] configuration file reproduced the problem with or without XFS_QUOTA, or XFS_DEBUG, PAGEBUF_DEBUG
Created attachment 49 [details] console output from kernel. Restarted the test after mkfs.xfs /dev/sdb and got these messages...
Comment on attachment 49 [details] console output from kernel. And corruption was detected on 1 of the NFS client.
Subject: Re: New: file corruption over NFSv3 UDP using 1.2pre3 Hmmm this does not look good. I would be helpful to see exactly what went wrong in the file. A cmp -l might be a good start or a hexdump -v on both the good and the bad files and them diffing the output. Basically what we are looking for is what "bad" data ended up in the file and start position and length of the bad data. On Mon, 2002-11-18 at 11:03, bugzilla-daemon@oss.sgi.com wrote: > http://oss.sgi.com/bugzilla/show_bug.cgi?id=198 > > Summary: file corruption over NFSv3 UDP using 1.2pre3 > Product: Linux XFS > Version: 1.2.x > Platform: IA32 > OS/Version: Linux > Status: NEW > Severity: critical > Priority: High > Component: XFS kernel code > AssignedTo: xfs-master@oss.sgi.com > ReportedBy: stephy32@videotron.ca > > > If this is a userspace bug, what version of the package are you using: > > What kernel are you using: > 2.4.18-17SGI_XFS1.2pre3 and 2.4.19-SGI_XFS1.2pre3 > > Where did the XFS code come from? (CVS, Linus, your distribution, etc): > Downloaded the kernel source RPM from sgi web site. > > Description of Problem: > Using Dell Poweredge 4600 server and exporting the second system drive, file > integrity verification fails. The failure occure on linux RH 7.3 clients, O2 > with IRIX and Sun Netra5. The server is using Intel E1000 PCI-X (2X) with or > without bonding. > > The test is ran concurently on 5 clients over gigabit ethernet or 100BT. > The test is perform by creating a 3 Gigabyte file on the client using local > storage. The file is then copied onto the NFS mount point and then copied back > under a different name. Then the original file is compared with both file, the > one on the NFS server and the copy locally. Within 3 to 4 hours of running the > test so after 10 to 20 sucessful pass, the test fails. > > > How Reproducible: > All the time, nothing reported by enabling CONFIG_XFS_DEBUG, > CONFIG_PAGEBUF_DEBUG, CONFIG_KERNEL_ASSERT. Tested with max_cpus=1 also. > > Steps to Reproduce: > 1. Mount the server over NFS with rsize=8192,wsize=8192,hard,udp under /ts1 > 2. Create a local test file: dd if=/dev/random of=/var/tmp/test bs=1024k > count=3000 > 3. Copy the file over to the server: cp /var/tmp/test /ts1/test-`hostname` > 4. Copy the file back: cp /ts1/test-`hostname` /var/tmp/test-`hostname`.BACK > 5. compare the files and their csum: > > Actual Results: > A difference between the original file and the file over NFS > > Expected Results: > No differences, please. > > Additional Information: > Running the same test with RH kernel without XFS using EXt2 din't fails after > 30HRS. > The internal SCSI is AIX7900. > > .config file changes: SCSI and AIC7XXX built in instead of as module. Tested > also with Hardware Raid 5 over Fiber Channel, it just takes longer to hit the > Bug. > > email me if you need more info. > > Stephane Harnois > > > > ------- You are receiving this mail because: ------- > You are the assignee for the bug, or are watching the assignee. >
Hmmm this does not look good. I would be helpful to see exactly what went wrong in the file. A cmp -l might be a good start or a hexdump -v on both the good and the bad files and them diffing the output. Basically what we are looking for is what "bad" data ended up in the file and start position and length of the bad data.
Created attachment 50 [details] Log file from corruption output from cmp64, something we wrote. The bad Data is from another offset in the file. Do you want me to upload the 2 files (total 6 gigabytes)?
Comment on attachment 50 [details] Log file from corruption first colomn: offset: original_data bad_data
Created attachment 51 [details] log file from today's corruption, with PAGEBUF_DEBUG enabled. log file matching the pagebuf error messages run.
Created attachment 52 [details] error out from consol when doing strace rm consol output when doing strace rm /var/lock/subsys/smb
Additional comments from Stephane I do have one bit of positive progress to report (finally)! I commented out the purging of the XFS refcache and got no corruption (ran about 56 hours). As you know, the refcache is just a table of open files kept by XFS, specifically to speedup nfsd. Otherwise, whenever a file is closed XFS does a truncate (and perhaps other stuff, probably to clear out any over allocation...) and the truncate takes a bit of time as it goes through all the file's pages. So by keeping them open, this penalty is avoided. To make sure that file info gets flushed, entries are purged regularly from the refcache: in short our files get closed every 30seconds. Well I just commented that part out. That's a test of course, not a solution. --- linux/fs/xfs/xfs_vfsops.c.orig Sat Nov 30 01:09:15 2002 +++ linux/fs/xfs/xfs_vfsops.c Sat Nov 30 01:09:43 2002 @@ -1515,7 +1515,7 @@ * eventually kicked out of the cache. */ if (flags & SYNC_BDFLUSH) { - xfs_refcache_purge_some(mp); + //xfs_refcache_purge_some(mp); } /* I'll look at the code this morning and try to understand what's going on. My guess is that the file is being closed (truncated and perhaps other stuff) and at the same time some other request is extending it, and this confuses XFS somehow, perhaps some locking is missing. It's still surprising that data jumps from one file to another though.
Created attachment 56 [details] xfs corruption locally, without nfs simply extract the tar, make and ./tst. This write 2 files in parallel, closing and opening the file every 16 X 16K. The 16k blocks are written out of order, b2, b1, b4, b3, b6, b5. The file get corrupted right after 1 or 2 iteration. The test doesn't corrupt the file when ran on Ext2 or Ext3 or on xfs on IRIX 6.5.
The patch provided by Russel fixed the corruption locally for out test case but "not" over NFS. I've ran the corrupter over night without a failure but got corruption over NFS the same as before. The difference between the test program and nfs is that nfsd uses linvfs_open(inode, *fd) and the test program goes via libc from user space. With the patch you provided, you can also get the corruption by mounting over NFS on the loopback interface. In this case, I can still get the corruption with corruptor without requiring large setup to test. I simply needed to run 2 thread per CPU. I noticed that the refcache_purge close the files without holding the i_sem. Is this acquired latter on in the close path? I was thinking that doing a try_down on the i_sem and only closing the file when the semaphore could be acquired would prevent closing/writing race condition while optimizing by not closing a file and doing all the block cleanup while the file is still being read/written to/from. Make sense?
I have not looked at this but I think this problem may be related to the fact that the xfs_inode is not torn down at the same time as the linux inode. A recient bug in our remount ro code (shutdown code for root fs) has shown we don't clean up our inodes as often as we should. This problem may be a related, if an inode has it's refcache dropped we may be flushing the data on the inode.
Created attachment 67 [details] test program for local corruption on linux xfs This can also show holes in the latest cvs xfs integrity.
Not that is really any progress on this bug. But the test program also fails on linux 2.6.0
Using the test program for local corruption, I can replicate the bug using the Mandrake 9.1 kernel when running multiple instances. 2.4.21-0.13mdkenterprise
I've never seen it in the real world though.
Subject: Re: data corruption on nfs+xfs Hi, I have reproduced similar problem on xfs1.3.1 (based on 2.4.21), my environment is as follows. nfs server : OS : RedHat9 + xfs1.3.1 (based on 2.4.21)$B!!(B CPU : Xeon(2.4GHz) x 2 MEM : 1GB NIC : Intel PRO/1000 Local Filesystem : XFS, the refcache is disabled. nfs client : OS : RadHat9 (based on 2.4.20-8) NIC : Intel PRO/1000 NFS Ver. : 3 NFS Mount Options : udp,hard,intr,wsize=8192 Within 1 hour of running the test, the corruption was detected. (to make it easy to detect the corruption, umount nfs, umount xfs, mount xfs and mount nfs before comparing data, i.e. purge memory cache.) The corruption width was a multiple of 4KB, starting at 4KB boundary. In many cases, it was caused in the start part of the physical extent. I have investigated the issue using the kernel embeded local trace. I think that the issue was caused by the delayed allocation mechanism. I explain the example of curruption scenario which I guess. Each process of the scenario is in order of time. 1. open and write in nfsd (for write1) The nfs client write 8KB data to file (called write1). The write request is processed in nfsd. The nfsd call open [linvfs_open], and call write [linvfs_write]. After calling write, the file has several delayed allocation blocks over end of the file, by allocation in chunks and alignment of writeiosize. file image offset=0 eof +----+----+----+----+----+- ... +----+ | | | | | | | | +----+----+----+----+----+- ... +----+ 4KB 4KB +---------+ write data (write1) +------------------------------------+ delayed allocation blocks 2. allocate disk space in kupdated (for write1) The disk space is allocated for delayed allocotion blocks before data flushed to disk [linvfs_writepage, page_state_convert]. file image offset=0 eof +----+----+----+----+----+- ... +----+ | | | | | | | | +----+----+----+----+----+- ... +----+ 4KB 4KB +---------+ write data (write1) +------------------------------------+ allocated disk space +---------+ called disk space1 +--------------------------+ called disk space2 3. close in nfsd (for write1) The nfsd call close [linvfs_release]. At this time, allocated disk space over end of the file (disk space2) is truncated, when the refcache is disabled [xfs_inactive_free_eofblocks]. file image offset=0 eof +----+----+ | | | +----+----+ 4KB 4KB +---------+ write data (write1) +---------+ disk space1 4. open and write in nfsd (for write2) Furthermore the nfs client write 8KB data to file (called write2). The nfsd call open [linvfs_open], and call write [linvfs_write]. file image offset=0 eof +----+----+----+----+----+- ... +----+ | | | | | | | | +----+----+----+----+----+- ... +----+ 4KB 4KB 4KB 4KB +---------+ write data (write1) +---------+ write data (write2) +--------------------------+ delayed allocation blocks +---------+ disk space1 5. flush data to disk in kupdated (for write1) The write data (write1) is flushed to disk space1 [page_state_convert]. And the write data (write2) is flushed to disk space2 [cluster_write] !!!, because the buffer status of write data (write2) is dirty and delay. But, the disk space2 dose not exist at this time. The disk space2 may be used by the other file or free space. I think that one of solution for the issue is to flush only buffers in end of the file before allocating disk space for delayed allocation blocks, don't flush buffers over that. I made patch for xfs1.3.1. I am running the test on the kernel added the patch, it has been running for over 16 hours with no corruption. Could you please comment the attached patch. Regards, Tsuda
Subject: Re: data corruption on nfs+xfs In message "Re: data corruption on nfs+xfs" (04/06/11 09:38:53), cattelan@xfs.org wrote... >This looks really promising. >I'm currently reading through the code again to >see what kind of implications this might have. >I'm worried that you're patch might increase file fragmentation, >but that is just at first glance. I'll look some more and run >some testing with and with out you're patch. Thank you. >I'm looking at xfs_inactive_free_eofblocks again, I think >there may be an issue with the xfs_inode di_size and the linux >inode i_size. I also think so. I think that following issue may be caused (I don't reproduce following issue). This problem is another with the previous problem and is based on asynchronous updating inode i_size and xfs_inode di_size. Each process is in order of time. 1. write 8KB The TP write 8KB data to file. First, 1st 4KB data is processed [do_generic_file_write]. file image offset=0 +----+ | | +----+ inode i_size : -----> (4KB) xfs_inode di_size : (0) inode i_size is 4KB, but xfs_inode di_size is zero, because a_op->write_commit update only i_size. Next, 2nd 4KB data is processed [do_generic_file_write]. file image offset=0 +----+----+ | | | +----+----+ inode i_size : ----------> (8KB) xfs_inode di_size : (0) inode i_size is 8KB, but xfs_inode di_size remain zero. 2. revalidate The revalidate runs (ex. by ls) [linvfs_revalidate]. At this time, inode i_size is changed to same value as xfs_inode di_size [vn_revalidate]. As result i_size is zero! file image offset=0 +----+----+ | | | +----+----+ inode i_size : (0) xfs_inode di_size : (0) 3. flush data The flushing runs by memory overload [balance_dirty]. Although it is going to flush the buffer (1st page) of write 8KB, its result is EIO, because 1st page index is over inode i_size [xfs_page_state_convert]. And the buffer is discarded on 2.4.26. 4. write 8KB (continue) inode i_size and xfs_inode di_size is updated 8KB at the last of write processing [xfs_write]. But write data is lossed. file image offset=0 +----+----+ | | | +----+----+ inode i_size : ----------> (8KB) xfs_inode di_size : ----------> (8KB) +----+ no data I think that the issue rarely is caused, and one of solution for the issue is to simultaneously update inode i_size and xfs_inode di_size at a_op->write_commit. >BTW what tracing did you use to find this? Although there were trial and error, finally I investigated with the tracing (attached patch). Regards, Tsuda
Created attachment 124 [details] Fix from Masanori TSUDA Masanaori's patch effectively fixes the problem! We still want to investigate the problem some more because we believe this fixes a symptom of a hole in i_size != di_size / vaild page handling. -Russell