Bugzilla – Bug 418
unwritten extents remain unwritten after mmap() modifies them
Last modified: 2007-12-04 10:45:08 CST
You need to log in before you can comment on or make changes to this bug.
If a file is allocated using XFS_IOC_RESVSP64 then mmap()ed with PROT_WRITE and written to (by altering bytes in the mmap()ed memory region), the extents remain marked as
Created an attachment (id=159) [details] test program to trigger the bug Reserves a region using either XFS_IOC_RESVSP64 or XFS_IOC_ALLOCSP64, then writes to it using several possible methods. Particular invocations that demonstrate the bug: write()ing to the reserved area gives us the expected results: puin:/tmp/pruts% gcc -Wall -Os -s resv.c -o resv && rm -f blaat && do_write= ./resv 167772160 blaat && ls -lh blaat && du -sh blaat && xfs_bmap -vp blaat -rw-r--r-- 1 wsl wsl 160M 2005-07-27 22:44 blaat 160M blaat blaat: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..7]: 1149760..1149767 0 (1149760..1149767) 8 1: [8..163839]: 1149768..1313599 0 (1149768..1313599) 163832 10000 2: [163840..163847]: 1313600..1313607 0 (1313600..1313607) 8 3: [163848..327671]: 1313608..1477431 0 (1313608..1477431) 163824 10000 4: [327672..327679]: 1477432..1477439 0 (1477432..1477439) 8 mmap()ing it then writing to the mapped memory gives us a SIGBUS: puin:/tmp/pruts% gcc -Wall -Os -s resv.c -o resv && rm -f blaat && do_mmap= ./resv 167772160 blaat && ls -lh blaat && du -sh blaat && xfs_bmap -vp blaat zsh: 1464 bus error do_write= do_mmap= ./resv 167772160 blaat ftruncate()ing it, then mmap()ing it and writing to the mapped memory gives us: puin:/tmp/pruts% gcc -Wall -Os -s resv.c -o resv && rm -f blaat && do_trunc= do_mmap= ./resv 167772160 blaat && ls -lh blaat && du -sh blaat && xfs_bmap -vp blaat -rw-r--r-- 1 wsl wsl 160M 2005-07-27 22:44 blaat 160M blaat blaat: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..327679]: 1149760..1477439 0 (1149760..1477439) 327680 10000
Used kernel: stock 2.6.12.3
We probably encountered the same problem. Our program is some kind of 'file distribution agent' used in distributed computing. It receives data from a network connection, stores them into a file and forwards them again over another network connection to the next computer on a list. Daily job is to distribute about 300 files of average size 600MB. Each file is created with ftruncate(), then preallocated with xfs_alloc() and then mmaped(). Data are written into mmaped memory and then msynced(). All worked fine, until we have used xfs_alloc() or xfs_resvsp() functions to pre-allocate files (it was necessary to ensure low fragmentation of the files). Problem was that the parts of the files were not written onto the disk and were regulary lost! Problem showed especially when the volume of the transfered files exceeded size of operating memory (probaly some broken caches started to flush or release pages). It seemed to me like if o.s. wes thinking that the modified page was already written even if it really was not, or o.s. did not noticed, that the page was changed at all. After lots of testing we have changed mmaping files to open/write/close functions and problem disappeared (we must use xfs_alloc() to lower fragmentation). I hope it would help you solve the problem.
I've just looked at the test program provided in comment #2 and I see there are several bugs in it. First of all, XFS_IOC_RESVSP64 does not change the file size. hence doing: fd = open(xxx, O_TRUNC|O_CREAT|O_RDWR); ioctl(fs, XFS_IOC_RESVSP64, &space); leaves you with a zero length file. If you then try to fault a page from the file, then you _correctly_ get a sigbus because you tried to read beyond EOF. You _must_ ftruncate() the file to the correct size (the range you preallocated) before you fault any pages. The test program provided even this if you tell it to and it is commented that it works. It works because the file size is set correctly! Secondly, the XFS_IOC_ALLOCSP64 case in teh test program which uses: space.l_whence = SEEK_SET; space.l_start = 0; space.l_len = o; if you read the man page for XFS_IOC_ALLOCSP64 carefully: .... l_whence is 0, 1, or 2 to indicate that the relative offset l_start will be measured from the start of the file. .... l_len is the size of the section. An l_len value of zero frees up to the end of the file; ..... The l_len field is currently ignored, and should be set to zero. The key is that last sentence - l_len is ignored and should be set to zero. And a zero value says "truncate file to l_start". So the effect of the test code is the equivalent of ftruncate64(fd, 0); i.e. set the file size to zero and hence page faults occur past EOF and the process _correctly_ gets a sigbus. IOWs, I don't see an XFS bug in the test case provided; just user error. Comment #3 tends to imply a different problem - a test case for that would be helpful....
Ok, so running the test program properly (silly cut'n'paste problem), I see the problem. Sorry, my fault. I modified the test case to a 128k file (smaller, simpler), then ran it once to get an unwritten extent on disk. I then wrote from /dev/urandom into the file and sync'd that to disk. Then, using direct I/o from teh block device, I read the data that was written to disk to make sure it was there. It was. Taking advantage of the fact that if you truncate a file and then write back to it, you typically get the same extent, i modified the test program to use O_TRUNC on open() rather than needing to rm it every time. This puts the unwritten extent straight back over the top of the blocks that were freed at open and that we wrote random data to. Then I ran the test program again and read the data back off disk from the block device to see whether the changes made by mmap actually hit the disk. The first 16k (this is an altix with 16k page size I'm testing on) of the blocks on disk had 0xc3 as teh first byte and zeros for the rest. The last byte of the blocks on disk was 0xa5 and the last rest of the last 16k of the blocks on disk was zero. The modification inteh middle of 0xb4 was there surrounded by zeros as well. IOWs, the data written by mmap hit the disk but we did not do unwritten extent conversion when we wrote the data out. Clearly, this is because the page is initially read from disk by the page fault code, and then later it marks the page dirty. The problem here is the read from disk does not mark the buffers on the page unwritten if we are reading from an unwritten extent. Hence when it gets dirtied and written out, all we do is allocate the underlying disk space; we don't actually do unwritten extent conversion on it because it is not an unwritten extent. Given this, i think that if I read() from an unwritten extent, then write() to that same region, the write will not cause unwritten extent conversion as there will already be mapped buffers on the page and so we won't remap the page and hence can't get the unwritten state set. From my history: 1112 do_trunc= do_mmap= ./resv 131072 blaat 1113 dd if=/dev/urandom of=blaat bs=128k count=1 conv=notrunc 1114 do_trunc= do_mmap= ./resv 131072 blaat 1115 ls -lh blaat && du -sh blaat && xfs_bmap -v blaat [get block number from xfs_bmap -v output] [single unwritten extent] 1116 dd if=/dev/mapper/test_vg-fred of=t.t bs=512 skip=770776 count=256 iflag=direct; 1117 od -x -A x t.t > t.tt [check t.tt for zero regions in start middle and end] [still a single unwritten extent] 1119 dd if=blaat of=/dev/null bs=16384 skip=2 count=1; dd if=/dev/zero of=blaat bs=16384 seek=2 count=1 conv=notrunc; sync 1120 dd if=/dev/mapper/test_vg-fred of=t.t bs=512 skip=770776 count=256 iflag=direct; od -x -A x t.t > t.tt 1121 ls -lh blaat && du -sh blaat && xfs_bmap -v blaat [ still a single unwritten extent] [ check t.tt - new zero region at offset 32k for 16k ] So, I can reproduce the mmap behaviour with read() and write(). Hmmmm - i think that delalloc has the same problem as well, only that will cause unreserved allocation during writeback - that could cause problems near ENOSPC, I think. So I know what the problem is, I'll have to think about how to fix it now. More later.
> So, I can reproduce the mmap behaviour with read() and write(). No, I can't actually. xfs_bmap -v <file> can be misleading. xfs_bmap -vp <file> will actually break down contiguous extents in different states and show that there is a mix of writen and unwritten extents. Makes sense, though, from reading the code I could not work out why write() wasn't doing the right thing....
I believe you wanted to push a bugfix in .21 or .22, as noted in "RESVSP problems" thread on the ML. I still get this misbehaviour with .23. What's the hold up?
The fix is in 2.6.23. What is your test case that is failing?
Okay, retested and it indeed works in .23. Mea culpa. Thus, the bug can be closed now, I think.
Fixed in 2.6.23