Bug 418 - unwritten extents remain unwritten after mmap() modifies them
: unwritten extents remain unwritten after mmap() modifies them
Status: RESOLVED FIXED
: XFS
XFS kernel code
: Current
: PC Linux
: P2 normal
: ---
Assigned To:
:
:
:
:
:
:
  Show dependency treegraph
 
Reported: 2005-07-28 15:42 CST by
Modified: 2007-12-04 10:45 CST (History)


Attachments
test program to trigger the bug (1.68 KB, text/plain)
2005-07-28 15:53 CST, Wessel Dankers
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-07-28 15:42:40 CST
If a file is allocated using XFS_IOC_RESVSP64 then mmap()ed with PROT_WRITE and
written to (by altering bytes in the mmap()ed memory region), the extents remain
marked as 
------- Comment #1 From 2005-07-28 15:53:39 CST -------
Created an attachment (id=159) [details]
test program to trigger the bug

Reserves a region using either XFS_IOC_RESVSP64 or XFS_IOC_ALLOCSP64, then
writes to it using several possible methods.
Particular invocations that demonstrate the bug:

write()ing to the reserved area gives us the expected results:

puin:/tmp/pruts% gcc -Wall -Os -s resv.c -o resv && rm -f blaat && do_write=
./resv 167772160 blaat && ls -lh blaat && du -sh blaat && xfs_bmap -vp blaat
-rw-r--r--  1 wsl wsl 160M 2005-07-27 22:44 blaat
160M	blaat
blaat:
 EXT: FILE-OFFSET	BLOCK-RANGE	 AG AG-OFFSET		TOTAL FLAGS
   0: [0..7]:		1149760..1149767  0 (1149760..1149767)	    8
   1: [8..163839]:	1149768..1313599  0 (1149768..1313599) 163832 10000
   2: [163840..163847]: 1313600..1313607  0 (1313600..1313607)	    8
   3: [163848..327671]: 1313608..1477431  0 (1313608..1477431) 163824 10000
   4: [327672..327679]: 1477432..1477439  0 (1477432..1477439)	    8


mmap()ing it then writing to the mapped memory gives us a SIGBUS:

puin:/tmp/pruts% gcc -Wall -Os -s resv.c -o resv && rm -f blaat && do_mmap=
./resv 167772160 blaat && ls -lh blaat && du -sh blaat && xfs_bmap -vp blaat
zsh: 1464 bus error  do_write= do_mmap= ./resv 167772160 blaat


ftruncate()ing it, then mmap()ing it and writing to the mapped memory gives us:


puin:/tmp/pruts% gcc -Wall -Os -s resv.c -o resv && rm -f blaat && do_trunc=
do_mmap= ./resv 167772160 blaat && ls -lh blaat && du -sh blaat && xfs_bmap -vp
blaat
-rw-r--r--  1 wsl wsl 160M 2005-07-27 22:44 blaat
160M	blaat
blaat:
 EXT: FILE-OFFSET      BLOCK-RANGE	AG AG-OFFSET	       TOTAL FLAGS
   0: [0..327679]:     1149760..1477439  0 (1149760..1477439) 327680 10000
------- Comment #2 From 2005-07-28 15:55:30 CST -------
Used kernel: stock 2.6.12.3
------- Comment #3 From 2005-09-02 14:06:25 CST -------
We probably encountered the same problem.

Our program is some kind of 'file distribution agent' used in distributed 
computing. It receives data from a network connection, stores them into a file 
and forwards them again over another network connection to the next computer on 
a list.

Daily job is to distribute about 300 files of average size 600MB. Each file is 
created with ftruncate(), then preallocated with xfs_alloc() and then mmaped(). 
Data are written into mmaped memory and then msynced().

All worked fine, until we have used xfs_alloc() or xfs_resvsp() functions to 
pre-allocate files (it was necessary to ensure low fragmentation of the files). 
Problem was that the parts of the files were not written onto the disk and were 
regulary lost! Problem showed especially when the volume of the transfered 
files exceeded size of operating memory (probaly some broken caches started to 
flush or release pages).

It seemed to me like if o.s. wes thinking that the modified page was already 
written even if it really was not, or o.s. did not noticed, that the page was 
changed at all.

After lots of testing we have changed mmaping files to open/write/close 
functions and problem disappeared (we must use xfs_alloc() to lower 
fragmentation).

I hope it would help you solve the problem.
------- Comment #4 From 2007-02-05 22:48:33 CST -------
I've just looked at the test program provided in comment #2 and I see 
there are several bugs in it. 
 
First of all, XFS_IOC_RESVSP64 does not change the file size. hence 
doing: 
 
        fd = open(xxx, O_TRUNC|O_CREAT|O_RDWR); 
        ioctl(fs, XFS_IOC_RESVSP64, &space); 
 
leaves you with a zero length file. If you then try to fault a page 
from the file, then you _correctly_ get a sigbus because you tried to 
read beyond EOF. 
 
You _must_ ftruncate() the file to the correct size (the range you 
preallocated) before you fault any pages. The test program provided 
even this if you tell it to and it is commented that it works. It works 
because the file size is set correctly! 
 
Secondly, the XFS_IOC_ALLOCSP64 case in teh test program which uses: 
 
         space.l_whence = SEEK_SET; 
         space.l_start = 0; 
         space.l_len = o; 
 
if you read the man page for XFS_IOC_ALLOCSP64 carefully: 
 
.... l_whence is 0, 1, or 2 to indicate that the relative offset l_start 
will be measured from the start  of  the  file. ....  l_len is the size 
of the section.  An l_len value of zero frees up to the end of the file; 
.....  The l_len field is currently ignored, and should be set to zero. 
 
The key is that last sentence - l_len is ignored and should be set to zero. 
And a zero value says "truncate file to l_start". 
 
So the effect of the test code is the equivalent of ftruncate64(fd, 0); 
i.e. set the file size to zero and hence page faults occur past EOF 
and the process _correctly_ gets a sigbus. 
 
IOWs, I don't see an XFS bug in the test case provided; just user error. 
 
Comment #3 tends to imply a different problem - a test case for that 
would be helpful.... 
 
------- Comment #5 From 2007-02-06 01:35:38 CST -------
Ok, so running the test program properly (silly cut'n'paste problem), 
I see the problem. Sorry, my fault. 
 
I modified the test case to a 128k file (smaller, simpler), then 
ran it once to get an unwritten extent on disk. I then wrote 
from /dev/urandom into the file and sync'd that to disk. Then, 
using direct I/o from teh block device, I read the data that was 
written to disk to make sure it was there. It was. 
 
Taking advantage of the fact that if you truncate a file and then 
write back to it, you typically get the same extent, i modified the 
test program to use O_TRUNC on open() rather than needing to rm it 
every time. This puts the unwritten extent straight back over the 
top of the blocks that were freed at open and that we wrote random 
data to. 
 
Then I ran the test program again and read the data back off disk 
from the block device to see whether the changes made by mmap actually 
hit the disk. The first 16k (this is an altix with 16k page size I'm 
testing on) of the blocks on disk had 0xc3 as teh first byte and zeros 
for the rest. The last byte of the blocks on disk was 0xa5 and the 
last rest of the last 16k of the blocks on disk was zero. The modification 
inteh middle of 0xb4 was there surrounded by zeros as well. 
 
IOWs, the data written by mmap hit the disk but we did not do unwritten 
extent conversion when we wrote the data out. 
 
Clearly, this is because the page is initially read from disk by the 
page fault code, and then later it marks the page dirty. The problem here 
is the read from disk does not mark the buffers on the page unwritten 
if we are reading from an unwritten extent. Hence when it gets dirtied 
and written out, all we do is allocate the underlying disk space; we 
don't actually do unwritten extent conversion on it because it is 
not an unwritten extent. 
 
Given this, i think that if I read() from an unwritten extent, then 
write() to that same region, the write will not cause unwritten extent 
conversion as there will already be mapped buffers on the page and 
so we won't remap the page and hence can't get the unwritten state set. 
 
From my history: 
 
 
 1112  do_trunc= do_mmap= ./resv 131072 blaat 
 1113  dd if=/dev/urandom of=blaat bs=128k count=1 conv=notrunc 
 1114  do_trunc= do_mmap= ./resv 131072 blaat 
 1115  ls -lh blaat && du -sh blaat && xfs_bmap -v blaat 
[get block number from xfs_bmap -v output] 
[single unwritten extent] 
 1116  dd if=/dev/mapper/test_vg-fred of=t.t bs=512 skip=770776 count=256 
iflag=direct; 
 1117  od -x -A x t.t > t.tt 
[check t.tt for zero regions in start middle and end] 
[still a single unwritten extent] 
 1119  dd if=blaat of=/dev/null bs=16384 skip=2 count=1; dd if=/dev/zero 
of=blaat bs=16384 seek=2 count=1 conv=notrunc; sync 
 1120  dd if=/dev/mapper/test_vg-fred of=t.t bs=512 skip=770776 count=256 
iflag=direct; od -x -A x t.t > t.tt 
 1121  ls -lh blaat && du -sh blaat && xfs_bmap -v blaat 
[ still a single unwritten extent] 
[ check t.tt - new zero region at offset 32k for 16k ] 
 
So, I can reproduce the mmap behaviour with read() and write(). 
 
Hmmmm - i think that delalloc has the same problem as well, only 
that will cause unreserved allocation during writeback - that could 
cause problems near ENOSPC, I think. 
 
So I know what the problem is, I'll have to think about how to fix it 
now. More later. 
------- Comment #6 From 2007-02-06 03:48:57 CST -------
> So, I can reproduce the mmap behaviour with read() and write(). 
 
No, I can't actually. xfs_bmap -v <file> can be misleading. 
xfs_bmap -vp <file> will actually break down contiguous extents 
in different states and show that there is a mix of writen and 
unwritten extents. 
 
Makes sense, though, from reading the code I could not work out why 
write() wasn't doing the right thing.... 
------- Comment #7 From 2007-11-30 05:24:31 CST -------
I believe you wanted to push a bugfix in .21 or .22, as noted in "RESVSP 
problems" thread on the ML. I still get this misbehaviour with .23. What's the 
hold up?
------- Comment #8 From 2007-12-02 14:41:20 CST -------
The fix is in 2.6.23. What is your test case that is failing? 
------- Comment #9 From 2007-12-04 08:22:58 CST -------
Okay, retested and it indeed works in .23. Mea culpa.

Thus, the bug can be closed now, I think.
------- Comment #10 From 2007-12-04 08:45:08 CST -------
Fixed in 2.6.23