[Top] [All Lists]

Re: fs corruption exposed by "xfs: increase prealloc size to double that

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"
From: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Date: Mon, 17 Mar 2014 01:28:04 +0000
Cc: Brian Foster <bfoster@xxxxxxxxxx>, linux-fsdevel@xxxxxxxxxxxxxxx, Dave Chinner <dchinner@xxxxxxxxxx>, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20140317002918.GT18016@xxxxxxxxxxxxxxxxxx>
References: <20140315210216.GP18016@xxxxxxxxxxxxxxxxxx> <20140317001130.GA7072@dastard> <20140317002918.GT18016@xxxxxxxxxxxxxxxxxx>
Sender: Al Viro <viro@xxxxxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Mon, Mar 17, 2014 at 12:29:18AM +0000, Al Viro wrote:

> I think I know what's going on - O_DIRECT write starting a bit before
> EOF on a file with the last extent that can be grown.  It fills
> a buffer_head with b_size extending quite a bit past the EOF; the
> blocks are really allocated.  What causes the problem is that we
> have the flags set for the *first* block.  IOW, buffer_new(bh) is
> false - the first block has already been allocated.  And for
> direct-io.c it means "no zeroing the tail of the last block".

BTW, that's something I have directly observed - xfs_get_blocks_direct()
called with iblock corresponding to a bit under 16Kb below EOF and
returning with ->b_size equal to 700K and ->b_flags not containing BH_New.

Once that has happened, the rest is inevitable - do_direct_IO() will
shove the userland data (~80Kb) into the blocks starting from one
in ->b_blocknr and dio_zero_block() called after it will do nothing,
since as far as it knows these blocks were not new - BH_New isn't there.

We may or may not get a visible junk (after all, the block tail might have
contained zeros), but the situation above is easily reproduced and when
it happens, there will be no zeroing out.  Actually, I'm not sure if
the comment in direct-io.c (just before get_more_blocks()) is correct; it
 * If the fs has mapped a lot of blocks, it should populate bh->b_size to  
 * indicate how much contiguous disk space has been made available at
 * bh->b_blocknr.  
 * If *any* of the mapped blocks are new, then the fs must set buffer_new().
 * This isn't very efficient...
but that is unsafe - consider a situation when you are writing 20Kb from e.g.
0.5Kb offset from the beginning of last (4Kb) block.  You have 6 blocks
affected, right?  One old, five new.  And you want the last half-kilobyte
of the new last block zeroed out.  What you do _not_ want is zeroing out the
half-kilobyte just before the affected area - the data there should stay.

IOW, we really can't mix new and old blocks in that interface - not enough
information is passed back to caller to be able to decide what does and
what does not need zeroing out.  It should be either all-new or all-old.

And it's not just the EOF, of course - the beginning of a hole in a sparse
file isn't any different from the end of file in that respect.

<Prev in Thread] Current Thread [Next in Thread>