On Wed, May 15, 2013 at 10:03:55AM +0200, Andras Korn wrote:
> On Wed, May 15, 2013 at 10:47:36AM +1000, Dave Chinner wrote:
> > > I have thousands of files on xfs whose inode claims their size is zero,
> > > but
> > > they have blocks allocated to them; du(1) reports a nonzero size.
> > > xfs_repair
> > > 3.1.9 ignores this. xfs_db can be used to recover the contents of the
> > > files
> > > (using commands like inode 1234; write core.size 4567).
> > >
> > > David Chinner explained to me that xfs_repair ignores these files because
> > > it's legitimate to have blocks beyond eof (e.g. due to fallocate()), and
> > Actually due to speculative preallocation done by delayed
> > allocation.
> But if the space is preallocated speculatively and then remains unused,
> shouldn't it get freed up somehow, eventually?
It does, whichever happens first:
- the file is closed if the "keep on close" heuristic hasn't
- the file has been clean for 5 minutes (background sweeper)
- the inode is reclaimed from cache.
If the system crashes, it is not cleaned up on reboot until the file
is read into cache again and the above can happen...
> > > that unwritten extent flagging doesn't help because such extents don't
> > > need
> > > to be flagged as it's impossible to read them.
> > fallocate will leave unwritten extents beyond EOF, in which case we
> > can detect it, but we know there's nothing to be done as there's no
> > data....
> OK, so we have the following cases:
> 1. fallocate. The file has more space allocated to it than the size field in
> its inode says, and the extra extents are flagged as unwritten.
> 2. speculative preallocation. Same as above, but the extents are not flagged
> as unwritten.
> 3. corruption. The file's inode reports an incorrect size for whatever
> reason. If it's too much, that's easy to detect; the problem is telling if
> it's too little.
> Distinguishing between #1 and #2 is possible based on the unwritten flags.
> In case #2, I think the extra space should be possible to free up (perhaps
> by xfs_repair?). Of course, I suppose you could just run truncate(1) on all
> The problem is recognising #3.
> Do I have that right?
But iin case #2 it is not xfs_repair's place to remove blocks that
are legitimately attached to an inode and correctly referenced. If
it does that, then you lose any hope of recovering from a missing
file size update on a crash as you lose all references to the data
that is on disk. i.e. you can't find the data to recover it.
> > > I think it would be useful to have the ability to distinguish between
> > > files
> > > that legitimately have extents beyond EOF and files whose inode
> > > incorrectly
> > > reports a too-small size.
> > How? Add a transaction to track the data that has been written?
> > Well, we already do that - with the inode size.
> Ah, yes, that's true.
> OK, thinking about it I realise there doesn't appear to be a good way of
> preventing the problem, but I'm still not sure some heuristic couldn't be
> invented to detect and partially remedy it after the fact.
Trying to remedy it in xfs_repair does more harm than good. What
happens now allows recovery of data if the inode size was wrong. If
we remove the blocks beyond EOF, we lose that ability and hence make
unrecoverable data loss more likely in common failure scenarios.