xfs
[Top] [All Lists]

Re: ftruncate() Writes Last Block of File

To: Alan Cook <acook@xxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: ftruncate() Writes Last Block of File
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Wed, 21 Mar 2012 15:22:03 +1100
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <loom.20120319T153420-855@xxxxxxxxxxxxxx>
References: <loom.20120319T153420-855@xxxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Mon, Mar 19, 2012 at 02:44:33PM +0000, Alan Cook wrote:
> I have three questions regarding the XFS implementation of ftruncate().  In 
> the
> block device driver, I can see that writes are being performed to the last 
> block
> of previously written file when ftruncate() is called.  I believe that I found
> ftruncate() in the XFS sources, but all I see is the filesize being updated in
> the inode.  So if ftruncate() is writing to the last block, it appears to be a
> triggered event.

Sure, you're triggering a flush-on-truncate heuristic because the
on-disk size does not match what is about to be logged from the
in-memory size.

Say for example, I write 1MB to a file, then truncate it back to 8k.
In memory before the truncate, you have this data:

   0    4k    8k    12k 1020k  1M
   +----+-----+-----+.....+-----+
                                ^ inode size = 1048576

And on disk you have this:

  0
  +
  ^ inode size = 0

because no data has been written back yet and the on disk inode size
does not get updated until after the data IO completes.

Hence if you now run a truncate, we have this in memory:

   0    4k    8k
   +----+-----+
              ^ inode size = 8192

And we have this on disk:

  0
  +
  ^ inode size = 0

And we have this in the log:

   0    4k    8k
   +          +
              ^ inode size = 8192

So if we crash at this point, log recovery will set the inode size
to 8192 but there is no data in the file because it never got
written by the kernel. Hence reading the file after recovery would
expose stale data in the file (bad!).

Therefore, before the truncate is done, we write the dirty data that
is between the current on-disk EOF and the new EOF that will be
logged to disk, so we have this state on disk:

   0    4k    8k
   +----+-----+
   ^ inode size = 0

where the blocks on disk are allocated and the data on disk. hence
when the truncate transaction is completed, the state in the log:

   0    4k    8k
   +          +
              ^ inode size = 8192

overlayed with the state on disk gives the correct result if a crash
occurs and log recovery is run.

> To test, I added printk() statements in the block device driver that outputs
> jiffies for write operations.  A file is created and written (~1 MiB), and 
> then
> truncated to 8192 via ftruncate().  The original write to file happens about 
> 20
> jiffies before the call to ftruncate().  When looking at the output, there is 
> an
> additional write to what is the last block of the truncated file, which 
> reports
> the same jiffies as the call to ftruncate().

That's what I'd expect from the above code.

> Does ftruncate() actually write to the last block of the file?  If not, any
> thoughts on what would be?  It only happens when ftruncate() is called.

It depends on the state of the file. if you do
write/fsync/ftruncate, then you won't see ftruncate write any data
because the state on disk is consistent with what is in memory.

> Where in the XFS kernel code is ftruncate() implemented?

xfs_setattr_size().

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>