[Top] [All Lists]

Re: xfs corruption with XFS_IOC_RESVSP

To: Nathan Scott <nathans@xxxxxxx>
Subject: Re: xfs corruption with XFS_IOC_RESVSP
From: Miquel van Smoorenburg <miquels@xxxxxxxxxxx>
Date: Fri, 26 Nov 2004 13:58:10 +0000
Cc: Miquel van Smoorenburg <miquels@xxxxxxxxxxx>, linux-xfs@xxxxxxxxxxx
In-reply-to: <20041126000611.GB953@frodo> (from nathans@xxxxxxx on Fri Nov 26 01:06:11 2004)
References: <1101413726l.13697l.0l@xxxxxxxxxxxxxxxxxxxxx> <20041126000611.GB953@frodo>
Sender: linux-xfs-bounce@xxxxxxxxxxx
On 2004.11.26 01:06, Nathan Scott wrote:
Hi Mike,

On Thu, Nov 25, 2004 at 08:15:26PM +0000, Miquel van Smoorenburg wrote:

> Now I installed this on a Dual Xeon with 4GB memory that has a really
> high load (hundreds of simultaneous database connections, 6-disk RAID5
> for 100% loaded all of the time). Suddenly, after a week or so of
> running, the database files got corrupt - NULs and random binary junk
> in the middle of the files.

Is this also on a 2.6.9 kernel or could it be an older one?

It's a 2.6.9 kernel, the only patch it has is the xfs-shutup-kmem one.

I fixed a problem like this some time ago.  The reason for the
NULLs on read is likely to be as follows...

When we do the space reservation, we create unwritten extents.
A read into those are defined to return zeroes (just like reads
into a hole would), and writes cause the extent to be adjusted so
that the written part is now marked as a regular extent, and any
remaining unwritten parts continue to be marked as unwritten.

So, in your case a reader after the write must be still seeing an
extent marked as unwritten.  This could be either a bug in your
program :) with the read happening before the write

Well actually, I let the 4 processes write the file in parallel,
each adding 500-1200 bytes at the end, until the file has grown
to 10 M after which I stop the write process and read the file
to check for inconsistencies.
The particular issue I fixed awhile ago is codified in test 084
(xfs-cmds/xfstests/{084,src/resvtest.c) and matches yours pretty
well, hence my question about kernel version.  The trigger there
was VM pressure to reclaim dirty pages, where those pages are
backed by unwritten extents.

I ran that (with -l $((512*1024*1024)) on a 512M RAM machine) a
few times, but I still can't reproduce the problem .. alas.

But, backing up to your original problem.  XFS actually does
some preallocation for writes beyond EOF already (in "biosize"
chunks).  When the file is closed, we go down the xfs_release
path, and any excess is trimmed off the end (trimmed back to a
filesystem block - xfs_inactive_free_eofblocks).  But, if you
are writing to those files append-only, you could mark them as
such (i.e. chattr +a) -- this will skip the trim-back step and
reduce fragmentation for you.

Yes, but only root can do that - also, chattr +a has al kinds
of other side effect I don't want. Couldn't this be made a
mount option, or an ioctl ?

I understand that ext3 also has preallocation support nowadays,
perhaps a unified mount option or ioctl would be a good idea.

Attached is a first cut at a patch for a mount option - I
understand that functional changes like this might not be
added (right away, at least) to stable filesystems like XFS,
and perhaps it's the wrong approach but it might be useful
to other people. Right now I have my filesystem mounted with
-o rw,noatime,biosize=18,pkeep and it appears to do what I
want without application changes.



Attachment: linux-2.6.9-xfs-pkeep.patch
Description: Text Data

<Prev in Thread] Current Thread [Next in Thread>