xfs
[Top] [All Lists]

Re: xfs corruption with XFS_IOC_RESVSP

To: Miquel van Smoorenburg <miquels@xxxxxxxxxxx>
Subject: Re: xfs corruption with XFS_IOC_RESVSP
From: Nathan Scott <nathans@xxxxxxx>
Date: Fri, 26 Nov 2004 11:06:11 +1100
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <1101413726l.13697l.0l@xxxxxxxxxxxxxxxxxxxxx>
References: <1101413726l.13697l.0l@xxxxxxxxxxxxxxxxxxxxx>
Sender: linux-xfs-bounce@xxxxxxxxxxx
User-agent: Mutt/1.5.3i
Hi Mike,

On Thu, Nov 25, 2004 at 08:15:26PM +0000, Miquel van Smoorenburg wrote:
>  ...
> I have an application that appends slowly and randomly to tens of
> thousands of database file, which are later read sequentially. Because
> the files are opened, written to (a few hundred bytes) and closed,
> all randomly, fragmentation is enormous.
>  ...
> This actually works as expected and has been running on several machines
> for quite some time with a 2.6.9 Linux kernel.
> 
> Now I installed this on a Dual Xeon with 4GB memory that has a really
> high load (hundreds of simultaneous database connections, 6-disk RAID5
> for 100% loaded all of the time). Suddenly, after a week or so of
> running, the database files got corrupt - NULs and random binary junk
> in the middle of the files.

Is this also on a 2.6.9 kernel or could it be an older one?
I fixed a problem like this some time ago.  The reason for the
NULLs on read is likely to be as follows...

> I wrote a small app to recreate the I/O patterns and sure enough, the
> same damage to the files:
> 
>       lockf(fd, LOCK);
>       pos = lseek(fd, 0, SEEK_END);
>       if (pos + num_bytes_to_write_would_cross_modulo_256K_boundary)
>               prealloc(another 256K at boundary)
>       write(buf, num_bytes_to_write, fd);
>       lockf(fd, UNLOCK);
> 
> I ran this in 4 processes on the same file and corruption would show up-
> usually a bunch of NULs were detected in the file.

When we do the space reservation, we create unwritten extents.
A read into those are defined to return zeroes (just like reads
into a hole would), and writes cause the extent to be adjusted so
that the written part is now marked as a regular extent, and any
remaining unwritten parts continue to be marked as unwritten.

So, in your case a reader after the write must be still seeing an
extent marked as unwritten.  This could be either a bug in your
program :) with the read happening before the write, or an XFS
problem, or a VM-behaving-badly problem.

The particular issue I fixed awhile ago is codified in test 084
(xfs-cmds/xfstests/{084,src/resvtest.c) and matches yours pretty
well, hence my question about kernel version.  The trigger there
was VM pressure to reclaim dirty pages, where those pages are
backed by unwritten extents.

> After a few days of testing, the server locked up - I rebooted it,

Hmm, kdb is your friend. :)  (for a backtrace on the lockup)


But, backing up to your original problem.  XFS actually does
some preallocation for writes beyond EOF already (in "biosize"
chunks).  When the file is closed, we go down the xfs_release
path, and any excess is trimmed off the end (trimmed back to a
filesystem block - xfs_inactive_free_eofblocks).  But, if you
are writing to those files append-only, you could mark them as
such (i.e. chattr +a) -- this will skip the trim-back step and
reduce fragmentation for you.

"biosize" is an XFS mount option.  It current maxes out at the
default, which is 64K.  I have had a patch kicking around for
ages which bumps up the allowed maximum value (see below) - a
combination of this patch, biosize=18 (i.e. 256K), and use of
the append-only inode attribute, should get you just what you
want without any application changes (I think) and in a way
that isn't susceptible to your earlier problems.

cheers.

-- 
Nathan


Index: xfs-linux/xfs_mount.h
===================================================================
--- xfs-linux.orig/xfs_mount.h  2004-06-16 10:35:26.000000000 +1000
+++ xfs-linux/xfs_mount.h       2004-08-13 09:40:19.637846024 +1000
@@ -426,10 +426,10 @@
 #define XFS_WRITEIO_LOG_LARGE  16
 
 /*
- * Max and min values for UIO and mount-option defined I/O sizes;
- * min value can't be less than a page.  Currently unused.
+ * Max and min values for mount-option defined I/O
+ * preallocation sizes.
  */
-#define XFS_MAX_IO_LOG         16      /* 64K */
+#define XFS_MAX_IO_LOG         26      /* 64M */
 #define XFS_MIN_IO_LOG         PAGE_SHIFT
 
 /*


<Prev in Thread] Current Thread [Next in Thread>