Marc Lehmann wrote:
On Sun, Apr 03, 2005 at 12:03:27PM -0700, Chris Wedgwood <cw@xxxxxxxx> wrote:
for really slow writes i found a large biosize helped. i've had this
Is this somewthing inside xfs or is this just setting the st_blksize stat
data field? If the latter, then its unlikely to help, as I configured
mythtv to use fairly large write's already (minimum 512KB, usually around
2MB). But thanks for the tip, it might help some other XFS filesystems I
have (although it isn't a problem there).
This is inside XFS and is different than doing a large I/O. What is happening
to you is your writes (independent of size) and just going into cache and
no specific disk space is being allocated at write time. XFS is creating
a delayed allocation extent, this will be a single extent for the range
of data which has not been flushed yet.
At some point the kernel decides to flush a page in this range. XFS will
determine it is within a delayed allocate extent, and allocate space for
the whole thing. However, if you exceed a fairly small size, you will end
up just allocating that space, and not preallocating anything else out
beyond eof. The patch Chris sent lets the code use a larger value for
this preallocation out beyond eof, used in combination with the biosize
mount option, it will let xfs do optimistic allocation for future
Note this space is freed when the file goes inactive. Also note that this will
affect the size returned in st_blksize which may confuse libc if absurdly
large values are used.
Allocation groups in xfs are basically the units of parallelism within the
allocator. You can have one thread active at once in an allocation group.
The basic placement rules are file inodes are placed in the same group
as their parent directory, directory inodes are placed in a different
group than their parent. Extents start out in the same group as the
inode, they will scan around allocation groups looking for free space
When allocating an extent for an inode, xfs will attempt to place the data
as close as it can to the end of the previous extent (it should use increasing
Now, if you have multiple inodes in the same allocation group allocating
extents, they are competing for free space. The algorithm here could be
better as the chances are on a filesystem like yours, they will be carving
space off the front of the free space in the allocation group. This is where
the checkerboarding comes from.
The realtime allocator uses a different binary chop algorithm which while
wasteful, makes it very hard to fragment realtime files. Hmm, buffered works
on realtime, maybe you could try hacking mythtv to set the realtime flag in
inodes and see what happens. It occurs to me that while it would fragment up
the free space more quickly, having the allocator use a binary chop to carve
up the free space rather than always starting at the beginning of it would
make it more likely for there to be unused space just beyond the last
extent allocated for a file. Implementing that would take some time by
someone experienced inside the allocator (Glen are you reading this ;-)
at the end of the day, there may actually not be much of a payoff.