[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: O_DIRECT & rpm (for xfs)
On Fri, 2003-08-29 at 11:29, Axel Thimm wrote:
> Hi Jeff,
>
> thanks for the background on the O_DIRECT problem. As I wrote, there
> is some active discussion on the xfs-list about how to deal with
> O_DIRECT/ext3/xfs/rpm, and your explanation will be very valuable to
> the XFS community, especially since it comes from the rpm master
> himself :)
>
> For your background: The hanging of rpm with recent XFS patches due to
> enabled O_DIRECT support generated some confusion about where this bug
> came from.
>
> Current XFS policy is to enable O_DIRECT only for XFS partitions.
>
> Thanks!
Having had O_DIRECT in the filesystems I have worked on for the last
decade (thats getting scary ;-), I might comment on some of Jeff's
message:
> O_DIRECT -- as instantiated in linux -- is goofy and under development.
> The behavior before was
> just to ignore O_DIRECT, it did not matter whether it was used or not.
> Life was good.
I tend to agree with this to some extent, although O_DIRECT has been
there for some filesystems in linux for a few years now.
> O_DIRECT imposes size and alignment constraints on the I/O request, just
> like character
> devices. The painful constraint is page aligned address, necessitating
> valloc, not just any buffer
> address.
The missing link here is a standard call to determine what the alignment
constraints are for a particular filesystem. XFS has one of these, but
it is XFS specific and not generally provided. The memory alignment one
is someone difficult to justify in its current version, since the
only real reason I know for constraining the alignment is the
capabilities of the hardware to do dma. A page is way too large,
a cache line is more likely to be the true boundary, a sector is
also a possible contstraint. Again, the real issue is the lack of
an interface to determine the allowed alignment.
> The other pain with O_DIRECT is that the EINVAL returned with alignment
> failure comes from
> read/write, not from the open. This is different than most syscalls,
> which either fail immediately,
> or the cause of failure is obvious for other reasons.
Since the open does not constrain what you do with read and write calls
afterwards, you cannot really expect the open to predict that you will
disobey the rules later. If you pass an invalid buffer address into a
read or a write you get the error on that call, not the open.
What might be a better approach, and what some of the big players probably
expect (read Oracle here), is that O_DIRECT fall back to buffered I/O
when the alignment constraints are not met on a call. You can conceive
of two forms of open call - a hard and a soft form of O_DIRECT. The hard
one would behave as it does now - if you disobey the rules, your I/O
fails, the soft one would be more forgiving and do buffered I/O. This
comes from the solaris implementation by the way. We are definitely
getting pushback from database folks to follow this model - since the
apps do not want to special case for different operating systems.
> What is/was really, really broken is that O_DIRECT came out of no place,
> was released without adequate
> warning, and even kernel folks disagree on the implementation, and the
> implementation is not
> complete. Very not good. Works on some file systems, not on others, etc
> etc.
See Christoph's response for this one. But the reason it is on for XFS
and not the other filesystems is that the bug in O_DIRECT which made
them turn it off does not affect the XFS implementation.
Steve
--
Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: lord@sgi.com