On Tue, Sep 23, 2008 at 06:05:04PM +1000, Dave Chinner wrote:
> Another failures that we often have reported is that XFS has 'hung' and
> traige indicates that the filesystem appears to be waiting for a metadata
> I/O completion to occur. We have seen in the past I/O errors not being
> propagated from the lower layers back into the filesystem causing these
> sort of problems. We have also seen cases where there have been silent
> I/O errors and the first thing to go wrong is 'XFS has hung'.
> To catch situations like this, we need to track all I/O we have in flight and
> have some method of timing them out. That is, if we haven't completed the I/O
> in N seconds, issue a warning and enter an exception handling process that
> attempts to deal with the problem.
> My initial thoughts on this is that it could be implemented via the MRU cache
> without much extra code being needed. The complexity with this is that we
> can't catch data read I/O because we use the generic I/O path for read. We do
> our own data write and metadata read/write, so we can easily add hooks to
> all these types of I/O. Hence we will initially target just metadata I/O as
> this would only need to hook into the xfs_buf I/O submission layer.
I don't think this is something we want to do in XFS itself, as this
would fit much better in the bio layer (and propagation through the
pagecache). That way we have it in one places instead of growing this
in various filesystems later on.