[PATCH] Prevent lookup from finding bad buffers
Eric Sandeen
sandeen at sandeen.net
Wed Nov 25 14:26:19 CST 2009
Lachlan McIlroy wrote:
> There's a bug in _xfs_buf_find() that will cause it to return buffers
> that failed to be initialised.
>
> If a thread has a buffer locked and is waiting for I/O to initialise
> it and another thread wants the same buffer the second thread will
> wait on the buffer lock in _xfs_buf_find(). If the initial thread
> gets an I/O error it marks the buffer in error and releases the
> buffer lock. The second thread gets the buffer lock, assumes the
> buffer has been successfully initialised, and then tries to use it.
>
> Some callers of xfs_buf_get_flags() will check for B_DONE, and if
> it's not set then re-issue the I/O, bust most callers assume the
> buffer and it's contents are good and then use the uninitialised
> data.
>
> The solution I've come up with is if we lookup a buffer and find
> it's got b_error set or has been marked stale then unhash it from
> the buffer hashtable and retry the lookup. Also if we fail to setup
> the buffer correctly in xfs_buf_get_flags() then mark the buffer in
> error and unhash it. If the buffer is marked stale then in
> xfs_buf_free() inform the page cache that the contents of the pages
> are no longer valid.
I managed to come up with a sorta-kinda testcase for this.
Fragmented freespace, many files in a dir, on raid5; simply doing
drop caches / ls in a loop triggered it.
I guess raid5 is bad in this respect; in it's make_request() we have:
} else {
/* cannot get stripe for read-ahead, just give-up */
clear_bit(BIO_UPTODATE, &bi->bi_flags);
finish_wait(&conf->wait_for_overlap, &w);
break;
}
and this happens fairly often. This probably explains a large
percentage of our xfs_da_do_buf(2) errors we've seen on the list.
More information about the xfs
mailing list