The reproducer Nick found that hung the XFS filesystem has two
different causes. Both manifest in the same way, so it was difficult to tell
them aparṫ - they left an item that could notbe flushed stuck in the AIL and
hence preventing the tail of the log from being moved forward.
The most frequently tripped problem (generally within 5 minutes) was the CIL
commit race when the checkpoitn could commit before the transaction items were
unlocked, resulting in stale information in log items and incorrect processing
of stale buffers. I coul dnot reproduce this race condition with a debug
kernel, which explain why my previous dbench testing did not uncover it.
Less frequently tripped was the inode cluster freeing problem - it would take
around 2 hours on average to trip this one, and it does not require delayed
logging to hit.
Nick, you'll need both patches to avoid the the hangs you reported - I've had
your test case running now for just over ten hours without any issues so far.