On Fri, Aug 08, 2014 at 02:49:24PM -0400, Brian Foster wrote:
> Hi all,
> I've seen collapse range fall over during some recent stress testing.
> I'm running fsx and 16 fsstress threads in parallel to reproduce. Note
> that the fsstress workload doesn't need to be on the same fs (I suspect
> a sync() is a trigger). These patches are what has fallen out so far...
> The first patch stems from the fact that the error caused an fs shutdown
> that appeared to be unnecessary. I was initially going to skip the inode
> log on any error, but on closer inspection it seems like we expect to
> abort/shutdown if something has in fact been changed, so this modifies
> the code to reduce that shutdown window. The second patch deals with the
> actual collapse failure by fixing up the locking.
> Note that I still reproduced at least one collapse failure even with
> these fixes, so there could be more at play here with the
> XFS: Internal error XFS_WANT_CORRUPTED_GOTO at line 5535 of file
> fs/xfs/libxfs/xfs_bmap.c. Caller xfs_collapse_file_space+0x1af/0x280 [xfs]
> This took significantly longer to reproduce and I don't yet have a feel
> for how reproducible it is in general. In the meantime, these two seemed
> relatively straightforward and incremental...
They look good, but it's too late for the 3.17 merge window.
However, given that we've got other fixes that need to go to 3.17
but are also too late (Chris Mason's direct IO invalidation fixes)
I'll plan these for 3.17-rc2 or so.