garbage block(s) after powercycle/reboot + sparse writes
Sage Weil
sage at inktank.com
Tue Jun 18 22:12:01 CDT 2013
On Wed, 19 Jun 2013, Dave Chinner wrote:
> On Wed, Jun 12, 2013 at 10:02:52AM -0700, Sage Weil wrote:
> > Hi guys,
> >
> > I reproduced this on two more boxes and have more data. The full set of
> > notes/logs is at
> >
> > http://newdream.net/~sage/bug-4976/notes.txt
>
> case c:
>
> /var/lib/ceph/osd/ceph-0/current/3.0_head/DIR_9/plana8021941-457__head_834C52B9__3:
> 0: [0..7]: 244206232..244206239
> 1: [8..1351]: hole
> 2: [1352..2431]: 244252824..244253903
> 3: [2432..3255]: hole
> 4: [3256..4855]: 244254728..244256327
>
> bad data starts at offset 1179648:
>
> Which lies within the middle of an allocated extent (offset 2304bb).
>
> IIUC, then there was a write at offset 700466 for 445465 bytes,
> (i.e start @ 1368bb, end @ 2239bb), but given the block count of
> the file didn't change, this must have been an overwrite of existing
> data. It's well within EOF, too, so it's not clear what has happened
> here - the bad data was not written by the ceph journal replay, and
> the extent was already allocated on disk...
>
> case d:
>
> 0: [0..7]: 732632192..732632199
> 1: [8..783]: hole
> 2: [784..2943]: 733513808..733515967
> 3: [2944..3511]: hole
> 4: [3512..4703]: 733516536..733517727
>
>
> INFO:teuthology.task.rados.rados.0.out:incorrect buffer at pos 1179648 (((same offset as previous case!!!))
>
> So, similar layout, we have a write at offset 404703 for 773518
> bytes (start @790bb, end @2302). Ok, that makes a little more sense
> to have corruption starting @ 2304. The pages in the page cache that
> cover this write would cover offsets 784bb to 2303bb, having zeroed
> the head and tail sections.
>
> The file size and number of blocks didn't change again and it's
> well within the file size, so this is also an overwrite without
> allocat. IOWs the overwrite ended at 2303bb, with corruption
> starting at 2304bb.
>
> Let's play a "what-if" game. there are 3 writes to the file:
>
> -@ 0 for 57 bytes (1FSB)
> -@ 404703 for 773518 bytes (
> -@ 1801584 for 603119 bytes
>
> These translate to the following regions in BB:
>
> - [0..1]
> - [790..2302]
> - [3518..4697]
>
> If we round these to filesystem blocks, we have:
>
> - [0..7]
> - [784..2303]
> - [3512..4703]
>
> And in filesystem blocks:
>
> Should have Actually have
> ----------- ------------
> - [0..1] - [0..1]
> - [98..287] - [98..367]
> - [439..587] - [439..587]
>
> So there's an extra 80 filesystem blocks in the middle extent. Does
> that tally with the block count of 3360bb = 420fsb? 1 + 190 + 149 =
> 340fsb. it doesn't, so there's another 60fsb beyond EOF.
>
> And that's pretty telling. There's non-zero'd speculative
> preallocation there. The *big* question is this: why was the file
> size extented (i.e. user data IO completed) and updated and the
> file size update logged to the journal *before* the zeroed regions
> of the file had been written to zero?
>
> And why only on 3.8/3.9?
I haven't gone back and tested on older kernels. It's possible this isn't
a new problem and previous failures were lost in the noise. :/
> > Any ideas?
>
> I'd love to see the code that is writing the files in the first
> place - it's not playing silly buggers with sync_file_range(), is
> it?
Actually, it is!
...
::sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE);
...
int fa_r = posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
...and a call to close(2).
The intent was just to push things out of the disk so that a syncfs()
that comes later will take less time. I wouldn't have expected this to
change the commit ordering semantics, though; only to initiate writeback
sooner.
(Coincidentally, all of this code was just recently rewritten to do a
simple fsync. I haven't retested the powercycling since then, though.)
sage
More information about the xfs
mailing list