xfs
[Top] [All Lists]

Re: garbage block(s) after powercycle/reboot + sparse writes

To: Sage Weil <sage@xxxxxxxxxxx>
Subject: Re: garbage block(s) after powercycle/reboot + sparse writes
From: Dave Chinner <david@xxxxxxxxxxxxx>
Date: Wed, 19 Jun 2013 11:46:46 +1000
Cc: Eric Sandeen <sandeen@xxxxxxxxxxx>, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <alpine.DEB.2.00.1306120955460.15386@xxxxxxxxxxxxxxxxxx>
References: <alpine.DEB.2.00.1306041210070.15156@xxxxxxxxxxxxxxxxxx> <51AE84C9.5030903@xxxxxxxxxxx> <alpine.DEB.2.00.1306120955460.15386@xxxxxxxxxxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
On Wed, Jun 12, 2013 at 10:02:52AM -0700, Sage Weil wrote:
> Hi guys,
> 
> I reproduced this on two more boxes and have more data.  The full set of 
> notes/logs is at
> 
>       http://newdream.net/~sage/bug-4976/notes.txt

case c:

/var/lib/ceph/osd/ceph-0/current/3.0_head/DIR_9/plana8021941-457__head_834C52B9__3:
        0: [0..7]: 244206232..244206239
        1: [8..1351]: hole
        2: [1352..2431]: 244252824..244253903
        3: [2432..3255]: hole
        4: [3256..4855]: 244254728..244256327

bad data starts at offset 1179648:

Which lies within the middle of an allocated extent (offset 2304bb).

IIUC, then there was a write at offset 700466 for 445465 bytes,
(i.e start @ 1368bb, end @ 2239bb), but given the block count of
the file didn't change, this must have been an overwrite of existing
data. It's well within EOF, too, so it's not clear what has happened
here - the bad data was not written by the ceph journal replay, and
the extent was already allocated on disk...

case d:

        0: [0..7]: 732632192..732632199
        1: [8..783]: hole
        2: [784..2943]: 733513808..733515967
        3: [2944..3511]: hole
        4: [3512..4703]: 733516536..733517727


INFO:teuthology.task.rados.rados.0.out:incorrect buffer at pos 1179648  (((same 
offset as previous case!!!))

So, similar layout, we have a write at offset 404703 for 773518
bytes (start @790bb, end @2302). Ok, that makes a little more sense
to have corruption starting @ 2304. The pages in the page cache that
cover this write would cover offsets 784bb to 2303bb, having zeroed
the head and tail sections.

The file size and number of blocks didn't change again and it's
well within the file size, so this is also an overwrite without
allocat. IOWs the overwrite ended at 2303bb, with corruption
starting at 2304bb.

Let's play a "what-if" game. there are 3 writes to the file:

        -@ 0 for 57 bytes       (1FSB)
        -@ 404703 for 773518 bytes (
        -@ 1801584 for 603119 bytes

These translate to the following regions in BB:

        - [0..1]
        - [790..2302]
        - [3518..4697]

If we round these to filesystem blocks, we have:

        - [0..7]
        - [784..2303]
        - [3512..4703]

And in filesystem blocks:

        Should have             Actually have
        -----------             ------------
        - [0..1]                - [0..1]
        - [98..287]             - [98..367]
        - [439..587]            - [439..587]

So there's an extra 80 filesystem blocks in the middle extent. Does
that tally with the block count of 3360bb = 420fsb? 1 + 190 + 149 =
340fsb. it doesn't, so there's another 60fsb beyond EOF.

And that's pretty telling. There's non-zero'd speculative
preallocation there. The *big* question is this: why was the file
size extented (i.e. user data IO completed) and updated and the
file size update logged to the journal *before* the zeroed regions
of the file had been written to zero?

And why only on 3.8/3.9?

> Any ideas?

I'd love to see the code that is writing the files in the first
place - it's not playing silly buggers with sync_file_range(), is
it?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>