xfs_repair deleting realtime files.

Anand Tiwari tiwarikanand at gmail.com
Tue Sep 25 22:45:07 CDT 2012


On Tue, Sep 25, 2012 at 8:44 PM, Dave Chinner <david at fromorbit.com> wrote:

> On Tue, Sep 25, 2012 at 07:26:32PM -0600, Anand Tiwari wrote:
> > On Mon, Sep 24, 2012 at 6:51 AM, Anand Tiwari <tiwarikanand at gmail.com
> >wrote:
> >
> > >
> > >
> > > On Mon, Sep 24, 2012 at 1:55 AM, Dave Chinner <david at fromorbit.com>
> wrote:
> > >
> > >> On Fri, Sep 21, 2012 at 12:00:13AM -0500, Eric Sandeen wrote:
> > >> > On 9/20/12 7:40 PM, Anand Tiwari wrote:
> > >> > > Hi All,
> > >> > >
> > >> > > I have been looking into an issue with xfs_repair with realtime
> sub
> > >> volume. some times while running xfs_repair I see following errors
> > >> > >
> > >> > > ----------------------------
> > >> > > data fork in rt inode 134 claims used rt block 19607
> > >> > > bad data fork in inode 134
> > >> > > would have cleared inode 134
> > >> > > data fork in rt inode 135 claims used rt block 29607
> > >> > > bad data fork in inode 135
> > >> > > would have cleared inode 135
> > >> .....
> > >> > > xfs_db> inode 135
> > >> > > xfs_db> bmap
> > >> > > data offset 0 startblock 13062144 (12/479232) count 2097000 flag 0
> > >> > > data offset 2097000 startblock 15159144 (14/479080) count 2097000
> > >> flag 0
> > >> > > data offset 4194000 startblock 17256144 (16/478928) count 2097000
> > >> flag 0
> > >> > > data offset 6291000 startblock 19353144 (18/478776) count 2097000
> > >> flag 0
> > >> > > data offset 8388000 startblock 21450144 (20/478624) count 2097000
> > >> flag 0
> > >> > > data offset 10485000 startblock 23547144 (22/478472) count 2097000
> > >> flag 0
> > >> > > data offset 12582000 startblock 25644144 (24/478320) count 2097000
> > >> flag 0
> > >> > > data offset 14679000 startblock 27741144 (26/478168) count 2097000
> > >> flag 0
> > >> > > data offset 16776000 startblock 29838144 (28/478016) count 2097000
> > >> flag 0
> > >> > > data offset 18873000 startblock 31935144 (30/477864) count 1607000
> > >> flag 0
> > >> > > xfs_db> inode 134
> > >> > > xfs_db> bmap
> > >> > > data offset 0 startblock 7942144 (7/602112) count 2097000 flag 0
> > >> > > data offset 2097000 startblock 10039144 (9/601960) count 2097000
> flag
> > >> 0
> > >> > > data offset 4194000 startblock 12136144 (11/601808) count 926000
> flag
> > >> 0
> > >> >
> > >> > It's been a while since I thought about realtime, but -
> > >> >
> > >> > That all seems fine, I don't see anything overlapping there, they
> are
> > >> > all perfectly adjacent, though of interesting size.
> > >>
> > >> Yeah, the size is the problem.
> > >>
> > >> ....
> > >> > Every extent above is length 2097000 blocks, and they are adjacent.
> > >> > But you say your realtime extent size is 512 blocks ... which
> doesn't go
> > >> > into 2097000 evenly.   So that's odd, at least.
> > >>
> > >> Once you realise that the bmapbt is recording multiples of FSB (4k)
> > >> rather than rtextsz (2MB), it becomes more obvious what the problem
> > >> is: rounding of the extent size at MAXEXTLEN - 2097000 is only 152
> > >> blocks short of 2^21 (2097152).
> > >>
> > >> I haven't looked at the kernel code yet to work out why it is
> > >> rounding to a non-rtextsz multiple, but that is the source of the
> > >> problem.
> > >>
> > >> The repair code is detecting that extents are not of the
> > >> correct granularity, but the error message indicates that this was
> > >> only ever expected for duplicate blocks occurring rather than a
> > >> kernel bug. So "fixing repair" is not what is needd here - finding
> > >> and fixing the kernel bug is what you shoul be looking at.
> > >>
> > >> Cheers,
> > >>
> > >> Dave.
> > >> --
> > >> Dave Chinner
> > >> david at fromorbit.com
> > >>
> > >
> > >
> > > thanks, I started looking at allocator code and and will report if see
> > > something
> > >
> >
> >
> > I think this is what happening.  If we have following conditions,
> >   1) we have more than 8gb contiguous space available to allocate. ( i.e.
> > more than 2^21 4k blocks)
> >   2) only one file is open for writing in real-time volume.
> >
> > To satisfy first condition, I just took empty file-system.
> >
> > Now lets start allocating, lets say in chucks of 25000, realtime
> allocator
> > will have no problem allocating "exact" block while searching forward.
> > xfs_rtfind_forw(). It will allocate 49 "real-time extents", where the
> 49th
> > "real-time extent" is partially full.  (25000/512 = 48)
> >
> > everything is fine for first 83 allocations, as we were able to grow the
> > extent. Now we have 2075000 (25000*83) blocks in first extent ie 4053
> > "real-time extents" (where last "real-time extent" is partially full).
> >
> > for 84th allocation, real-time allocator will allocate another 49
> > "real-time extents" as it does not know about maximum extent size, but we
> > can not grow the extent in xfs_bmap_add_extent_unwritten_real().  so we
> > insert a new extent (case BMAP_LEFT_FILLING).  now the new extent starts
> > from 2075000, which is not aligned with rextsize (512 in this case).
>
> Ok, so it's a problem with using unwritten extents and converting
> them. That is, the issue probably has nothing to do with the
> realtime allocator at all.
>
> Basically, when the unwritten extent occurs, we end up with a map
> like this:
>
> ext 0:  offset 0, length 2075000 state written
> ext 1:  offset 2075000 length 25000 state unwritten
>
> This will occur because you can't mix written/unwritten state in a
> single extent.
>
> What xfs_bmap_add_extent_unwritten_real() is attempting to do is
> convert the unwritten extent to written state and merge it with it's
> siblings. In this case, 2075000 + 25000 > MAXEXTLEN, so it does not
> merge them because of this check:
>
>         if ((state & BMAP_LEFT_VALID) && !(state & BMAP_LEFT_DELAY) &&
>             LEFT.br_startoff + LEFT.br_blockcount == new->br_startoff &&
>             LEFT.br_startblock + LEFT.br_blockcount == new->br_startblock
> &&
>             LEFT.br_state == newext &&
> >>>>>>      LEFT.br_blockcount + new->br_blockcount <= MAXEXTLEN)
>                 state |= BMAP_LEFT_CONTIG;
>
> Which means that BMAP_LEFT_CONTIG is not set, indicating that the no
> merging with the an adjacent left extent should occur. Hence we end
> up wwith this:
>
> ext 0:  offset 0, length 2075000 state written
> ext 1:  offset 2075000 length 25000 state written
>
> That's fine for normal operation, but it means that large contiguous
> regions written via direct IO with non-rtextsz aligned/sized IO will
> have problem this problem.
>
> What technically should happen for these real time files is that the
> LEFT extent should be shortened to be aligned, and the new extent be
> lengthened and have it's startblock adjusted accordingly.
>
> i.e. we should end up with this:
>
> ext 0:  offset 0, length 2074624 state written
> ext 1:  offset 2074624 length 25376 state written
>
> > To fix this, I see two options,
> > 1) fix real-time allocator and teach it about maximum extent size.
> > 2) for real-time files, aligned new extent before inserting.
>
> 3) Fix the {BMAP_LEFT_CONTIG,MAXEXTLEN,rtextsz} handling in
> xfs_bmap_add_extent_unwritten_real().
>
> It's possible that the BMAP_RIGHT_CONTIG case also needs similar
> fixing...
>
> > In my opinion, we should not worry about either of above, as this looks
> > good method for allocation.  I can fix xfs_repair tool and make it aware
> of
> > these conditions ("real-time extents" shared by two or more extents).
>
> Personally, I'd prefer that 3) is done because RT extents should
> always be rtextsz aligned and sized, and the bmapbt should respect
> that requirement in all cases.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david at fromorbit.com
>

thanks Dave for prompt reply, I meant to implement option 2 as you
explained (option 3).  I will start working on it tomorrow. In the mean
time, I also had  to put something in xfs_repair for the files which
already exists on the disk. Would you guys willing to review/comment on
that ?

anand
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20120925/606a41f5/attachment-0001.htm>


More information about the xfs mailing list