<br><br><div class="gmail_quote">On Tue, Sep 25, 2012 at 8:44 PM, Dave Chinner <span dir="ltr"><<a href="mailto:david@fromorbit.com" target="_blank">david@fromorbit.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="HOEnZb"><div class="h5">On Tue, Sep 25, 2012 at 07:26:32PM -0600, Anand Tiwari wrote:<br>
> On Mon, Sep 24, 2012 at 6:51 AM, Anand Tiwari <<a href="mailto:tiwarikanand@gmail.com">tiwarikanand@gmail.com</a>>wrote:<br>
><br>
> ><br>
> ><br>
> > On Mon, Sep 24, 2012 at 1:55 AM, Dave Chinner <<a href="mailto:david@fromorbit.com">david@fromorbit.com</a>> wrote:<br>
> ><br>
> >> On Fri, Sep 21, 2012 at 12:00:13AM -0500, Eric Sandeen wrote:<br>
> >> > On 9/20/12 7:40 PM, Anand Tiwari wrote:<br>
> >> > > Hi All,<br>
> >> > ><br>
> >> > > I have been looking into an issue with xfs_repair with realtime sub<br>
> >> volume. some times while running xfs_repair I see following errors<br>
> >> > ><br>
> >> > > ----------------------------<br>
> >> > > data fork in rt inode 134 claims used rt block 19607<br>
> >> > > bad data fork in inode 134<br>
> >> > > would have cleared inode 134<br>
> >> > > data fork in rt inode 135 claims used rt block 29607<br>
> >> > > bad data fork in inode 135<br>
> >> > > would have cleared inode 135<br>
> >> .....<br>
> >> > > xfs_db> inode 135<br>
> >> > > xfs_db> bmap<br>
> >> > > data offset 0 startblock 13062144 (12/479232) count 2097000 flag 0<br>
> >> > > data offset 2097000 startblock 15159144 (14/479080) count 2097000<br>
> >> flag 0<br>
> >> > > data offset 4194000 startblock 17256144 (16/478928) count 2097000<br>
> >> flag 0<br>
> >> > > data offset 6291000 startblock 19353144 (18/478776) count 2097000<br>
> >> flag 0<br>
> >> > > data offset 8388000 startblock 21450144 (20/478624) count 2097000<br>
> >> flag 0<br>
> >> > > data offset 10485000 startblock 23547144 (22/478472) count 2097000<br>
> >> flag 0<br>
> >> > > data offset 12582000 startblock 25644144 (24/478320) count 2097000<br>
> >> flag 0<br>
> >> > > data offset 14679000 startblock 27741144 (26/478168) count 2097000<br>
> >> flag 0<br>
> >> > > data offset 16776000 startblock 29838144 (28/478016) count 2097000<br>
> >> flag 0<br>
> >> > > data offset 18873000 startblock 31935144 (30/477864) count 1607000<br>
> >> flag 0<br>
> >> > > xfs_db> inode 134<br>
> >> > > xfs_db> bmap<br>
> >> > > data offset 0 startblock 7942144 (7/602112) count 2097000 flag 0<br>
> >> > > data offset 2097000 startblock 10039144 (9/601960) count 2097000 flag<br>
> >> 0<br>
> >> > > data offset 4194000 startblock 12136144 (11/601808) count 926000 flag<br>
> >> 0<br>
> >> ><br>
> >> > It's been a while since I thought about realtime, but -<br>
> >> ><br>
> >> > That all seems fine, I don't see anything overlapping there, they are<br>
> >> > all perfectly adjacent, though of interesting size.<br>
> >><br>
> >> Yeah, the size is the problem.<br>
> >><br>
> >> ....<br>
> >> > Every extent above is length 2097000 blocks, and they are adjacent.<br>
> >> > But you say your realtime extent size is 512 blocks ... which doesn't go<br>
> >> > into 2097000 evenly. So that's odd, at least.<br>
> >><br>
> >> Once you realise that the bmapbt is recording multiples of FSB (4k)<br>
> >> rather than rtextsz (2MB), it becomes more obvious what the problem<br>
> >> is: rounding of the extent size at MAXEXTLEN - 2097000 is only 152<br>
> >> blocks short of 2^21 (2097152).<br>
> >><br>
> >> I haven't looked at the kernel code yet to work out why it is<br>
> >> rounding to a non-rtextsz multiple, but that is the source of the<br>
> >> problem.<br>
> >><br>
> >> The repair code is detecting that extents are not of the<br>
> >> correct granularity, but the error message indicates that this was<br>
> >> only ever expected for duplicate blocks occurring rather than a<br>
> >> kernel bug. So "fixing repair" is not what is needd here - finding<br>
> >> and fixing the kernel bug is what you shoul be looking at.<br>
> >><br>
> >> Cheers,<br>
> >><br>
> >> Dave.<br>
> >> --<br>
> >> Dave Chinner<br>
> >> <a href="mailto:david@fromorbit.com">david@fromorbit.com</a><br>
> >><br>
> ><br>
> ><br>
> > thanks, I started looking at allocator code and and will report if see<br>
> > something<br>
> ><br>
><br>
><br>
> I think this is what happening. If we have following conditions,<br>
> 1) we have more than 8gb contiguous space available to allocate. ( i.e.<br>
> more than 2^21 4k blocks)<br>
> 2) only one file is open for writing in real-time volume.<br>
><br>
> To satisfy first condition, I just took empty file-system.<br>
><br>
> Now lets start allocating, lets say in chucks of 25000, realtime allocator<br>
> will have no problem allocating "exact" block while searching forward.<br>
> xfs_rtfind_forw(). It will allocate 49 "real-time extents", where the 49th<br>
> "real-time extent" is partially full. (25000/512 = 48)<br>
><br>
> everything is fine for first 83 allocations, as we were able to grow the<br>
> extent. Now we have 2075000 (25000*83) blocks in first extent ie 4053<br>
> "real-time extents" (where last "real-time extent" is partially full).<br>
><br>
> for 84th allocation, real-time allocator will allocate another 49<br>
> "real-time extents" as it does not know about maximum extent size, but we<br>
> can not grow the extent in xfs_bmap_add_extent_unwritten_real(). so we<br>
> insert a new extent (case BMAP_LEFT_FILLING). now the new extent starts<br>
> from 2075000, which is not aligned with rextsize (512 in this case).<br>
<br>
</div></div>Ok, so it's a problem with using unwritten extents and converting<br>
them. That is, the issue probably has nothing to do with the<br>
realtime allocator at all.<br>
<br>
Basically, when the unwritten extent occurs, we end up with a map<br>
like this:<br>
<br>
ext 0: offset 0, length 2075000 state written<br>
ext 1: offset 2075000 length 25000 state unwritten<br>
<br>
This will occur because you can't mix written/unwritten state in a<br>
single extent.<br>
<br>
What xfs_bmap_add_extent_unwritten_real() is attempting to do is<br>
convert the unwritten extent to written state and merge it with it's<br>
siblings. In this case, 2075000 + 25000 > MAXEXTLEN, so it does not<br>
merge them because of this check:<br>
<br>
if ((state & BMAP_LEFT_VALID) && !(state & BMAP_LEFT_DELAY) &&<br>
LEFT.br_startoff + LEFT.br_blockcount == new->br_startoff &&<br>
LEFT.br_startblock + LEFT.br_blockcount == new->br_startblock &&<br>
LEFT.br_state == newext &&<br>
>>>>>> LEFT.br_blockcount + new->br_blockcount <= MAXEXTLEN)<br>
state |= BMAP_LEFT_CONTIG;<br>
<br>
Which means that BMAP_LEFT_CONTIG is not set, indicating that the no<br>
merging with the an adjacent left extent should occur. Hence we end<br>
up wwith this:<br>
<br>
ext 0: offset 0, length 2075000 state written<br>
ext 1: offset 2075000 length 25000 state written<br>
<br>
That's fine for normal operation, but it means that large contiguous<br>
regions written via direct IO with non-rtextsz aligned/sized IO will<br>
have problem this problem.<br>
<br>
What technically should happen for these real time files is that the<br>
LEFT extent should be shortened to be aligned, and the new extent be<br>
lengthened and have it's startblock adjusted accordingly.<br>
<br>
i.e. we should end up with this:<br>
<br>
ext 0: offset 0, length 2074624 state written<br>
ext 1: offset 2074624 length 25376 state written<br>
<div class="im"><br>
> To fix this, I see two options,<br>
> 1) fix real-time allocator and teach it about maximum extent size.<br>
> 2) for real-time files, aligned new extent before inserting.<br>
<br>
</div>3) Fix the {BMAP_LEFT_CONTIG,MAXEXTLEN,rtextsz} handling in<br>
xfs_bmap_add_extent_unwritten_real().<br>
<br>
It's possible that the BMAP_RIGHT_CONTIG case also needs similar<br>
fixing...<br>
<div class="im"><br>
> In my opinion, we should not worry about either of above, as this looks<br>
> good method for allocation. I can fix xfs_repair tool and make it aware of<br>
> these conditions ("real-time extents" shared by two or more extents).<br>
<br>
</div>Personally, I'd prefer that 3) is done because RT extents should<br>
always be rtextsz aligned and sized, and the bmapbt should respect<br>
that requirement in all cases.<br>
<div class="HOEnZb"><div class="h5"><br>
Cheers,<br>
<br>
Dave.<br>
--<br>
Dave Chinner<br>
<a href="mailto:david@fromorbit.com">david@fromorbit.com</a><br>
</div></div></blockquote></div><br><div>thanks Dave for prompt reply, I meant to implement option 2 as you explained (option 3). I will start working on it tomorrow. In the mean time, I also had to put something in xfs_repair for the files which already exists on the disk. Would you guys willing to review/comment on that ? </div>
<div><br></div><div>anand</div>