Stephen Lord wrote:
>
> On Tue, 2002-10-29 at 01:30, Luben Tuikov wrote:
> > Maybe like this?
> >
> > --- linux/fs/xfs/pagebuf/page_buf.c.orig Tue Oct 29 01:48:57 2002
> > +++ linux/fs/xfs/pagebuf/page_buf.c Tue Oct 29 01:52:17 2002
> > @@ -1665,22 +1665,18 @@
> > bh = bufferlist[i];
> >
> > /* Complete the buffer_head, then submit the IO */
> > - if (psync) {
> > - init_buffer(bh, callback, psync);
> > - } else {
> > - init_buffer(bh, callback, pb);
> > - }
> > + init_buffer(bh, callback, psync ? : pb);
> >
>
> Well that's a little cryptic, a conditional
> expression which I have to deduce is defaulting
> to the left hand side in the case where there is
> no right hand side.
a ? a : b; <==> a ? : b;
>
> > bh->b_rdev = bh->b_dev;
> > bh->b_rsector = bh->b_blocknr;
> > set_bit(BH_Mapped, &bh->b_state);
> > set_bit(BH_Req, &bh->b_state);
> > -
> > - if (rw == WRITE) {
> > - set_bit(BH_Uptodate, &bh->b_state);
> > - }
> > - generic_make_request(rw, bh);
> > + if (rw == WRITE)
> > + set_bit(BH_Dirty, &bh->b_state);
>
> In general callers of the block layer
> clear the dirty bit on buffers they submit,
> see __block_write_full_page. In reality,
> since we use generic_make_request it does
> not appear these flags matter.
I consider the md code to be a correct implementation. Please
see the code which I refered to in the original message.
If the buffer should be written out, the dirty bit should be turned on.
This makes sense anywhich way you look at it and think about it.
This makes sense from 1. the code I've seen in the kernel (md)
and from 2. logical point of view and from 3. my old OS course textbook
(``the dinosaur book'').
Regarding BH_Uptodate: as you can see in the rest of the kernel,
this bit is toggled when io is completed, from the callback function.
Thus, it would make sense to turn it off, until the io has
completed and then let the callback function set it to the proper value.
IOW, I want to ensure that any stacking driver, would get the proper
idea if they _were_ to check BH_Uptodate and BH_Dirty. I don't want to
be getting no IO just because someone decided to check BH_Uptodate
or BH_Dirty and returned the bh without actual IO. That is, I want to
cover the contingencies.
That is, I'm talking from a principle point of view.
Please try to understand this the right way.
> > }
> > + for (i = 0; i < cnt; i++)
> > + generic_make_request(rw, bufferlist[i]);
>
> On the one hand you are doing fancy optimizations
> of the init_buffer call, and on the other you add
> a new loop - which is not needed once the other
> synchronization calls are in.
This isn't ``fancy optimizations'', it's just an idiom of code of
submitting bh's to generic_make_request(). You can see this done
elsewhere in the kernel and is the proper way of doing it.
Had the original author just used that idiom, the above for() loop
could've stayed the same without the patch from Asano...
Just as sticking to K&R2 C layout (Documentation/CodingStyle), I'd
argue that there are certain programming idioms of the Linux kernel
which are *good* to be abided by (in the kernel).
Please try to understand that I'm talking out of principle here,
_and_ good practice. All I want is XFS to work and be faster. We use
it so I do care.
This is all similar to the locking example I gave in an email of mine
here some time ago, and is also similar to the if() example I also
gave here found in the function we are talking about.
It's all about a clear logical concept implemented in C. Clear code
portrays clear logical foundation and well thought out plan.
Declaring a struct in the middle of the source file is not my idea
of good programming practice. pagesync_t can be part of pb -- ``locking''
is implicit in pb, and ``remain'' can be part of pb, in fact they both
can be made into one by setting locking = remain + 1 (atomically of course).
I've done similar things in my ``sbs'', thus the discourse.
> > + run_task_queue(&tq_disk);
>
> ^^^^^^^^^
>
> This makes XFS more synchronous in terms of disk I/O
> than it needs to be. Try feeding this code a large
> dbench load or something similar.
Yes, I'd like that very much. Please tell me/point to the way _you_
guys do your testing. We'd like to try it here, other than our tests.
> I think the only part of this which makes a difference to the
> execution of the code is the run_task_queue, please explain why
> it is in there.
This is by clear example to the locking code I gave some
emails ago.
So for the same principle I'd like to see a *well defined* (one) place
for run_task_queue(&tq_disk), rather than see the call all over the
place.
I know and realize that this may be hard as some IO is done in
process context (the process which initiated it), but a kernel
thread *should* do it, for sake of the comment above __get_request_wait()
in ll_rw_blk.c.
So when in process context, let the kernel do IO throttling, i.e.
don't try and be smarter than the kernel, but when xfs's thread
is doing it, running the disk task queue is essential for
performance (similar thing I have in sbs).
So, as you can see this was my attempt at getting closer to this
model.
> This means xfs has no way to coalesce any
> metadata I/O beyond an individual page into the device queue.
> So our 32K log writes become 8 seperate disk I/Os for starters.
> Now, if you are sitting on top of lvm it is probably slicing
> and dicing under the covers and masking this, if you try a
> single disk I think you will notice a performance drop.
Right.
In which case that clear separation model as outlined above is
necessary if not a MUST.
That is, let's not try to out-smart the kernel. It knows best
and we should do the minimal which is requred by us (on our side).
This also protects us from changes elsewhere in the kernel.
--
Luben
P.S. I couldn't see the justifications used in XFS to
treat md and lvm differently. The block layer certainly
doesn't, they're just another stacking driver...
|