xfs
[Top] [All Lists]

Re: Unexpected xfs_buf_item_init initialization and inflexible xfs_buf_i

To: Timothy Shimmin <tes@xxxxxxx>
Subject: Re: Unexpected xfs_buf_item_init initialization and inflexible xfs_buf_item_zone
From: Shailendra Tripathi <stripathi@xxxxxxxxx>
Date: Wed, 11 Jan 2006 16:29:59 +0530
Cc: linux-xfs@xxxxxxxxxxx
In-reply-to: <43C4BFAF.1060507@xxxxxxx>
References: <14BC3454F4B4614FBC4F3FB19A84C372050C5F@xxxxxxxxxxxxxxxxxxxxxxx> <43BA04AF.2060703@xxxxxxxxx> <43C4BFAF.1060507@xxxxxxx>
Sender: linux-xfs-bounce@xxxxxxxxxxx
User-agent: Mozilla Thunderbird 0.9 (X11/20041127)
Hi Tim,
        Thanks so much for the response.  Comments inlined.
Yes I agree, that the map_size is going to be unnecessarily large
(potentially by 1).
where N = number of total bits in map = number of 128 byte chunks ("chunks")
and  n = number of bits in an int ("NBWORD")
map_size = ((N + (n-1))/n)  number of ints

The existing calculation of ((N + n)/n) will be wrong when N is a multiple of n,
it will go over by 1. i.e. it will be wrong when N is a multiple of 32.
The question, however, is if this will cause a problem and not just
a slight waste of space.
        
Given that the chunks calculation relies on buf count (not on page count), this extra addition appears mandatory. So, it appears that it has been intentionally set to extra one to compensate its usage along with buf count.

If non-aligned (non 4k aligned) request for full 64K buffer comes in xfs_trans_read_buf, the buffer actually created may be of higher size as buffers are created in multiples of PAGE_SIZE (4k). For example, if xfs_buf_get_flags gets ioff= 2K, and length is 16 blocks(4k), the buffer created will be of 17 pages. Hence, the dirtied chunks can be in 17th page as well.

The chunks calculated here should be for complete buffer (not the allocation request) as this is what logging code will finally see (it does not know the original request) and the maps will account for the complete buffer (first 2k as well). However, chunks & map calculated using buf count will only suffice for 16 pages. This forces to have the map_size off by 1.

chunks = (int)((XFS_BUF_COUNT(bp) + (XFS_BLI_CHUNK - 1)) >> XFS_BLI_SHIFT);
        map_size = (int)((chunks + NBWORD) >> BIT_TO_WORD_SHIFT);

Isn't understanding this will be far more easier if (and logically more correct?) if it would have been

chunks = (int)((XFS_PAGE_COUNT(bp) (XFS_BLI_CHUNK - 1)) >> XFS_BLI_SHIFT);
and map_size = (int)((chunks + NBWORD -1) >> BIT_TO_WORD_SHIFT);
        
I realize now the current code is correct, however, it took me a while to figure out this indirect relation. When I changed the map_size calculation to correct it, I got "unable to handle page request" because of the above chunks calculation.
        
And, the second part which I was interested to know was (not very clear in my mail) as to where from this XFS_MAX_BLOCKSIZE comes and how buf log item code is dependent upon this. I am unable to get the correlation between this size and the max buffer size.

To be precise, can I figure what is the maximum size of a buf item which can be created with the block size of 4k (or what is the maximum number of pages which can be dirtied in any transaction in worst case). If it is dependent upon blocksize, why not to make the buf_item_zone initialization dependent upon the blocksize someway to avoid the wastage of space. It is really significant if the number of inodes supported on the system is high. (If it can be avoided, it can be an icing on cake).

xfs_buf_item_zone =
                kmem_zone_init((sizeof(xfs_buf_log_item_t) +
                                (((XFS_MAX_BLOCKSIZE / XFS_BLI_CHUNK) /
                                  NBWORD) * sizeof(int))),
                               "xfs_buf_item");

I also need to know this because I have changed the size of the inode allocated at once from 64 inodes to full stripe width allocation. I noticed that I have been crossing the 16 + 1 page limit. I allocate an ad hoc higher number to avoid this. This creates more wastage of space. Alternatively, I will have to go for dynamic allocation of maps which may be costlier as the comment above this initialization says so.
        
-shailendra

Some concerns that spring to mind:

1) Will the allocation of space for the blf_data_map be matched by the accesses
of the map?
The allocation size is set in the setting up of the xfs_buf_item_zone.
It does this to:
sizeof(xfs_buf_log_item_t) + XFS_MAX_BLOCK_SIZE / XFS_BLI_CHUNK / NBWORD * sizeof(int)
= sizeof(xfs_buf_log_item_t) + 64K / 128 / 32 * 4
And this basically sizes it for a buffer of the largest block (64K) and
for 1 extra in xfs_buf_log_item_t's blf_data_map[1].
So we do seem to actually cater for 1 extra here.
And so the accesses which use xfs_next_bit & blf_map_size when they go over
what covers the buffer, they will not go over the zone limit.

2) Will the oversized map cause us to access beyond the buffer and its pages?
Well, looking at xfs_buf_item_format() and xfs_buf_item_size(), they call
upon xfs_buf_offset(). However, it looks like they would do so only for a bit in the map which was turned on (1). And the extra bits at the end of the map
will be zeroed at xfs_buf_item_init() stage using kmem_zone_zalloc().
And should never be turned on by xfs_buf_item_log() as the byte offsets
given would always be within the buffer.
So I think xfs_next_bit() will just scan over these final zeroes and then
return -1.
But maybe I've missed something?

Also, chunks calculation may go well off the mark of the blocks are not page aligned. The code does not block alignment check. It only checks for sectoral alignment. For example, lets say I need 32768 pb_buffer_length. However, the offset is not 4096 aligned. Lets say it starts at, 2048. Given this, _pagebuf_lookup_pages will prepare 8 + 1 pages.
page_count = page_buf_btoc(end) - page_buf_btoct(bp->pb_file_offset);

In this case, chunks calculated will suffice for only 8 pages. In this particular case, the statement 2 makes amend for it and allocates correct map size. But, it oversteps when the offset is block (4096) aligned. That is, if offset was 0, still it will allocate 9 pages.


I'm not that au fait with the page buf code.
But I do not understand your page alignment concern. The map is just storing
which parts of the buffer (down to a 128 byte level) need to be logged.
(During xfs_buf_item_format() when it transforms the map into vectors it
will look up the regions to see if the adjacent bits are actually
contiguous (and not in another page for example)
and if not then it will use an extra vector)

It affects the xfs_next_bit called in xfs_buf_item_size and xfs_buf_item_format. If last chunk of the 8 blocks allocated are dirty, the code will over step into 9th (which might be not even cached) page and potentially return in "unable to handle paging request". This is because the xfs_next_bit will not return -1 as it does not see the end due to map size overstepping.

But I would think it won't overstep because this extra end part of the map will
be zeroed out.
Have I missed something?

Thanks muchly,
Tim.


<Prev in Thread] Current Thread [Next in Thread>