Hi Tim,
I figured out the max_buf_len. It is limited to the
max(XFS_INODE_CLUSTER_SIZE, blocksize) which will be always 2
blocks/pages for filesystem created with blocksize 4k on linux.
I added small code and recorded the 8192 such entries and they confirmed
the same
(gdb) p xfs_max_buf_len
$1 = 2
(gdb) p xfs_buf_len (number of pages dirtied)
$2 = {1 <repeats 712 times>, 2, 2, 2, 2, 1 <repeats 92 times>, 2, 2, 2, 2,
1 <repeats 822 times>, 2, 2, 2, 2, 1 <repeats 1084 times>, 2, 2, 2, 2,
1 <repeats 1734 times>, 2, 2, 2, 2, 1 <repeats 118 times>, 2, 2, 2, 2,
1 <repeats 942 times>, 2, 2, 2, 2, 1 <repeats 1325 times>, 2, 2, 2, 2,
1 <repeats 740 times>, 2, 2, 2, 2, 1 <repeats 587 times>}
From this, it does appear that 60 bytes are wasted per transactionally
active inode by xfs_buf_item_zone which can be potentially saved though
at the cost of one inflexibility - one system potentially can't have a
filesystem of different blocksizes. On linux, 4k seems to be default and
not sure if it is really a point of concern if it is changed. Any
comments ?
Regards,
Shailendra
Timothy Shimmin wrote:
Hi Shailendra,
Thanks for the report.
Shailendra Tripathi wrote:
Section A:
----------
xfs_buf_item_init()
/*
* chunks is the number of XFS_BLI_CHUNK size pieces
* the buffer can be divided into. Make sure not to
* truncate any pieces. map_size is the size of the
* bitmap needed to describe the chunks of the buffer.
*/
chunks = (int)((XFS_BUF_COUNT(bp) + (XFS_BLI_CHUNK - 1)) >>
XFS_BLI_SHIFT);
map_size = (int)((chunks + NBWORD) >> BIT_TO_WORD_SHIFT); ---> 2
For 4096 block page (linux default), one map is required per page.
Given this, line 2 appears odd. Instead, it should be
map_size = (int)((chunks + NBWORD -1) >> BIT_TO_WORD_SHIFT);
Yes I agree, that the map_size is going to be unnecessarily large
(potentially by 1).
where N = number of total bits in map = number of 128 byte chunks
("chunks")
and n = number of bits in an int ("NBWORD")
map_size = ((N + (n-1))/n) number of ints
The existing calculation of ((N + n)/n) will be wrong when N is a
multiple of n,
it will go over by 1. i.e. it will be wrong when N is a multiple of 32.
The question, however, is if this will cause a problem and not just
a slight waste of space.
Some concerns that spring to mind:
1) Will the allocation of space for the blf_data_map be matched by the
accesses
of the map?
The allocation size is set in the setting up of the xfs_buf_item_zone.
It does this to:
sizeof(xfs_buf_log_item_t) + XFS_MAX_BLOCK_SIZE / XFS_BLI_CHUNK /
NBWORD * sizeof(int)
= sizeof(xfs_buf_log_item_t) + 64K / 128 / 32 * 4
And this basically sizes it for a buffer of the largest block (64K) and
for 1 extra in xfs_buf_log_item_t's blf_data_map[1].
So we do seem to actually cater for 1 extra here.
And so the accesses which use xfs_next_bit & blf_map_size when they go over
what covers the buffer, they will not go over the zone limit.
2) Will the oversized map cause us to access beyond the buffer and its
pages?
Well, looking at xfs_buf_item_format() and xfs_buf_item_size(), they call
upon xfs_buf_offset(). However, it looks like they would do so only for
a bit
in the map which was turned on (1). And the extra bits at the end of the
map
will be zeroed at xfs_buf_item_init() stage using kmem_zone_zalloc().
And should never be turned on by xfs_buf_item_log() as the byte offsets
given would always be within the buffer.
So I think xfs_next_bit() will just scan over these final zeroes and then
return -1.
But maybe I've missed something?
Also, chunks calculation may go well off the mark of the blocks are
not page aligned. The code does not block alignment check. It only
checks for sectoral alignment. For example, lets say I need 32768
pb_buffer_length. However, the offset is not 4096 aligned. Lets say it
starts at, 2048. Given this, _pagebuf_lookup_pages will prepare 8 + 1
pages.
page_count = page_buf_btoc(end) - page_buf_btoct(bp->pb_file_offset);
In this case, chunks calculated will suffice for only 8 pages. In this
particular case, the statement 2 makes amend for it and allocates
correct map size. But, it oversteps when the offset is block (4096)
aligned. That is, if offset was 0, still it will allocate 9 pages.
I'm not that au fait with the page buf code.
But I do not understand your page alignment concern. The map is just
storing
which parts of the buffer (down to a 128 byte level) need to be logged.
(During xfs_buf_item_format() when it transforms the map into vectors it
will look up the regions to see if the adjacent bits are actually
contiguous (and not in another page for example)
and if not then it will use an extra vector)
It affects the xfs_next_bit called in xfs_buf_item_size and
xfs_buf_item_format. If last chunk of the 8 blocks allocated are
dirty, the code will over step into 9th (which might be not even
cached) page and potentially return in "unable to handle paging
request". This is because the xfs_next_bit will not return -1 as it
does not see the end due to map size overstepping.
But I would think it won't overstep because this extra end part of the
map will
be zeroed out.
Have I missed something?
Thanks muchly,
Tim.
|