Hi Tim,
Thanks so much for the response. Comments inlined.
Yes I agree, that the map_size is going to be unnecessarily large
(potentially by 1).
where N = number of total bits in map = number of 128 byte chunks
("chunks")
and n = number of bits in an int ("NBWORD")
map_size = ((N + (n-1))/n) number of ints
The existing calculation of ((N + n)/n) will be wrong when N is a
multiple of n,
it will go over by 1. i.e. it will be wrong when N is a multiple of 32.
The question, however, is if this will cause a problem and not just
a slight waste of space.
Given that the chunks calculation relies on buf count (not on page
count), this extra addition appears mandatory. So, it appears that it
has been intentionally set to extra one to compensate its usage along
with buf count.
If non-aligned (non 4k aligned) request for full 64K buffer comes in
xfs_trans_read_buf, the buffer actually created may be of higher size as
buffers are created in multiples of PAGE_SIZE (4k).
For example, if xfs_buf_get_flags gets ioff= 2K, and length is 16
blocks(4k), the buffer created will be of 17 pages. Hence, the dirtied
chunks can be in 17th page as well.
The chunks calculated here should be for complete buffer (not the
allocation request) as this is what logging code will finally see (it
does not know the original request) and the maps will account for the
complete buffer (first 2k as well). However, chunks & map calculated
using buf count will only suffice for 16 pages. This forces to have the
map_size off by 1.
chunks = (int)((XFS_BUF_COUNT(bp) + (XFS_BLI_CHUNK - 1)) >>
XFS_BLI_SHIFT);
map_size = (int)((chunks + NBWORD) >> BIT_TO_WORD_SHIFT);
Isn't understanding this will be far more easier if (and logically more
correct?) if it would have been
chunks = (int)((XFS_PAGE_COUNT(bp) (XFS_BLI_CHUNK - 1)) >> XFS_BLI_SHIFT);
and map_size = (int)((chunks + NBWORD -1) >> BIT_TO_WORD_SHIFT);
I realize now the current code is correct, however, it took me a while
to figure out this indirect relation. When I changed the map_size
calculation to correct it, I got "unable to handle page request" because
of the above chunks calculation.
And, the second part which I was interested to know was (not very clear
in my mail) as to where from this XFS_MAX_BLOCKSIZE comes and how buf
log item code is dependent upon this. I am unable to get the
correlation between this size and the max buffer size.
To be precise, can I figure what is the maximum size of a buf item which
can be created with the block size of 4k (or what is the maximum number
of pages which can be dirtied in any transaction in worst case). If it
is dependent upon blocksize, why not to make the buf_item_zone
initialization dependent upon the blocksize someway to avoid the wastage
of space. It is really significant if the number of inodes supported on
the system is high. (If it can be avoided, it can be an icing on cake).
xfs_buf_item_zone =
kmem_zone_init((sizeof(xfs_buf_log_item_t) +
(((XFS_MAX_BLOCKSIZE / XFS_BLI_CHUNK) /
NBWORD) * sizeof(int))),
"xfs_buf_item");
I also need to know this because I have changed the size of the inode
allocated at once from 64 inodes to full stripe width allocation. I
noticed that I have been crossing the 16 + 1 page limit. I allocate an
ad hoc higher number to avoid this. This creates more wastage of space.
Alternatively, I will have to go for dynamic allocation of maps which
may be costlier as the comment above this initialization says so.
-shailendra
Some concerns that spring to mind:
1) Will the allocation of space for the blf_data_map be matched by the
accesses
of the map?
The allocation size is set in the setting up of the xfs_buf_item_zone.
It does this to:
sizeof(xfs_buf_log_item_t) + XFS_MAX_BLOCK_SIZE / XFS_BLI_CHUNK /
NBWORD * sizeof(int)
= sizeof(xfs_buf_log_item_t) + 64K / 128 / 32 * 4
And this basically sizes it for a buffer of the largest block (64K) and
for 1 extra in xfs_buf_log_item_t's blf_data_map[1].
So we do seem to actually cater for 1 extra here.
And so the accesses which use xfs_next_bit & blf_map_size when they go over
what covers the buffer, they will not go over the zone limit.
2) Will the oversized map cause us to access beyond the buffer and its
pages?
Well, looking at xfs_buf_item_format() and xfs_buf_item_size(), they call
upon xfs_buf_offset(). However, it looks like they would do so only for
a bit
in the map which was turned on (1). And the extra bits at the end of the
map
will be zeroed at xfs_buf_item_init() stage using kmem_zone_zalloc().
And should never be turned on by xfs_buf_item_log() as the byte offsets
given would always be within the buffer.
So I think xfs_next_bit() will just scan over these final zeroes and then
return -1.
But maybe I've missed something?
Also, chunks calculation may go well off the mark of the blocks are
not page aligned. The code does not block alignment check. It only
checks for sectoral alignment. For example, lets say I need 32768
pb_buffer_length. However, the offset is not 4096 aligned. Lets say it
starts at, 2048. Given this, _pagebuf_lookup_pages will prepare 8 + 1
pages.
page_count = page_buf_btoc(end) - page_buf_btoct(bp->pb_file_offset);
In this case, chunks calculated will suffice for only 8 pages. In this
particular case, the statement 2 makes amend for it and allocates
correct map size. But, it oversteps when the offset is block (4096)
aligned. That is, if offset was 0, still it will allocate 9 pages.
I'm not that au fait with the page buf code.
But I do not understand your page alignment concern. The map is just
storing
which parts of the buffer (down to a 128 byte level) need to be logged.
(During xfs_buf_item_format() when it transforms the map into vectors it
will look up the regions to see if the adjacent bits are actually
contiguous (and not in another page for example)
and if not then it will use an extra vector)
It affects the xfs_next_bit called in xfs_buf_item_size and
xfs_buf_item_format. If last chunk of the 8 blocks allocated are
dirty, the code will over step into 9th (which might be not even
cached) page and potentially return in "unable to handle paging
request". This is because the xfs_next_bit will not return -1 as it
does not see the end due to map size overstepping.
But I would think it won't overstep because this extra end part of the
map will
be zeroed out.
Have I missed something?
Thanks muchly,
Tim.
|