file journal fadvise
Mark Nelson
mark.nelson at inktank.com
Mon Dec 1 16:31:18 CST 2014
On 12/01/2014 01:23 PM, Sage Weil wrote:
> On Mon, 1 Dec 2014, Mark Nelson wrote:
>> On 11/30/2014 09:26 PM, Sage Weil wrote:
>>> On Mon, 1 Dec 2014, ??? wrote:
>>>> Hi sage:
>>>> For fadvise_random it only change the file readahead. I think it make
>>>> no sense for xfs
>>>> Becasue xfs don't like btrfs, the journal write always on old place(at
>>>> first allocated). We only can make those place contiguous.
>>>
>>> I'm thinking of the OSD journal, which can be a regular file. I guess it
>>> would probably be an allocator mode, set via a XFS_XFLAG_* flag passed to
>>> an ioctl, which makes the delayed allocation especially unconcerned with
>>> keeping blocks contiguous. It would need to be combined with the discard
>>> ioctl so that any journal write can be allocated wherever it is most
>>> convenient (hopefully contiguous to some other write).
>>>
>>> sage
>>
>> Hi Sage,
>>
>> Could you quick write down the steps you are thinking we'd take to implement
>> this? I'm concerned about the amount of overhead this could cause but I want
>> to make sure I'm thinking about it correctly. Especially when trim happens and
>> what you think/expect to happens at the FS and device levels.
>
> 1- set journal_discard = true
> 2- add journal_preallocate = true config option, set it to false, and make
> the fallocate(2) call on journal create conditional on that.
> 3- test with defaults (discard = false, preallocate = true) and
> compare it to discard = true + preallocate = false (with file journal).
> 4- possibly add a call to set extsize to something small on the journal
> file. Or give xfs some other appropriate hint, if one exists.
>
> sage
CCing XFS devel so we can get some feedback from those guys too.
Question: Looking through our discard code in common/blkdev.cc, it
looks like the new discard implementation is using blkdiscard. For
co-located journals should we be using fstrim_range?
FWIW there were some performance tests done quite a while ago:
http://people.redhat.com/lczerner/discard/files/Performance_evaluation_of_Linux_DIscard_support_Dev_Con2011_Brno.pdf
>
>>
>> Mark
>>
>>>
>>>
>>>>
>>>> Thanks!
>>>> Jianpeng
>>>>
>>>> 2014-12-01 2:46 GMT+08:00 Sage Weil <sweil at redhat.com>:
>>>>> Currently, when an OSD journal is stored as a file, we preallocate it as
>>>>> a
>>>>> large contiguous extent. That means that for every journal write we're
>>>>> seeking back to wherever the journal is. That possibly not ideal for
>>>>> writes. For reads it's great, but that's the last thing we care about
>>>>> optimizing (we only read the journal after a failure, which is very
>>>>> rare).
>>>>>
>>>>> I wonder if we would do better if we:
>>>>>
>>>>> 1- trim/discard the old journal contents,
>>>>> 2- posix_fadvise RANDOM
>>>>>
>>>>> I'm not sure what the XFS behavior is in this case, but ideally it seems
>>>>> what we want it to do is write the journal wherever on disk it is most
>>>>> convenient... ideally contiguous with some other write that it is
>>>>> already
>>>>> doing. If fadvise random doesn't do that, perhaps there is another
>>>>> allocator hint we can give it that will get us that behavior...
>>>>>
>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo at vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo at vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo at vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
More information about the xfs
mailing list