xfs
[Top] [All Lists]

Re: [PATCH] xfs: allow logical-sector sized O_DIRECT for any fs sector s

To: Dave Chinner <david@xxxxxxxxxxxxx>, Eric Sandeen <sandeen@xxxxxxxxxx>
Subject: Re: [PATCH] xfs: allow logical-sector sized O_DIRECT for any fs sector size
From: Eric Sandeen <sandeen@xxxxxxxxxxx>
Date: Wed, 15 Jan 2014 16:52:05 -0600
Cc: xfs-oss <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <20140115223848.GZ3469@dastard>
References: <52D6CC91.6000408@xxxxxxxxxx> <20140115223848.GZ3469@dastard>
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.2.0
On 1/15/14, 4:38 PM, Dave Chinner wrote:
> On Wed, Jan 15, 2014 at 11:59:45AM -0600, Eric Sandeen wrote:
>> Some time ago, mkfs.xfs started picking the storage physical
>> sector size as the default filesystem "sector size" in order
>> to avoid RMW costs incurred by doing IOs at logical sector
>> size alignments.
>>
>> However, this means that for a filesystem made with i.e.
>> a 4k sector size on an "advanced format" 4k/512 disk,
>> 512-byte direct IOs are no longer allowed.  This means
>> that XFS has essentially turned this AF drive into a hard
>> 4K device, from the filesystem on up.
>>
>> XFS's mkfs-specified "sector size" is really just controlling
>> the minimum size & alignment of filesystem metadata IO.
>>
>> There is no real need to tightly couple XFS's minimal
>> metadata size to the minimum allowed direct IO size;
>> XFS can continue doing metadata in optimal sizes, but
>> still allow smaller DIOs for apps which issue them,
>> for whatever reason.
> 
> Given that we already serialise sub-block IO completely, doing
> "sub-sector" IO will also be serialised so there shouldn't be any
> new issues introduced by this change.
> 
>> This patch adds 2 new fields to the xfs_buftarg, so that
>> we now track 3 sizes:
>>
>>  1) The device logical sector size
>>  2) The device physical sector size
>>  3) The filesystem sector size, which is the minimum unit and
>>     alignment of IO which will be performed by metadata operations.
> 
> I wouldn't call it the "filesystem sector size" because it's clear
> that it doesn't apply to everything in the filesystem. I'd prefer
> that we call it the "metadata sector size", similar to the "log
> sector size" we keep for situations where the external log device
> has a different sector size to the data device.

Ok, fair enough.

>> The first is used for the minimum allowed direct IO alignment,
>> the 2nd is used to report DIO sizes in XFS_IOC_DIOINFO
>> (the theory being, if an app actually asks, we can give them
>> the optimal answer, even if we allow smaller IOs), and the
>> 3rd is used internally by the filesystem for metadata IOs.
> 
> Terminology nit: the 3rd is set by mkfs to define the physical
> format constraints that metadata in the filesystem must obey. If we
> ever have to allocate sector sized metadata, then it will be used
> for more than just IO....

Ok.  I can fix comments.  :D

>> This has passed xfstests on filesystems made with 4k sectors,
>> including when run under the patch I sent to ignore
>> XFS_IOC_DIOINFO, and issue 512 DIOs anyway.  I also directly
>> tested end of block behavior on preallocated, sparse, and
>> existing files when we do a 512 IO into a 4k file on a 
>> 4k-sector filesystem, to be sure there were no unexpected
>> behaviors.
>>
>> Signed-off-by: Eric Sandeen <sandeen@xxxxxxxxxx>
>> ---
>>
>> NB: This depends on this patch which is in the xfs tree,
>> but not yet upstream:
>>  xfs: simplify xfs_setsize_buftarg callchain; remove unused arg
>>
>>
>>
>> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
>> index 9fccfb5..a89dcdf 100644
>> --- a/fs/xfs/xfs_buf.c
>> +++ b/fs/xfs/xfs_buf.c
>> @@ -1599,6 +1599,7 @@ xfs_setsize_buftarg(
>>      unsigned int            blocksize,
>>      unsigned int            sectorsize)
>>  {
>> +    /* Set up filesystem block and sector sizes */
>>      btp->bt_bsize = blocksize;
>>      btp->bt_sshift = ffs(sectorsize) - 1;
> 
> The two places that use bt_sshift convert it back to a byte count.
> We should change this simply to being a byte count.

Ok, yeah, good point.  I think I considered that along the way.

>>      btp->bt_smask = sectorsize - 1;
>> @@ -1614,6 +1615,9 @@ xfs_setsize_buftarg(
>>              return EINVAL;
>>      }
>>  
>> +    /* Set up device logical & physical sector size info */
>> +    btp->bt_lsmask = bdev_logical_block_size(btp->bt_bdev) - 1;
>> +    btp->bt_pssize = bdev_physical_block_size(btp->bt_bdev);
>>      return 0;
>>  }
>>  
>> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
>> index 1cf21a4..29a0db9 100644
>> --- a/fs/xfs/xfs_buf.h
>> +++ b/fs/xfs/xfs_buf.h
>> @@ -88,14 +88,28 @@ typedef unsigned int xfs_buf_flags_t;
>>   */
>>  #define XFS_BSTATE_DISPOSE   (1 << 0)       /* buffer being discarded */
>>  
>> +/*
>> + * The xfs_buftarg contains 3 notions of "sector size" -
>> + *
>> + * 1) The device logical sector size
>> + * 2) The device physical sector size
>> + * 3) The filesystem sector size, which is the minimum unit and
>> + *    alignment of IO which will be performed by metadata operations.
>> + *
>> + * The latter is specified at mkfs time, stored on-disk in the
>> + * superblock's sb_sectsize, and is set from there.
>> + */
>> +
>>  typedef struct xfs_buftarg {
>>      dev_t                   bt_dev;
>>      struct block_device     *bt_bdev;
>>      struct backing_dev_info *bt_bdi;
>>      struct xfs_mount        *bt_mount;
>> -    unsigned int            bt_bsize;
>> -    unsigned int            bt_sshift;
>> -    size_t                  bt_smask;
>> +    unsigned int            bt_bsize;       /* fs block size */
>> +    unsigned int            bt_sshift;      /* fs sector size shift */
>> +    size_t                  bt_smask;       /* fs sector size mask */
>> +    size_t                  bt_lsmask;      /* dev logical sectsz mask */
>> +    unsigned int            bt_pssize;      /* dev physical sector size */
> 
> This patch makes bt_smask unused, so it can be removed. bt_bsize is
> also unused, so that should be removed, too. indeed, the buftarg has
> a backpointer to the xfs_mount, so we can get the block size from
> there if it is ever necessary.

bt_bsize is used, until your suggestion below.  :)  But I can
remove it with that change, sure.


> And I think we should make these names a little more descriptive
> while we are touching them so comments to describe them are
> unnecessary:
> 
>       unsigned int            bt_meta_sectorsize;
>       unsigned int            bt_physical_sectorsize;
>       size_t                  bt_logical_sectormask;

Ok.  I was just sticking with the tradtional (lack of) verbosity.

>> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
>> index 33ad9a7..1f3431f 100644
>> --- a/fs/xfs/xfs_ioctl.c
>> +++ b/fs/xfs/xfs_ioctl.c
>> @@ -1587,7 +1587,12 @@ xfs_file_ioctl(
>>                      XFS_IS_REALTIME_INODE(ip) ?
>>                      mp->m_rtdev_targp : mp->m_ddev_targp;
>>  
>> -            da.d_mem = da.d_miniosz = 1 << target->bt_sshift;
>> +            /*
>> +             * Report device physical sector size as "optimal" minimum,
>> +             * unless blocksize is smaller than that.
>> +             */
>> +            da.d_miniosz = min(target->bt_pssize, target->bt_bsize);
> 
> Just grab the filesysetm block size from the xfs_mount:
> 
>               da.d_miniosz = min(target->bt_pssize, mp->m_sb.sb_blocksize);
> 
>> +            da.d_mem = da.d_miniosz;
> 
> I'd suggest that this should be PAGE_SIZE - it's for memory buffer
> alignment, not IO alignment, so using the IO alignment just seems
> wrong to me...

Ok.  Was just sticking close to what we had before.

So:
                da.d_miniosz = min(target->bt_pssize, mp->m_sb.sb_blocksize);
                da.d_mem = PAGE_SIZE;

?  Then we can have a minimum IO size of 512, and a memory alignment of
4k, isn't that a little odd?

(IOWs we could do 512-aligned memory before, right?  What's the downside,
or the value in changing it now?)

Thanks,
-Eric

> Cheers,
> 
> Dave.
> 

<Prev in Thread] Current Thread [Next in Thread>