xfs
[Top] [All Lists]

Re: [PATCH] Introduce SEEK_DATA/SEEK_HOLE to XFS V3

To: Jeff Liu <jeff.liu@xxxxxxxxxx>
Subject: Re: [PATCH] Introduce SEEK_DATA/SEEK_HOLE to XFS V3
From: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Date: Wed, 21 Dec 2011 11:32:51 -0500
Cc: xfs@xxxxxxxxxxx, Christoph Hellwig <hch@xxxxxxxxxxxxx>, Chris Mason <chris.mason@xxxxxxxxxx>
In-reply-to: <4EE7691D.6040807@xxxxxxxxxx>
References: <4EE7691D.6040807@xxxxxxxxxx>
User-agent: Mutt/1.5.21 (2010-09-15)
Hi Jeff,

sorry for the delay getting back to this - last week has been very busy
for me.

> Tests:
> ======
> 1. General tests I have mentioned before.
> 2. Cover look up DIRTY pages through fallocate(2).

Can you send your actual test cases to the list?  Preferably wired up
to xfstests, and also verified against btrfs and ocfs2 which already
have the feature.

> The issue is I have not yet successfully worked out a test case can
> cover look up WRITEBACK pages easily,

It's probably fairly hard to hit, as you'd need a relatively slow device
to hit it reliably.  Doing a sync_file_range call with
SYNC_FILE_RANGE_WRITE as the flags argument just before the lseek call
probably is the best way to get it.

> +/*
> + * Try to find out the data buffer offset in page cache for unwritten
> + * extents. Firstly, try to probe the DIRTY pages in current extent range,
> + * and iterate each page to lookup all theirs data buffers, if a buffer
> + * head status is unwritten, return its offset. If there is no DIRTY pages
> + * found or lookup done, we need to lookup the WRITEBACK pages again and
> + * perform the same operation as previously to avoid data loss.
> + */
> +STATIC loff_t
> +xfs_probe_unwritten_buffer(
> +     struct inode            *inode,
> +     struct xfs_bmbt_irec    *map,
> +     int                     *found)
> +{
> +     struct xfs_inode        *ip = XFS_I(inode);
> +     struct xfs_mount        *mp = ip->i_mount;
> +     struct pagevec          pvec;
> +     pgoff_t                 index;
> +     pgoff_t                 end;
> +     loff_t                  offset = 0;
> +     int                     tag = PAGECACHE_TAG_DIRTY;
> +
> +     pagevec_init(&pvec, 0);
> +
> +probe_writeback_pages:
> +     index = XFS_FSB_TO_B(mp, map->br_startoff) >> PAGE_CACHE_SHIFT;
> +     end = XFS_FSB_TO_B(mp, map->br_startoff + map->br_blockcount)
> +                        >> PAGE_CACHE_SHIFT;
> +
> +     do {
> +             unsigned        nr_pages;
> +             unsigned int    i;
> +             int             want = min_t(pgoff_t, end - index,
> +                                          PAGEVEC_SIZE - 1) + 1;
> +             nr_pages = pagevec_lookup_tag(&pvec, inode->i_mapping,
> +                                           &index, tag, want);
> +             if (nr_pages == 0) {
> +                     /*
> +                      * No dirty pages returns for this extent, try
> +                      * to lookup the writeback pages again.
> +                      * FIXME: If this is the first time for probing
> +                      * DIRTY pages but nothing returned, we need to
> +                      * search the WRITEBACK pages from the extent
> +                      * beginning offset, but if we have found out
> +                      * some DIRTY pages before, maybe we should
> +                      * continue to probe the WRITEBACK pages from
> +                      * the current page index rather than beginning?
> +                      */
> +                     if (tag == PAGECACHE_TAG_DIRTY) {
> +                             tag = PAGECACHE_TAG_WRITEBACK;
> +                             goto probe_writeback_pages;
> +                     }

I don't think this is correct.  Even if we have dirty pages we migt have
writeback pages at an lower index.  Given that the we can't look for
multiple tags at the same time it seems like we need a normal pagecache
lookup and iterate over all pages.  Later we could optimize this by
adding a multiple tag lookup helper, but let's get the code functional
for now.  Btw, did you look what btrfs and ocfs2 do here?

> +                             /*
> +                              * In XFS, if an extent in XFS_EXT_UNWRITTEN
> +                              * state, that means the disk blocks were
> +                              * already mapped for it, but the data is
> +                              * still lived at page caches. For buffers
> +                              * resides at DIRTY pages, their BH state
> +                              * should be in (dirty && mapped && unwritten
> +                              * && uptodate) status. For buffers resides
> +                              * at WRITEBACK pages, their BH state should
> +                              * be in (mapped && unwritten && uptodate)
> +                              * status. So we only need to check unwritten
> +                              * buffer status here.

Remove the "In XFS" - this is XFS code so that part is redudant.

XFS_EXT_UNWRITTEN do not need to have data at all, in fact they most
likely don't.  So I'd reword this to:

An extent in XFS_EXT_UNWRITTEN has disk blocks already mapped to it, but
no data has been commiteed to them yet.  If it has dirty data in the
pagecache it can be identified by having BH_Unwritten set in each
buffer.

> +STATIC loff_t
> +xfs_seek_data(
> +     struct file             *file,
> +     loff_t                  start,
> +     u32                     type)
> +{
> +     struct inode            *inode = file->f_mapping->host;
> +     struct xfs_inode        *ip = XFS_I(inode);
> +     struct xfs_mount        *mp = ip->i_mount;
> +     xfs_fsize_t             isize = i_size_read(inode);
> +     loff_t                  offset = 0;
> +     struct xfs_ifork        *ifp;
> +     xfs_fileoff_t           fsbno;
> +     xfs_filblks_t           len;
> +     int                     lock;
> +     int                     error;
> +
> +     if (start >= isize)
> +             return -ENXIO;
> +
> +     lock = xfs_ilock_map_shared(ip);

I'd move the check after acquiring the lock just to be sure.

> +     fsbno = XFS_B_TO_FSBT(mp, start);
> +     ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
> +     len = XFS_B_TO_FSB(mp, isize);
> +
> +     for (;;) {
> +             struct xfs_bmbt_irec    map[2];
> +             int                     nmap = 2;
> +             int                     found = 0;
> +             loff_t                  seekoff;
> +
> +             error = xfs_bmapi_read(ip, fsbno, len - fsbno, map, &nmap,
> +                                    XFS_BMAPI_ENTIRE);
> +             if (error)
> +                     goto out_lock;
> +
> +             /* No extents at given offset, must be beyond EOF */
> +             if (!nmap) {
> +                     error = ENXIO;
> +                     goto out_lock;
> +             }
> +
> +             seekoff = XFS_FSB_TO_B(mp, fsbno);
> +             /*
> +              * Landed in a hole, skip to check the next extent.
> +              * If the next extent landed in an in-memory data extent,
> +              * or it is a normal extent, its fine to return.
> +              * If the next extent landed in a hole extent, calculate
> +              * the start file system block number for next bmapi read.
> +              * If the next extent landed in an unwritten extent, we
> +              * need to probe the page cache to find out the data buffer
> +              * offset, if nothing found, treat it as a hole extent too.
> +              */
> +             if (map[0].br_startblock == HOLESTARTBLOCK) {
> +                     if (map[1].br_startblock == HOLESTARTBLOCK) {
> +                             fsbno = map[1].br_startoff +
> +                                     map[1].br_blockcount;
> +                     } else if (map[1].br_state == XFS_EXT_UNWRITTEN) {
> +                             offset = xfs_probe_unwritten_buffer(inode,
> +                                                                 &map[1],
> +                                                                 &found);
> +                             if (found) {
> +                                     offset = max_t(loff_t, seekoff, offset);
> +                                     break;
> +                             }
> +                             /*
> +                              * No data buffer found in pagecache, treate it
> +                              * as a hole.
> +                              */
> +                             fsbno = map[1].br_startoff +
> +                                     map[1].br_blockcount;
> +                     } else {
> +                             offset = max_t(loff_t, seekoff,
> +                                     XFS_FSB_TO_B(mp, map[1].br_startoff));
> +                             break;
> +                     }

It seems like the hole handling is the same for this case and what we
handle below.

> +             /* Landed in a delay allocated extent or a read data extent */

s/read/real/

> +STATIC loff_t
> +xfs_seek_hole(
> +     struct file             *file,
> +     loff_t                  start,
> +     u32                     type)
> +{
> +     struct inode            *inode = file->f_mapping->host;
> +     struct xfs_inode        *ip = XFS_I(inode);
> +     struct xfs_mount        *mp = ip->i_mount;
> +     xfs_fsize_t             isize = i_size_read(inode);
> +     xfs_fileoff_t           fsbno;
> +     loff_t                  holeoff;
> +     loff_t                  offset = 0;
> +     int                     lock;
> +     int                     error;
> +
> +     if (start >= isize)
> +             return -ENXIO;
> +
> +     lock = xfs_ilock_map_shared(ip);

I'd move the check after acquiring the lock just to be sure.

> +     error = xfs_bmap_first_unused(NULL, ip, 1, &fsbno, XFS_DATA_FORK);
> +     if (error)
> +             goto out_lock;

Hmm - this misses the unwritten cases that we handle in SEEK_DATA.  But
if we want to handle it we probably can't use the simple
xfs_bmap_first_unused call that Dave suggested but need to use the
xfs_bmapi_read loop, too.

Sorry for not finding this earlier.

<Prev in Thread] Current Thread [Next in Thread>