[Top] [All Lists]

Re: Preallocation with direct IO?

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: Re: Preallocation with direct IO?
From: Amit Sahrawat <amit.sahrawat83@xxxxxxxxx>
Date: Sat, 31 Dec 2011 18:16:00 +0530
Cc: "hch@xxxxxxxxxxxxx" <hch@xxxxxxxxxxxxx>, "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=wTRFs002IZoNredjMRJ5oTcV0B8xqGdX+ZB7yuM4CxI=; b=mS8yPXJ9MvG73ks9AvoGBWyFQf+QDzOH/1DC1lGONSxglDwQPm+YUUI2Og4QuitAq9 P6b5pzX+fRphsp+dY4QJvR5RwXMrEtK1YmneyGyCMs8a54A9U3+SLm68exEq+IsV8ccI OgOE8hPppNmN//LgCtkN+IxDJWW/3fCNzqg6U=
In-reply-to: <20111230204307.GN23662@dastard>
References: <4efc665b.d139e30a.2f32.fffff97a@xxxxxxxxxxxxx> <20111229205745.GH12731@dastard> <CADDb1s18XsCRkjq_spMKx0-4g2H51mRJXn=u=boqUe4TXZw-MQ@xxxxxxxxxxxxxx> <20111230204307.GN23662@dastard>
On Sat, Dec 31, 2011 at 2:13 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Fri, Dec 30, 2011 at 08:37:00AM +0530, Amit Sahrawat wrote:
>> On Fri, Dec 30, 2011 at 2:27 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> > On Thu, Dec 29, 2011 at 01:10:49PM +0000, amit.sahrawat83@xxxxxxxxx wrote:
>> >> Hi, I am using a test setup which is doing write using multiple
>> >> threads using direct IO. The buffer size which is used to write is
>> >> 512KB.  After continously running this for long duration - i
>> >> observe that number of extents in each file is getting
>> >> huge(2K..4K..). I observed that each extent is of 512KB(aligned to
>> >> write buffer size). I wish to have low number of extents(i.e,
>> >> reduce fragmentation)... In case of buffered IO- preallocation
>> >> works good alongwith the mount option 'allocsize'. Is there
>> >> anything which can be done for Direct IO?  Please advice for
>> >> reducing fragmentation with direct IO.
>> >
>> > Direct IO does not do any implicit preallocation. The filesystem
>> > simply gets out of the way of direct IO as it is assumed you know
>> > what you are doing.
>> This is the supporting line I was looking for.
>> >
>> > i.e. you know how to use the fallocate() or ioctl(XFS_IOC_RESVSP64)
>> > calls to preallocate space or to set up extent size hints to use
>> > larger allocations than the IO being done during syscalls...
>> I tried to make use of preallocating space using
>> ioctl(XFS_IOC_RESVSP64) - but over time - this is also not working
>> well with the Direct I/O.
> Without knowing how you are using preallocation, I cannot comment on
> this. Can you describe how your application does IO (size,
> frequency, location in file, etc) and preallocation (same again), as
> well as xfs_bmap -vp <file> output of fragmented files? That way I
> have some idea of what your problem is and so might be able to
> suggest fixes...
Prealloction was done using - snippets like these:
        fl.l_whence = SEEK_SET;
        fl.l_start = 0;
        fl.l_len = (long long) PREALLOC;       /* 1GB */
        printf ("Preallocating %lld MB\n", (fl.l_len / (1024 * 1024)));
        err = ioctl (hFile, XFS_IOC_RESVSP64, &fl);
I verified the prealloc working by taking a look at the file size (ls
-l) disk usage using 'df -kh' and also taking a look at the file
extents using xfs_bmap
xfs_bmap shows the extent of the preallocated length.
i.e., preallocation was working as expected.

To share the test case, due to some reasons - I cannot share the exact
code - but the working is like this:

In the Test case - there are 5 threads
1st Thread - this is doing actual amongst all the threads
        buffer = valloc(WRITE_SIZE);
        fd= open64(file,O_CREAT|O_DIRECT|O_WRONLY|O_TRUNC)
        Initial write to file data of 5GB using 512KB buffer size
        for(i=0; i < WRITE_COUNT; i++)
                write(fd, buffer,WRITE_SIZE);
                if(ncount++ < TRUNCSIZE)        
                        open(fd, O_RDWR|O_CREAT)
                        gettimeofday() - Start Point
                        sync(fd); // At times this sync is taking time around 
5sec even
though the test case is doing I/O using O_DIRECT
                        gettimeofday() - End Point
                        If(sync time greater than 2secs)
                        gettimeofday() - Start Point
                        gettimeofday() - End Point

                        if(truncate time greater than 2sec)

                        open64(file, O_WRONLY|O_APPEND|O_DIRECT);
                        ncount = 0;             

2nd Thread - Writing to a file in while loop
        while (1)
                write(10 bytes)
                usleep(100 * 1000);
3rd Thread - Reading the file from 2nd Thread
                read(file, buffer,10);
                lseek(file, 0,0);
4th thread - Just printing the the size information for the '2' files
which are written
5th thread - Also, reading the file from 2nd thread

>> Is there any call to set up extent size
>> also? please update I can try to make use of that also.
> `man xfsctl` and search for XFS_IOC_FSSETXATTR.
thanks Dave, this is exactly what was needed - this is working as of now.

But there continues to be a problem with the sync time. Even though
there is no dirty data - but still sync is taking time around 5sec(but
this is very rare - and observed very few times in overnight runnings)
So, also very difficult to debug what could be the issue and who could
be culprit. At one time - tried to check the trace during this sync
time issue - please find as given below:

(dump_backtrace+0x0/0x11c) from [<c0389520>] (dump_stack+0x20/0x24)
(dump_stack+0x0/0x24) from [<c0067b70>] (__schedule_bug+0x7c/0x8c)
(__schedule_bug+0x0/0x8c) from [<c0389bc0>] (schedule+0x88/0x5fc)
(schedule+0x0/0x5fc) from [<c020a0c8>] (_xfs_log_force+0x238/0x28c)
(_xfs_log_force+0x0/0x28c) from [<c020a320>] (xfs_log_force+0x20/0x40)
(xfs_log_force+0x0/0x40) from [<c02308c4>] (xfs_commit_dummy_trans+0xc8/0xd4)
(xfs_commit_dummy_trans+0x0/0xd4) from [<c0231468>] (xfs_quiesce_data+0x60/0x88)
(xfs_quiesce_data+0x0/0x88) from [<c022e080>] (xfs_fs_sync_fs+0x2c/0xe8)
(xfs_fs_sync_fs+0x0/0xe8) from [<c015cccc>] (__sync_filesystem+0x8c/0xa8)
(__sync_filesystem+0x0/0xa8) from [<c015cd1c>] (sync_one_sb+0x34/0x38)
(sync_one_sb+0x0/0x38) from [<c013b1f0>] (iterate_supers+0x7c/0xc0)
(iterate_supers+0x0/0xc0) from [<c015cbf4>] (sync_filesystems+0x28/0x34)
(sync_filesystems+0x0/0x34) from [<c015cd68>] (sys_sync+0x48/0x78)
(sys_sync+0x0/0x78) from [<c003b4c0>] (ret_fast_syscall+0x0/0x48)

In order to resolve this -  applied the below patche:
xfs: dummy transactions should not dirty VFS state
But still continued to observe the sync timing issue.

One thing, do we need fsync() - when performing write using O_DIRECT?I
think 'no'
Also, should sync() be taking time when there is no 'dirty' data?

Please share your opinion.

Thanks & Regards,
Amit Sahrawat

> Cheers,
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>