xfs
[Top] [All Lists]

RE: XFS Preallocation

To: Dave Chinner <david@xxxxxxxxxxxxx>
Subject: RE: XFS Preallocation
From: Peter Vajgel <pv@xxxxxx>
Date: Tue, 1 Feb 2011 19:20:18 +0000
Accept-language: en-US
Cc: Jef Fox <jef.fox@xxxxxxxxxx>, "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
In-reply-to: <20110201080354.GM11040@dastard>
References: <155CAEA5D902E7429569DD197567724A01534D42@xxxxxxxxxxxxxxxxxxx> <20110128045205.GR21311@dastard> <155CAEA5D902E7429569DD197567724A01534D60@xxxxxxxxxxxxxxxxxxx> <20110129001700.GZ21311@dastard> <3F5ACD12257C714E9C0535D0A839171802A9B4@xxxxxxxxxxxxxxxxxxxxxxxxxx> <20110201080354.GM11040@dastard>
Thread-index: Acu+pyslMT8bBaoITbmmG6aQZJGIGgAadf3QAB748QAAHBqRwACLE6gAAAWhYEA=
Thread-topic: XFS Preallocation
> -----Original Message-----
> From: Dave Chinner [mailto:david@xxxxxxxxxxxxx]
> Sent: Tuesday, February 01, 2011 12:04 AM
> To: Peter Vajgel
> Cc: Jef Fox; xfs@xxxxxxxxxxx
> Subject: Re: XFS Preallocation
> 
> On Tue, Feb 01, 2011 at 04:45:09AM +0000, Peter Vajgel wrote:
> > > Preallocation is the only option. Allowing preallocation without
> > > marking extents as unwritten opens a massive security hole (i.e.
> > > exposes stale data) so I say no to any request for addition of such
> > > functionality (and have for years).
> >
> > How about opening this option to at least root (root can already read
> > the device anyway)?.
> 
> # ls -l foo
> -rw-r--r-- 1 dave dave        0 Aug 16 10:44 foo
> #
> # prealloc_without_unwritten 0 1048576 foo # ls -l foo
> -rw-r--r-- 1 dave dave  1048576 Aug 16 10:44 foo #
> 
> Now user dave can read the stale data exposed by the root only operation. Any
> combination of making the file available to a non-root user after a 
> preallocation-
> without-unwritten-extents
> operation has this problem.  IOWs, just making such a syscall "root only" 
> doesn't
> solve the security problem.

Correct - if an admin made prealloc_without_unwritten runnable by any user then 
yes - but I would argue that such an admin should not even have root 
privileges. Vxfs had this ability since version 1 and I don't' remember a 
single customer complaint about this feature. Most of the times the feature was 
used by db to preallocate large amounts of space knowing that they won't incur 
any overhead (even transactional) when doing direct io to the pre-allocated 
range. It could be that at those times even a transactional overhead was 
significant enough that we wanted to eliminate it.

> 
> To fix it, we have to require inodes have 0600 perms, owned by root, and 
> cannot be
> chmod/chowned to anyone else, ever. At that point, we're requiring 
> applications to run
> as root to to use this functionality. Same requirement as fiemap + reading 
> from the
> block device, which you can do right without any kernel mods or filesystem 
> hacks...
> 
> > There are cases when creating large
> > files without writing to them is important. A good example is testing
> > xfs overhead when doing a specific workload (like random
> > reads) to large files.
> 
> For testing it doesn't matter how long it takes you to write the file in the 
> first place.

At the scale we operate it does. We have multiple variables so the number of 
combinations is large. We have hit every single possible hardware and software 
problem and problem resolution can take months if it takes days to reproduce 
the problem. Hardware vendors (disk, controller, motherboard manufacturers) are 
much more responsive when you can reproduce a problem on the fly in seconds 
(especially in comparative benchmarking). The tests usually run only couple of 
minutes. With 12x3TB (possibly multiplied by a factor of X with our new 
platform) it would be unacceptable to wait for writes to finish.

> 
> > In this case we want to hit the disk on every request. Currently we
> > have a workaround (below) but official support would be preferable.
> 
> Officially, we _removed_ the unwritten=0 option from mkfs because of the 
> security
> problems. Not to mention that it was never, ever tested...
> 
> >
> > --pv
> >
> >
> > # create_xfs_files
> >
> > dev=$1
> > mntpt=$2
> > dircount=$3
> > filecount=$4
> > size=$5
> >
> > # Umount.
> > umount $2
> >
> > # Create the fs.
> > mkfs -t xfs -f -d unwritten=0,su=256k,sw=10 -l su=256k -L "/hay" $dev
> 
> Which fails due to:
> 
> unknown option -d unwritten=0
> /* blocksize */         [-b log=n|size=num]
> /* data subvol */       [-d agcount=n,agsize=n,file,name=xxx,size=num,
>                             (sunit=value,swidth=value|su=num,sw=num),
>                             sectlog=n|sectsize=num .....

It still works for us but we tend to be conservative in moving our releases.

> 
> > # Clear unwritten flag - current xfs ignores this flag typeset -i
> > agcount=$(xfs_db -c "sb" -c "print" $dev | grep agcount) typeset -i
> > i=0 while [[ $i != $agcount ]] do
> >   xfs_db -x -c "sb $i" -c "write versionnum 0xa4a4" $dev
> >   i=i+1
> > done
> >
> > # Mount the filesystem.
> > mount -t xfs -o nobarrier,noatime,nodiratime,inode64,allocsize=1g $dev
> > $mntpt
> >
> > i=0
> > while [[ $i != $dircount ]]
> > do
> >   mkdir $mntpt/dir$i
> >   typeset -i j=0
> >   while [[ $j != $filecount ]]
> >   do
> >     file=$mntpt/dir$i/file$j
> >     xfs_io -f -c "resvsp 0 $size" $file
> >     inum=$(ls -i $file | awk '{print $1}')
> >     umount $mntpt
> >     xfs_db -x -c "inode $inum" -c "write core.size $size" $dev
> >     mount -t xfs -o nobarrier,noatime,nodiratime,inode64,allocsize=1g
> > $dev $mntpt
> 
> That's quite a hack to work around the EOF zeroing that extending the file 
> size after
> allocating would do because the preallocated extents beyond EOF are not marked
> unwritten. Perhaps truncating the file first, then preallocating is what you 
> want:
> 
>       xfs_io -f -c "truncate $size" -c "resvsp 0 $size" $file


I think I had it in reverse before - allocate and truncate but the truncate got 
stuck in a loop (probably zeroing out the extents?) making the node 
unresponsive to the point that it was impossible to ssh to it. It eventually 
returned but it took a while. But that was like 3 years ago. If I get to it 
I'll try the other order.

> 
> >     j=j+1
> >   done
> >   i=i+1
> > done
> 
> Regardless of all this, perhaps themost important point is that your proposed 
> use of
> XFS is fundamentally unsupportable by the linux XFS
> community: you've got proprietary software on some external hardware writing 
> to the
> disk without going through the linux XFS kernel code.
> You're basically in the same boat as people running proprietary kernel 
> modules -
> unless you can prove the problem is not caused by your hw/sw or manual 
> filesystem
> modifications, then it's a waste of our (limited) resources to even look at 
> the problem.
> That generally comes down to being able to reproduce the problem on a vanilla 
> kernel
> on a filesystem created with a supported mkfs....

Understood. That's why I limit this hack only to testing. I would never even 
dream to put this into production. Although one could assume that if 
xfs_check/xfs_repair bless the filesystem before it's mounted you would be 
safe. But then you might be exposing yourself to bugs in xfs_check/xfs_repair 
which might have been overlooked since it's not the usual way of using xfs.

Thank you,

Peter

> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>