& xfsTemplate,top=>1,side=>1 &>
The current work items for XFS for Linux are listed below. Many of the items on the list have been classified according to the type of issue they address and according to priority.
The classification types are as follows:
The items on the list have been prioritized. There are four prioritization levels: P1, P2, P3, and P4.
The following work items remain for XFS for Linux. The items are described in detail following the summary.
Classification: A, C, D, E
Priority: P1
O_SYNC I/O will return control to user space without ensuring that data is on disk. This will potentially break some applications.
The Irix code will not help much here in that the buffering implementation is so different. Fastest fix is to do an fsync after the write has buffered data; ideally we only want to wait for our data though.
75% pagebuf work, 25% additional XFS code.
Owner: Eric Sandeen
Classification: A, D, E
Priority: P1
Did some basic read code which could also be the core of a write path too. This did not do cache coherency with buffered I/O. The Irix approach to this was to flush buffered data to disk before a read and to flush and invalidate it before a write. Stephen Tweedie has some fairly grandiose schemes for doing this in an fs independent way using kiobufs. This will not show up before 2.5 so we need our own implementation. One concept we can copy from his idea is to detect cached data and use it in the I/O - i.e. on read copy from the cache not the disk, on write copy from user memory to cache then do I/O to disk.
We also have more locking in place than Irix direct writes have - the Linux inode semaphore will single thread direct writes as it stands. Without changing the fs independent code we would have to drop locks and reobtain them in the xfs code.
Owner: Russell Cattelan
Classification: C
Priority: P1
The test9 kernel has a new VM layer - this will affect the XFS code. Previous discussions about the VM system had talked about abstracting away the use of buffer heads to manage write ordering, and adding a flush method where the filesystem gets to choose what gets flushed. This fits in pretty well with the way we write delalloc data now. We should maybe consider joining this effort and implementing the flush method. This will help with some of the performance issues.
Owner: Rajagopal Ananthanarayanan
Priority: P1
We have the situation currently with the user tools where we have a number of "unstable" interfaces which are used in some of the tools and will destabilize the tools (have done before and will do again). We should rearrange what we currently have in xfs-cmds into more than one package. It was always known that this _must_ happen for those interfaces which are not specific to XFS (libattr, libacl, libdm) in order for them to be more widely accepted, and the way we ship these components in the xfs-cmds package currently is a recipe for packaging dependency headaches down the track.
Owner: Nathan Scott
Classification: A, B, C, E
Priority: P1, P2, P3
We can only support filesystems with a block size of 1 page, the page size is architecture specific. This is more pagebuf work for the most part, although once we have metadata chunks bigger than a single system page size we have some problems to solve.
We are using pages to cache metadata, pages are allocated one at a time, so each page sized chunk of memory usually is not adjacent in the address space to the page covering the next block of the disk. We do have some code to do memory remapping in the kernel to get around this. However, this is code which would never be accepted into Linus's tree. In general there is major resistance to doing address space remapping - it is fairly expensive, and impacts the whole system, not just the thread doing the remapping.
We already have some metadata bigger than a page - inode clusters. However, because these clusters are actually just arrays of fixed sized objects, we do not make accesses across page boundaries and it was fairly simple to modify xfs to not need the inode buffers to appear as one chunk of memory. In the case of directory blocks and other structures, this is not going to work and another solution must be found.
Block size < page size shouldn't be too hard. Block size > page size will require a lot of work. Could map multiple pages together. Could kmemalloc a pool of pages. Could do a page cache size change using 64K blocks to solve this. Key is getting contiguous chunks of memory for this. Kanoj Sarcar would be a reference for this.
5.1 16K block size support (for SN-IA64) [P1]
5.2 Block size < page size support [P2]
5.3 Block size > page size support [P3]
Owners: Russell Cattelan, Glen Overby, Rajagopal Ananthanarayanan
Priority: P2
There are a number of performance things we can do to pagebuf. We can also do some things which will benefit XVM performance.
Could use EAGAIN to solve this. Is a potential deadlock problem. Ananth is working on a patch.
Owner: Rajagopal Ananthanarayanan
Priority: P2
Jens Axboe is working on the IDE support. Almost ready for inclusion in the tree.
The generic kiobuf changes to the block I/O request layer are a temporary solution and will most likely be rewritten for 2.5.
Owner: Martin Petersen
Priority: P2
Except for RAID-5, both MD and LVM will be ready and kiobuf-aware soon.
Owner: Martin Petersen
Classification: C
Priority: P2
Linux has no defined interface for manipulating extended attributes. We have added all the Irix system calls to our tree, but not reserved system call numbers for them. This means that we have to change our numbers every time we move to a new kernel version and someone else has added system calls.
At the same time, there is another project http://acl.bestbits.at/ which is working on an ACL interface for Linux, they also have extended attributes, and their own different API. We need to collaborate with them on a common API (ours has more calls than theirs).
There is also some discussion about a suitable API - or even should an API be available to user space, for extended attributes. There seem to be philosophical differences between some filesystem developers and GUI developers who want to be able to tag arbitrary data to a file.
Another option is to obtain an XFS system call in the interim and push all the calls though this while the final 'official' position is worked out.
Regarding ACLs, merge Danny's ACL patch to provide this functionality in a similar way to IRIX. Once an 'official' interface has been decided on, use that.
Owners: Tim Shimmin Andrew Gildfind
Priority: P2
XFS needs to be supported on IA64. It should work on Alpha, Sparc, and MIPS64 first. Try it out on 64 bit MIPS first.
Owners: Martin Petersen, Rajagopal Ananthanarayanan
Priority: P2
Need to support XFS as root. Steve's been running this way for two months - copied from ext2 (find/cpio), then edited fstab and lilo.conf.
Thomas Graichen has a mini-root capability.
Owners: Tom Duffy, Russell Cattelan, Eric Sandeen
Classification: A, B, D, E
Priority: P3
This requires a transaction at I/O completion time for the first write to an extent. In Irix, writes which flush cache go through the filesystem. In Linux they do not have to. A proposed extension to the new VM just introduced into the kernel could help here - a flush call which the VM makes to the filesystem to tell it to push data out to disk. We really need clustering of writes for this to be effective, the callback should be for as large a chunk of data as possible - otherwise lots of transactions will get executed.
This includes zeroing allocated unwritten disk space.
Owners: Glen Overby, Russell Cattelan, Rajagopal Ananthanarayanan
Priority: P3
Inode numbers are 32 bits and block devices have a 2 Tbyte size limitation.
Extending the Linux inode number to greater than 32 bits is probably not an option. Changing inode allocation on xfs to restrict inode clusters to the lower 2 Tbytes of AGs would fix it for new filesystems on Linux, but not for moving large filesystems from Irix.
If NFS gets changed to use opaque file handles then the inode number will only be visible outside the filesystem in calls such as stat and getdents. In this case we could use larger inode numbers internally and be left with the issue of getting them out to user space correctly. NFS opaque file handle patches were done by Neil Brown.
On device addressing there may be some options in the kiobuf changes currently being worked on.
There are 4 bytes (32 bits) of address space in a 512 byte block. dksc could be a way around the 2 Tbyte device limitation. Stephen Tweedie may also have some ideas on this.
This is a 32 bit system problem since this field is 64 bits on 64 bit systems.
Owner: Martin Petersen
Classification: A, B, C
Priority: P3
There was talk of which quota implementation we should use, the one in XFS, or the Linux implementation. I think we have to use the XFS on disk format and the quota code within XFS as you really want to update quota files transactionally with the filesystem modification which caused them. I am not sure if it is possible to do this, but being able to use the existing Linux quota utilities, or a modified version of them, would be good.
14.1 User quotas
14.2 Group (& project?) quotas
Owner: Nathan Scott
Classification: C
Priority: P3
XFS has support for shutting down a filesystem when it detects corruption or I/O failures. This does not work on Linux right now - we need to get to the point where we can unmount the filesystem without any disk I/O. There are error injection calls in XFS which can be turned on to simulate I/O errors, and options on fsstress to exercise them, this probably went through syssgi though.
Owner: Mark Nordstrand
Classification: C
Priority: P4
We are currently stuck at a specific compiler level. There are parts of XFS which do not compile with later compilers, and parts which generate bad code. There are also other parts of the kernel which do not work correctly with later compilers, but this should not stop us from fixing our code.
This is a stretch goal and is very low priority.
Owner: Russell Cattelan
Priority: P4
Needs some extensions to the Linux device interface in terms of request priorities.
Needs to be looked at more. This should probably be a low priority feature.
Priority: P4
The internal DMAPI port for xfs and the external open XDSM project need to sync up somehow. XDSM may be in ReiserFS. This will be a merge effort. Don't support this merge until the 2.5 kernel.
Owner: Dean Roehrich
Classification: C
Priority: P4
The way we pass around the saved interrupt state between function calls will not work on some Linux platforms.
May be fixed - appears to be working.
Owner: Martin Petersen
Classification: C
Priority: P4
We implemented our own version of mrlocks, they are heavy weight when compared to the Linux equivalents. The Linux equivalents are missing some functionality, but code has been written for this. One downside is that CXFS uses even more locking variants.
Ben LaHaise may have code for the additional lock functionality. Using Linux locks might be important to get accepted into Linus's tree.
Owner: Daniel Moore
Priority: P4
We can still die due to this. Memory allocations in pagebuf are more a problem than those in XFS. In some failure cases XFS will attempt to flush other memory users, pagebuf does not.
This may not be a problem. Haven't seen anything in this area in a while. May come back with file systems with block size > page size.
Priority; P4
Should shoot for having XFS in standard Linux distributions.
Owners: Martin Petersen, Rajagopal Ananthanarayanan
Priority: P4
Should shoot for having XFS in a future kernel distribution.
Owners: Martin Petersen, Rajagopal Ananthanarayanan