reordering file operations for performance
Phil Karn
karn at philkarn.net
Sun Jan 30 22:47:03 CST 2011
I have written a file deduplicator, dupmerge, that walks through a file
system (or reads a list of files from stdin), sorts them by size, and
compares each pair of the same size looking for duplicates. When it finds
two distinct files with identical contents on the same file system, it
deletes the newer copy and recreates its path name as a hard link to the
older version.
For performance it actually compares SHA1 hashes, not the actual file
contents. To avoid unnecessary full-file reads, it first compares the hashes
of the first pages (4kiB) of each file. Only if they match will I compute
and compare the full file hashes. Each file is fully read at most once and
sequentially, so if the file occupies a single extent it can be read in a
single large contiguous transfer. This is noticeably faster than doing a
direct compare, seeking between two files at opposite ends of the disk.
I am looking for additional performance enhancements, and I don't mind using
fs-specific features. E.g., I am now stashing the file hashes into xfs
extended file attributes.
I regularly run xfs_fsr and have added fallocate() calls to the major file
copy utilities, so all of my files are in single extents. Is there an easy
way to ask xfs where those extents are located so that I could sort a set of
files by location and then access them in a more efficient order?
I know that there's more to reading a file than accessing its data extents.
But by the time I'm comparing files I have already lstat()'ed them all so
their inodes and directory paths are probably all still in the cache.
Thanks,
Phil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20110130/14a50323/attachment.htm>
More information about the xfs
mailing list