On Sat, Dec 28, 2013 at 12:20:39AM +0100, Arkadiusz MiÅkiewicz wrote:
> On Friday 27 of December 2013, Dave Chinner wrote:
> > On Fri, Dec 27, 2013 at 09:07:22AM +0100, Arkadiusz MiÅkiewicz wrote:
> > > On Friday 27 of December 2013, Jeff Liu wrote:
> > > > On 12/27 2013 14:48 PM, Stor?? wrote:
> > > > > Hey:
> > > > >
> > > > > 20T xfs file system
> > > > >
> > > > >
> > > > >
> > > > > /usr/sbin/xfs_check: line 28: 14447 Killed
> > > > > xfs_db$DBOPTS -i -p xfs_check -c "check$OPTS" $1
> > > >
> > > > xfs_check is deprecated and please use xfs_repair -n instead.
> > > >
> > > > The following back traces show us that it seems your system is run out
> > > > memory when executing xfs_check, thus, snmp daemon/xfs_db were killed.
> > >
> > > This reminds me a question...
> > >
> > > Could xfs_repair store its temporary data (some of that data, the biggest
> > > parte) on disk instead of in memory?
> > Where on disk?
> In directory/file that I'll tell it to use (since I usualy have few xfs
> filesystems on single server and so far only one at a time breaks).
How is that any different from just adding swap space to the server?
> Could xfs_repair tell kernel that this data should always end up on swap
> (allowing other programs/daemons to use regular memory) prehaps? (Don't know
> interface that would allow to do that in kernel though). That would be some
> half baked solution.
It's up to the kernel to manage what gets swapped and what doesn't.
I suppose you could use control groups to constrict the RAM
xfs_repair uses, but how to configure such policy is way ouside my
area of expertise.
> > > I don't know it that would make sense, so asking. Not sure if xfs_repair
> > > needs to access that data frequently (so on disk makes no sense) or
> > > maybe it needs only for iteration purposes in some later phase (so on
> > > disk should work).
> > >
> > > Anyway memory usage of xfs_repair was always a problem for me (like 16GB
> > > not enough for 7TB fs due to huge amount of fies being stored). With
> > > parallel scan it's even worse obviously.
> > Yes, your problem is that the filesystem you are checking contains
> > 40+GB of metadata and a large amount of that needs to be kept in
> > memory from phase 3 through to phase 6.
> Is that data (or most of that data) frequenly accessed? Or something that's
> iterated over let say once in each phase?
free/used space is tracked in a btree. It gets set up, for example,
in phase 3, then iterated in phase 4 where inode bmap btrees are
validated, and then phase 5 rebuilds the on disk free space trees
from what is validated as used/free space in phase 4.
So, the free space information is used in each phase it is required,
and then it is discarded from memory.
Inodes are tracked in a AVL tree. They get set up and validated
against the AGI inode btrees in phase 3, then validated against the
directory structure in phase 6. Most get tossed outo f memory during
phase 6, but those with multiple link counts are held on to until
phase 7 where the link counts are validated.
So, the data that is pulled into memory during phases 2 and 3 (i.e.
all the metadata in the filesystem) cannot be fully validated and
freed until later phases complete. The indexes are regularly
traversed, so soul dnot get swapped. The leaves shoul donly get hit
once per phase, so should be swapped in and out only once per phase
that uses the information.
[snip trial and error xfs_repair OOM complaints]
Basically, you don't know how much metadata is in your filesystem,
so you don't know how much swap space to add up front. Simple: add
100GB of swap file on a fast drive (e.g. an SSD) and that will make
repair run to completion faster than any amount of work I could do
to make it faster.
Basically, you are asking us to make xfs-repair omniscient so it
always either succeeds or fails immediately so that you don't have
to plan for disaster recovery...
> - what's worse tools give no estimations of ram needed etc but that's afaik
# xfs_repair -vv -m 1 -n /dev/<foo>
> > If you really want to add
> > some kind of database interface to store this information somewhere
> > else, then I'll review the patches. ;)
> Right. So only "easy" task finding the one who understands the code and can
> write such interface left. Anyone?
> IMO ram usage is a real problem for xfs_repair and there has to be some
> upstream solution other than "buy more" (and waste more) approach.
I think you are forgetting that developer time is *expensive* and
*scarce*. This is essentially a solved problem: An SSD in a USB3
enclosure as a temporary swap device is by far the most cost
effective way to make repair scale to arbitrary amounts of metadata.
It certainly scales far better than developer time and testing