xfs
[Top] [All Lists]

Re: Premature "No Space left on device" on XFS

To: stan@xxxxxxxxxxxxxxxxx
Subject: Re: Premature "No Space left on device" on XFS
From: Bernhard Schmidt <berni@xxxxxxxxxxxxx>
Date: Fri, 07 Oct 2011 02:47:12 +0200
Cc: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/simple; d=birkenwald.de; h= content-transfer-encoding:content-type:content-type:in-reply-to :references:subject:subject:mime-version:user-agent:from:from :date:date:message-id:received; s=mailout; t=1317948433; bh=2MqI EeluZHkMoW6ivqy5EhFNwzNRZcVlKaTAE4wtM94=; b=TUFd/KFlFYFvOwVs0Xfu ihG5zW4oaeyBqWDBFUyE5swpI89bAL4Hz9BtlbZQCveanqvwzRAa7RUNNXRqz48h NTlXSf7gsUC1T9uu+vlNjibGIhio/i3bDrwfKTwOI2O0zQ1sZwjWUDFS5UecLHMh O79ZVZEX/v09qLq3sauLibU=
In-reply-to: <4E8E4630.8030108@xxxxxxxxxxxxxxxxx>
References: <4E8E079B.4040103@xxxxxxxxxxxxx> <4E8E4630.8030108@xxxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1
On 07.10.2011 02:22, Stan Hoeppner wrote:

Hi,

On 10/6/2011 2:55 PM, Bernhard Schmidt wrote:
Hi,

this is an XFS-related summary of a problem report I sent to the postfix
mailinglist a few minutes ago after a bulkmail test system blew up
during a stress test.

We have a few MTAs running SLES11.1 amd64 (2.6.32.45-0.3-default), 10 GB
XFS Spooldirectory with default blocksize (4k). It was bombarded with
mails faster than it could send them on, which eventually led to almost
2 million files of ~1.5kB in one directory. Suddenly, this started to
happen

lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # touch a
touch: cannot touch `a': No space left on device
lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # df .
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb              10475520   7471160   3004360  72%
/var/spool/postfix-bulk
lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # df -i .
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sdb             10485760 1742528 8743232   17% /var/spool/postfix-bulk

So we could not create any file in the spool directory anymore despite
df claiming to have both free blocks and inodes. This led to a pretty
spectacular lockup of the mail processing chain.

My theory is that XFS is using a full 4k block for each 1.5kB file,
which accounts to some loss. But still, 10GB / 4kB makes 2.5 mio files,
which have surely not been reached here. Is there that high overhead?
Why is neither df-metric reporting this problem? Is there any way to get
reasonable readings out of df in this case? The system would have
stopped accepting mail from outside if the freespace would have sunk
below 2GB, so out-of-space happened way to early for it.

Dig deeper so you can get past theory and find facts.  Do you see any
errors in dmseg?

No, nothing in dmesg. As soon as I delete one file the mail processing continues. This is some sort of expected outcome in this situation, it is a classic 2-MTA-with-queues-with-a-content-filter setup. The before-filter instance will connect through the filter to the post-filter instance and try to deliver mails. During that period the mail allocates two files (active queue in the before-filter, incoming queue in the post-filter instance). If the second file cannot be opened the mail will never be delivered and the before-filter queue never processed.

Bernhard

<Prev in Thread] Current Thread [Next in Thread>