On 10/6/2011 2:55 PM, Bernhard Schmidt wrote:
> Hi,
>
> this is an XFS-related summary of a problem report I sent to the postfix
> mailinglist a few minutes ago after a bulkmail test system blew up
> during a stress test.
>
> We have a few MTAs running SLES11.1 amd64 (2.6.32.45-0.3-default), 10 GB
> XFS Spooldirectory with default blocksize (4k). It was bombarded with
> mails faster than it could send them on, which eventually led to almost
> 2 million files of ~1.5kB in one directory. Suddenly, this started to
> happen
>
> lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # touch a
> touch: cannot touch `a': No space left on device
> lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # df .
> Filesystem 1K-blocks Used Available Use% Mounted on
> /dev/sdb 10475520 7471160 3004360 72%
> /var/spool/postfix-bulk
> lxmhs45:/var/spool/postfix-bulk/postfix-bulkinhss # df -i .
> Filesystem Inodes IUsed IFree IUse% Mounted on
> /dev/sdb 10485760 1742528 8743232 17% /var/spool/postfix-bulk
>
> So we could not create any file in the spool directory anymore despite
> df claiming to have both free blocks and inodes. This led to a pretty
> spectacular lockup of the mail processing chain.
>
> My theory is that XFS is using a full 4k block for each 1.5kB file,
> which accounts to some loss. But still, 10GB / 4kB makes 2.5 mio files,
> which have surely not been reached here. Is there that high overhead?
> Why is neither df-metric reporting this problem? Is there any way to get
> reasonable readings out of df in this case? The system would have
> stopped accepting mail from outside if the freespace would have sunk
> below 2GB, so out-of-space happened way to early for it.
Dig deeper so you can get past theory and find facts. Do you see any
errors in dmseg?
--
Stan
|