xfs
[Top] [All Lists]

Re: ARC-1120 and MD very sloooow

To: stan@xxxxxxxxxxxxxxxxx
Subject: Re: ARC-1120 and MD very sloooow
From: Jimmy Thrasibule <thrasibule.jimmy@xxxxxxxxx>
Date: Mon, 25 Nov 2013 09:56:36 +0100
Cc: Linux RAID <linux-raid@xxxxxxxxxxxxxxx>, "xfs@xxxxxxxxxxx" <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:subject:from:to:cc:date:in-reply-to:references :content-type:mime-version:content-transfer-encoding; bh=wNC59DHaXMpEwcKf72sgVggX7F2XKXykjYP5E/XNkEM=; b=VXP5wUNO4Itzw+3X4Dvc0PuiIyvYfq8pJnhudhMrFtNG/7/5IKrHA+QPWX+5QggO7O smLtmwzQK5awIRj/lBOWZhQv3IfNarFPit8xdUblnlY07jHx96J+mATvLXntzzb6rJ4H stVTOZBe2eXnA0zrpjDSCxv8F5zuLzvcWKEqT3v+z4aRlxujjinU5llV2K/bBU/BhxiY FUrMj8xclRcy1mCGvGzyyjml9eEOIriOeAOvZ2RbE3YxsU+MAmBCh2raJ0tt0ifwvTvg NbG7XFBhYRQDLK6HykgJP8W8w77ksXyJ4DCghFoNNsHQmIwFKAJ15uHnj+a3JP1UXr04 1UvA==
In-reply-to: <528FBBE5.80404@xxxxxxxxxxxxxxxxx>
References: <1385118796.8091.31.camel@xxxxxxxxxxxxxxxxxxxx> <528FBBE5.80404@xxxxxxxxxxxxxxxxx>
Hello Stan,

> This may not be an md problem.  It appears you've mangled your XFS
> filesystem alignment.  This may be a contributing factor to the low
> write throughput.
> 
> >         md3 : active raid10 sdc1[0] sdf1[3] sde1[2] sdd1[1]
> >               7813770240 blocks super 1.2 512K chunks 2 near-copies [4/4] 
> > [UUUU]
> ...
> >         /dev/md3 on /srv type xfs 
> > (rw,nosuid,nodev,noexec,noatime,attr2,delaylog,inode64,sunit=2048,swidth=4096,noquota)
> 
> Beyond having a ridiculously unnecessary quantity of mount options, it
> appears you've got your filesystem alignment messed up, still.  Your
> RAID geometry is 512KB chunk, 1MB stripe width.  Your override above is
> telling the filesystem that the RAID geometry is chunk size 1MB and
> stripe width 2MB, so XFS is pumping double the IO size that md is
> expecting.

The nosuid, nodev, noexec, noatime and inode64 options are mine, the
others are added by the system.


> >         # xfs_info /dev/md3 
> >         meta-data=/dev/md3               isize=256    agcount=32, 
> > agsize=30523648 blks
> >                  =                       sectsz=512   attr=2
> >         data     =                       bsize=4096   blocks=976755712, 
> > imaxpct=5
> >                  =                       sunit=256    swidth=512 blks
> >         naming   =version 2              bsize=4096   ascii-ci=0
> >         log      =internal               bsize=4096   blocks=476936, 
> > version=2
> >                  =                       sectsz=512   sunit=8 blks, 
> > lazy-count=1
> 
> You created your filesystem with stripe unit of 128KB and stripe width
> of 256KB which don't match the RAID geometry.  I assume this is the
> reason for the fstab overrides.  I suggest you try overriding with
> values that match the RAID geometry, which should be sunit=1024 and
> swidth=2048.  This may or may not cure the low write throughput but it's
> a good starting point, and should be done anyway.  You could also try
> specifying zeros to force all filesystem write IOs to be 4KB, i.e. no
> alignment.
> 
> Also, your log was created with a stripe unit alignment of 4KB, which is
> 128 times smaller than your chunk.  The default value is zero, which
> means use 4KB IOs.  This shouldn't be a problem, but I do wonder why you
> manually specified a value equal to the default.
> 
> mkfs.xfs automatically reads the stripe geometry from md and sets
> sunit/swidth correctly (assuming non-nested arrays).  Why did you
> specify these manually?

It is said to trust mkfs.xfs, that's what I did. No options have been
specified by me and mkfs.xfs guessed everything by itself.


> > The issue is that disk access is very slow and I cannot spot why. Here
> > is some data when I try to access the file system.
> > 
> > 
> >         # dd if=/dev/zero of=/srv/test.zero bs=512K count=6000
> >         6000+0 records in
> >         6000+0 records out
> >         3145728000 bytes (3.1 GB) copied, 82.2142 s, 38.3 MB/s
> >         
> >         # dd if=/srv/store/video/test.zero of=/dev/null
> >         6144000+0 records in
> >         6144000+0 records out
> >         3145728000 bytes (3.1 GB) copied, 12.0893 s, 260 MB/s
> 
> What percent of the filesystem space is currently used?

Very small, 3GB / 6TB, something like 0.05%.

> >         First run:
> >         $ time ls /srv/files
> >         [...]
> >         real        9m59.609s
> >         user        0m0.408s
> >         sys 0m0.176s
> 
> This is a separate problem and has nothing to do with the hardware, md,
> or XFS.  I assisted with a similar, probably identical, ls completion
> time issue last week on the XFS list.  I'd guess you're storing user and
> group data on a remote LDAP server and it is responding somewhat slowly.
>  Use 'strace -T' with ls and you'll see lots of poll calls and the time
> taken by each.  17,189 files at 35ms avg latency per LDAP query yields
> 10m02s, if my math is correct, so 35ms is your current avg latency per
> query.  Be aware that even if you get the average LDAP latency per file
> down to 2ms, you're still looking at 34s for ls to complete on this
> directory.  Much better than 10 minutes, but nothing close to the local
> speed you're used to.
> 
> >         Second run:
> >         $ time ls /srv/files
> >         [...]
> >         real        0m0.257s
> >         user        0m0.108s
> >         sys 0m0.088s
> 
> Here the LDAP data has been cached.  Wait an hour, run ls again, and
> it'll be slow again.
> 
> >         $ ls -l /srv/files | wc -l
> >         17189
> 
> > I guess the controller is what's is blocking here as I encounter the
> > issue only on servers where it is installed. I tried many settings like
> > enabling or disabling cache but nothing changed.

Just using the old good `/etc/passwd` and `/etc/group` files here. There
is no special permissions configuration.


> The controller is not the cause of the 10 minute ls delay.  If you see
> the ls delay only on servers with this controller it is coincidence.
> The cause lay elsewhere.
> 
> Areca are pretty crappy controllers generally, but I doubt they're at
> fault WRT your low write throughput, though it is possible.

Well I have issues only on those servers. Strange enough.


I see however that I messed the outputs concerning the filesystem
details. Let me put everything in order.


Server 1
--------

# xfs_info /dev/md3
meta-data=/dev/mapper/data-video isize=256    agcount=33, agsize=50331520 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=1610612736, imaxpct=5
         =                       sunit=128    swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mdadm -D /dev/md3
/dev/md3:
        Version : 1.2
  Creation Time : Thu Oct 24 14:33:59 2013
     Raid Level : raid10
     Array Size : 7813770240 (7451.79 GiB 8001.30 GB)
  Used Dev Size : 3906885120 (3725.90 GiB 4000.65 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Fri Nov 22 12:30:20 2013
          State : clean 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : srv1:data  (local to host srv1)
           UUID : ea612767:5870a6f5:38e8537a:8fd03631
         Events : 22

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
       2       8       65        2      active sync   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1

# grep md3 /etc/fstab
/dev/md3        /srv        xfs        defaults,inode64        0        0


Server 2
--------

# xfs_info /dev/md0
meta-data=/dev/md0               isize=256    agcount=32, agsize=30523648 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=976755712, imaxpct=5
         =                       sunit=256    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=476936, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Thu Nov  8 11:20:57 2012
     Raid Level : raid10
     Array Size : 3907022848 (3726.03 GiB 4000.79 GB)
  Used Dev Size : 1953511424 (1863.01 GiB 2000.40 GB)
   Raid Devices : 4
  Total Devices : 5
    Persistence : Superblock is persistent

    Update Time : Mon Nov 25 08:37:33 2013
          State : active 
 Active Devices : 4
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 1

         Layout : near=2
     Chunk Size : 1024K

           Name : srv2:0
           UUID : 0bb3f599:e414f7ae:0ba93fa2:7a2b4e67
         Events : 280490

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       5       8       65        3      active sync   /dev/sde1

       4       8       81        -      spare   /dev/sdf1

# grep md0 /etc/fstab
/dev/md0        /srv       noatime,nodev,nosuid,noexec,inode64        0        0


--
Jimmy


<Prev in Thread] Current Thread [Next in Thread>