Inode and dentry cache behavior
Shrinand Javadekar
shrinand at maginatics.com
Thu Apr 23 14:50:15 CDT 2015
Hi,
I am running Openstack Swift on a single server with 8 disks. All
these 8 disks are formatted with default XFS parameters. Each disk has
a capacity of 3TB. The machine has 64GB of data.
Here's what Openstack Swift does:
1. The file-system is mounted at /srv/node/r0.
2. Creates a temp file: /srv/node/r0/tmp/tmp_sdfsdf
3. Writes to this file: 4 writes of 64K each and does an fsync and
close. Final size of the file is 256K.
4. Create the path: /srv/node/r0/1004/eef/deadbeef. The directory
/srv/node/r0/objects/1004 already existed before. So it only needs to
create "eef" and "deadbeef". Before creating each directory, it
verifies that the directory does not exist.
5. Rename the file /srv/node/r0/tmp/tmp_sdfsdf to
/srv/node/r0/objects/1004/eef/deadbeef/foo.data.
6. fsync /srv/node/r0/objects/1004/eef/deadbeef/foo.data.
7. It then does a directory listing for /srv/node/r0/objects/1004/eef.
8. Opens the file /srv/node/r0/objects/1004/hashes.pkl
9. Writes to the file /srv/node/r0/objects/1004/hashes.pkl
10. Closes the file /srv/node/r0/objects/1004/hashes.pkl.
Writes are getting sharded across ~1024 directories. Essentially,
there are 0000-1024 directories under /srv/node/r0/objects/. 1004
above is one of them in the example above.
This works great when the filesystem is newly formatted and mounted.
However, as more and more data get's written to the system, the above
sequence of events progressively gets slower.
* We observe that the time for fsync remains pretty much constant throughout.
* What seems to be causing the performance to nosedive, is that inode
and dentry caching doesn't seem to be working.
* For experiment sake, we set vfs_cache_pressure to 0 so there would
be no reclaiming of inode and dentry cache entries. However, that does
not seem to help.
* We see openat() calls taking close to 1 second.
Any ideas, what might be causing this behavior? Are there other
params, specifically, xfs params that can be tuned for this workload.
The sequence of events above is the typical workload, at high
concurrency.
Here are the answers to other questions requested from the XFS wiki page:
* kernel version (uname -a)
3.13.0-39-generic #66-Ubuntu SMP Tue Oct 28 13:30:27 UTC 2014 x86_64
x86_64 x86_64 GNU/Linux
* xfsprogs version
xfs_repair version 3.1.7
* number of CPUs
16
* contents of /proc/meminfo
See attached file mem_info.
* contents of /proc/mounts
/dev/mapper/troll_data_vg_23578621012a_1-troll_data_lv_1 /srv/node/r0
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
/dev/mapper/troll_data_vg_23578621012a_2-troll_data_lv_2 /srv/node/r1
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
/dev/mapper/troll_data_vg_23578621012a_3-troll_data_lv_3 /srv/node/r2
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
/dev/mapper/troll_data_vg_23578621012a_4-troll_data_lv_4 /srv/node/r3
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
/dev/mapper/troll_data_vg_23578621012a_5-troll_data_lv_5 /srv/node/r4
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
/dev/mapper/troll_data_vg_23578621012a_6-troll_data_lv_6 /srv/node/r5
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
/dev/mapper/troll_data_vg_23578621012a_7-troll_data_lv_7 /srv/node/r6
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
/dev/mapper/troll_data_vg_23578621012a_8-troll_data_lv_8 /srv/node/r7
xfs rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,inode64,logbufs=8,noquota
0 0
* contents of /proc/partitions
See attached file partitions_info.
* RAID layout (hardware and/or software)
NO RAID!!
* LVM configuration
See attached file lvm_info. Use lvdisplay to obtain it.
* type of disks you are using
sdm disk 2.7T ST3000NXCLAR3000
sdm1 part 1M
sdm2 part 2.7T
dm-1 lvm 2.7T
* write cache status of drives
Drives have no write cache.
* size of BBWC and mode it is running in
No BBWC
* xfs_info output on the filesystem in question
meta-data=/dev/mapper/troll_data_vg_23578621012a_8-troll_data_lv_8
isize=256 agcount=4, agsize=183141376 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=732565504, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=357698, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
* dmesg output showing all error messages and stack traces
No errors.
* IOStat and VMStat output.
See the attached files iostat_log and vmstat_log.
-Shri
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mem_info
Type: application/octet-stream
Size: 1197 bytes
Desc: not available
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20150423/87cfe83e/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: partitions_info
Type: application/octet-stream
Size: 1548 bytes
Desc: not available
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20150423/87cfe83e/attachment-0005.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: iostat_log
Type: application/octet-stream
Size: 75861 bytes
Desc: not available
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20150423/87cfe83e/attachment-0006.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vmstat_log
Type: application/octet-stream
Size: 1934 bytes
Desc: not available
URL: <http://oss.sgi.com/pipermail/xfs/attachments/20150423/87cfe83e/attachment-0007.obj>
More information about the xfs
mailing list