I'm including some more data for you to look at. This was produced by a
tool I just got the go ahead to open source and if you want to pull a
copy to try out it's at sourceforge.net/projects/collectl [shameless
plug]. I'm sending along a few selected extracts (I hope you don't mind).
As for the testing timeline, the write of 1.6M files started at 5:19,
the stats at about 5:41 and the doomed reads at 5:51. At 5:52:26, you
can see the cpu hit 25% and the nfs call almost completely stop.
The first file contains the cpu, disk, network and nfs server data for
the duration of the run (unfortunately I forgot to change my selections
and missed some of the starting nfs data). The second file is a
snapshot of slabs, but only during the read portion where things went
bad. For slab data I chose the collectl option that only shows slabs
that changed between samples rather than just the ones you requested
since the list isn't really that long. If you prefer the complete list
of slabs I can send that as well.
In any event, if you think this sort of data is useful I'd be happy to
send along the complete data set which you can then play back with
collectl displaying the data of your choice in any of a number of
formats. I even have data on all the nfs calls!
One last thing - when things went bad I tried changing valued for
'drop_caches' which really didn't have any effect other than a short
blip in cpu utilization which you can observe in cdfn.gz at the
following times:
5:53:30 echo 1 > /proc/sys/vm/drop_caches
5:54:00 echo 2 > /proc/sys/vm/drop_caches
5:54:30 echo 2 > /proc/sys/vm/drop_caches
As for your comment about the load being on one cpu, I too initially
thought that but according to collectl that's not the case. The load is
split across all of them as you can see from this very short sample:
# SINGLE CPU STATISTICS
# CPU USER NICE SYS IDLE WAIT
05:52:30 0 0 0 29 70 0
05:52:30 1 0 0 31 68 0
05:52:30 2 0 0 0 100 0
05:52:30 3 0 0 39 61 0
05:52:31 0 0 0 29 70 0
05:52:31 1 0 0 26 74 0
05:52:31 2 0 0 14 86 0
05:52:31 3 0 0 30 70 0
05:52:32 0 0 0 19 81 0
05:52:32 1 0 0 32 68 0
05:52:32 2 0 0 27 73 0
05:52:32 3 1 0 20 79 0
I'll be out of town the rest of the week and not spending too much time
on mail, so any lack of responses on my part shouldn't be interpreted as
a lack of interest.
-mark
David Chinner wrote:
> On Mon, Jun 18, 2007 at 08:35:45PM -0400, Mark Seger wrote:
>
>> Thanks for the reply. I guess my general question is that if this is
>> indeed a memory issue, wouldn't you agree that it's a bug if the server
>> essentially becomes incapable of servicing data?
>>
>
> Yup.
>
>
>> Maybe I wasn't clear
>> that as long as any of the clients are trying to do reads, the cpu
>> essentially locks up at 25% utilization across 4 cpus. It's not until I
>> kill all the readers that the server returns to normal.
>>
>
> All nfsds reading a single directory - there's a single sleeping
> lock for the directory that they are contending on (i_mutex). Hence
> one CPU busy maximum.
>
>
>>> Sounds like you are running out of memory to cache the workload in.
>>> The readdir load indicates that you are probably running out of
>>> dentry/inode
>>> cache space, and so every lookup is having to re-read the inodes
>>>
>> >from disk. i.e. readdir and stat are necessary.
>>
>>>
>>>
>> I hear what you're saying, but why then isn't the original stat slower?
>>
>
> Because memory reclaim can effectively put random holes the cache.
> Hence the second read becomes a random I/O workload instead of a
> more sequential workload where readahead can hide most latencies.
>
>
>> After creating the 1M+ files I can umount/mount the file system or
>> simply reboot the server, assuring nothing is cached and can either stat
>> or read all the files in about 15 minutes so why would rereading inodes
>> from disk happen at a such a slow rate.
>>
>
> Because reading into empty memory in a sequential manner is much
> faster than filling random holes in an already full cache that
> may be thrashing....
>
>
>>> I'd suggest looking at /proc/slabinfo (slabtop helps here) and
>>> /proc/meminfo to determine how much of your working set of inodes
>>> are being held in cache and how quickly they are being recycled.
>>>
>>>
>> one of the things I do monitor is memory and slab info and can even send
>> you a detailed trace on a per slab basis. are there any specific slabs
>> I should be looking at?
>>
>
> # egrep [xrdb][fanu][sdnf][i_f] /proc/slabinfo
>
>
>>> perhaps fiddling with /proc/sys/vm/vfs_cache_pressure will help
>>> keep inodes/dentryies in memory over page cache pages...
>>>
>>>
>> any suggestions for settings?
>>
>
> Whatever is suggested in Documentation/filesystems/proc.txt for keeping
> 1.5-2x more dentries/inodes around under memory pressure.
>
> Also, when it is thrashing, can you try these combinations:
>
> # echo 1 > /proc/sys/vm/drop_caches
> # echo 2 > /proc/sys/vm/drop_caches
> # echo 3 > /proc/sys/vm/drop_caches
>
> And see if any of them improve the throughput....
>
> Cheers,
>
> Dave.
>
|