I don't normally top post, but I just wanted to quickly say I _really_
enjoyed reading your reply below. It was seriously educational. I
really enjoyed your note about the 24p Altix system. I've been a fan of
SGI NUMA machines since the Origin 2k, due to the uniqueness of this
scalable interconnect, though I've never been an SGI user. :(
Keep up the great work, and keep us educated, as you've done so very
well here. :)
Dave Chinner put forth on 9/2/2010 6:29 AM:
> On Thu, Sep 02, 2010 at 03:41:59AM -0500, Stan Hoeppner wrote:
>> Dave Chinner put forth on 9/2/2010 2:01 AM:
>>> No, that's definitely not the case. A different kernel in the
>>> same 8p VM, 12x2TB SAS storage, w/ 4 threads, mount options
>>> FSUse% Count Size Files/sec App Overhead
>>> 0 800000 0 39554.2 7590355
>>> 4 threads with mount options "logbsize=262144,delaylog"
>>> FSUse% Count Size Files/sec App Overhead
>>> 0 800000 0 67269.7 5697246
>> What happens when you bump each of these to 8 threads, 1 per core? If
> FSUse% Count Size Files/sec App Overhead
> 0 1600000 0 127979.3 13156823
> So, 1 thread does 19k files/s, 2 thread does 37k files/s, 4 gets
> 67k, and 8 gets 128k. I'd say that's almost linear scaling and CPU
> bound at each load point ;)
>> the test consumes all cpus/cores, what instrumentation are you viewing
>> that tells you the cpu utilization _isn't_ due to memory b/w starvation?
> 1) profiling like 'perf top' or oprofile, using hardware counters to
> profile on cpu cycles, l1/l2 cache misses, etc
> 2) the delayed logging code uses significantly more memory bandwidth
> than the original code because it copies changed information twice
> (instead of once) before it is written to disk. Given that single
> threaded performance of delayed logging is identical to the original
> code and scalability from 1 to 8 cores is almost linear, it cannot
> be memory bandwidth bound....
> The code might be memory _latency_ bound (i.e on cache misses), but
> it is certainly not stressing pure memory bandwidth.
>> A modern 64 bit 2 GHz core from AMD or Intel has an L1 instruction issue
>> rate of 8 bytes/cycle * 2,000 MHz = 16,000 MB/s = 16 GB/s per core. An
>> 8 core machine would therefore have an instruction issue rate of 8 * 16
>> GB/s = 128 GB/s. A modern dual socket system is going to top out at
>> 24-48 GB/s, well short of the instruction issue rate. Now, this doesn't
>> even take the b/w of data load/store operations into account, but I'm
>> guessing the data size per directory operation is smaller than the total
>> instruction sequence, which operates on the same variable(s).
>> So, if the CPUs are pegging, and we're not running out of memory b/w,
>> then this would lead me to believe that the hot kernel code, core
>> fs_mark code and the filesystem data are fully, or near fully, contained
>> in level 2 and 3 CPU caches. Is this correct, more or less?
> However (and it is a big however!), I generally don't care to
> analyse performance at this level because it's getting into
> micro-optimisation territory. Sure, it will get you a few percent
> here and there, but then you lose focus on improving the algorithms.
> An algorithmic change can provide an order of magnitude improvement,
> not a few percent. The delayed logging code is a clear example of
> Another example - perf top shows this on the above 8p load on
> a plain 2.6.36-rc3 kernel (and it gets about 40k files/s):
> 426043.00 27.4% _xfs_buf_find
> 87491.00 5.6% __ticket_spin_lock
> 67204.00 4.3% xfs_dir2_node_addname
> 60434.00 3.9% dso__find_symbol
> 48407.00 3.1% kmem_cache_alloc
> 37006.00 2.4% __d_lookup
> 31625.00 2.0% xfs_trans_buf_item_match
> 20036.00 1.3% xfs_log_commit_cil
> 18728.00 1.2% _raw_spin_unlock_irqrestore
> 18428.00 1.2% __memset
> 18001.00 1.2% __memcpy
> 17781.00 1.1% xfs_da_do_buf
> 17732.00 1.1% xfs_iflush_cluster
> 16831.00 1.1% kmem_cache_free
> 14836.00 1.0% __kmalloc
> It is clear that buffer lookup is consuming the most CPU of any
> operation. Why? Because the buffer hash table is too small. I've
> already posted patches for a short term solution (increase the size
> of the hash table) and the above 127k files/s result is using that
> patch. hence it is clear that the micro-optimisation works, but at
> the cost of 16x increase in memory usage for the hash table. And
> that still isn't really large enough, because now the load is
> already pushing the limits of the enlarged hash table.
> As Christoph has already suggested, the correct way to fix the
> problem is to change the caching algorithm to something that is
> self-scaling (e.g. a rb-tree or a btree). That will keep memory
> usage low on small filesystems, yet scale efficiently to large
> numbers of buffers, something a hash cannot easily do.
> IOWs, an algorithmic change will solve the problem far better for
> more situations than the micro-optimisation of tweaking the hash
> sizes. Reduced to the simplest argument, scalability is all
> about choosing the right algorithm so you don't have to care about
> minute details to obtain the performance you require.
>> I'll have to dig around. I've never even looked for the archives for
>> this list. It's hopefully mirrored in the usual places.
>> Out of curiosity, have you ever run into memory b/w starvation before
>> peaking all CPUs while running this test?
> No. Last time I ran out of bandwidth doing an IO workloads was doing
> 6-7GB/s of buffered writes to disk on a 24p ia64 Altix. The disk
> subsystem could handle 11GB/s, we got 10GB/s with direct IO, but
> buffered IO was limited by the cross-sectional memory bandwidth of
> the machine (25GB/s) because of the extra copy buffered IO
>> I could see that maybe
>> occurring with dual 1GHz+ P3 class systems with their smallish caches
>> and lowly single channel PC100, back before the switch to DDR memory,
>> but those machines were probably gone before XFS was open sourced, IIRC,
>> so you may not have had the pleasure (if you could call it that).
> The ratio between CPU cycles and memory bandwidth really hasn't
> changed much since then. The CPUs weren't powerful enough then,
> either, to run enough metadata ops to get near memory bandwidth