High latencies writing to a memory mapped file
Shawn Bohrer
shawn.bohrer at gmail.com
Mon Jul 26 17:09:09 CDT 2010
Hey Dave,
Thanks for the suggestions.
On Fri, Jul 23, 2010 at 09:28:46AM +1000, Dave Chinner wrote:
> On Thu, Jul 22, 2010 at 09:47:06AM -0500, Shawn Bohrer wrote:
> > Hello,
> >
> > We have an application that receives data over the network, and then
> > writes data out to a shared memory mapped file on xfs. Previously
> > everything worked great when we were running a CentOS 5.3 kernel
> > (2.6.18-128.7.1.el5), but when we upgraded to the newer CentOS 5.4 or
> > 5.5 kernels we started to see occasional spikes (up to 1 second) in
> > the time it takes to write the data to the memory mapped file. I
> > noticed that between the CentOS 5.3 and 5.4 kernels the upstream xfs
> > changes were backported. Testing on 2.6.32.16 shows the same
> > occasional latency spikes.
> >
> > After doing some debugging on 2.6.32.16 I have found the cause of the
> > latency spikes. The spikes occur every time pdflush runs to flush the
> > dirty pages. Note our app only writes data at about 1.5 MB/s and we
> > are well below both the dirty_background_ratio and the dirty_ratio so
> > these flushes are only of pages that are dirty_expire_centisecs old.
> > Capturing one of the spikes with ftrace shows:
> >
> > ...
> > flush-8:0-601 [011] 1291592.127590: funcgraph_entry: | xfs_ilock() {
> > flush-8:0-601 [011] 1291592.127590: funcgraph_entry: | down_write() {
> > flush-8:0-601 [011] 1291592.127590: funcgraph_entry: 0.213 us | _cond_resched();
> > flush-8:0-601 [011] 1291592.127591: funcgraph_exit: 0.657 us | }
> > flush-8:0-601 [011] 1291592.127591: funcgraph_exit: 1.174 us | }
> > flush-8:0-601 [011] 1291592.127764: sched_switch: task flush-8:0:601 [120] (D) ==> swapper:0 [120]
> > ...
>
> what is the lead-in context to this hunk? i.e. how does xfs_ilock()
> get called? That will tell me if it is blocking on the iolock or the
> ilock...
I don't have the lead-in context here. I'm using the function_graph
tracer and with stalls up to a second in length it easily overflowed
the ftrace ring buffer. Once I found I my process was blocking on the
xfs_ilock(), I limited my traces to include xfs_ilock, xfs_iunlock,
handle_mm_fault, and the scheduler events. If you are still
interested in seeing some more detail here I can try to collect some
more captures.
> > Here you can see flush-8:0 acquires the xfs_ilock() then I believe it
> > blocks waiting for the I/O to complete. Next you can see my
> > file_writer process generates a page fault while writing new data to
> > the memory mapped file. The page fault tries to acquire the
> > xfs_ilock() and then blocks because it is held by the flush-8:0
> > process.
> >
> > ...
> > file_writer-10549 [001] 1291593.100416: print: 00000000 write_delay start
> > file_writer-10549 [001] 1291593.100419: funcgraph_entry: | handle_mm_fault() {
> > file_writer-10549 [001] 1291593.100420: funcgraph_entry: | __do_fault() {
> > file_writer-10549 [001] 1291593.100420: funcgraph_entry: | filemap_fault() {
> > file_writer-10549 [001] 1291593.100421: funcgraph_entry: 0.775 us | find_get_page();
> > file_writer-10549 [001] 1291593.100422: funcgraph_entry: 0.191 us | _cond_resched();
> > file_writer-10549 [001] 1291593.100422: funcgraph_exit: 1.980 us | }
> > file_writer-10549 [001] 1291593.100422: funcgraph_entry: | unlock_page() {
> > file_writer-10549 [001] 1291593.100423: funcgraph_entry: 0.154 us | page_waitqueue();
> > file_writer-10549 [001] 1291593.100423: funcgraph_entry: 0.313 us | __wake_up_bit();
> > file_writer-10549 [001] 1291593.100423: funcgraph_exit: 1.080 us | }
> > file_writer-10549 [001] 1291593.100424: funcgraph_entry: | xfs_vm_page_mkwrite() {
> > file_writer-10549 [001] 1291593.100424: funcgraph_entry: | block_page_mkwrite() {
> > file_writer-10549 [001] 1291593.100424: funcgraph_entry: 0.240 us | _cond_resched();
> > file_writer-10549 [001] 1291593.100425: funcgraph_entry: | block_prepare_write() {
> > file_writer-10549 [001] 1291593.100425: funcgraph_entry: | __block_prepare_write() {
> > file_writer-10549 [001] 1291593.100426: funcgraph_entry: | create_empty_buffers() {
> > file_writer-10549 [001] 1291593.100426: funcgraph_entry: | alloc_page_buffers() {
> > file_writer-10549 [001] 1291593.100426: funcgraph_entry: | alloc_buffer_head() {
> > file_writer-10549 [001] 1291593.100426: funcgraph_entry: | kmem_cache_alloc() {
> > file_writer-10549 [001] 1291593.100427: funcgraph_entry: 0.491 us | _cond_resched();
> > file_writer-10549 [001] 1291593.100427: funcgraph_exit: 1.043 us | }
> > file_writer-10549 [001] 1291593.100428: funcgraph_entry: 0.185 us | recalc_bh_state();
> > file_writer-10549 [001] 1291593.100428: funcgraph_exit: 2.003 us | }
> > file_writer-10549 [001] 1291593.100428: funcgraph_entry: 0.137 us | set_bh_page();
> > file_writer-10549 [001] 1291593.100428: funcgraph_exit: 2.701 us | }
> > file_writer-10549 [001] 1291593.100429: funcgraph_entry: 0.271 us | _spin_lock();
> > file_writer-10549 [001] 1291593.100429: funcgraph_exit: 3.751 us | }
> > file_writer-10549 [001] 1291593.100429: funcgraph_entry: | xfs_get_blocks() {
> > file_writer-10549 [001] 1291593.100430: funcgraph_entry: | __xfs_get_blocks() {
> > file_writer-10549 [001] 1291593.100430: funcgraph_entry: | xfs_iomap() {
> > file_writer-10549 [001] 1291593.100430: funcgraph_entry: | xfs_ilock() {
>
> So this is trying to lock the ilock exclusively to map the range or
> do delayed allocation reservations.
>
> > Finally xfslogd/1 wakes up the flush-8:0 process which releases the
> > xfs_ilock() waking up my file_writer process and allowing it to
> > continue.
> > ...
> > xfslogd/1-213 [001] 1291593.158606: sched_wakeup: task flush-8:0:601 [120] success=1 [011]
> > <idle>-0 [011] 1291593.158609: sched_switch: task swapper:0 [120] (R) ==> flush-8:0:601 [120]
>
> xfslogd? Inodes do not get locked across log writes; they are
> unlocked in the transaction commit.
>
> Hmmm - perhaps the issue is that the transaction commit has not been
> able to get a iclog buffer to write the transaction into and so is
> sleeping waiting for an iclog IO to complete. That would be woken by
> the xfslogd, and then the transaction would write into the
> now-available iclog and unlock the inode.
>
> So, I'd say that in this case your application is waiting for log IO
> completion to occur. Nothing you can do to avoid that, but you can
> minimise the impact - increase the log buffer size via the mount
> option logbsize (i.e. -o logbsize=262144) to reduce the frequency of
> iclog buffer IOs and make it less likely the system will run stall
> having run out of iclog space.
>
> The other possibility is that the log IO is being slowed by barriers
> being issued with the log writes. If you've got non-volatile caches
> on your storage, you can turn off barriers and that should help
> reduce log IO latency significantly.
I've tried mounting with "noatime,nobarrier,logbufs=8,logbsize=262144"
but I don't notice much difference and still see the stalls.
> Another possibility is to switch to the deadline or no-op IO
> scheduler to avoid log IO being delayed by (the ever changing)
> CFQ IO scheduling idiosyncracies.
I also tried using the noop IO scheduler, but also didn't see any
improvement.
Thanks,
Shawn
More information about the xfs
mailing list