Author: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
Date: Sat, 18 Jun 2011 16:19:50 +0200
Running the latest git kernel (3.0-rc3) my machine hangs for long periods (1-2 sec) whenever I delete a large directory recursively on my xfs partition. During the hang I cannot move the mouse pointe
Author: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
Date: Sun, 19 Jun 2011 10:16:35 +0200
I've tested this a little further. The behavior is independent of the kernel version used (tested back to 2.6.37). My SSD is also fine and a freshly created xfs partition shows no problems, too. Plea
So what is the system doing when it "hangs"? Is it CPU bound (e.g. cpu scheduler issue)? Is the system running out of memory and stalling everything in memory reclaim? What IO is occurring? How does
Author: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
Date: Mon, 20 Jun 2011 02:54:15 +0200
It's totally idle otherwise; just a desktop with a single xterm. The machine has four cores (and also runs with "CONFIG_PREEMPT=y"), so I don't think it is CPU bound at all. It has 8GB of memory (and
Sure, the system might be otherwise idle, but what I was asking is what load does the "rm -rf" cause. What IO does it cause? is it cpu bound? etc. Oh, joy. Another user having strange performance pro
Author: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
Date: Mon, 20 Jun 2011 04:02:36 +0200
I have not measured this, so I cannot tell. No, it's none of those (it's a Seagate Barracuda Green ST1500). Sector 0 is 4k aligned for sure. The odd 512 byte offset was present only on some first gen
And so you are speculating as to the cause of the problem. What I'm trying to do is work from the bottom up to ensure that the layers below the fs are not the cause of the problem. Which also means y
Author: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
Date: Mon, 20 Jun 2011 08:03:51 +0200
Understood. Yes. OK. I was able to reproduce the same hang on a conventional 512 sector drive. The partition that I've used was the predecessor to the one on the 4k drive. So it saw roughly the same
Author: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
Date: Mon, 20 Jun 2011 13:13:59 +0200
Here are two more examples. The time when the hang occurs is marked with --cpu0-usage--cpu1-usage--cpu2-usage--cpu3-usage-- --dsk/sdb-- --system-- --load-avg-- --dsk/sdb-- usr sys idl wai hiq siq:usr
Author: Michael Monnerie <michael.monnerie@xxxxxxxxxxxxxxxxxxx>
Date: Mon, 20 Jun 2011 13:45:28 +0200
Could it be that some sectors on the disk are not easy to read for the drive, and that it simply retries several times until it works again? SATA disks can show that behaviour. You could try with "dd
Author: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
Date: Mon, 20 Jun 2011 14:31:32 +0200
Thanks for the hint, Michael. I've just checked the SMART status on both disks and the 4kb drive looks indeed suspicious: 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 8 198 Offline_
Author: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
Date: Mon, 20 Jun 2011 23:16:07 +0200
Hmm, both tests ran fine without any errors. And the two SMART attributes above are back to zero again (must have been a temporary firmware hiccup). As you can see in the data I've posted, the disk w
What is the resolution of the samples here? Where did the hang occur during this output? FWIW, Can you capture the hang while running 'iostat -x -d -m 1' so we can see what is happening with queue de
Author: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
Date: Tue, 21 Jun 2011 10:02:19 +0200
Resolution is 1 sec. I think the hang occurred when cpu3 wait switched from 41% to 99% (but I'm not sure). Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %uti
Author: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
Date: Tue, 21 Jun 2011 12:40:53 +0200
And another example: Cause Maximum Percentage [xlog_state_get_iclog_space] 2098.7 msec 38.0 % Waiting for TTY data 1953.4 msec 29.7 % Syncing filesystem 723.0 msec 12.1 % Unlinking file 25.8 msec 7.7
Author: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
Date: Tue, 21 Jun 2011 20:24:14 +0200
FWIW here are two links to svg time-charts produced by: perf timechart record /home/markus/rm_sync http://trippelsdorf.de/timechart1.svg http://trippelsdorf.de/timechart2.svg -- Markus
Author: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
Date: Tue, 21 Jun 2011 20:57:01 +0200
And this is what the mysterious kworker is doing during the sync. It's the one consuming most of the CPU time. 39.96% kworker/3:0 [kernel.kallsyms] 0xffffffff811da9da k [k] xfs_trans_ail_update_bulk