<br><div class="gmail_quote"><div class="gmail_quote"><div class="im">On Sun, Jul 29, 2012 at 8:03 PM, Dave Chinner <span dir="ltr"><<a href="mailto:david@fromorbit.com" target="_blank">david@fromorbit.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>On Fri, Jul 27, 2012 at 01:14:17AM -0700, Jason Newton wrote:<br>
> Hi,<br>
><br>
> I think the following bug is still around:<br>
><br>
> <a href="http://oss.sgi.com/archives/xfs/2011-11/msg00179.html" target="_blank">http://oss.sgi.com/archives/xfs/2011-11/msg00179.html</a><br>
><br>
> I get the same stack trace.<br>
<br>
</div>Not surprising, I doubt anyone has looked at it much. Indeed,<br>
xfs/090 assert fails immediately in the rt allocator for me....<br>
<div><br>
> There's another report out there somewhere<br>
> with another similar stack trace. I know the realtime code is not<br>
> maintained so much but it seems to be a waste to let it fall out of<br>
> maintenance when it's the only thing on linux that seems to fill the<br>
> realtime io niche.<br>
<br>
</div>The XFS "realtime" device has nothing to do with "realtime IO".<br>
<br>
If anything, it's probably much worse at "realtime IO" than the<br>
normal data device, especially at scale, because it is bitmap rather<br>
than btree based. And it is single threaded.<br>
<br>
That's why it really isn't maintained - the data device is as good<br>
or better in RT workloads as the "realtime" device....<br></blockquote></div><div><br>This wasn't expected, thanks for the clarifications. What was the original point of RT files?<br></div><div class="im">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><br>
> So this email is mainly about the null pointer deref on the spinlock in<br>
> _xfs_buf_find on realtime files, but I figure I might also ask a few more<br>
> questions.<br>
><br>
> What kind of differences should one expect between GRIO and realtime files?<br>
<br>
</div>Linux doesn't support GRIO. It's an Irix only thing, and that<br>
required special hardware support for bandwidth reservation, special<br>
frame schedulers in the IO path, etc. The XFS realtime device was<br>
just one part of the whole GRIO framework. Anyway, if you don't have<br>
15 year old SGI hardware you can't use GRIO.<br>
<br>
If you are talking about GRIOv2, then, well, you aren't running<br>
CXFS...<br>
<div><br>
> What kind of on latencies of writes should one expect for realtime files vs<br>
> normal?<br>
<br>
</div>How long is a piece of string?<br></blockquote></div><div>Well, I had meant with say one block of io.<br> </div><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><br>
> raw video to disk (3 high res 10bit video streams, 5.7MB per frame, at 20hz<br>
> so effectively 60fps total). I use 2 512GB OCZ vertex 4 SSDs which<br>
> support ~450MB/s each. I've soft-raided them together (raid 0) with a 4k<br>
> chunksize<br>
<br>
</div>There's your first problem. You are storing 5.7MB files, so why<br>
would you use a 4k chunk size? You'd do better with something on the<br>
order of 1MB chunk size (2MB stripe width) so that you are forming<br>
as large IOs as possible with the minimum of software overhead (i.e<br>
no merging of 4k IOs into larger IOs in the IO scheduler).<br>
<br></blockquote></div><div>I went to the intel builtin raid0 and I found chunksize 4k, 64k, and 128k, doesn't actually affect much in terms of latency, throughput with the simulation application I've written - nor CPU. Even directly streaming to the raid partition still gobbles 40% cpu (single thread, single stream @ 60fps, higher avg latency than xfs). XFS on any of these chunksizes is 60-70% CPU with 3 streams, 1 per thread. For XFS single thread, single stream @ 60fps it looked like the same as direct, maybe getting up to 45, and 50% CPU occasionally. All these numbers are seemingly dependent on the mood of the SSD, along
with how often there were latency overruns (sometimes none for 45 minutes, sometimes
every second - perhaps there's a pattern to the behavior). I'd be interested in trying larger blocksizes than 4k (I don't mean raid0 chunksize) but that doesn't seem possible with x86_64 and linux...<br>
</div><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Note that you are also writing hundreds of GB to the SSDs, which<br>
will be triggering internal garbage collection, and that will have<br>
significant impact on Io completion latency. It's not uncommon to<br>
see 500ms IO latencies occur on consumer level SSDs when garbage<br>
collect kicks in. If you are going to use SATA SSDs, then you're<br>
going to have to design your application to be able to handle such<br>
write latencies...<br>
<div><br></div></blockquote></div><div>500ms does look like to be in the neighborhood for the garbage collection for these drives. Maybe 4-450 on the avg. This neighborhood is an obvious outlier in some tests.<br>
</div><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>
> and I get about 900MB/s avg in a benchmark program I wrote to<br>
> simulate my videostream logging needs. I only save one file per<br>
> videostream (only 1 videostream modeled in simulation), which I append to<br>
> in a loop with a single write call, which records the frame, over and over<br>
> while keeping track of timing.<br>
<br>
</div>The typical format for high bandwidth video stream is file per<br>
frame. That's exactly what the filestreams allocator is designed for<br>
- ingest of multiple streams and keeping them in separate locations<br>
(AGs) on disk. This means allocation remains concurrent and doesn't<br>
serialise, causing excess, unpredicatble latencies.<br>
<br></blockquote></div><div>Ah, that is interesting. I used to save tiffs but I figured that would be more variable in latency and cpu usage since it's opening and closing files constantly. However you have a definite point since it's not serialized to one stream, that there's some extra concurrency to exploit. I'll have to benchmark with multiple files again.<br>
<br></div><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Indeed, if you use file per frame, and a RAID0 chunk size of 3MB<br>
(6MB stripe width), then XFs will align the data in each file to the<br>
same stripe unit boundary for all files. There will be 300kb of free<br>
space between them, but having everything nicely aligned to the<br>
underlying geometry tends to help maintain allocation determinism<br>
until the filesystem is 5.7/6 * 100% = 95% full.....<br>
<div> <br></div></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>
> The frame is in memory and nonzero with<br>
> some interesting pattern to defeat compression if its in the pipeline<br>
> anywhere. I get 180-300MB/s with O_DIRECT, so better performance without<br>
> O_DIRECT (maybe because it's soft-raid?).<br>
<br>
</div>It sounds like you are using in line write(2) calls, which means the<br>
IO is synchronous (i.e. occurs within the write syscall), which<br>
means throughput is bound by IO completion latency. AIO+DIO solves<br>
this problem as it implies application level frame buffering - this<br>
is a common way of ensuring that IO latencies don't cause dropped<br>
frames<br>
<br></blockquote></div><div>Yes, I don't really want to convolute the main program with AIO, it's complex enough as is.<br> <br></div><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Using buffered IO means the write(2) operates at memory speed, but<br>
you then have no control over allocation and writeback, and memory<br>
allocation and reclaim becomes a major source of latency that direct<br>
IO does not have. Doing buffered IO to the realtime device is, well,<br>
even less well tested than the realtime device, as historically the<br>
RT device only supported direct IO. It's supposed to work, but it's<br>
never really been well tested, and I don't know anyone who uses it<br>
in production....<br>
<div><br>
> The problem is that I<br>
> occationally get hickups in latency... there's nothing else using the disk<br>
> (embedded system, no other pid's running + root is RO). I use the deadline<br>
> io scheduler on both my SSDs.<br>
<br>
</div>Yep, that'll be because you are using buffered IO. It'll be faster<br>
than a naive Direct IO implementation, but you'll have latency<br>
issues that cannot be avoided or predicted.<br></blockquote></div><div><br>Interesting, what constitutes a proper Direct IO implementation? AIO + an recording structures who's size is a multiple of in this case 4k?<br>
</div><div class="im">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><br>
> xfs_info of my video raid:<br>
> meta-data=/dev/md2 isize=256 agcount=32, agsize=7380047<br>
<br>
</div>Lots of little AGs - that will stress the freespace management of<br>
the filesystem pretty quickly.....<br>
<div><br>
> blks<br>
> = sectsz=512 attr=2<br>
> data = bsize=4096 blocks=236161504, imaxpct=25<br>
> = sunit=1 swidth=2 blks<br>
> naming =version 2 bsize=4096 ascii-ci=0<br>
> log =internal bsize=4096 blocks=115313, version=2<br>
> = sectsz=512 sunit=1 blks, lazy-count=1<br>
> realtime =none extsz=4096 blocks=0, rtextents=0<br>
<br>
</div>And no realtime device. It doesn't look like you're testing what you<br>
think you are testing....<br>
<br></blockquote></div><div>Sorry, the topic quickly moved from something of a bug report / query to an involved benchmark and testing. This xfs_info was not when I had the realtime section, it was just for 4k chunksize raid0. After a few crashes on the realtime section I moved on to other testing since I doubted there was little that could be done. I've since performed alot of testing (to be discussed hopefully in the next week, I'm getting to be pretty short on time) and rewrote the framelogging component of the application with average bandwidth in mind and decoupled the saving of frame data from the framegrabber threads. Basically I just have a configurable circular buffer of up to 2 seconds of frames. I think that is the best answer for now as from my naive point of view, its some combination of linux related (FS path was never RT) and SSD (garbage collection was unplanned... who knows what else the firmware is doing).<br>
<br>I'm still interested in finding out why streaming a few hundred MB to disk has so much over head in comparison to the calculations I do in userspace, though. Straight copies of frames (in the real program, copied because of limitations of the framegrabber driver's DMA engine) don't use as much cpu as writing to a single SSD. It takes a little over a millisecond to copy a frame. On hardware, while it's an embedded system it's got an 2.2ghz 2-core i7 in it, the southbridge is BD82QM67-PCH.<span class="HOEnZb"><font color="#888888"><br>
<br>-Jason<br></font></span></div></div>
</div><br>