On 7/27/2012 3:14 AM, Jason Newton wrote:
> raw video to disk (3 high res 10bit video streams, 5.7MB per frame, at 20hz
> so effectively 60fps total). I use 2 512GB OCZ vertex 4 SSDs which
> support ~450MB/s each. I've soft-raided them together (raid 0) with a 4k
> chunksize and I get about 900MB/s avg in a benchmark program I wrote to
> simulate my videostream logging needs.
> I only have 50 milliseconds per frame and latencies exceeding this would
> result in dropped frames (bad).
> xfs_info of my video raid:
> meta-data=/dev/md2 isize=256 agcount=32, agsize=7380047
> = sectsz=512 attr=2
> data = bsize=4096 blocks=236161504, imaxpct=25
> = sunit=1 swidth=2 blks
> naming =version 2 bsize=4096 ascii-ci=0
> log =internal bsize=4096 blocks=115313, version=2
> = sectsz=512 sunit=1 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
> I'm using 3.2.22 with the rt34 patchset.
> If it's desired I can post my benchmark code. I intend to rework it a
> little so it only does 60fps capped since this is my real workload.
> If anyone has any tips for reducing latencies of the write calls or cpu
> usage, I'd be interested for sure.
I don't think your write latency problem is software related.
What do you think the odds are that the wear leveling routine is kicking
in and causing your half second max latencies? In one test you wrote
over 90% of the user cells of the devices, and most of your test writes
were over 100GB--10% of the user cells. That's an extremely large wear
load for an SSD over a short period.
What happens when you format each SSD directly and write to the two XFS
filesystems, without md/RAID0, two streams to one SSD and one to the
other? That'll free up serious cycles allowing you to eliminate CPU
WRT CPU consumption, at these data rates, md/RAID0 is going to eat
massive cycles, even though it is not bound by a single thread as are
RAID1/10/5/6. A linear concat will eat the same as RAID0. The others
would simply peak one core and scale no further. Both 0/linear are
fully threaded and simply pass an offset to the block layer, so using an
embedded CPU with more cores would help. One with a faster clock would
as well obviously, but not as much as more cores.
Interesting topic Jason.