On Wed, Jul 25, 2007 at 08:14:05PM +1000, Mark Goodwin wrote:
>
>
> David Chinner wrote:
> >
> >The next question is the hard one. What do we do when we detect
> >a log record CRC error? Right now it just warns and sets a flag
> >in the log. I think it should probably prevent log replay from
> >replaying past this point (i.e. trim the head back to the last
> >good log record) but I'm not sure what the best thing to do here.
> >
> >Comments?
>
> 1. perhaps use a new flag XLOG_CRC_MISMATCH instead of XLOG_CHKSUM_MISMATCH
Yeah, that's easy to do. What to do with the error is more important,
though.
> 2. is there (or could there be if we added it), correction for n-bit errors?
Nope. To do that, we'd need to implement some type of Reed-Solomon
coding and would need to use more bits on disk to store the ECC
data. That would have a much bigger impact on log throughput than a
table based CRC on a chunk of data that is hot in the CPU cache. And
we'd have to write the code as well. ;)
However, I'm not convinced that this sort of error correction is the
best thing to do at a high level as all the low level storage
already does Reed-Solomon based bit error correction. I'd much
prefer to use a different method of redundancy in the filesystem so
the error detection and correction schemes at different levels don't
have the same weaknesses.
That means the filesystem needs strong enough CRCs to detect bit
errors and sufficient structure validity checking to detect gross
errors. XFS already does pretty good structure checking; we don't
have bit error detection though....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
|