Bug 328 - freeze on XFS recovery on sparc64 with linux kernel 2.4.26
: freeze on XFS recovery on sparc64 with linux kernel 2.4.26
Status: RESOLVED FIXED
: XFS
XFS kernel code
: unspecified
: Other Linux
: major
: ---
Assigned To:
:
:
:
:
:
:
  Show dependency treegraph
 
Reported: 2004-05-08 13:56 CST by
Modified: 2004-06-15 11:53 CST (History)


Attachments
A patch to fix XFS recovery on 64-bit little-endian architectures (like sparc64) (1.06 KB, patch)
2004-05-15 05:33 CST, Nicolas Boullis
Details | Diff
A patch to fix XFS recovery on 64-bit little-endian architectures (like sparc64) (2.56 KB, patch)
2004-05-15 05:35 CST, Nicolas Boullis
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2004-05-08 13:56:45 CST
Hi,

On sparc64, I tried several versions of the XFS patch for 2.4 kernels, up to the
version currently in the 2.4.26 kernel. All of them fail when a recovery is
attempted, first by freezing and sometimes by oopsing. It is so badly broken
that even the openboot hotkey (STOP-A) does not work.

I just rebuilt my kernel and xfs module with DEBUG. And now, at recovery time, I
get an assertion failure:
XFS assertion failed: item->ri_buf[i].i_len % XFS_BLI_CHUNK == 0, file:
xfs_log_recover.c, line: 1955

(I never experienced such a problem on i386.)

Is there something I can do to help track this bug?


Regards,

Nicolas
------- Comment #1 From 2004-05-15 05:33:35 CST -------
Created an attachment (id=120) [details]
A patch to fix XFS recovery on 64-bit little-endian architectures (like
sparc64)

As it is currently written, the xfs_contig_bits function in xfs_bit.c, when
BITS_PER_LONG != 32, implicitely assumes that the arch is little-endian, as it
considers that the LSB of *((uint *)map) is the same as the LSB of *((unsigned
long int *)map).

This patch implements a rewrite of xfs_contig_bits, that is supposed to work on
all arches. Moreover, it adds two ASSERTs that would have been useful to track
this bug.
------- Comment #2 From 2004-05-15 05:35:40 CST -------
Created an attachment (id=121) [details]
A patch to fix XFS recovery on 64-bit little-endian architectures (like
sparc64)

As it is currently written, the xfs_contig_bits function in xfs_bit.c, when
BITS_PER_LONG != 32, implicitely assumes that the arch is little-endian, as it
considers that the LSB of *((uint *)map) is the same as the LSB of *((unsigned
long int *)map).

This patch implements a rewrite of xfs_contig_bits, that is supposed to work on
all arches. Moreover, it adds two ASSERTs that would have been useful to track
this bug.
------- Comment #3 From 2004-05-15 05:36:26 CST -------
(From update of attachment 121 [details])
As it is currently written, the xfs_contig_bits function in xfs_bit.c, when
BITS_PER_LONG != 32, implicitely assumes that the arch is little-endian, as it
considers that the LSB of *((uint *)map) is the same as the LSB of *((unsigned
long int *)map).

This patch implements a rewrite of xfs_contig_bits, that is supposed to work on
all arches. Moreover, it adds two ASSERTs that would have been useful to track
this bug.
------- Comment #4 From 2004-05-15 05:38:24 CST -------
Sorry for the extra noise, that's roughly the first time I use bugzilla. I hope
everything is now fine. If it is not, please ask.

Nicolas
------- Comment #5 From 2004-05-16 21:59:40 CST -------
Hi Nicolas,

Thanks very much for the patch.

I see the problem that you point out.
For xfs_buf_item's we store a blf_data_map which is an array of 32bit ints
and we use xfs_buf_item_log() to add to the map. When we call xfs_contig_bits on
a 64 bit word system we have to be careful. On little endian systems the LSB of
a 32 bit word and a 64 bit word for the same data is the same but on a big
endian system they differ. So one can't just call the linux find_next_zero_bit()
function when we have 64 bit words on big endian boxes (as you mentioned).
I noticed that for xfs_next_bit() we are doing the right thing and doing our own
32 bit int code.
The xfs_contig_bits() is fairly similar to xfs_next_bit() and is largely just a
xfs_next_zero_bit() except that we want a count of all the 1s so far even if we
never find a zero (which would return -1 in the xfs_next* case).
So I compared your xfs_contig_bits() code with our existing xfs_next_bit() code
and I realised that we have a few simplifications to make which you've already
done in your code. The thing is, unlike in the kernel functions, the size
parameter in xfs code is in 32 bit ints and NOT in bits. We only do ops on size
which keep it a multiple of 32 bits and therefore some size related tests in
xfs_next_bit() can be simplified/removed. I'll make some changes for
xfs_next_bit() in a separate incident :-)

So the only thing I noticed in your code is that
  while (size >= NBWORD) {
could probably be simplified to
  while (size) {
and the assert removed,
since size must be a multiple of NBWORD.

Did you write this code based on xfs_next_bit() or from
somewhere else?

--Tim
------- Comment #6 From 2004-05-17 11:48:01 CST -------
Hi Tim,

I do agree with your explanation of the bug.
About the
  while (size >= NBWORD) {
that could be simplified to
  while (size) {
I agree that there should be no difference, but I'd rather keep the assert,
"just in case"... I can't think of a situation where this would be needed, but,
as far as I know, asserts don't hurt when debug is disabled...

And you're right, that code is actually based on xfs_next_bit.


Regards,

Nicolas
------- Comment #7 From 2004-06-15 09:53:16 CST -------
Tim's patch has been in for a while an verified on a few 64bit BE plattforms.