According to Ragnar Kjørstad:
> When using the utilities from Release-1.0.2 we got the following
> message:
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> xfs_repair: xfs_log_recover.c:159: xlog_find_verify_log_record:
> Assertion `start_blk != 0 || *last_blk != start_blk' failed.
>
> I've upgraded to the CVS version, and now the error-message is gone, but
Cool, these are _exactly_ the same phenomena, we observed the last two
weeks...
What we found out so far:
The problem begins, as xfs_repair tries the find the head- and tail
block of the log, specifically in "xlog_find_zeroed"
Here is the call stack:
main
phase2
zero_log
xlog_find_tail
xlog_find_head
xlog_find_zeroed
If i understand correctly, xlog_find_zeroed should return the first
block with cycle number 0. Unfortunatily, under some circumstances[1],
it doesnt set "blk_no", and leaves it to the value it has from the
auto declaration. In our case, this is:
(gdb) p *blk_no
$1 = 970809832858808788
(which is rather large...)
Later on, back in "xlog_find_tail", when *blk_no has become head_blk,
we have this for loop:
/*
* Search backwards looking for log record header block
*/
ASSERT(*head_blk < INT_MAX);
for (i=(int)(*head_blk)-1; i>=0; i--) {
if ((error = xlog_bread(log, i, 1, bp)))
goto bread_err;
if (INT_GET(*(uint *)(XFS_BUF_PTR(bp)), ARCH_CONVERT)
== XLOG_HEADEC_NUM) {
found = 1;
break;
}
}
Now, with the above value in *head_blk (truncated to 4 byte), this
loop runs several hours (as Ragnar Kjørstad observed), reading blocks
from the filesystem.
[1] now for the "some circumstances" in "xlog_find_zeroed"
These are the last lines of this function:
if ((error = xlog_find_verify_log_record(log, start_blk,
&last_blk, 0)))
goto bp_err;
*blk_no = last_blk;
bp_err:
xlog_put_bp(bp);
if (error)
return error;
return -1;
} /* xlog_find_zeroed */
If "xlog_find_verify_log_record" returns -1, it jumps over the
assignment "*blk_no = last_blk" and returns "error" (which, in this
case is -1). So we have the case that "xlog_find_zeroed" returns -1,
in spite of the fact that *blk_no is _not_ set. But, according to the
comment of the function:
* Return:
* 0 => the log is completely written to
* -1 => use *blk_no as the first block of the log
* >0 => error has occurred
*/
My conclusion was, that the log is corrupt in such a way, that
xfs_repair can't handle it. I decided to forget the data in the log,
and tried to deleted it:
(gdb) p *log
$5 = {l_tail_lsn = 0, l_last_sync_lsn = 0, l_mp = 0xbfffdf50, l_dev = 2064,
l_logBBstart = 32, l_logsize = 2097152, l_logBBsize = 4096, ...}
I closed the xfs_repair session and called
# cat /dev/zero | dd of=/dev/sdb bs=512 seek=32 count=1
And then the really strange thing happened:
The next run of xfs_repair seemed to repair the filesystem. If anyone
is interessted, it can be downloaded at:
http://slime.wu-wien.ac.at/xfs/repair.003.out
But: after that, the log was corrupted again! I got the same effect as
described above. It seems, that running xfs_repair destroys the log!
Now, i have no idea, what to try next....
If i can do anything, to bring some light into this issue, please let
mit know!
Thanks,
\wlang{}
--
Willi.Langenberger@xxxxxxxxxxxxx Fax: +43/1/31336/702
Zentrum fuer Informatikdienste, Wirtschaftsuniversitaet Wien, Austria
|