On Mon, Feb 29, 2016 at 11:47:54AM +0200, Alex Lyakas wrote:
> Hello Dave,
> I have tried the same scenario with the 4.5 kernel from about a week
> ago, latest commit being [1].
> The same crash is happening, stack trace being [2].
>
> I am not proficient with xfstests, unfortunately. I tried running
> them several times, but I am not sure I was doing that properly.
>
> As for your question why the "block beyond end of the filesystem
> fails". I tried to debug it further and added a print into
> _xfs_buf_find. What happens is that at some point, the sb_dblocks
> value is updated to the new value, but the in-memory pag object is
> not created. So the test:
>
> eofs = XFS_FSB_TO_BB(btp->bt_mount, btp->bt_mount->m_sb.sb_dblocks);
> if (blkno < 0 || blkno >= eofs) {
> ...
>
> still holds, but the needed pag does not exist.
>
> Here are the results of the prints that I added:
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.546542] SGI XFS with
> ACLs, security attributes, realtime, no debug enabled
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.551243] XFS (dm-0):
> Mounting V4 Filesystem
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.576677] XFS (dm-0):
> Starting recovery (logdev: internal)
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.577866]
> _xfs_buf_find: blkno=0 eofs=200704 >m_sb.sb_dblocks=25088
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.577872]
> _xfs_buf_find: blkno=0 eofs=200704 >m_sb.sb_dblocks=25088
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.577882]
> _xfs_buf_find: blkno=0 eofs=200704 >m_sb.sb_dblocks=25088
> ===> Here we start seeing the new value of "sb_dblocks" and hence
> the new value of "eofs":
We shouldn't see sb_dblocks change until log recovery completes the
first phase of log recovery and the in-core superblock is re-read
from disk in xlog_do_recover().
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.606796]
> _xfs_buf_find: blkno=1 eofs=204800 >m_sb.sb_dblocks=25600
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.606804]
> _xfs_buf_find: blkno=1 eofs=204800 >m_sb.sb_dblocks=25600
looking up AGF 0.
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.606988]
> _xfs_buf_find: blkno=2 eofs=204800 >m_sb.sb_dblocks=25600
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.606991]
> _xfs_buf_find: blkno=2 eofs=204800 >m_sb.sb_dblocks=25600
Now AGI 0.
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.607205]
> _xfs_buf_find: blkno=50177 eofs=204800 >m_sb.sb_dblocks=25600
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.607210]
> _xfs_buf_find: blkno=50177 eofs=204800 >m_sb.sb_dblocks=25600
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.607375]
> _xfs_buf_find: blkno=50178 eofs=204800 >m_sb.sb_dblocks=25600
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.607378]
> _xfs_buf_find: blkno=50178 eofs=204800 >m_sb.sb_dblocks=25600
AGF/AGI 1.
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.607512]
> _xfs_buf_find: blkno=100353 eofs=204800 >m_sb.sb_dblocks=25600
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.607515]
> _xfs_buf_find: blkno=100353 eofs=204800 >m_sb.sb_dblocks=25600
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.607652]
> _xfs_buf_find: blkno=100354 eofs=204800 >m_sb.sb_dblocks=25600
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.607655]
> _xfs_buf_find: blkno=100354 eofs=204800 >m_sb.sb_dblocks=25600
AGF/AGI 2.
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.607829]
> _xfs_buf_find: blkno=150529 eofs=204800 >m_sb.sb_dblocks=25600
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.607832]
> _xfs_buf_find: blkno=150529 eofs=204800 >m_sb.sb_dblocks=25600
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.607980]
> _xfs_buf_find: blkno=150530 eofs=204800 >m_sb.sb_dblocks=25600
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.607982]
> _xfs_buf_find: blkno=150530 eofs=204800 >m_sb.sb_dblocks=25600
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.608073]
AGF/AGI 4.
> _xfs_buf_find: blkno=200705 eofs=204800 >m_sb.sb_dblocks=25600
> ===> and here we crash, but as you see, blkno is valid WRT eofs value.
> Feb 29 11:47:09 vc-00-00-350-dev kernel: [ 91.608120] BUG: unable
> to handle kernel NULL pointer dereference at 0000000000000098
AGF 5 goes splat.
Which means it's through the first phase of log recovery and it's
not failing in log recovery. i.e. we are now running
xfs_initialize_perag_data() after log recovery. So, as I said a
couple of posts back up this thread:
| If log recovery succeeds, then yes, I can see that there is a
| problem here because the per-ag tree is not reinitialised after
| the superblock is re-read. That's a pretty easy fix, though (3-4
| lines of code in xlog_do_recover() to detect a change in
| filesystem block count and call xfs_initialize_perag() again..
Cheers,
Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
|