Hi Steve,
Further to my last emails on this, I think I've tracked down why the
crashes occur, but don't know how to fix it. I eliminated the scsi
hardware, ethernet card, etc, that Seth Mos suggested might be the problem
(got loans of completely different hardware). I can reliably crash my
test machine in under an hour by running test 013 in a loop, and letting
the "/etc/cron.hourly/sysstat" cron job run. Doing some random other
commands during the process helps speed the crash up.
The crashes I see are related to the machine having highmem support, and
buffers allocated with pages in high memory making their way onto the
(fs/buffer.c) free_list. I added an extra field to struct buffer_head
that records in the buffer head who created it (in create_empty_buffers),
and what function called put_last_free. In every instance, the
buffer_head that causes the crash was created by
hook_buffers_to_page_delay, and put onto the free list later by a call to
__invalidate_buffers. (Adding code to record in the bh who called
that.... done.... crashed, - the caller was blkdev_put this time, but I'll
run a few more tests).
When one of these bh's with bh->b_page in high memory is given to ext2 by
getblk, and a "bread" performed, bh->b_data gets set to values < PAGE_SIZE
by a call to set_bh_page. This is why it looked like the bh's were
corrupted in my previous backtraces. The actual disk IO that was
performed on these pages proceeds okay though, as ll_rw_blk() does
create_bounce's for the real disk I/O (which is why the dereferences
you saw came after a successful call to bread).
I can seemingly (no crashes after a weekend of repeats) make the crashes
go away by replacing GFP_HIGHUSER with GFP_USER in clean_inode
(fs/inode.c), and _pagebuf_lookup_pages (fs/pagebuf/page_buf.c).
Changing one alone doesn't make any difference.
Hope that this makes some sense to you, and you can just say aha, and wave
the magic wand :). I hope you can replicate it locally with this
information.
Regards,
Chris
|