[Top] [All Lists]

xfs: invalid requests to request_fn from xfs_repair

To: xfs@xxxxxxxxxxx
Subject: xfs: invalid requests to request_fn from xfs_repair
From: Jamie Pocas <pocas.jamie@xxxxxxxxx>
Date: Tue, 1 Apr 2014 16:16:39 -0400
Delivered-to: xfs@xxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=iyeBIkT9qgmBEOQeFkI+JhH0MKLeT75b21q7WUUY6Mg=; b=mEnpn9Jn4sNhAgeH+6DKWrf8Wy3+sBt8wvjeHq3l7HUh9i7Gj19TYhxuMubVxv12jQ gIwZHb7oVdNDS0RPrf2PsjJZclHOHikawWx/zamjriWd9LN7B98BPasPaMmiyG1leOn8 nCrQB6t7MVmgJg9PeOVoXtVE0HkyhSZ+zuNxijkyShIbNrqv2FOOzkiFwEXY2P4YiPGb rbChdfv9G9XHzz8Sl076OQpNFsBJ5EFSZ9cL7rAqtYAXWnSWl7pAEmswWj3fZJTmqW1/ rNn6nkRA/foCWEEH7B0h0m3igm7GD48yuypWgqSUO+vcu4nBYMvNcwnplmz9hOW5IvtT cMMw==
Hi folks,

I have a very simple block device driver that uses the request_fn style of processing instead of the older bio handling or newer multiqueue approach. I have been using this with ext3 and ext4 for years with no issues, but scalability requirements have dictated that I move to xfs to better support larger devices.

I'm observing something weird in my request_fn. It seems like the block layer is issuing invalid requests to my request function, and it really manifests when I use xfs_repair. Here's some info:

blk_queue_physical_block_size(q, 512) // should be no surprise
blk_queue_logical_block_size(q, 512) // should be no surprise
blk_queue_max_segments(q, 128); /* 128 memory segments (page + offset/length pairs) per request! */
blk_queue_max_hw_sectors(q, CA_MAX_REQUEST_SECTORS); /* Up to 1024 sectors (512k) per request hard limit in the kernel */
blk_queue_max_segment_size(q, CA_MAX_REQUEST_BYTES); /* 512k (1024 sectors) is the hard limit in the kernel */

While iterating through segments in rq_for_each_segment(), for some requests I am seeing some odd behavior.

segment 0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 903   // Ok, this looks normal
segment 1: iter.bio->bi_sector = 1023, blk_rq_cur_sectors(rq) = 7 // Whoah... this doesn't look right to me

You can see with segment 1, that the start sector is *NOT* adjacent to the the previous segment's sectors (there's a gap from sector 903 through 1022) and that the "sparse" request, for lack of a better term, extends beyond the max I/O boundary of 512k. Furthermore, this doesn't seem to jibe with what userspace is doing, which is a simple 512k read all in one chunk with a single userspace address.

But when you look at the strace of what xfs_repair is doing, it's just an innocuous read of 512k from sector 0.

write(2, "Phase 1 - find and verify superb"..., 40Phase 1 - find and verify superblock...
) = 40
mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f00e2f42000
mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f00e2ec1000
lseek(4, 0, SEEK_SET)                   = 0
read(4, 0x7f00e2ec1200, 524288)         = -1 EIO (Input/output error)
write(2, "superblock read failed, offset 0"..., 61superblock read failed, offset 0, size 524288, ag 0, rval -1
) = 61

The reason you see the EIO is because I am failing a request in the driver since it violates the restrictions I set earlier, is non-adjacent, and so I am unable to satisfy it.

Point 1: Shouldn't requests contain all segments that are adjacent on disk e.g. if initially before the rq_for_each_segment() loop blk_rq_pos(rq) is 10, and blk_rq_cur_sectors is 10, then on the next iteration (if any) iter.bio->bi_sector should be 10+10-1=20? Is my understanding correct? Are these some kind of special requests that should be handled differently (e.g. I know that DISCARD requests have to be handled differently and shouldn't be run through rq_for_each_segment, and that FLUSH requests are often empty). The cmd_flags say that they are normal REQ_TYPE_FS requests.

Point 2: If I ignore the incorrect iter.bio->bi_sector, and just read/write the request out as if it were adjacent, I xfs_repair reports corruption, and sure enough there are inodes which are zeroed out instead of having the inode magic 0x494e ( "IN") as expected. So mkfs.xfs, while not sending what appear to be illegal requests, is still resulting in corruption.

Point 3: Interestingly this goes away when I set blk_queue_max_segments(q, 1), but this obviously cuts down on clustering, and this of course kills performance. Is this indicative of anything in particular that I could be doing wrong?

Please cut me some slack when I say something like xfs_repair is "sending" invalid requests. I know that there is the C library, system call interface, block layer, etc.. in between, but I just mean to say simply that using this tool results in this unexpected behavior. I don't mean to point blame at xfs or xfsprogs. If this turns out to be a block layer issue, and this posting needs to be sent elsewhere, I apologize and would appreciate being pointed in the right direction.

It almost feels like the block layer is splitting the bios up wrongly, is corrupting the bvecs, or is introducing a race. What's strange again, is that I have only seen this behavior with xfs tools, but not ext3, or ext4 and e2fsprogs which has been working for years. It really shouldn't matter though, because mkfs.xfs and xfs_repair are user space tools, so this shouldn't cause the block layer in the kernel to send down invalid requests. I have been grappling with this for a few weeks, and I am tempted to go to the old bio handling function instead just to see if that would work out for me better, but that would be a big rewrite of the LLD. I am using an older Ubuntu 12.04 kernel 3.2.x so I am not able to go to the newer multiqueue implementation.

Any ideas/suggestions?
Need more information?

Thanks and Regards,
Jamie Pocas

<Prev in Thread] Current Thread [Next in Thread>