On Wed, Aug 15, 2012 at 7:35 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Tue, Aug 14, 2012 at 10:42:21PM +0200, Stefan Priebe wrote:
>> Hello list,
>> i'm testing KVM with qemu, libiscsi, virtio-scsi-pci and
>> scsi-general on top of a nexenta storage solution. While doing
>> mkfs.xfs on an already used LUN / block device i discovered that the
>> unmapping / discard commands mkfs.xfs sends take a long time which
>> results in a lot of aborted scsi commands.
> Sounds like a problem with your storage being really slow at
>> Would it make sense to let mkfs.xfs send these unmapping commands in
>> small portations (f.e. 100MB)
> No, because the underlying implementation (blkdev_issue_discard())
> already breaks the discard request up into the granularity that is
> supported by the underlying storage.....
>> or is there another problem in the
>> patch to the block device? Any suggestions or ideas?
> .... which, of course, had bugs in it so is a muchmore likely cause
> of your problems.
> That said,the discard granularity is derived from information the
> storage supplies the kernel in it's SCSI mode page, so if the
> discard granularity is too large, that's a storage problem, not a
> linux problem at all, let alone a mkfs.xfs problem.
That is true.
But this particular issue seen in the network traces show that on this
particular storage array,
when a huge train of discards are sent, to basically discard the entire LUN,
the storage array may take many minutes to perform these discards,
during which time the array is unresponsive to any other I/O, on the
same LUN or on other LUNs.
This is definitely an issue with the i/o scheduler in the storage
array and not strictly in the linux kernel or mkfs.xfs.
And this basically means that for these kind of arrays with this
discard behaviour, running a command that performs
a huge number of discards to discard the entire device will basically
act as a full denial-of-service attack,
since every lun and every host that is attached to the array will
experience a full outage for minutes.
This is definitely an issue with the array, BUT linux kernel and/or
userspace utilities can, and very often are,
implement tweaks to be more firendly towards and avoid triggering
unfortunate hw behaviour.
For example, linux kernel contains a "fix" for the pentium FDIV bug,
eventhoug there was never any issue in linux that needed fixing.
The only other realistic alternative is to provide warnings such as :
"Some storage arrays may have major performance problems if you run
mkfs.xfs that can cause a full outage for every single lun on that
array that lasts for many minutes. Unless you KNOW that your storage
arrray does not have such issue for a fact, you should never run
mkfs.xfs on a production system outside of a full scheduled outage
window. The full set of storage arrays where this is a potential issue
is not known".
I think it would be more storage array friendly to implement tweaks to
avoid triggering such issue son this, and possibly other arrays.