What are the RAW I/O Enhancements?
Current file-system based
disk I/O requires fixed size I/O operations (typically 1024 bytes)
into kernel buffers, then the data is moved from the kernel buffer
to the user program address space - while this does allow the
file system to cache frequently accessed data, it also consumes
excess system bus bandwidth when copying the data from the kernel
buffer(s) into the user address space. Both the small size of
the I/O (2 sectors) and the copy operation conspire to throttle
the I/O subsystem throughput for database operations, where
transactions and full-table scan operations operate more quickly
with no operating system data intervention.
To help alleviate this problem, Stephan C. Tweedie of Redhat, developed a
mechanism that allows disk I/O directly to a buffer in the application
address space (historically known as raw (or unprocessed) I/O). This
mechanism will lock the required pages of memory to prevent them from
being paged out or swapped during the I/O operation. Applications required
to perform this type of disk I/O would open the character special
device /dev/raw and bind the disk device to a special raw
device using an ioctl(2) system call.
This mechanism however, is cumbersome to use and suffers from some
deficiencies. The primary deficiency with the mechanism comes
from its continued use of the file-system buffer header data
structures and associated device queueing routines. While use
of the buffer headers was a straightforward mechanism, it
implies that I/O operations would still be required to be fragmented
into 1024-bytes per operation, increasing the kernel overhead
significantly. The binding mechanism to bind an existing
block device to a new raw device is also somewhat cumbersome
and counterintuitive to Unix System Administrators, who expect
to find a relationship in the device namespace between a block
device and its corresponding raw device.
To address these concerns, SGI has added additional capabilities
to Stephan's raw I/O patch, which allow large I/O operations
directly to the user address space, and bypasses the bulk
of the kernel I/O queueing code for SCSI and FiberChannel devices.
The raw I/O enhancements work by registering the SCSI major numbers
as character devices in addition to the normal registration as
block devices. Once the proper nodes are created in the
device namespace (/dev), the character device (raw) version of
a scsi (or Fibrechannel) device may be accessed. The Major
and Minor numbers from the block device to which raw I/O access
is desired are used when creating the associated character special
device.
First, you must enable the raw disk I/O enhancements with
the CONFIG_RAW parameter, second, you must create the special device
nodes for the character special disk devices, preserving the
major/minor number relationship between the block and character
special devices.
A patch is in development to automatically create the appropriate
entries in the /dev/rsd namespace for raw SCSI and Fibrechannel
disk devices. Stay tuned.
The current version of fileutils released with Redhat distributions
and the SGI Linux Environment 1.1 contains a version of the 'dd'
command which doesn't align the input and output buffers correctly.
Here is a version of dd that should work with the aforementioned
distributions:
Because the SCSI or Fibrechannel disk controller is accessing the
buffer in the application directly for read and write operations,
special considerations must be met when allocating the buffer.
The buffer must be aligned on a byte boundary which is cogruent
to zero modulo the sector size of the raw device. The size of
the input/output request must be congruent to zero modulo the
sector size of the raw device. The file offset (lseek) value
must be congruent to zero modulo the sector size of the raw
device.
The requirement on buffer address alignment is due to the inability
of DMA controllers to split a sector across multiple scatter-gather
list (page) boundaries. The requirements on buffer size and offset
value are due to the SCSI common command set (CCS) utilizing a
sector as the fundamental transfer unit (i.e. a value of one in the
scsi command transfer size field indicates 1 sector - typically 512
bytes).
With normal filesystem-based input and output operations, the above
requirements are met by the buffer (or page) cache subsystem whose
buffers are all typically page-aligned and sized to the filesystem
block size (for ext2 1024- or 4096-bytes). However, use of the
buffer cache does require additional system bus bandwidth due to
the extraneous data copy involved.
Allocating a buffer on a page boundary will work correctly for
all raw I/O operations (e.g. using mmap on /dev/zero).
Note that the buffer will be locked into physical memory for the
duration of the input or output request.
This patch was developed to provide high-throughput,
low-overhead I/O access for database products. Due to
development time pressures and the absolute uselessness
of IDE devices in a high-performance database environment,
development was concentrated on the SCSI subsystem (which
supports both SCSI and FibreChannel devices).
While it hasn't been tested, the ide-scsi pseudo-host-adapter
should be able to be used to do raw I/O to ide devices.
How do the RAW I/O Enhancements work?
How do I use the RAW I/O Enhancements?
Do these changes work with CONFIG_DEVFS_FS
Why can't I use the dd command with raw devices?
Why do read(2) and write(2) return EINVAL for Raw Devices?
Why doesn't this patch support IDE devices?
What Are the raw I/O enhancements?
How do the RAW I/O enhancements work?
How do I use the RAW I/O Enhancements?
Do these changes work with CONFIG_DEVFS_FS?
Why can't I use the dd command with Raw devices?
Why do read(2) and write(2) return EINVAL for Raw Devices?
Why doesn't this patch work for IDE devices?

