www.sgi.com
[Products][Solutions][Developers][Support][Serious Fun]-
Open Source
About this site
Project Raw I/O Enhancements
Overview
News
FAQ
Mailing List
How to Contribute
Download
License
SGI Open Source
Project List
Raw I/O Enhancements FAQ

What are the RAW I/O Enhancements?
How do the RAW I/O Enhancements work?
How do I use the RAW I/O Enhancements?
Do these changes work with CONFIG_DEVFS_FS
Why can't I use the dd command with raw devices?
Why do read(2) and write(2) return EINVAL for Raw Devices?
Why doesn't this patch support IDE devices?


What Are the raw I/O enhancements?

Current file-system based disk I/O requires fixed size I/O operations (typically 1024 bytes) into kernel buffers, then the data is moved from the kernel buffer to the user program address space - while this does allow the file system to cache frequently accessed data, it also consumes excess system bus bandwidth when copying the data from the kernel buffer(s) into the user address space. Both the small size of the I/O (2 sectors) and the copy operation conspire to throttle the I/O subsystem throughput for database operations, where transactions and full-table scan operations operate more quickly with no operating system data intervention.

To help alleviate this problem, Stephan C. Tweedie of Redhat, developed a mechanism that allows disk I/O directly to a buffer in the application address space (historically known as raw (or unprocessed) I/O). This mechanism will lock the required pages of memory to prevent them from being paged out or swapped during the I/O operation. Applications required to perform this type of disk I/O would open the character special device /dev/raw and bind the disk device to a special raw device using an ioctl(2) system call.

This mechanism however, is cumbersome to use and suffers from some deficiencies. The primary deficiency with the mechanism comes from its continued use of the file-system buffer header data structures and associated device queueing routines. While use of the buffer headers was a straightforward mechanism, it implies that I/O operations would still be required to be fragmented into 1024-bytes per operation, increasing the kernel overhead significantly. The binding mechanism to bind an existing block device to a new raw device is also somewhat cumbersome and counterintuitive to Unix System Administrators, who expect to find a relationship in the device namespace between a block device and its corresponding raw device.

To address these concerns, SGI has added additional capabilities to Stephan's raw I/O patch, which allow large I/O operations directly to the user address space, and bypasses the bulk of the kernel I/O queueing code for SCSI and FiberChannel devices.

How do the RAW I/O enhancements work?

The raw I/O enhancements work by registering the SCSI major numbers as character devices in addition to the normal registration as block devices. Once the proper nodes are created in the device namespace (/dev), the character device (raw) version of a scsi (or Fibrechannel) device may be accessed. The Major and Minor numbers from the block device to which raw I/O access is desired are used when creating the associated character special device.

How do I use the RAW I/O Enhancements?

First, you must enable the raw disk I/O enhancements with the CONFIG_RAW parameter, second, you must create the special device nodes for the character special disk devices, preserving the major/minor number relationship between the block and character special devices.

Do these changes work with CONFIG_DEVFS_FS?

A patch is in development to automatically create the appropriate entries in the /dev/rsd namespace for raw SCSI and Fibrechannel disk devices. Stay tuned.

Why can't I use the dd command with Raw devices?

The current version of fileutils released with Redhat distributions and the SGI Linux Environment 1.1 contains a version of the 'dd' command which doesn't align the input and output buffers correctly. Here is a version of dd that should work with the aforementioned distributions:

dd(1) command

Why do read(2) and write(2) return EINVAL for Raw Devices?

Because the SCSI or Fibrechannel disk controller is accessing the buffer in the application directly for read and write operations, special considerations must be met when allocating the buffer.

The buffer must be aligned on a byte boundary which is cogruent to zero modulo the sector size of the raw device. The size of the input/output request must be congruent to zero modulo the sector size of the raw device. The file offset (lseek) value must be congruent to zero modulo the sector size of the raw device.

The requirement on buffer address alignment is due to the inability of DMA controllers to split a sector across multiple scatter-gather list (page) boundaries. The requirements on buffer size and offset value are due to the SCSI common command set (CCS) utilizing a sector as the fundamental transfer unit (i.e. a value of one in the scsi command transfer size field indicates 1 sector - typically 512 bytes).

With normal filesystem-based input and output operations, the above requirements are met by the buffer (or page) cache subsystem whose buffers are all typically page-aligned and sized to the filesystem block size (for ext2 1024- or 4096-bytes). However, use of the buffer cache does require additional system bus bandwidth due to the extraneous data copy involved.

Allocating a buffer on a page boundary will work correctly for all raw I/O operations (e.g. using mmap on /dev/zero).

Note that the buffer will be locked into physical memory for the duration of the input or output request.

Why doesn't this patch work for IDE devices?

This patch was developed to provide high-throughput, low-overhead I/O access for database products. Due to development time pressures and the absolute uselessness of IDE devices in a high-performance database environment, development was concentrated on the SCSI subsystem (which supports both SCSI and FibreChannel devices).

While it hasn't been tested, the ide-scsi pseudo-host-adapter should be able to be used to do raw I/O to ide devices.

about this site  |  privacy policy | owner(s) of project rawio
Copyright © 1999 Silicon Graphics, Inc. All rights reserved. | Trademark Information