> A related question is how to handle message boundaries in read()
> and write() constistently. If write() in 2.3.x implicitly sets MSG_EOR,
> which is interpreted as `each write() should generate a single, complete
> message in terms of the underlaying protocol' by many protocol families,
> I think read() from a SEQ_PACKET socket should behave consistently. That
> means it should only return if the last fragment was received (unless
> the read buffer space is to small in which case read() should return
> an error). But as linux maps all read() to recvmsg() internally, the
> socket layer only sees a recvmsg() call and cannot determine whether
> it originated from a read(). Thus, it will be necessary to add a flag
> to recvmsg, which is always set when recvmsg is called on behalf of read().
> This flag would request that recvmsg should return only if either the
> final part of the messages arrived or the receive buffer size is exceeded.
> Is this what MSG_WAITALL is intended for?
The question of read() and write() is a difficult one. For write() to work
without assistance from other syscalls (i.e. sendmsg(), or perhaps one might
invent ioctl(SIOEOR)) it seemed that adding an implicit MSG_EOR was the only
sensible option. There is nothing in POSIX to say either way, and I didn't
have access to any other OS which implemented SEQPACKET sockets in order to
check what they did (although thats probably a good thing to try and find
out if possible - I'd be interested to know).
I feel that read() is less of a problem. The behaviour of recvmsg() is
retained to a certain extent by the fact that recvmsg() is called to
"do the work" as you said. This means that any read() call will only
contain the whole, or part of a single record, and never more than one
record. The philosophy here was simply "if you want to know where the
record boundaries are, use a function which returns flags, if you don't
care where the boundaries are, or you already know because your protocol
determines the record size in advance, use read()".
If you are suggesting that (I think you are, but I'm not 100% sure)
that the protocol not transfer a single byte of data to userspace in a
read() call until the EOR marker has been seen, this has problems. Firstly,
upon the "buffer not big enough error" the userspace program has to
find out somehow how big the buffer needs to be (probably another ioctl()).
Secondly, the kernel side buffers now have to be big enough to store
a complete record from the transmitting application. There is nothing to
say how large a record maybe - it could be many times larger than the
physical memory of the receiving machine. Within a specific protocol
there may well be limits, but in some there aren't. DECnet is one of
the protocols that are unlimited in this way, which is most of the reason
for the current behaviour.
Overall, I prefer the option of keeping the behaviour of read() as simple as
possible and just using the more comprehensive recvmsg() when more
information is required.
MSG_WAITALL means don't return until the specified number of bytes have
been read. For SEQPACKET, that has to be amended so that early return
occurs at message boundaries, otherwise the rule of no more than one
record per recvmsg() call could be broken. However I don't think that
MSG_WAITALL should be merged into read() for SEQPACKET sockets, simply
because it gives no more information to userland than the current scheme,