Nikita Danilov wrote:
> Xuan Baldauf writes:
> >
> >
> > Nikita Danilov wrote:
> >
> > > Hans Reiser writes:
> > > > Russell Coker wrote:
> > > > >
> > > > > On Mon, 16 Jul 2001 11:00, Chris Wedgwood wrote:
> > > > > > On Sun, Jul 15, 2001 at 02:01:09PM +0400, Hans Reiser wrote:
> > > > > >
> > > > > > Making the server stateless is wrong
> > > > > >
> > > > > > why?
> > > > >
> > > > > Because it leads to all the problems we have seen! Why not have
> the client
> > > > > have an open file handle (the way Samba works and the way the Unix
> file
> > > > > system API works)? Then when the server goes down the client sends
> a request
> > > > > to open the file again...
> > >
> > > If you have 10000 clients each opening 100 files you got 1e6 opened
> > > files on the server---it wouldn't work. NFS was designed to be stateless
> > > to be scalable.
> >
> > Every existing file has at least one name (or is member of the hidden
> to-be-deleted-directory, and so has
> > a name, too), and an object_id. Suppose the object_id is 32 bytes long. A
> virtual filedescriptor may be 4
> > bytes long, some housekeeping metadata 28 bytes, so we will have 64MB
> occupied in you scenario. Where's
> > the problem? 80% of those 64MB can be swapped out.
>
> For each open file you have:
>
> struct file (96b)
> struct inode (460b)
> struct dentry (112b)
>
> at least. This totals to 668M of kernel memory, that is, unpageable.
As I said below, the NFS server should be a user-space daemon..
>
> All files are kept in several hash tables and hash-tables are known to
> degrade. Well, actually, I am afraid current Linux kernels cannot open
> 1e6 of files.
This is one thing I cannot understand, too.
>
>
> >
> > >
> > >
> > > > >
> > > > > > making the readdir a multioperation act is wrong
> > > > > >
> > > > > > why? i have 3M directories... ar you saying clients should read
> the
> > > > > > whole things at once?
> > > > >
> > > > > No. findfirst()/findnext() is the correct way of doing this.
> Forcing the
> > > > > client to read through 3M directory entries to see if "foo.*"
> matches
> > > > > anything is just wrong. The client should be able to ask for a
> list of file
> > > > > names matching certain criteria (time stamp, name, ownership, etc).
> The
> > > > > findfirst() and findnext() APIs on DOS, OS/2, and Windows do this
> quite well.
> > > >
> > > > there is a fundamental conflict between having cookies, shrinkable
> directories, and the ability to
> > > > find foo.* without reading the whole directory, all at the same time.
> > > >
> > > > NFS V4 is designed by braindead twerps incapable of layering software
> when designing it.
> > >
> > > Just cannot stand it. You mean that NFS v4 features database in a kernel
> > > too? (It's our insider joke.)
> >
> > NFS should not be kernel-bound. The nfs server application mimics the
> applications on the NFS clients. If
> > this is not possible, something is wrong.
> >
> > >
> > >
> > > >
> > > > >
> > > > > If you have 3M directory entries then SMB should kick butt over NFS.
> > > > >
> > > > > Also while we're at it, one of the worst things about NFS is the
> issue of
> > > > > delete. Because it's stateless NFS servers implement unlink as
> "mv" and
> > > > > things get nasty from there...
> > > > >
> > > > > --
> > > > > http://www.coker.com.au/bonnie++/ Bonnie++ hard drive benchmark
> > > > > http://www.coker.com.au/postal/ Postal SMTP/POP benchmark
> > > > > http://www.coker.com.au/projects.html Projects I am working on
> > > > > http://www.coker.com.au/~russell/ My home page
> > >
> > > Nikita.
> >
> > Being "stateless" is only a weak way to implement disconnected
> > operation. If there was state (if the
> > server could know that a client has a filedescriptor to a file), the
> > client could be informed that it's
> > virtual file descriptors to the files to be deleted are invalid
> > now. This only fails if the network is
> > down, so this is a disconnected operation problem.
> >
> > By the way: if NFS was scaleable, why doesn't it allow every handle
> > from the server (like inode numbers,
> > directory position cookies) to be of variable, dynamic,
> > server-determined size? This would be scaleable.
> >
> > P.S.: Hans, how do you prevent object_id|inode reusing? Using
> > mount_id+generation counter per mount?
>
> There is generation counter stored persistently with super-block,
> incremented on each inode deletion.
Is the superblock always logged on inode-deletion for other reasons than the
generation counter? If not, would
the above method be more efficient, because it does not require logging
superblocks?
>
>
> >
>
> Nikita.
|