xfs
[Top] [All Lists]

Re: Linux 2.4.17-xfs vs previous XFS versions and certain non-us char

To: "XFS: linux-xfs@xxxxxxxxxxx" <linux-xfs@xxxxxxxxxxx>
Subject: Re: Linux 2.4.17-xfs vs previous XFS versions and certain non-us characters in filenames
From: "D. Stimits" <stimits@xxxxxxxxxx>
Date: Sun, 27 Jan 2002 13:43:08 -0700
References: <1012101803.1045.28.camel@steelnest> <1012102374.1045.35.camel@steelnest> <3C536F44.1020301@sgi.com> <20020127152120.A1490@s2y4n2c.de> <20020127154745.A20990@wotan.suse.de> <1012143898.923.1.camel@steelnest> <1012146858.923.6.camel@steelnest> <20020127172958.A8796@wotan.suse.de> <3C5437F5.3553AAC5@idcomm.com> <3C543B45.3000308@sgi.com>, <3C54463E.EE19BF02@idcomm.com> <20020127193205.C086A125E6@rebutia.sweeney.demon.co.uk>
Reply-to: stimits@xxxxxxxxxx
Sender: owner-linux-xfs@xxxxxxxxxxx
Keith Matthews wrote:
> 
> On Sun, 27 Jan 2002 11:26:06 -0700 D. Stimits <D. Stimits 
> <stimits@xxxxxxxxxx>> wrote:
> 
> > Stephen Lord wrote:
> > >
> > > D. Stimits wrote:
> > >
> [SNIP]
> 
> > To some degree it reminds me of PostegreSQL. Somewhere in the docs I
> > recall seeing it mention that it works with non-C locales (non-english
> > basically), but that it would then run slower due to no hashing. I guess
> > instead of trying to support other locales at full performance,
> > PostgreSQL (at least the version I read docs on a year or so ago)
> > completely eliminated hashing if different character sets were used.
> 
> > With all the internationalizing going on, and the "world economy" being
> > so much more important in the tech industry, I have to wonder how long
> > it will be before hash routines for all these different character sets
> 
> One of the problems of designing hashing routines for different
> character sets is that different languages use different character
> frequencies (even when they use the same character set).
> 
> I discovered this some years ago when looking into the problems of
> producing Arabic language applications on systems whose hashing
> algorithms had been desigend for English. While the hashing worked it
> caused some nasty performance problems due to bunching, which did not
> occur with English language data (after all the algorithm had been
> designed to give an even distribution with English words).
> 
> I would imagine that heavy use of German/French/Dutch names for
> files/directories would have similar effects, but possibly not so
> marked. Eastern European languages with the greater incidence of the
> letter 'z' would be another problem area.
> 
> Hence I suspect that the universal optimal hashing algorithm may never
> be found.

Without a doubt, no hash routine will fit all languages. Even within the
same language there are specific areas that might end up being a
problem. If I were to use a basic hashing routine for a fiction novel,
it would probably work well; the same hash for a manual on Qt widget set
or Xlib would have problems, since so many things start with Q or X in
those cases, but not in the general language. I am thinking in terms of
two areas, one would be to take existing language hashing that is
commonly available for 7 bit ASCII, and expand it to versions that are
similar for 8 bit Latin-1, and for 16 bit wide characters. That could be
"breadth" of support across a single language, english. The other side
would be to add hash routines that are effective for various non-english
languages in wide character format; this would be adding "depth" of
support for a single encoding to work for more languages (my description
could easily have the naming turned around). If one were to look at
commonly available hash schemes for string data, almost all of them
would be intended for use with the english language; if more were
available as a common and stadard library, it would be easier to design
software various languages and encoding schemes. I am still curious how
much of a penalty something like a Japanese encoding would be for
XFS...obviously, the same hashing can't be used, the characters are not
even 7 or 8 bit (until you consider that the Japanese usually embed a
mix of character sets, e.g., the ability to use ordinary ASCII/english
characters embedded directly into kanji pages). I wonder if anyone with
a Japanese or wide character format has even used XFS? I'm very curious
about the difficulties of using these more complicated (especially
mixed) character sets.

D. Stimits, stimits@xxxxxxxxxx

> 
> --
> Keith Matthews
> Frequentous Consultants  - Linux Services,
>                 Oracle development & database administration


<Prev in Thread] Current Thread [Next in Thread>