xfs
[Top] [All Lists]

Re[2]: Linux 2.4.17-xfs vs previous XFS versions and certain non-us ch

To: linux-xfs@xxxxxxxxxxx
Subject: Re[2]: Linux 2.4.17-xfs vs previous XFS versions and certain non-us characters in filenames
From: Keith Matthews <keith_m@xxxxxxxxxxxxxxxxxxx>
Date: Sun, 27 Jan 2002 19:32:05 +0000 (GMT)
In-reply-to: <3C54463E.EE19BF02@idcomm.com>
References: <1012101803.1045.28.camel@steelnest> <1012102374.1045.35.camel@steelnest> <3C536F44.1020301@sgi.com> <20020127152120.A1490@s2y4n2c.de> <20020127154745.A20990@wotan.suse.de> <1012143898.923.1.camel@steelnest> <1012146858.923.6.camel@steelnest> <20020127172958.A8796@wotan.suse.de> <3C5437F5.3553AAC5@idcomm.com> <3C543B45.3000308@sgi.com>, <3C54463E.EE19BF02@idcomm.com>
Sender: owner-linux-xfs@xxxxxxxxxxx
On Sun, 27 Jan 2002 11:26:06 -0700 D. Stimits <D. Stimits <stimits@xxxxxxxxxx>> 
wrote:

> Stephen Lord wrote:
> > 
> > D. Stimits wrote:
> > 
[SNIP]

> To some degree it reminds me of PostegreSQL. Somewhere in the docs I
> recall seeing it mention that it works with non-C locales (non-english
> basically), but that it would then run slower due to no hashing. I guess
> instead of trying to support other locales at full performance,
> PostgreSQL (at least the version I read docs on a year or so ago)
> completely eliminated hashing if different character sets were used.

> With all the internationalizing going on, and the "world economy" being
> so much more important in the tech industry, I have to wonder how long
> it will be before hash routines for all these different character sets

One of the problems of designing hashing routines for different
character sets is that different languages use different character
frequencies (even when they use the same character set).   

I discovered this some years ago when looking into the problems of
producing Arabic language applications on systems whose hashing
algorithms had been desigend for English. While the hashing worked it
caused some nasty performance problems due to bunching, which did not
occur with English language data (after all the algorithm had been
designed to give an even distribution with English words). 

I would imagine that heavy use of German/French/Dutch names for
files/directories would have similar effects, but possibly not so
marked. Eastern European languages with the greater incidence of the
letter 'z' would be another problem area.

Hence I suspect that the universal optimal hashing algorithm may never
be found.

--
Keith Matthews
Frequentous Consultants  - Linux Services, 
                Oracle development & database administration



<Prev in Thread] Current Thread [Next in Thread>