[Top] [All Lists]

Re: [RFC v2] Unicode/UTF-8 support for XFS

To: Olaf Weber <olaf@xxxxxxx>
Subject: Re: [RFC v2] Unicode/UTF-8 support for XFS
From: Andi Kleen <andi@xxxxxxxxxxxxxx>
Date: Tue, 23 Sep 2014 22:15:40 +0200
Cc: Andi Kleen <andi@xxxxxxxxxxxxxx>, Ben Myers <bpm@xxxxxxx>, linux-fsdevel@xxxxxxxxxxxxxxx, tinguely@xxxxxxx, xfs@xxxxxxxxxxx
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <54219C17.3090104@xxxxxxx>
References: <20140918195650.GI19952@xxxxxxx> <87lhpbhfgg.fsf@xxxxxxxxxxxxxxxxxxxx> <20140922184145.GH4482@xxxxxxx> <20140922192958.GJ4120@xxxxxxxxxxxxxxxxxx> <54219C17.3090104@xxxxxxx>
User-agent: Mutt/1.5.20 (2009-06-14)
> You only pay the space cost if you use it, similar to the nls tables.

The way Linux module loading works these things are loaded by default,
not when someone needs it.

So no, you (or rather every unfortunate Linux XFS user) would pay it always,
unless you black list the module or rebuild the kernel.

> A big part of the table does decompositions for Korean: eliminating
> the Hangul decompositions removes 156320 bytes, leaving 89936 bytes.

Are there regular ranges or other redundancies in the Korean encoding
that could be used to compress paths?

Doing some basic research other people already answered this:

Please use the ICU or google tables referenced below. Apparently
smaller is possible too, but 40-50k seems more reasonable.

I'm just gonna make the claim that whatever performance you
get from a larger table is dwarfed by the cache miss overhead.




NFC normalization requires large tables, right?
Like many other cases, there is a tradeoff between size and performance.
You can use very small tables, at some cost in performance. (Even there,
the actual performance cost depends on how often normalization needs to
be invoked, as discussed above.)

To see an analysis of the situation, see Normalization Footprint. It is
a bit out of date, but gives a sense of the magnitude. For comparison,
ICU's optimized tables for NFC take 44 kB (UTF-16) and Google's
optimized tables for NFC take 46 kB (UTF-8).


<Prev in Thread] Current Thread [Next in Thread>