On 24-09-14 13:07, Olaf Weber wrote:
On 23-09-14 22:15, Andi Kleen wrote:
A big part of the table does decompositions for Korean: eliminating
the Hangul decompositions removes 156320 bytes, leaving 89936 bytes.
Are there regular ranges or other redundancies in the Korean encoding
that could be used to compress paths?
Yes, though at the expense of more complicated code and interfaces. in
particular, lookups that want a normalized string would need to provide a
10-byte buffer to store it in.
I spent some time working on this, and the effect on the lookup code isn't
as bad as I'd thought. The updated code should be posted early next week.
With this change, the table size for the full trie becomes 89952 bytes. Of
this, 66400 bytes are spent on the NFKD + Ignorables, an additional 20992
bytes on NFDK + Ignorables + Case Fold. The remainder, 2560 bytes, are
additional info for older unicode versions.
Note that the NFDK + Ignorables + Case Fold trie forwards to the NFKD +
Ignorables where they overlap. A stand-alone version would be 71750 bytes.
As noted before these tables also contain the Canonical Combining Class and
unicode version information for the code points. The latter allows for
supporting multiple unicode versions using a single combined table.
Olaf Weber SGI Phone: +31(0)30-6696796
Veldzigt 2b Fax: +31(0)30-6696799
Technical Lead 3454 PW de Meern Vnet: 955-6796
Storage Software The Netherlands Email: olaf@xxxxxxx