xfs
[Top] [All Lists]

Re: [RFC v2] Unicode/UTF-8 support for XFS

To: Andi Kleen <andi@xxxxxxxxxxxxxx>, Ben Myers <bpm@xxxxxxx>
Subject: Re: [RFC v2] Unicode/UTF-8 support for XFS
From: Olaf Weber <olaf@xxxxxxx>
Date: Tue, 23 Sep 2014 15:01:20 +0200
Cc: <linux-fsdevel@xxxxxxxxxxxxxxx>, <tinguely@xxxxxxx>, <xfs@xxxxxxxxxxx>
Delivered-to: xfs@xxxxxxxxxxx
In-reply-to: <87lhpbhfgg.fsf@xxxxxxxxxxxxxxxxxxxx>
Organization: SGI
References: <20140918195650.GI19952@xxxxxxx> <87lhpbhfgg.fsf@xxxxxxxxxxxxxxxxxxxx>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1
On 22-09-14 16:55, Andi Kleen wrote:
Ben Myers <bpm@xxxxxxx> writes:

Strings are normalized using a trie that stores the relevant
information.  The trie itself is about 250kB in size, and lives in a
separate module.

So 250kB bloat -- and what does this fix exactly?

Someone putting random ligatures into their file names and expecting
the file to be the same as before. Can't they just not do that?

I like the 'office' example because it is applicable to English and easy to explain. Once you move away from English examples are much easier to come by. Take a Dutch name like 'Renée Soutendijk'.

These two forms both spell Renée in UTF-8:
  0x52 0x65 0x6E 0xC3 0xA9 0x65
  0x52 0x65 0x6E 0x65 0xCC 0x81 0x65
The difference is
 LATIN SMALL LETTER E WITH ACUTE (U+00E9)
 LATIN SMALL LETTER E (U+0065) COMBINING ACUTE ACCENT (U+0301)
and corresponds to the difference between NFC and NFD.

These two forms both spell Soutendijk in UTF-8:
  0x53 0x6F 0x75 0x74 0x65 0x6E 0x64 0x69 0x6A 0x6B
  0x53 0x6F 0x75 0x74 0x65 0x6E 0x64 0xC4 0xB3 0x6B
The difference is
  LATIN SMALL LETTER I (U+0069) LATIN SMALL LETTER J (U+006A)
  LATIN SMALL LIGATURE IJ (U+0133)
and the former is the compatibility decomposition of the latter, the 'K' in NFKC/NFKD.

Do accented letters count as random ligatures that people should just not use?

The bulk of the table deals with Korean.

Olaf

--
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                           Veldzigt 2b       Fax:    +31(0)30-6696799
Technical Lead             3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  olaf@xxxxxxx

<Prev in Thread] Current Thread [Next in Thread>