[getdns-users] U-labels, A-labels, conversion, and the API helper functions
ajs at anvilwalrusden.com
Sun Jul 17 02:03:00 UTC 2016
I'd planned to help out a little today with the "universal acceptance"
stuff for getdns. I discovered a small wrinkle. As usual with i18n
issues, this wrinkle turned out to be a giant sinkhole.
The current dependency is on libidn1. That version of libidn was
designed to support the original IDNA2003. This causes three problems:
1. In principle, it means that some names that are handled by
IDNA2008 can't be handled by this tool. To begin with, IDNA2003
is supposed to stop with Unicode 3.2, which should mean that
approximately all emoji (just for instance) shouldn't work. In
practice, most systems aren't able to tell you what version of
Unicode you're running, so this is probably not a huge deal,
though it's certainly messy.
2. Characters that are mapped in IDNA2003 but not in IDNA2008
will get treated wrong. For instance, the label fußball in
IDNA2003 is just mapped to the label fussball. In IDNA2008, it's
xn--fuball-cta. These cases seem more troubling.
3. Some labels are ok in IDNA2003 that are not allowed in
IDNA2008. For instance, the label i♥ny is legal under IDNA2003
(and yields xn--iny-zx5a) but DISALLOWED under IDNA2008.
It seems to me that the helper functions
getdns_convert_ulabel_to_alabel and getdns_convert_alabel_to_ulabel
ought to give an error when asked to convert to or from a label that
is not actually legal. But this raises a more fundamental question.
I think there are three (mostly mutually exclusive) things we could
expect getdns to do. The first is to be an ultra-low-level function
that simply converts punycode-encoded labels to unicode-encoded labels
and conversely. In this case, the API would have no opinion on
whether a putative label was valid or invalid for the protocol. This
would be a good thing to do if we think that the library needs to
support not just IDNA2008 labels, but other non-IDNA2008 schemes.
FWIW, the WHATWG seems to want to use
IDNA2003-ignoring-the-Unicode-version (so a completely undefined
protocol, to be frank) for URI encoding. If we embrace this model, I
think the API ought to be changed not to use function names referring
to U-labels and A-labels, and we should make it clear that the
application needs to do something more intelligent with IDNA because
we're not offering the facility.
The second path is to provide an IDNA2008 facility that makes
transparent U-label/A-label use possible but optional. In this case,
I think, the library we're using needs to change so that it properly
detects bad putative U-label or A-label input (e.g. it should throw an
error if it gets i♥ny since that's not a legal U-label). This is, I
think, what at least some of us had in mind when the API spec was
written. Note that the API still needs to be able to handle the
Unicode string "i♥ny" and use it as a DNS label in itself, since such
a label is in fact a legal DNS label (it's a series of octets) and is
actually encouraged for use by DNS-SD and mDNS.
The third path is to include (presumably as a long-term goal) not just
the support in the above, but a fully-developed mapping layer as well,
so that the system could take something like fußball and do both the
IDNA2003-compatible and IDNA2008-compatible lookups. (This is
approximately a recommendation of Unicode's UTS#46, though there's
much more to it. I could expand on this if anyone thinks it's a good
idea.) I note that this kind of effort would be much more complicated
and really long, because to do it well you actually probably have to
become locale-aware and therefore the API would need to provide a lot
more facilities for interacting with user environment than I think
anyone has contemplated (it would almost certainly require
considerably more work on the API spec). OTOH, it'd be a big advance
in providing the sort of "local mapping" facilities that IDNA2008 said
were needed to replace the in-protocol mapping that IDNA2003 did.
ajs at anvilwalrusden.com
More information about the Users