1742 lines
65 KiB
Plaintext
1742 lines
65 KiB
Plaintext
INTERNET-DRAFT Adam M. Costello
|
||
draft-ietf-idn-amc-ace-m-00.txt 2001-Feb-12
|
||
Expires 2001-Aug-14
|
||
|
||
AMC-ACE-M version 0.1.0
|
||
|
||
Status of this Memo
|
||
|
||
This document is an Internet-Draft and is in full conformance with
|
||
all provisions of Section 10 of RFC2026.
|
||
|
||
Internet-Drafts are working documents of the Internet Engineering
|
||
Task Force (IETF), its areas, and its working groups. Note
|
||
that other groups may also distribute working documents as
|
||
Internet-Drafts.
|
||
|
||
Internet-Drafts are draft documents valid for a maximum of six
|
||
months and may be updated, replaced, or obsoleted by other documents
|
||
at any time. It is inappropriate to use Internet-Drafts as
|
||
reference material or to cite them other than as "work in progress."
|
||
|
||
The list of current Internet-Drafts can be accessed at
|
||
http://www.ietf.org/ietf/1id-abstracts.txt
|
||
|
||
The list of Internet-Draft Shadow Directories can be accessed at
|
||
http://www.ietf.org/shadow.html
|
||
|
||
Distribution of this document is unlimited. Please send comments
|
||
to the author at amc@cs.berkeley.edu, or to the idn working
|
||
group at idn@ops.ietf.org. A non-paginated (and possibly
|
||
newer) version of this specification may be available at
|
||
http://www.cs.berkeley.edu/~amc/charset/amc-ace-m
|
||
|
||
Abstract
|
||
|
||
AMC-ACE-M is a reversible map from a sequence of Unicode [UNICODE]
|
||
characters to a sequence of letters (A-Z, a-z), digits (0-9), and
|
||
hyphen-minus (-), henceforth called LDH characters. Such a map
|
||
(called an "ASCII-Compatible Encoding", or ACE) might be useful for
|
||
internationalized domain names [IDN], because host name labels are
|
||
currently restricted to LDH characters by [RFC952] and [RFC1123].
|
||
|
||
AMC-ACE-M is a cross between BRACE [BRACE00] (which is efficient
|
||
but complex) and DUDE [DUDE00] (which is simple and provides case
|
||
preservation). AMC-ACE-M is much simpler than BRACE but similarly
|
||
efficient, and provides case preservation like DUDE.
|
||
|
||
Besides domain names, there might also be other contexts where it is
|
||
useful to transform Unicode characters into "safe" (delimiter-free)
|
||
ASCII characters. (If other contexts consider hyphen-minus to be
|
||
unsafe, a different character could be used to play its role, like
|
||
underscore.)
|
||
|
||
Contents
|
||
|
||
Features
|
||
Name
|
||
Overview
|
||
Base-32 characters
|
||
Encoding procedure
|
||
Decoding procedure
|
||
Signature
|
||
Case sensitivity models
|
||
Comparison with RACE, BRACE, LACE, and DUDE
|
||
Example strings
|
||
Security considerations
|
||
References
|
||
Author
|
||
Example implementation
|
||
|
||
Features
|
||
|
||
Uniqueness: Every Unicode string maps to at most one LDH string.
|
||
|
||
Completeness: Every Unicode string maps to an LDH string.
|
||
Restrictions on which Unicode strings are allowed, and on length,
|
||
may be imposed by higher layers.
|
||
|
||
Efficient encoding: The ratio of encoded size to original size is
|
||
small for all Unicode strings. This is important in the context
|
||
of domain names because [RFC1034] restricts the length of a domain
|
||
label to 63 characters.
|
||
|
||
Simplicity: The encoding and decoding algorithms are reasonably
|
||
simple to implement. The goals of efficiency and simplicity are at
|
||
odds; AMC-ACE-M aims at a good balance between them.
|
||
|
||
Case-preservation: If the Unicode string has been case-folded prior
|
||
to encoding, it is possible to record the case information in the
|
||
case of the letters in the encoding, allowing a mixed-case Unicode
|
||
string to be recovered if desired, but a case-insensitive comparison
|
||
of two encoded strings is equivalent to a case-insensitive
|
||
comparison of the Unicode strings. This feature is optional; see
|
||
section "Case sensitivity models".
|
||
|
||
Readability: The letters A-Z and a-z and the digits 0-9 appearing
|
||
in the Unicode string are represented as themselves in the label.
|
||
This comes for free because it usually the most efficient encoding
|
||
anyway.
|
||
|
||
Name
|
||
|
||
AMC-ACE-M is a working name that should be changed if it is adopted.
|
||
(The M merely indicates that it is the thirteenth ACE devised by
|
||
this author. BRACE was the third. D through L did not deliver
|
||
enough efficiency to justify their complexity.) Rather than waste
|
||
good names on experimental proposals, let's wait until one proposal
|
||
is chosen, then assign it a good name. Suggestions (assuming the
|
||
primary use is in domain names):
|
||
|
||
UniHost
|
||
UTF-A ("A" for "ASCII" or "alphanumeric",
|
||
but unfortunately UTF-A sounds like UTF-8)
|
||
UTF-H ("H" for "host names",
|
||
but unfortunately UTF-H sounds like UTF-8)
|
||
UTF-D ("D" for "domain names")
|
||
NUDE (Normal Unicode Domain Encoding)
|
||
|
||
Overview
|
||
|
||
AMC-ACE-M maps characters to characters--it does not consume or
|
||
produce code points, code units, or bytes, although the algorithm
|
||
makes use of code points, and implementations will of course need to
|
||
represent the input and output characters somehow, usually as bytes
|
||
or other code units.
|
||
|
||
Each character in the Unicode string is represented by an
|
||
integral number of characters in the encoded string. There is no
|
||
intermediate bit string or octet string.
|
||
|
||
The encoded string alternates between two modes: literal mode and
|
||
base-32 mode. LDH characters in the Unicode string are encoded
|
||
literally, except that hyphen-minus is doubled. Non-LDH characters
|
||
in the Unicode string are encoded using base-32, in which each
|
||
character of the encoded string represents five bits (a "quintet").
|
||
A non-paired hyphen-minus in the encoded string indicates a mode
|
||
change.
|
||
|
||
In base-32 mode a group of one to five quintets are used to
|
||
represent a number, which is added to an offset to yield a
|
||
Unicode code point, which in turn represents a Unicode character.
|
||
(Surrogates, which are code units used by UTF-16 in pairs to
|
||
refer to code points, are not used and not allowed in AMC-ACE-M.)
|
||
Similarities between the code points are exploited to make the
|
||
encoding more compact.
|
||
|
||
Base-32 characters
|
||
|
||
"a" = 0 = 0x00 = 00000 "s" = 16 = 0x10 = 10000
|
||
"b" = 1 = 0x01 = 00001 "t" = 17 = 0x11 = 10001
|
||
"c" = 2 = 0x02 = 00010 "u" = 18 = 0x12 = 10010
|
||
"d" = 3 = 0x03 = 00011 "v" = 19 = 0x13 = 10011
|
||
"e" = 4 = 0x04 = 00100 "w" = 20 = 0x14 = 10100
|
||
"f" = 5 = 0x05 = 00101 "x" = 21 = 0x15 = 10101
|
||
"g" = 6 = 0x06 = 00110 "y" = 22 = 0x16 = 10110
|
||
"h" = 7 = 0x07 = 00111 "z" = 23 = 0x17 = 10111
|
||
"i" = 8 = 0x08 = 01000 "2" = 24 = 0x18 = 11000
|
||
"j" = 9 = 0x09 = 01001 "3" = 25 = 0x19 = 11001
|
||
"k" = 10 = 0x0A = 01010 "4" = 26 = 0x1A = 11010
|
||
"m" = 11 = 0x0B = 01011 "5" = 27 = 0x1B = 11011
|
||
"n" = 12 = 0x0C = 01100 "6" = 28 = 0x1C = 11100
|
||
"p" = 13 = 0x0D = 01101 "7" = 29 = 0x1D = 11101
|
||
"q" = 14 = 0x0E = 01110 "8" = 30 = 0x1E = 11110
|
||
"r" = 15 = 0x0F = 01111 "9" = 31 = 0x1F = 11111
|
||
|
||
The digits "0" and "1" and the letters "o" and "l" are not used, to
|
||
avoid transcription errors.
|
||
|
||
All decoders must recognize both the uppercase and lowercase
|
||
forms of the base-32 characters. The case may or may not convey
|
||
information, as described in section "Case sensitivity models".
|
||
|
||
Encoding procedure
|
||
|
||
The encoder first examines the Unicode string and chooses some
|
||
parameters. It writes these parameters into the output string, then
|
||
proceeds to encode each Unicode character, one at a time. The exact
|
||
sequence of steps is given below. All ordering of bits and quintets
|
||
is big-endian (most significant first). The >> and << operators
|
||
used below mean bit shift, as in C. For >> there is no question of
|
||
logical versus arithmetic shift because AMC-ACE-M makes no use of
|
||
negative numbers.
|
||
|
||
0) Determine the Unicode code point for each non-LDH character in
|
||
the Unicode string. Since LDH characters are encoded literally,
|
||
their code points are not needed. Depending on how the Unicode
|
||
string is presented to the encoder, this step may be a no-op.
|
||
|
||
1) Verify that there are are no invalid code points in the input;
|
||
that is, none exceed 0x10FFFF (the highest code point in the
|
||
Unicode code space) and none are in the range D800..DFFF
|
||
(surrogates).
|
||
|
||
2) Determine the most populous row: Row n is defined as the 256
|
||
code points starting with n << 8, except that this definition
|
||
would makes rows D8..DF useless, because they would contain only
|
||
surrogates. Therefore AMC-ACE-M defines rows D8..DF to be the
|
||
following non-aligned blocks of 256 code points:
|
||
|
||
row D8 = 0020..001F
|
||
row D9 = 005B..015A
|
||
row DA = 007B..017A
|
||
row DB = 00A0..019F
|
||
row DC = 00C0..01BF
|
||
row DD = 00DF..01DE
|
||
row DE = 0134..0233
|
||
row DF = 0270..036F
|
||
|
||
(Rationale: Whereas almost every small script is confined to
|
||
a single row, the Latin script is split across a few rows,
|
||
and the row boundaries are not especially convenient for many
|
||
languages.)
|
||
|
||
Determine the row containing the most non-LDH input code points,
|
||
breaking ties in favor of smaller-numbered rows. (If a code
|
||
point appears multiple times in the input, it counts multiple
|
||
times. This applies to steps 3 and 4 also.) Call it row B.
|
||
Let offsetB be the first code point of row B.
|
||
|
||
3) Determine the most populous 16-window: For each n in 0..31 let
|
||
offset = ((offsetB >> 3) + n) << 3 and count the number of code
|
||
points in the range offset through offset + 0xF. Let A be the
|
||
value of n that maximizes this count, breaking ties in favor
|
||
of smaller values of n, and let offsetA be the corresponding
|
||
offset.
|
||
|
||
4) Determine the most populous 20k-window: If the input is empty,
|
||
then let C = 0. Otherwise, for each input code point, let n =
|
||
code_point >> 11, and count the number of non-LDH input code
|
||
points that are not in row B and are in the range (n << 11)
|
||
through (n << 11) + 0x4FFF. Determine the value of n that
|
||
maximizes the count, breaking ties in favor of smaller values of
|
||
n, and let C be that value.
|
||
|
||
5) Choose a style: One of the base-32 codes used in step 7.3 has
|
||
two variants, and so base-32 mode is subdivided into two styles,
|
||
narrow and wide, depending on which variant is used. Compute
|
||
the total number of base-32 characters that would be produced
|
||
if narrow style were used, and the number if wide style were
|
||
used. The easiest way to do this is to mimic the logic of steps
|
||
6 and 7.3. Use whichever style would produce fewer base-32
|
||
characters. In case of a tie, use narrow style.
|
||
|
||
6) Encode the parameters. If narrow style is used, then let
|
||
offsetC = (offsetB >> 12) << 12, and encode B and A as three or
|
||
four base-32 characters:
|
||
|
||
00bbb bbbbb aaaaa if B <= 0xFF
|
||
01bbb bbbbb bbbbb aaaaa otherwise
|
||
|
||
If wide style is used, then let offsetC = C << 11, and encode B
|
||
and C as three or five base-32 characters:
|
||
|
||
10bbb bbbbb ccccc if B <= 0xFF and C <= 0x1F
|
||
11bbb bbbbb bbbbb ccccc ccccc otherwise
|
||
|
||
7) Encode each input character in turn, using the first of the
|
||
following cases that applies. The mode is initially base-32.
|
||
|
||
7.1) The character is a hyphen-minus (U+002D). Encode it as
|
||
two hyphen-minuses.
|
||
|
||
7.2) The character is an LDH character. If in base-32 mode
|
||
then output a hyphen-minus and switch to literal mode.
|
||
Copy the character to the output.
|
||
|
||
7.3) The character is a non-LDH character. If in literal
|
||
mode then output a hyphen-minus and switch to base-32
|
||
mode. Encode the character's code point using the
|
||
first of the following cases that applies. Square
|
||
brackets enclose quintets that can be used to record
|
||
the upper/lowercase attribute of the Unicode character
|
||
(because the corresponding base-32 characters are
|
||
guaranteed to be letters rather than digits) (see section
|
||
"Case sensitivity models").
|
||
|
||
7.3.1) Narrow style was chosen and the code point is in
|
||
the range offsetA through offsetA + 0xF. Subtract
|
||
offsetA and encode the difference as a single
|
||
base-32 character:
|
||
|
||
[0xxxx]
|
||
|
||
7.3.2) The code point is in the range offsetB through
|
||
offsetB + 0xFF. Subtract offsetB and encode the
|
||
difference as two base-32 characters:
|
||
|
||
1xxxx [0xxxx]
|
||
|
||
7.3.3) The code point is in the range offsetC through
|
||
offsetC + 0xFFF. Subtract offsetC and encode the
|
||
difference as three base-32 characters:
|
||
|
||
1xxxx 1xxxx [0xxxx]
|
||
|
||
7.3.4) Wide style was chosen and the code point is in
|
||
the range offsetC + 0x1000 through offsetC +
|
||
0x4FFF. Subtract offsetC + 0x1000 and encode the
|
||
difference as three base-32 characters:
|
||
|
||
[0xxxx] xxxxx xxxxx
|
||
|
||
7.3.5) The code point is in the range 0 through 0xFFFF.
|
||
Encode it as four base-32 characters:
|
||
|
||
1xxxx 1xxxx 1xxxx [0xxxx]
|
||
|
||
7.3.6) If we've come this far, the code point must be
|
||
in the range 0x10000 through 0x10FFFF. Subtract
|
||
0x10000 and encode the difference as five base-32
|
||
characters:
|
||
|
||
1xxxx 1xxxx 1xxxx 1xxxx [0xxxx]
|
||
|
||
Decoding procedure
|
||
|
||
The details of the decoding procedure are implied by the encoding
|
||
procedure. The overall sequence of steps is as follows.
|
||
|
||
1) Undo the encoder's step 6: From the first few base-32
|
||
characters, determine whether narrow or wide style is used, and
|
||
determine the offsets.
|
||
|
||
2) Set the mode to base-32. For each remaining input character, use
|
||
the first of the following cases that applies:
|
||
|
||
2.1) The character is a hyphen-minus, and the following
|
||
character is also a hyphen-minus. Consume them both and
|
||
output a hyphen-minus.
|
||
|
||
2.2) The character is a hyphen-minus. Consume it and toggle
|
||
the mode flag.
|
||
|
||
2.3) The current mode is literal. Consume the input character
|
||
and output it.
|
||
|
||
2.4) Interpret the input character and up to four of its
|
||
successors as base-32. Consume characters until one is
|
||
found whose value has the form 0xxxx. That is the one
|
||
that carries the upper/lowercase information. Remember
|
||
the length of the code. If the length is one and wide
|
||
style is being used, consume two more characters.
|
||
Decode the base-32 characters into an integer, add the
|
||
appropriate offset (which depends on the remembered code
|
||
length), and output the Unicode character corresponding to
|
||
the resulting code point.
|
||
|
||
If the case-flexible or case-preserving model is being
|
||
used (see section "Case sensitivity models"), the decoder
|
||
must either perform the case conversion as it is decoding,
|
||
or construct a separate record of the case information to
|
||
accompany the output string.
|
||
|
||
3) Before returning the output (be it a string or a string plus
|
||
case information), the decoder must invoke the encoder on it,
|
||
and compare the result to the input string. The comparison
|
||
must be case-sensitive if the case-sensitive or case-flexible
|
||
model is being used, case-insensitive if the case-insensitive
|
||
or case-preserving model is being used. If the two strings do
|
||
not match, it is an error. This check is necessary to guarantee
|
||
the uniqueness property (there cannot be two distinct encoded
|
||
strings representing the same Unicode string).
|
||
|
||
If the decoder at any time encounters an unexpected character, or
|
||
unexpected end of input, then the input is invalid.
|
||
|
||
Signature
|
||
|
||
The issue of how to distinguish ACE strings from unencoded strings
|
||
is largely orthogonal to the encoding scheme itself, and is
|
||
therefore not specified here. In the context of domain name labels,
|
||
a standard prefix and/or suffix (chosen to be unlikely to occur
|
||
naturally) would presumably be attached to ACE labels. (In that
|
||
case, it would probably be good to forbid the encoding of Unicode
|
||
strings that appear to match the signature, to avoid confusing
|
||
humans about whether they are looking at a Unicode string or an ACE
|
||
string.)
|
||
|
||
In order to use AMC-ACE-M in domain names, the choice of signature
|
||
must be mindful of the requirement in [RFC952] that labels never
|
||
begin or end with hyphen-minus. The raw encoded string will never
|
||
begin with a hyphen-minus, and will end with a hyphen-minus iff the
|
||
Unicode string ends with a hyphen-minus. The easiest solution is
|
||
to use a suffix as the signature. Alternatively, if the Unicode
|
||
strings were forbidden from ending with a hyphen-minus, a prefix
|
||
could be used.
|
||
|
||
It appears that "---" is extremely rare in domain names; among the
|
||
four-character prefixes of all the second-level domains under .com,
|
||
.net, and .org, "---" never appears at all. Therefore, perhaps the
|
||
signature should be of the form ?--- (prefix) or ---? (suffix),
|
||
where ? could be "u" for Unicode, or "i" for internationalized, or
|
||
"a" for ACE, or maybe "q" or "z" because they are rare.
|
||
|
||
Case sensitivity models
|
||
|
||
The higher layer must choose one of the following four models.
|
||
|
||
Models suitable for domain names:
|
||
|
||
* Case-insensitive: Before a string is encoded, all its non-LDH
|
||
characters must be case-folded so that any strings differing
|
||
only in case become the same string (for example, strings could
|
||
be forced to lowercase). Folding LDH characters is optional.
|
||
The case of base-32 characters and literal-mode characters is
|
||
arbitrary and not significant. Comparisons between encoded
|
||
strings must be case-insensitive. The original case of non-LDH
|
||
characters cannot be recovered from the encoded string.
|
||
|
||
* Case-preserving: The case of the Unicode characters is not
|
||
considered significant, but it can be preserved and recovered,
|
||
just like in non-internationalized host names. Before a string
|
||
is encoded, all its non-LDH characters must be case-folded
|
||
as in the previous model. LDH characters are naturally able
|
||
to retain their case attributes because they are encoded
|
||
literally. The case attribute of a non-LDH character is
|
||
recorded in one of the base-32 characters that represent
|
||
it (section "Encoding procedure" tells which one). If the
|
||
base-32 character is uppercase, it means the Unicode character
|
||
is caseless or should be forced to uppercase after being
|
||
decoded (which is a no-op if the case folding already forces
|
||
to uppercase). If the base-32 character is lowercase, it
|
||
means the Unicode character is caseless or should be forced to
|
||
lowercase after being decoded (which is a no-op if the case
|
||
folding already forces to lowercase). The case of the other
|
||
base-32 characters in a multi-quintet encoding is arbitrary
|
||
and not significant. Only uppercase and lowercase attributes
|
||
can be recorded, not titlecase. Comparisons between encoded
|
||
strings must be case-insensitive, and are equivalent to
|
||
case-insensitive comparisons between the Unicode strings. The
|
||
intended mixed-case Unicode string can be recovered as long as
|
||
the encoded characters are unaltered, but altering the case of
|
||
the encoded characters is not harmful--it merely alters the case
|
||
of the Unicode characters, and such a change is not considered
|
||
significant.
|
||
|
||
In this model, the input to the encoder and the output of the
|
||
decoder can be the unfolded Unicode string (in which case the
|
||
encoder and decoder are responsible for performing the case
|
||
folding and recovery), or can be the folded Unicode string
|
||
accompanied by separate case information (in which case the
|
||
higher layer is responsible for performing the case folding and
|
||
recovery). Whichever layer performs the case recovery must
|
||
first verify that the Unicode string is properly folded, to
|
||
guarantee the uniqueness of the encoding.
|
||
|
||
It is easy to extend the nameprep algorithm [NAMEPREP02] to
|
||
remember case information. It merely requires an additional
|
||
bit to be associated with each output code point in the mapping
|
||
table.
|
||
|
||
The case-insensitive and case-preserving models are interoperable.
|
||
If a domain name passes from a case-preserving entity to a
|
||
case-insensitive entity, the case information will be lost, but
|
||
the domain name will still be equivalent. This phenomenon already
|
||
occurs with non-internationalized domain names.
|
||
|
||
Models unsuitable for domain names, but possibly useful in other
|
||
contexts:
|
||
|
||
* Case-sensitive: Unicode strings may contain both uppercase and
|
||
lowercase characters, which are not folded. Base-32 characters
|
||
must be lowercase. Comparisons between encoded strings must be
|
||
case-sensitive.
|
||
|
||
* Case-flexible: Like case-preserving, except that the choice
|
||
of whether the case of the Unicode characters is considered
|
||
significant is deferred. Therefore, base-32 characters must
|
||
be lowercase, except for those used to indicate uppercase
|
||
Unicode characters. Comparisons between encoded strings may be
|
||
case-sensitive or case-insensitive, and such comparisons are
|
||
equivalent to the corresponding comparisons between the Unicode
|
||
strings.
|
||
|
||
Comparison with RACE, BRACE, LACE, and DUDE
|
||
|
||
In this section we compare AMC-ACE-M and four other ACEs: RACE
|
||
[RACE03], BRACE [BRACE00], LACE [LACE01], and Extended DUDE
|
||
[DUDE00]. We do not include SACE [SACE], UTF-5 [UTF5], or UTF-6
|
||
[UTF6] in the comparison, because SACE appears obviously too
|
||
complex, UTF-5 appears obviously too inefficient, and UTF-6 can
|
||
never be more efficient than its similarly simple successor, DUDE.
|
||
|
||
Case preservation support:
|
||
|
||
DUDE, AMC-ACE-M: all characters
|
||
BRACE: only the letters A-Z, a-z
|
||
RACE, LACE: none
|
||
|
||
RACE, BRACE, and LACE transform the Unicode string to an
|
||
intermediate bit string, then into a base-32 string, so there is no
|
||
particular alignment between the base-32 characters and the Unicode
|
||
characters. DUDE and AMC-ACE-M do not have this intermediate stage,
|
||
and enforce alignment between the base-32 characters and the Unicode
|
||
characters, which facilitates the case preservation.
|
||
|
||
Complexity is hard to measure. This author would subjectively
|
||
describe the complexity of the algorithms as:
|
||
|
||
RACE, LACE, DUDE: fairly simple but not trivial
|
||
AMC-ACE-M: moderate
|
||
BRACE: complex
|
||
|
||
The complexity of AMC-ACE-M is in the number of rules, but the
|
||
individual rules are not very complex, and they are generally
|
||
non-interacting.
|
||
|
||
The relative efficiency of the various algorithms is suggested
|
||
by the sizes of the encodings in section "Example strings". For
|
||
each ACE there is a graph below showing a horizontal bar for
|
||
each example string, representing the ACE length divided by the
|
||
minimum length among all the ACEs for that example string (so the
|
||
ratio is at least 1). Example R is excluded because it violates
|
||
nameprep [NAMEPREP02]. The other example strings all use different
|
||
languages, except that there are several Japanese examples. To
|
||
avoid skewing the results, each graph collapses all the Japanese
|
||
ratios into a single bar representing the median ratio. A ratio r
|
||
is represented by a bar of length r/0.04 characters. Since the bar
|
||
will always be at least 1/0.04 = 25 characters long, we show the
|
||
first 25 characters as "O" and the rest as "@". The bars are sorted
|
||
so that the graph looks like a cummulative distribution. Each bar
|
||
is labeled with the language of the corresponding example string.
|
||
(The difference between the Chinese and Taiwanese strings is that
|
||
the former uses simplified characters.)
|
||
|
||
RACE:
|
||
Hindi OOOOOOOOOOOOOOOOOOOOOOOOO@@@
|
||
Korean OOOOOOOOOOOOOOOOOOOOOOOOO@@@
|
||
Arabic OOOOOOOOOOOOOOOOOOOOOOOOO@@@@
|
||
Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@
|
||
Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@
|
||
Russian OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@
|
||
Japanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@
|
||
Spanish OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@
|
||
Chinese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@
|
||
Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@@@
|
||
Czech OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@@@@@@@@@@@@
|
||
|
||
LACE:
|
||
Korean OOOOOOOOOOOOOOOOOOOOOOOOO@@@
|
||
Hindi OOOOOOOOOOOOOOOOOOOOOOOOO@@@@
|
||
Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@
|
||
Arabic OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@
|
||
Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@
|
||
Chinese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@
|
||
Japanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@
|
||
Russian OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@
|
||
Spanish OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@
|
||
Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@
|
||
Czech OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@@@@@
|
||
|
||
DUDE:
|
||
Russian OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Arabic OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO@@
|
||
Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@
|
||
Chinese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@
|
||
Japanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@
|
||
Korean OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@
|
||
Spanish OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@
|
||
Czech OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@
|
||
Hindi OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@
|
||
Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@
|
||
|
||
AMC-ACE-M:
|
||
Czech OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Japanese OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Korean OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Russian OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Spanish OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Chinese OOOOOOOOOOOOOOOOOOOOOOOOO@
|
||
Arabic OOOOOOOOOOOOOOOOOOOOOOOOO@@@
|
||
Hindi OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@
|
||
|
||
BRACE:
|
||
Chinese OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Hindi OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Japanese OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Spanish OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO
|
||
Arabic OOOOOOOOOOOOOOOOOOOOOOOOO@
|
||
Czech OOOOOOOOOOOOOOOOOOOOOOOOO@
|
||
Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO@
|
||
Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO@@
|
||
Korean OOOOOOOOOOOOOOOOOOOOOOOOO@@
|
||
Russian OOOOOOOOOOOOOOOOOOOOOOOOO@@@
|
||
|
||
These results suggest that DUDE is preferrable to RACE and LACE,
|
||
because it has similar simplicity, better support for case
|
||
preservation, and is somewhat more efficient.
|
||
|
||
The results also suggest that AMC-ACE-M is preferrable to BRACE,
|
||
because it has similar efficiency, better support for case
|
||
preservation, and is simpler.
|
||
|
||
DUDE and AMC-ACE-M have equal support for case preservation, but
|
||
AMC-ACE-M offers significantly better efficiency, at the cost of
|
||
significantly greater complexity, so choosing between them entails a
|
||
value judgement.
|
||
|
||
Example strings
|
||
|
||
In the ACE encodings below, signatures (like "bq--" for RACE) are
|
||
not shown. Non-LDH characters in the Unicode string are forced to
|
||
lowercase before being encoded using BRACE, RACE, and LACE. For
|
||
RACE and LACE, the letters A-Z are likewise forced to lowercase.
|
||
UTF-8 and UTF-16 are included for length comparisons, with non-ASCII
|
||
bytes shown as "?". AMC-ACE-M is abbreviated AMC-M. Backslashes
|
||
show where line breaks have been inserted in ACE strings too long
|
||
for one line. The RACE and LACE encodings are courtesy of Mark
|
||
Davis's online UTF converter [UTFCONV] (slightly modified to remove
|
||
the length restrictions).
|
||
|
||
The first several examples are all names of Japanese music artists,
|
||
song titles, and TV programs, just because the author happens to
|
||
have them handy (but Japanese is useful for providing examples
|
||
of single-row text, two-row text, ideographic text, and various
|
||
mixtures thereof).
|
||
|
||
(A) 3<nen>B<gumi><kinpachi><sensei> (Japanese TV program title)
|
||
|
||
<nen> = U+5E74 (kanji)
|
||
<gumi> = U+7D44 (kanji)
|
||
<kinpachi><sensei> = U+91D1 U+516B U+5148 U+751F (kanji)
|
||
|
||
UTF-16: ????????????????
|
||
UTF-8: 3???B???????????????
|
||
AMC-M: utk-3-8ze-B-hkenqtymwifi9
|
||
BRACE: u-3-ygj-b-ynb6gjc7pp4k5p5w
|
||
DUDE: j3le74G062nd44p1d1l16bk8n51f
|
||
RACE: 3aadgxtuabrh2rer2fiwwukioupq
|
||
LACE: 74adgxtuabrh2rer2fiwwukioupq
|
||
|
||
(B) <amuro><namie>-with-SUPER-MONKEYS (Japanese music group name)
|
||
|
||
<amuro><namie> = U+5B89 U+5BA4 U+5948 U+7F8E U+6075 (kanji)
|
||
|
||
UTF-8: ??????????????????-with-SUPER-MONKEYS
|
||
AMC-M: u5m2j4etwif6q2zf---with--SUPER--MONKEYS
|
||
BRACE: uvj7fuaqcahy982xa---with--SUPER--MONKEYS
|
||
DUDE: lb89q4p48nf8em075-g077m9n4m8-N3LGM5N2-MdVURLN9J
|
||
UTF-16: ????????????????????????????????????????????????
|
||
LACE: ajnytjablfeac74oafqhkeyafv3qm5difvzxk4dfoiww233onnsxs4y
|
||
RACE: 3bnysw5elfeh7dtaouac2adxabuqa5aanaac2adtab2qa4aamuaheab\
|
||
nabwqa3yanyagwadfab4qa4y
|
||
|
||
(C) Hello-Another-Way-<sorezore><no><basho> (Japanese song title)
|
||
|
||
<sorezore><no> = U+305D U+308C U+305E U+308C U+306E (hiragana)
|
||
<basho> = U+5834 U+6240 (kanji)
|
||
|
||
UTF-8: Hello-Another-Way-?????????????????????
|
||
BRACE: ji7-Hello--Another--Way---v3jhaefvd2ufj62
|
||
AMC-M: bsk-Hello--Another--Way---p2nq2nyqx2veyuwa
|
||
DUDE: M8lssv-Huvn4m8ln2-Nm1n9-j05docleocmel834m240
|
||
UTF-16: ??????????????????????????????????????????????????
|
||
LACE: ciagqzlmnrxs2ylon52gqzlsfv3wc6jnauyf3dc6rrxacwbuafrea
|
||
RACE: 3aagqadfabwaa3aan4ac2adbabxaa3yaoqagqadfabzaaliao4agcad\
|
||
zaawtaxjqrqyf4memgbxfqndcia
|
||
|
||
(D) <hitotsu><yane><no><shita>2 (Japanese TV program title)
|
||
|
||
<hitotsu> = U+3072 U+3068 U+3064 (hiragana)
|
||
<yane> = U+5C4B U+6839 (kanji)
|
||
<no> = U+306E (hiragana)
|
||
<shita> = U+4E0B (kanji)
|
||
|
||
UTF-16: ????????????????
|
||
UTF-8: ?????????????????????2
|
||
AMC-M: bsnzciex6wmy2vjqw8sm-2
|
||
BRACE: ji96u56uwbhf2wqxnw4s-2
|
||
DUDE: j072m8klc4bm839j06eke0bg032
|
||
RACE: 3ayhemdigbsfys3iheyg4tqlaaza
|
||
LACE: 74yhemdigbsfys3iheyg4tqlaaza
|
||
|
||
(E) Maji<de>Koi<suru>5<byou><mae> (Japanese song title)
|
||
|
||
<de> = U+3067 (hiragana)
|
||
<suru> = U+3059 U+308B (hiragana)
|
||
<byou><mae> = U+79D2 U+524D (kanji)
|
||
|
||
UTF-8: Maji???Koi??????5??????
|
||
UTF-16: ??????????????????????????
|
||
AMC-M: bsm-Maji-r-Koi-b2m-5-z37cxuwp
|
||
BRACE: ji8-Maji-g-Koi-qe7x-5-wx7p6ma
|
||
DUDE: Mdhqpj067G06bvpj059obg035n9d2l24d
|
||
RACE: 3aag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq
|
||
LACE: 74ag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq
|
||
|
||
(F) <pafii>de<runba> (Japanese song title)
|
||
|
||
<pafii> = U+30D1 U+30D5 U+30A3 U+30FC (katakana)
|
||
<runba> = U+30EB U+30F3 U+30D0 (katakana)
|
||
|
||
UTF-16: ??????????????
|
||
BRACE: 3iu8pazt-de-pygi
|
||
AMC-M: bs3jp4d9n-de-8m9di
|
||
RACE: gdi5li7475sp6zpl6pia
|
||
DUDE: j0d1lq3vcg064lj0ebv3t0
|
||
UTF-8: ????????????de?????????
|
||
LACE: aqyndvnd7qbaazdfamyox46q
|
||
|
||
(G) <sono><supiido><de> (Japanese song title)
|
||
|
||
<sono> = U+305D U+306E (hiragana)
|
||
<supiido> = U+30B9 U+30D4 U+30FC U+30C9 (katakana)
|
||
<de> = U+3067 (hiragana)
|
||
|
||
RACE: gbow5oou7tewo
|
||
UTF-16: ??????????????
|
||
BRACE: bidprdmp9wt7mi
|
||
LACE: a4yf23vz2t6mszy
|
||
AMC-M: bsmfyq5j7e9n6jr
|
||
DUDE: j05dmer9t4vcs9m7
|
||
UTF-8: ?????????????????????
|
||
|
||
The next several examples are all translations of the sentence "Why
|
||
can't they just speak in <language>?" (courtesy of Michael Kaplan's
|
||
"provincial" page [PROVINCIAL]). Word breaks and punctuation have
|
||
been removed, as is often done in domain names.
|
||
|
||
(H) Arabic (Egyptian):
|
||
U+0644 U+064A U+0647 U+0645 U+0627 U+0628 U+062A U+0643 U+0644
|
||
U+0645 U+0648 U+0634 U+0639 U+0631 U+0628 U+064A U+061F
|
||
|
||
DUDE: m44qnli7oqk3kloj4phi8kahf
|
||
BRACE: 28akcjwcmp3ciwb4t3ngd4nbaz
|
||
AMC-M: agiekhfuhuiukdefivevjvbuiktr
|
||
RACE: azceur2fe4ucuq2eivediojrfbfb6
|
||
LACE: cedeisshiutsqksdircuqnbzgeueuhy
|
||
UTF-16: ??????????????????????????????????
|
||
UTF-8: ??????????????????????????????????
|
||
|
||
(I) Chinese (simplified):
|
||
U+4ED6 U+4EEC U+4E3A U+4EC0 U+4E48 U+4E0D U+8BF4 U+4E2D U+6587
|
||
|
||
UTF-16: ??????????????????
|
||
BRACE: kgcqqsgp26i5h4zn7req5i
|
||
AMC-M: uqj7g8nvk6awispn9wupdnh
|
||
DUDE: ked6ucjas0k8gdobf4ke2dm587
|
||
UTF-8: ???????????????????????????
|
||
LACE: azhnn3b2ybea2aml6qau4libmwdq
|
||
RACE: 3bhnmtxmjy5e5qcojbha3c7ujywwlby
|
||
|
||
(J) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
|
||
|
||
<ccaron> = U+010D
|
||
<ecaron> = U+011B
|
||
<iacute> = U+00ED
|
||
|
||
UTF-8: Pro??prost??nemluv????esky
|
||
AMC-M: g26-Pro-p-prost-9m-nemluv-6pp-esky
|
||
BRACE: i32-Pro-u-prost-8y-nemluv-29f3n-esky
|
||
DUDE: N0imfh0dg70imfn3kh1bg6eltsn5mudh0dg65n3mbn9
|
||
UTF-16: ????????????????????????????????????????????
|
||
LACE: amaha4tpaeaq2biaobzg643uaearwbyanzsw23dvo3wqcainaqagk43\
|
||
lpe
|
||
RACE: ah7xb73s75xq373q75zp6377op7xig77n37wl73n75wp65p7o3762dp\
|
||
7mx7xh73l754q
|
||
|
||
(K) Hebrew:
|
||
U+05DC U+05DE U+05D4 U+05D4 U+05DD U+05E4 U+05E9 U+05D5 U+05D8
|
||
U+05DC U+05D0 U+05DE U+05D3 U+05D1 U+05E8 U+05D9 U+05DD U+05E2
|
||
U+05D1 U+05E8 U+05D9 U+05EA
|
||
|
||
AMC-M: af4nqeep8e8jfinaqdb8ijp8cb8ij8k
|
||
DUDE: ldcukktu4pt5osgujhu8t9tu2t1u8t9ua
|
||
BRACE: 27vkyp7bgwmbpfjgc4ynx5nd8xsp5nd9c
|
||
RACE: axon5vgu3xsotvoy3tin5u6r5dm53ywr5dm6u
|
||
LACE: cyc5zxwu2to6j2ov3donbxwt2huntxpc2hunt2q
|
||
UTF-8: ????????????????????????????????????????????
|
||
UTF-16: ????????????????????????????????????????????
|
||
|
||
(L) Hindi:
|
||
U+092F U+0939 U+0932 U+094B U+0917 U+0939 U+093F U+0928 U+094D
|
||
U+0926 U+0940 U+0915 U+094D U+092F U+094B U+0902 U+0928 U+0939
|
||
U+0940 U+0902 U+092C U+094B U+0932 U+0938 U+0915 U+0924 U+0947
|
||
U+0939 U+0948 U+0902 (Devanagari)
|
||
|
||
BRACE: 2b7xtenqdr7zc6uma2pmcz7ibage237kdemicnk9gei32
|
||
RACE: bextsmslc44t6kcnezabktjpjmbcqokaaiwewmrycuseookiai
|
||
LACE: dyes6ojsjmltspzijuteafknf5fqekbziabcyszshaksirzzjaba
|
||
AMC-M: ajhurbvcwmthbhuiwpugitfwpurwmscuibiscunwmvcatfuerbwisc
|
||
DUDE: p2fj9ikbh7j9vi8kdi6k0h5kdifkbg2i8j9k0g2ickbj2oh5i4k7j9k\
|
||
8g2
|
||
UTF-16: ???????????????????????????????????????????????????????\
|
||
?????
|
||
UTF-8: ???????????????????????????????????????????????????????\
|
||
???????????????????????????????????
|
||
|
||
(M) Korean:
|
||
U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774
|
||
U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74
|
||
U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C (Hangul syllables)
|
||
|
||
UTF-16: ????????????????????????????????????????????????
|
||
UTF-8: ???????????????????????????????????????????????????????\
|
||
?????????????????
|
||
AMC-M: yhxcj2w6exiaxi68acfn92n68ezehk6xypdpwam6zehmwhk648eavwd\
|
||
p6aqi23ieemweywn
|
||
BRACE: y394qebjusrcndbs82pkvstf96sxufcr7ffr4vbgdwsxufcx8pdktgb\
|
||
gmnsqydmk7im56arju6pt82
|
||
LACE: 77atrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\
|
||
exj2mlpfzzcyjrsely5ck4ta
|
||
RACE: 3datrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\
|
||
exj2mlpfzzcyjrsely5ck4ta
|
||
DUDE: s138qcc4s758raa8ke0s0acr78cke4s774t55cqd6ds5b4r97cs774t\
|
||
574lcr2e4q74s5bcr9c8g98s88bn44qe4c
|
||
|
||
(N) Russian:
|
||
U+041F U+043E U+0447 U+0435 U+043C U+0443 U+0436 U+0435 U+043E
|
||
U+043D U+0438 U+043D U+0435 U+0433 U+043E U+0432 U+043E U+0440
|
||
U+044F U+0442 U+043F U+043E U+0440 U+0443 U+0441 U+0441 U+043A
|
||
U+0438 (Cyrillic)
|
||
|
||
DUDE: K3fuk7j5sk3j6lutotljuiuk0vijfuk0jhhjao
|
||
AMC-M: aehHgrvfemvgvfgfafvfvdgvcgiwrkhgimjjca
|
||
BRACE: 269xyjvcyafqfdwyr3xfd8z8byi6z39xyi692s7ug2
|
||
RACE: aq7t4rzvhrbtmnj6hu4d2njthyzd4qcpii7t4qcdifatuoa
|
||
LACE: dqcd6pshgu6egnrvhy6tqpjvgm7depsaj5bd6psainaucory
|
||
UTF-16: ???????????????????????????????????????????????????????\
|
||
???
|
||
UTF-8: ???????????????????????????????????????????????????????
|
||
???
|
||
|
||
(O) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
|
||
|
||
<eacute> = U+00E9
|
||
<ntilde> = U+00F1
|
||
|
||
UTF-8: Porqu??nopuedensimplementehablarenEspa??ol
|
||
AMC-M: aa7-Porqu-b-nopuedensimplementehablarenEspa-j-ol
|
||
BRACE: 22x-Porqu-9-nopuedensimplementehablarenEspa-j-ol
|
||
DUDE: N0mfn2hlu9mevn0lm5klun3m9tn0mcltlun4m5ohishn2m5uLn3gm1v\
|
||
1mfs
|
||
RACE: abyg64troxuw433qovswizloonuw24dmmvwwk3tumvugcytmmfzgk3t\
|
||
fonygd4lpnq
|
||
LACE: faaha33sof26s3tpob2wkzdfnzzws3lqnrsw2zloorswqylcnrqxezl\
|
||
omvzxayprn5wa
|
||
UTF-16: ???????????????????????????????????????????????????????\
|
||
?????????????????????????
|
||
|
||
(P) Taiwanese:
|
||
U+4ED6 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D U+6587
|
||
|
||
UTF-16: ??????????????????
|
||
UTF-8: ???????????????????????????
|
||
AMC-M: uqj7g2tbgtu6a385pspnxkupdnh
|
||
BRACE: kgcqui49gatc2wyrn8y7cndgte9
|
||
RACE: 3bhnmuaroize5qe6xvha3cvkjywwlby
|
||
LACE: 75hnmuaroize5qe6xvha3cvkjywwlby
|
||
DUDE: ked6l011n232kec0pebdke0doaaake2dm587
|
||
|
||
(Q) Vietnamese:
|
||
Ta<dotbelow>isaoho<dotbelow>kh<ocirc>ngth<ecirc><hookabove>chi\
|
||
<hookabove>no<acute>iti<ecirc><acute>ngVi<ecirc><dotbelow>t
|
||
|
||
<dotbelow> = U+0323
|
||
<ocirc> = U+00F4
|
||
<ecirc> = U+00EA
|
||
<hookabove> = U+0309
|
||
<acute> = U+0301
|
||
|
||
UTF-8: Ta??isaoho??kh??ngth????chi??no??iti????ngVi????t
|
||
AMC-M: ada-Ta-ud-isaoho-ud-kh-s9e-ngth-s8kj-chi-j-no-b-iti-s8k\
|
||
b-ngVi-s8kud-t
|
||
BRACE: i54-Ta-8-isaoho-ay-kh-29n-ngth-s2xa6i-chi-k-no-2g-iti-2\
|
||
9c29-ngVi-25p48-t
|
||
UTF-16: ???????????????????????????????????????????????????????\
|
||
?????????????????????
|
||
DUDE: N4m1j23g69n3m1vovj23g6bov4menn4m8uaj09g63opj09g6evj01g6\
|
||
9n4m9uaj01g6enN6m9uaj23g74
|
||
LACE: aiahiyibamrqmadjonqw62dpaebsgcaannupi3thoruouaidbebqay3\
|
||
ineaqgcicabxg6aidaecaa2lunhvacaybauag4z3wnhvacazdaeahi
|
||
RACE: ap7xj73bep7wt73t75q76377nd7w6i77np7wr77u75xp6z77ot7wr77\
|
||
kbh7wh73i75uqt73o75xqd73j752p62p75ia763x7m77xn73j77vch7\
|
||
3u
|
||
|
||
The last example is an ASCII string that breaks not only the
|
||
existing rules for host name labels but also the rules proposed in
|
||
[NAMEPREP02] for internationalized domain names.
|
||
|
||
(R) -> $1.00 <-
|
||
|
||
UTF-8: -> $1.00 <-
|
||
DUDE: -jei0kj1iej0gi0jc-
|
||
RACE: aawt4ibegexdambahqwq
|
||
LACE: bmac2praeqys4mbqea6c2
|
||
UTF-16: ??????????????????????
|
||
AMC-M: aae--vqae-1-q-00-avn--
|
||
BRACE: 229--t2b4-1-w-00-i9i--
|
||
|
||
Security considerations
|
||
|
||
Users expect each domain name in DNS to be controlled by a single
|
||
authority. If a Unicode string intended for use as a domain label
|
||
could map to multiple ACE labels, then an internationalized domain
|
||
name could map to multiple ACE domain names, each controlled by
|
||
a different authority, some of which could be spoofs that hijack
|
||
service requests intended for another. Therefore AMC-ACE-M is
|
||
designed so that each Unicode string has a unique encoding.
|
||
|
||
However, there can still be multiple Unicode representations of the
|
||
"same" text, for various definitions of "same". This problem is
|
||
addressed to some extent by the Unicode standard under the topic
|
||
of canonicalization, but some text strings may be misleading or
|
||
ambiguous to humans when used as domain names, such as strings
|
||
containing dots, slashes, at-signs, etc. These issues are being
|
||
further studied under the topic of "nameprep" [NAMEPREP02].
|
||
|
||
References
|
||
|
||
[ACEID01] Yoshiro Yoneya, Naomasa Maruyama, "Proposal for
|
||
a determining process of ACE identifier", 2000-Dec-19,
|
||
draft-ietf-idn-aceid-01.
|
||
|
||
[BRACE00] Adam Costello, "BRACE: Bi-mode Row-based
|
||
ASCII-Compatible Encoding for IDN version 0.1.2", 2000-Sep-19,
|
||
draft-ietf-idn-brace-00.
|
||
|
||
[DUDE00] Brian Spolarich, Mark Welter, "DUDE: Differential Unicode
|
||
Domain Encoding", 2000-Nov-21, draft-ietf-idn-dude-00.
|
||
|
||
[IDN] Internationalized Domain Names (IETF working group),
|
||
http://www.i-d-n.net/, idn@ops.ietf.org.
|
||
|
||
[LACE01] Paul Hoffman, Mark Davis, "LACE: Length-based ASCII
|
||
Compatible Encoding for IDN", 2001-Jan-05, draft-ietf-idn-lace-01.
|
||
|
||
[NAMEPREP02] Paul Hoffman, Marc Blanchet, "Preparation
|
||
of Internationalized Host Names", 2001-Jan-17,
|
||
draft-ietf-idn-nameprep-02.
|
||
|
||
[PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page",
|
||
http://www.trigeminal.com/samples/provincial.html.
|
||
|
||
[RACE03] Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding
|
||
for IDN", 2000-Nov-28, draft-ietf-idn-race-03.
|
||
|
||
[RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host
|
||
Table Specification", 1985-Oct, RFC 952.
|
||
|
||
[RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities",
|
||
1987-Nov, RFC 1034.
|
||
|
||
[RFC1123] Internet Engineering Task Force, R. Braden (editor),
|
||
"Requirements for Internet Hosts -- Application and Support",
|
||
1989-Oct, RFC 1123.
|
||
|
||
[SACE] Dan Oscarsson, "Simple ASCII Compatible Encoding (SACE)",
|
||
draft-ietf-idn-sace-*.
|
||
|
||
[UNICODE] The Unicode Consortium, "The Unicode Standard",
|
||
http://www.unicode.org/unicode/standard/standard.html.
|
||
|
||
[UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a
|
||
Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*.
|
||
|
||
[UTF6] Mark Welter, Brian W. Spolarich, "UTF-6 - Yet Another
|
||
ASCII-Compatible Encoding for IDN", draft-ietf-idn-utf6-*.
|
||
|
||
[UTFCONV] Mark Davis, "UTF Converter",
|
||
http://www.macchiato.com/unicode/convert.html.
|
||
|
||
Author
|
||
|
||
Adam M. Costello <amc@cs.berkeley.edu>
|
||
http://www.cs.berkeley.edu/~amc/
|
||
|
||
|
||
Example implementation
|
||
|
||
|
||
/******************************************/
|
||
/* amc-ace-m.c 0.1.0 (2001-Feb-12-Mon) */
|
||
/* Adam M. Costello <amc@cs.berkeley.edu> */
|
||
/******************************************/
|
||
|
||
/* This is ANSI C code implementing AMC-ACE-M version 0.1.*. */
|
||
|
||
|
||
/************************************************************/
|
||
/* Public interface (would normally go in its own .h file): */
|
||
|
||
#include <limits.h>
|
||
|
||
enum amc_ace_status {
|
||
amc_ace_success,
|
||
amc_ace_invalid_input,
|
||
amc_ace_output_too_big
|
||
};
|
||
|
||
enum case_sensitivity { case_sensitive, case_insensitive };
|
||
|
||
#if UINT_MAX >= 0x10FFFF
|
||
typedef unsigned int u_code_point;
|
||
#else
|
||
typedef unsigned long u_code_point;
|
||
#endif
|
||
|
||
int amc_ace_m_encode(
|
||
unsigned int input_length,
|
||
const u_code_point *input,
|
||
const unsigned char *uppercase_flags,
|
||
unsigned int *output_size,
|
||
unsigned char *output );
|
||
|
||
/* amc_ace_m_encode() converts Unicode to AMC-ACE-M. The input */
|
||
/* must be represented as an array of Unicode code points */
|
||
/* (not code units; surrogate pairs are not allowed), and the */
|
||
/* output will be represented as null-terminated ASCII. The */
|
||
/* input_length is the number of code points in the input. The */
|
||
/* output_size is an in/out argument: the caller must pass */
|
||
/* in the maximum number of characters that may be output */
|
||
/* (including the terminating null), and on successful return */
|
||
/* it will contain the number of characters actually output */
|
||
/* (including the terminating null, so it will be one more than */
|
||
/* strlen() would return, which is why it is called output_size */
|
||
/* rather than output_length). The uppercase_flags array must */
|
||
/* hold input_length boolean values, where nonzero means the */
|
||
/* corresponding Unicode character should be forced to uppercase */
|
||
/* after being decoded, and zero means it is caseless or should */
|
||
/* be forced to lowercase. Alternatively, uppercase_flags may */
|
||
/* be a null pointer, which is equivalent to all zeros. The */
|
||
/* letters a-z and A-Z are always encoded literally, regardless */
|
||
/* of the corresponding flags. The encoder always outputs */
|
||
/* lowercase base-32 characters except when nonzero values */
|
||
/* of uppercase_flags require otherwise, so the encoder is */
|
||
/* compatible with any of the case models. The return value */
|
||
/* may be any of the amc_ace_status values defined above; if */
|
||
/* not amc_ace_success, then output_size and output may contain */
|
||
/* garbage. On success, the encoder will never need to write an */
|
||
/* output_size greater than input_length*5+6, because of how the */
|
||
/* encoding is defined. */
|
||
|
||
int amc_ace_m_decode(
|
||
enum case_sensitivity case_sensitivity,
|
||
unsigned char *scratch_space,
|
||
const unsigned char *input,
|
||
unsigned int *output_length,
|
||
u_code_point *output,
|
||
unsigned char *uppercase_flags );
|
||
|
||
/* amc_ace_m_decode() converts AMC-ACE-M to Unicode. The input */
|
||
/* must be represented as null-terminated ASCII, and the output */
|
||
/* will be represented as an array of Unicode code points. */
|
||
/* The case_sensitivity argument influences the check on the */
|
||
/* well-formedness of the input string; it must be case_sensitive */
|
||
/* if case-sensitive comparisons are allowed on encoded strings, */
|
||
/* case_insensitive otherwise (see also section "Case sensitivity */
|
||
/* models" of the AMC-ACE-M specification). The scratch_space */
|
||
/* must point to space at least as large as the input, which will */
|
||
/* get overwritten (this allows the decoder to avoid calling */
|
||
/* malloc()). The output_length is an in/out argument: the */
|
||
/* caller must pass in the maximum number of code points that */
|
||
/* may be output, and on successful return it will contain the */
|
||
/* actual number of code points output. The uppercase_flags */
|
||
/* array must have room for at least output_length values, or it */
|
||
/* may be a null pointer if the case information is not needed. */
|
||
/* A nonzero flag indicates that the corresponding Unicode */
|
||
/* character should be forced to uppercase by the caller, while */
|
||
/* zero means it is caseless or should be forced to lowercase. */
|
||
/* The letters a-z and A-Z are output already in the proper case, */
|
||
/* but their flags will be set appropriately so that applying the */
|
||
/* flags would be harmless. The return value may be any of the */
|
||
/* amc_ace_status values defined above; if not amc_ace_success, */
|
||
/* then output_length, output, and uppercase_flags may contain */
|
||
/* garbage. On success, the decoder will never need to write */
|
||
/* an output_length greater than the length of the input (not */
|
||
/* counting the null terminator), because of how the encoding is */
|
||
/* defined. */
|
||
|
||
|
||
/**********************************************************/
|
||
/* Implementation (would normally go in its own .c file): */
|
||
|
||
#include <string.h>
|
||
|
||
/* Character utilities: */
|
||
|
||
/* is_ldh(codept) returns 1 if the code point represents an LDH */
|
||
/* character (ASCII letter, digit, or hyphen-minus), 0 otherwise. */
|
||
|
||
static int is_ldh(u_code_point codept)
|
||
{
|
||
if (codept == 45) return 1;
|
||
if (codept < 48) return 0;
|
||
if (codept <= 57) return 1;
|
||
if (codept < 65) return 0;
|
||
if (codept <= 90) return 1;
|
||
if (codept < 97) return 0;
|
||
if (codept <= 122) return 1;
|
||
return 0;
|
||
}
|
||
|
||
/* is_AtoZ(c) returns 1 if c is an */
|
||
/* uppercase ASCII letter, zero otherwise. */
|
||
|
||
static unsigned char is_AtoZ(unsigned char c)
|
||
{
|
||
return c >= 65 && c <= 90;
|
||
}
|
||
|
||
/* special_row_offset[n] holds the offset of the */
|
||
/* bottom of special row 0xD8 + n, where n is in 0..7. */
|
||
|
||
static u_code_point special_row_offset[] =
|
||
{ 0x0020, 0x005B, 0x007B, 0x00A0, 0x00C0, 0x00DF, 0x0134, 0x0270 };
|
||
|
||
/* base32[n] is the lowercase base-32 character representing */
|
||
/* the number n from the range 0 to 31. Note that we cannot */
|
||
/* use string literals for ASCII characters because an ANSI C */
|
||
/* compiler does not necessarily use ASCII. */
|
||
|
||
static const unsigned char base32[] = {
|
||
97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */
|
||
109, 110, /* m-n */
|
||
112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */
|
||
50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */
|
||
};
|
||
|
||
/* base32_decode(c) returns the value of a base-32 character, in the */
|
||
/* range 0 to 31, or the constant base32_invalid if c is not a valid */
|
||
/* base-32 character. */
|
||
|
||
enum { base32_invalid = 32 };
|
||
|
||
static unsigned int base32_decode(unsigned char c)
|
||
{
|
||
if (c < 50) return base32_invalid;
|
||
if (c <= 57) return c - 26;
|
||
if (c < 97) c += 32;
|
||
if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid;
|
||
return c - 97 - (c > 108) - (c > 111);
|
||
}
|
||
|
||
/* unequal(case_sensitivity,a1,a2,n) returns 0 if the arrays */
|
||
/* a1 and a2 are equal in the first n positions, 1 otherwise. */
|
||
/* If case_sensitivity is case_insensitive, then ASCII A-Z are */
|
||
/* considered equal to a-z respectively. */
|
||
|
||
static int unequal(
|
||
enum case_sensitivity case_sensitivity,
|
||
const unsigned char *a1,
|
||
const unsigned char *a2,
|
||
unsigned int n )
|
||
{
|
||
const unsigned char *end;
|
||
unsigned char c1, c2;
|
||
|
||
if (case_sensitivity != case_insensitive) return memcmp(a1,a2,n);
|
||
|
||
for (end = a1 + n; a1 < end; ++a1, ++a2) {
|
||
c1 = *a1;
|
||
c2 = *a2;
|
||
if (c1 >= 65 && c1 <= 90) c1 += 32;
|
||
if (c2 >= 65 && c2 <= 90) c2 += 32;
|
||
if (c1 != c2) return 1;
|
||
}
|
||
|
||
return 0;
|
||
}
|
||
|
||
|
||
/* Encoder: */
|
||
|
||
int amc_ace_m_encode(
|
||
unsigned int input_length,
|
||
const u_code_point *input,
|
||
const unsigned char *uppercase_flags,
|
||
unsigned int *output_size,
|
||
unsigned char *output )
|
||
{
|
||
unsigned int literal, wide; /* boolean */
|
||
u_code_point codept, n, diff, morebits;
|
||
u_code_point A, B, C, offsetA, offsetB, offsetC, offset;
|
||
const u_code_point *input_end, *p, *pp;
|
||
unsigned int count, max, next_in, next_out, max_out, codelen, i;
|
||
unsigned char c;
|
||
|
||
input_end = input + input_length;
|
||
|
||
/* 1) Verify that only valid code points appear: */
|
||
|
||
for (p = input; p < input_end; ++p) {
|
||
if (*p >> 11 == 0x1B || *p > 0x10FFFF) return amc_ace_invalid_input;
|
||
}
|
||
|
||
/* 2) Determine the most populous row: B and offsetB */
|
||
|
||
/* first check the special rows: */
|
||
|
||
B = 0xD8;
|
||
offsetB = special_row_offset[0];
|
||
max = 0;
|
||
|
||
for (n = 0; n < 8; ++n) {
|
||
offset = special_row_offset[n];
|
||
count = 0;
|
||
|
||
for (p = input; p < input_end; ++p) {
|
||
if (*p - offset <= 0xFF && !is_ldh(*p)) ++count;
|
||
}
|
||
|
||
if (count > max) {
|
||
B = 0xD8 + n;
|
||
offsetB = offset;
|
||
max = count;
|
||
}
|
||
}
|
||
|
||
/* now check the regular rows: */
|
||
|
||
for (pp = input; pp < input_end; ++pp) {
|
||
n = *pp >> 8;
|
||
count = 0;
|
||
|
||
for (p = input; p < input_end; ++p) {
|
||
if (*p >> 8 == n && !is_ldh(*p)) ++count;
|
||
}
|
||
|
||
if (count > max || (count == max && n < B)) {
|
||
B = n;
|
||
offsetB = n << 8;
|
||
max = count;
|
||
}
|
||
}
|
||
|
||
/* 3) Determine the most populous 16-window: A and offsetA */
|
||
|
||
A = 0;
|
||
max = 0;
|
||
|
||
for (n = 0; n <= 0x1F; ++n) {
|
||
offset = ((offsetB >> 3) + n) << 3;
|
||
count = 0;
|
||
|
||
for (p = input; p < input_end; ++p) {
|
||
if (*p - offset <= 0xF && !is_ldh(*p)) ++count;
|
||
}
|
||
|
||
if (count > max) {
|
||
A = n;
|
||
offsetA = offset;
|
||
max = count;
|
||
}
|
||
}
|
||
|
||
/* 4) Determine the most populous 20k-window: C */
|
||
|
||
C = 0;
|
||
max = 0;
|
||
|
||
for (pp = input; pp < input_end; ++pp) {
|
||
count = 0;
|
||
n = *pp >> 11;
|
||
offset = n << 11;
|
||
|
||
for (p = input; p < input_end; ++p) {
|
||
if (*p - offset <= 0x4FFF && !is_ldh(*p)) ++count;
|
||
|
||
if (count > max || (count == max && n < C)) {
|
||
C = n;
|
||
max = count;
|
||
}
|
||
}
|
||
}
|
||
|
||
/* 5) Determine the style to use: wide or narrow */
|
||
|
||
/* if narrow style were used: */
|
||
|
||
offsetC = (offsetB >> 12) << 12;
|
||
count = 3 + (B > 0xFF);
|
||
|
||
for (p = input; p < input_end; ++p) {
|
||
if (is_ldh(*p)) { }
|
||
else if (*p - offsetA <= 0xF) count += 1;
|
||
else if (*p - offsetB <= 0xFF) count += 2;
|
||
else if (*p - offsetC <= 0xFFF) count += 3;
|
||
else if (*p <= 0xFFFF) count += 4;
|
||
else count += 5;
|
||
}
|
||
|
||
max = count;
|
||
|
||
/* if wide style were used: */
|
||
|
||
offsetC = C << 11;
|
||
count = B <= 0xFF && C <= 0x1F ? 3 : 5;
|
||
|
||
for (p = input; p < input_end; ++p) {
|
||
if (is_ldh(*p)) { }
|
||
else if (*p - offsetB <= 0xFF) count += 2;
|
||
else if (*p - offsetC <= 0x4FFF) count += 3;
|
||
else if (*p <= 0xFFFF) count += 4;
|
||
else count += 5;
|
||
}
|
||
|
||
wide = (count < max);
|
||
|
||
/* 6) Initialize offsetC, and encode the style and offsets: */
|
||
|
||
max_out = *output_size;
|
||
next_out = 0;
|
||
|
||
if (wide) {
|
||
offsetC = C << 11;
|
||
|
||
if (B <= 0xFF && C <= 0x1F) {
|
||
if (max_out - next_out < 3) return amc_ace_output_too_big;
|
||
output[next_out++] = base32[0x10 | (B >> 5)];
|
||
output[next_out++] = base32[B & 0x1F];
|
||
output[next_out++] = base32[C];
|
||
}
|
||
else {
|
||
if (max_out - next_out < 5) return amc_ace_output_too_big;
|
||
output[next_out++] = base32[0x18 | (B >> 10)];
|
||
output[next_out++] = base32[(B >> 5) & 0x1F];
|
||
output[next_out++] = base32[B & 0x1F];
|
||
output[next_out++] = base32[C >> 5];
|
||
output[next_out++] = base32[C & 0x1F];
|
||
}
|
||
}
|
||
else {
|
||
offsetC = (offsetB >> 12) << 12;
|
||
|
||
if (B <= 0xFF) {
|
||
if (max_out - next_out < 3) return amc_ace_output_too_big;
|
||
output[next_out++] = base32[B >> 5];
|
||
output[next_out++] = base32[B & 0x1F];
|
||
}
|
||
else {
|
||
if (max_out - next_out < 4) return amc_ace_output_too_big;
|
||
output[next_out++] = base32[8 | (B >> 10)];
|
||
output[next_out++] = base32[(B >> 5) & 0x1F];
|
||
output[next_out++] = base32[B & 0x1F];
|
||
}
|
||
|
||
output[next_out++] = base32[A];
|
||
}
|
||
|
||
/* 7) Main encoding loop: */
|
||
|
||
literal = 0;
|
||
|
||
for (next_in = 0; next_in < input_length; ++next_in) {
|
||
codept = input[next_in];
|
||
|
||
if (codept == 45 /* hyphen-minus */) {
|
||
/* case 7.1 */
|
||
if (max_out - next_out < 2) return amc_ace_output_too_big;
|
||
output[next_out++] = 45;
|
||
output[next_out++] = 45;
|
||
continue;
|
||
}
|
||
|
||
if (is_ldh(codept)) {
|
||
/* case 7.2 */
|
||
if (!literal) {
|
||
if (max_out - next_out < 1) return amc_ace_output_too_big;
|
||
output[next_out++] = 45;
|
||
literal = 1;
|
||
}
|
||
|
||
if (max_out - next_out < 1) return amc_ace_output_too_big;
|
||
output[next_out++] = codept;
|
||
continue;
|
||
}
|
||
|
||
/* case 7.3 */
|
||
|
||
if (literal) {
|
||
if (max_out - next_out < 1) return amc_ace_output_too_big;
|
||
output[next_out++] = 45;
|
||
literal = 0;
|
||
}
|
||
|
||
if (!wide) {
|
||
diff = codept - offsetA;
|
||
|
||
if (diff <= 0xF) {
|
||
/* case 7.3.1 */
|
||
codelen = 1;
|
||
goto encoder_base32_bottom;
|
||
}
|
||
}
|
||
|
||
diff = codept - offsetB;
|
||
|
||
if (diff <= 0xFF) {
|
||
/* case 7.3.2 */
|
||
codelen = 2;
|
||
goto encoder_base32_bottom;
|
||
}
|
||
|
||
diff = codept - offsetC;
|
||
|
||
if (diff <= 0xFFF) {
|
||
/* case 7.3.3 */
|
||
codelen = 3;
|
||
goto encoder_base32_bottom;
|
||
}
|
||
|
||
if (wide) {
|
||
diff = codept - offsetC - 0x1000;
|
||
|
||
if (diff <= 0x3FFF) {
|
||
/* case 7.3.4 */
|
||
codelen = 1;
|
||
morebits = diff & 0x3FF;
|
||
diff >>= 10;
|
||
goto encoder_base32_bottom;
|
||
}
|
||
}
|
||
|
||
if (codept <= 0xFFFF) {
|
||
/* case 7.3.5 */
|
||
diff = codept;
|
||
codelen = 4;
|
||
goto encoder_base32_bottom;
|
||
}
|
||
|
||
/* case 7.3.6 */
|
||
diff = codept - 0x10000;
|
||
codelen = 5;
|
||
|
||
encoder_base32_bottom: /* output diff as n base-32 digits: */
|
||
if (max_out - next_out < codelen) return amc_ace_output_too_big;
|
||
i = codelen - 1;
|
||
c = base32[diff & 0xF];
|
||
if (uppercase_flags && uppercase_flags[next_in]) c -= 32;
|
||
output[next_out + i] = c;
|
||
|
||
while (i > 0) {
|
||
diff >>= 4;
|
||
output[next_out + --i] = base32[0x10 | (diff & 0xF)];
|
||
}
|
||
|
||
next_out += codelen;
|
||
|
||
if (wide && codelen == 1) {
|
||
/* case 7.3.4 */
|
||
if (max_out - next_out < 2) return amc_ace_output_too_big;
|
||
output[next_out++] = base32[morebits >> 5];
|
||
output[next_out++] = base32[morebits & 0x1F];
|
||
}
|
||
}
|
||
|
||
/* null terminator: */
|
||
if (max_out - next_out < 1) return amc_ace_output_too_big;
|
||
output[next_out++] = 0;
|
||
*output_size = next_out;
|
||
return amc_ace_success;
|
||
}
|
||
|
||
|
||
/* Decoder: */
|
||
|
||
int amc_ace_m_decode(
|
||
enum case_sensitivity case_sensitivity,
|
||
unsigned char *scratch_space,
|
||
const unsigned char *input,
|
||
unsigned int *output_length,
|
||
u_code_point *output,
|
||
unsigned char *uppercase_flags )
|
||
{
|
||
unsigned int literal, wide, large; /* boolean */
|
||
const unsigned char *next_in;
|
||
unsigned char c;
|
||
unsigned int next_out, max_out, codelen, input_size, scratch_size;
|
||
u_code_point q, B, offsets[6], diff, offset;
|
||
enum amc_ace_status status;
|
||
|
||
/* 1) Decode the style and offsets: */
|
||
|
||
next_in = input;
|
||
q = base32_decode(*next_in++);
|
||
if (q == base32_invalid) return amc_ace_invalid_input;
|
||
wide = q >> 4;
|
||
large = (q >> 3) & 1;
|
||
B = q & 7;
|
||
q = base32_decode(*next_in++);
|
||
if (q == base32_invalid) return amc_ace_invalid_input;
|
||
B = (B << 5) | q;
|
||
|
||
if (large) {
|
||
q = base32_decode(*next_in++);
|
||
if (q == base32_invalid) return amc_ace_invalid_input;
|
||
B = (B << 5) | q;
|
||
}
|
||
|
||
/* offsets[codelen] is for base-32 codes with codelen characters */
|
||
/* (not counting the extra two in wide-style 0xxxx xxxxx xxxxx) */
|
||
|
||
offsets[2] = B >> 3 == 0x1B ? special_row_offset[B & 7] : B << 8;
|
||
q = base32_decode(*next_in++);
|
||
if (q == base32_invalid) return amc_ace_invalid_input;
|
||
|
||
if (!wide) {
|
||
offsets[1] = ((offsets[2] >> 3) + q) << 3;
|
||
offsets[3] = (offsets[2] >> 12) << 12;
|
||
}
|
||
else {
|
||
offset = q << 11;
|
||
|
||
if (large) {
|
||
q = base32_decode(*next_in++);
|
||
if (q == base32_invalid) return amc_ace_invalid_input;
|
||
offset = (offset << 5) | q;
|
||
}
|
||
|
||
offsets[3] = offset;
|
||
offsets[1] = offset + 0x1000;
|
||
}
|
||
|
||
offsets[4] = 0;
|
||
offsets[5] = 0x10000;
|
||
|
||
/* 2) Main decoding loop: */
|
||
|
||
max_out = *output_length;
|
||
next_out = 0;
|
||
literal = 0;
|
||
|
||
for (;;) {
|
||
c = *next_in++;
|
||
if (!c) break;
|
||
|
||
if (c == 45 /* hyphen-minus */) {
|
||
if (*next_in == 45) {
|
||
/* case 2.1: "--" decodes to "-" */
|
||
++next_in;
|
||
if (max_out - next_out < 1) return amc_ace_output_too_big;
|
||
if (uppercase_flags) uppercase_flags[next_out] = 0;
|
||
output[next_out++] = 45;
|
||
continue;
|
||
}
|
||
|
||
/* case 2.2: unpaired hyphen-minus toggles mode */
|
||
literal = !literal;
|
||
continue;
|
||
}
|
||
|
||
if (!is_ldh(c)) return amc_ace_invalid_input;
|
||
if (max_out - next_out < 1) return amc_ace_output_too_big;
|
||
|
||
if (literal) {
|
||
/* case 2.3: literal letter/digit */
|
||
if (uppercase_flags) uppercase_flags[next_out] = is_AtoZ(c);
|
||
output[next_out++] = c;
|
||
continue;
|
||
}
|
||
|
||
/* case 2.4: base-32 sequence */
|
||
|
||
diff = 0;
|
||
codelen = 1;
|
||
|
||
for (;;) {
|
||
q = base32_decode(c);
|
||
if (q == base32_invalid) return amc_ace_invalid_input;
|
||
diff = (diff << 4) | (q & 0xF);
|
||
if ((q & 0x10) == 0) break;
|
||
if (++codelen > 5) return amc_ace_invalid_input;
|
||
c = *next_in++;
|
||
}
|
||
|
||
/* Now codelen is the number of input characters read, */
|
||
/* and c is the character holding the uppercase flag. */
|
||
|
||
if (wide && codelen == 1) {
|
||
q = base32_decode(*next_in++);
|
||
if (q == base32_invalid) return amc_ace_invalid_input;
|
||
diff = (diff << 5) | q;
|
||
q = base32_decode(*next_in++);
|
||
if (q == base32_invalid) return amc_ace_invalid_input;
|
||
diff = (diff << 5) | q;
|
||
}
|
||
|
||
offset = offsets[codelen];
|
||
if (uppercase_flags) uppercase_flags[next_out] = is_AtoZ(c);
|
||
output[next_out++] = offset + diff;
|
||
}
|
||
|
||
/* 3) Re-encode the output and compare to the input: */
|
||
|
||
input_size = next_in - input;
|
||
scratch_size = input_size;
|
||
status = amc_ace_m_encode(next_out, output, uppercase_flags,
|
||
&scratch_size, scratch_space);
|
||
if (status != amc_ace_success ||
|
||
scratch_size != input_size ||
|
||
unequal(case_sensitivity, scratch_space, input, input_size)
|
||
) return amc_ace_invalid_input;
|
||
*output_length = next_out;
|
||
return amc_ace_success;
|
||
}
|
||
|
||
|
||
/******************************************************************/
|
||
/* Wrapper for testing (would normally go in a separate .c file): */
|
||
|
||
#include <assert.h>
|
||
#include <stdio.h>
|
||
#include <stdlib.h>
|
||
#include <string.h>
|
||
|
||
/* For testing, we'll just set some compile-time limits rather than */
|
||
/* use malloc(), and set a compile-time option rather than using a */
|
||
/* command-line option. */
|
||
|
||
enum {
|
||
unicode_max_length = 256,
|
||
ace_max_size = 256,
|
||
test_case_sensitivity = case_insensitive
|
||
};
|
||
|
||
|
||
static void usage(char **argv)
|
||
{
|
||
fprintf(stderr,
|
||
"%s -e reads big-endian UTF-32 and writes AMC-ACE-M ASCII.\n"
|
||
"%s -d reads AMC-ACE-M ASCII and writes big-endian UTF-32.\n"
|
||
"UTF-32 is extended: bit 31 is used as force-to-uppercase flag.\n"
|
||
, argv[0], argv[0]);
|
||
exit(EXIT_FAILURE);
|
||
}
|
||
|
||
|
||
static void fail(const char *msg)
|
||
{
|
||
fputs(msg,stderr);
|
||
exit(EXIT_FAILURE);
|
||
}
|
||
|
||
static const char too_large[] =
|
||
"input or output is too large, recompile with larger limits\n";
|
||
|
||
static const char invalid_input[] = "invalid input\n";
|
||
|
||
int main(int argc, char **argv)
|
||
{
|
||
enum amc_ace_status status;
|
||
|
||
if (argc != 2) usage(argv);
|
||
if (argv[1][0] != '-') usage(argv);
|
||
if (argv[1][2] != '\0') usage(argv);
|
||
|
||
if (argv[1][1] == 'e') {
|
||
u_code_point input[unicode_max_length];
|
||
unsigned char uppercase_flags[unicode_max_length];
|
||
unsigned char output[ace_max_size];
|
||
unsigned int input_length, output_size;
|
||
int c0, c1, c2, c3;
|
||
|
||
/* Read the UTF-32 input string: */
|
||
|
||
input_length = 0;
|
||
|
||
for (;;) {
|
||
c0 = getchar();
|
||
c1 = getchar();
|
||
c2 = getchar();
|
||
c3 = getchar();
|
||
|
||
if (c1 == EOF || c2 == EOF || c3 == EOF) {
|
||
if (c0 != EOF) fail("input not a multiple of 4 bytes\n");
|
||
break;
|
||
}
|
||
|
||
if (input_length == unicode_max_length) fail(too_large);
|
||
|
||
if ((c0 != 0 && c0 != 0x80)
|
||
|| c1 < 0 || c1 > 0x10
|
||
|| c2 < 0 || c2 > 0xFF
|
||
|| c3 < 0 || c3 > 0xFF ) {
|
||
fail(invalid_input);
|
||
}
|
||
|
||
input[input_length] = ((u_code_point) c1 << 16) |
|
||
((u_code_point) c2 << 8) | (u_code_point) c3;
|
||
uppercase_flags[input_length] = (c0 >> 7);
|
||
++input_length;
|
||
}
|
||
|
||
/* Encode, and output the result: */
|
||
|
||
output_size = ace_max_size;
|
||
status = amc_ace_m_encode(input_length, input, uppercase_flags,
|
||
&output_size, output);
|
||
if (status == amc_ace_invalid_input) fail(invalid_input);
|
||
if (status == amc_ace_output_too_big) fail(too_large);
|
||
assert(status == amc_ace_success);
|
||
fputs((char *) output, stdout);
|
||
return EXIT_SUCCESS;
|
||
}
|
||
|
||
if (argv[1][1] == 'd') {
|
||
unsigned char input[ace_max_size], scratch[ace_max_size];
|
||
u_code_point output[unicode_max_length], codept;
|
||
unsigned char uppercase_flags[unicode_max_length];
|
||
unsigned int output_length, i;
|
||
size_t n;
|
||
|
||
/* Read the AMC-ACE-M ASCII input string: */
|
||
|
||
n = fread(input, 1, ace_max_size, stdin);
|
||
if (n == ace_max_size) fail(too_large);
|
||
input[n] = 0;
|
||
|
||
/* Decode, and output the result: */
|
||
|
||
output_length = unicode_max_length;
|
||
status = amc_ace_m_decode(test_case_sensitivity, scratch, input,
|
||
&output_length, output, uppercase_flags);
|
||
if (status == amc_ace_invalid_input) fail(invalid_input);
|
||
if (status == amc_ace_output_too_big) fail(too_large);
|
||
assert(status == 0);
|
||
|
||
for (i = 0; i < output_length; ++i) {
|
||
putchar(uppercase_flags[i] ? 0x80 : 0);
|
||
codept = output[i];
|
||
putchar(codept >> 16);
|
||
putchar((codept >> 8) & 0xFF);
|
||
putchar(codept & 0xFF);
|
||
}
|
||
|
||
return EXIT_SUCCESS;
|
||
}
|
||
|
||
usage(argv);
|
||
return EXIT_SUCCESS; /* not reached, but quiets a compiler warning */
|
||
}
|
||
|
||
|
||
|
||
INTERNET-DRAFT expires 2001-Aug-12
|