a580a835b8
* xraydict functionality and usage improvements Add a filter_function to xraydict, allowing fewer big data structures. Make uses of xraydict prefer exclusion sets to exclusion lists, to avoid repeated linear search of a list. * Make `big5_coded_forms_from_hkscs` a set, remove set trailing commas. * Remove `big5_coded_forms_from_hkscs` in favour of a filter function. * Similarly, use sets for 7-bit exclusion lists except when really short. * Revise mappings for seven 78JIS codepoints. Mappings for 25-23 and 90-22 were previously the same as those used for 97JIS; they have been swapped to correspond with how the IBM extension versus the standard code are mapped in the "old sequence" (78JIS-based) as opposed to the "new sequence". Mappings for 32-70, 34-45, 35-29, 39-77 and 54-02 in 78JIS have been changed to reflect disunifications made in 2000-JIS and 2004-JIS, assigning the 1978-edition unsimplified variants of those characters separate coded forms (where previously, only swaps and disunifications in 83JIS and disunifications in 90JIS (including JIS X 0212) had been considered). This only affects the `jis_encoding` codec (including the decoding direction for `iso-2022-jp-2`, `iso-2022-jp-3` and `iso-2022-jp-2004`), and the decoding is only affected when `ESC $ @` (not `ESC $ B`) is used. The `iso-2022-jp` codec is unaffected, and remains similar to (but more consistently pedantic than) the WHATWG specification, thus using the same table for both 78JIS and 97JIS. * Make `johab-ebcdic` decoder use many-to-one, not corporate PUA. Many-to-one decodes are not uncommon in CJK encodings (e.g. Windows-31J), and mapping to the IBM Corporate PUA (code page 1449) would probably make it render as completely the wrong character if at all in practice. * Switch `cp950_no_eudc_encoding_map` away from a hardcoded exclusion list. * Codec support for `x-mac-korean`. * Add a test bit for the UTF-8 wrapper. * Document the unique error-condition definition of the ISO-2022-JP codec. * Update docs now there is an actual implementation for `x-mac-korean`. * Further explanations of the hazards of `jis_encoding`. * Sanitised → Sanitised or escaped. * Further clarify the status with not verifying Shift In. * Corrected description of End State 2. * Changes to MacKorean to avoid mapping non-ASCII using ASCII punctuation. * Extraneous word "still". * Fix omitting MacKorean single-byte codes.
190 lines
14 KiB
Python
190 lines
14 KiB
Python
"""@brief Convert a string to and from various encodings.
|
||
|
||
The basic supported encodings are roughly as specified in the [WHATWG Encoding Standard
|
||
](https://encoding.spec.whatwg.org/), but more are also supported unless restriction to web
|
||
encodings is explicitly specified.
|
||
|
||
Most encodings supported by Python are implemented, but not currently `idna` or `punycode`. Note
|
||
however that Python makes `x-mac-japanese` an alias of `shift_jis`; this has not been done here.
|
||
Also note that the behaviour in regards to association of encoding names with variants is somewhat
|
||
different to Python's, partly due to following WHATWG: this affects most CJK codecs (e.g. Python
|
||
treats `shift_jis` and `ms-kanji` differently, while this package does not), but also e.g.
|
||
"ISO-8859-1".
|
||
|
||
Main entry points for the package are `codecs.infrastructure.encode`, `codecs.infrastructure.decode`
|
||
and `codecs.infrastructure.lookup`, all three of which are also available as e.g. `codecs.encode`
|
||
for convenience.
|
||
|
||
The list of codecs (not an exhaustive list of labels, nor close to one) is as follows.
|
||
|
||
### Single-byte extended ASCII encodings:
|
||
|
||
|Major label(s)|Meaning|
|
||
|---|---|
|
||
|`cp437`|8-bit United States (DOS)|
|
||
|`cp720`|8-bit Arabic Letters and Box Drawing (DOS)|
|
||
|`cp737`|8-bit Greek and Box Drawing (DOS)|
|
||
|`cp775`|8-bit Baltic Rim (DOS)|
|
||
|`cp850`|8-bit Western Europe and Canada (DOS)|
|
||
|`cp852`|8-bit Central European (DOS)|
|
||
|`cp855`|8-bit Balkan Cyrillic (DOS)|
|
||
|`cp856`|8-bit Hebrew (DOS)|
|
||
|`cp857`|8-bit Turkish (DOS)|
|
||
|`cp858`|8-bit Western Europe and Canada with Euro (DOS)|
|
||
|`cp860`|8-bit European Portugese (DOS)|
|
||
|`cp861`|8-bit Icelandic (DOS)|
|
||
|`cp862`|8-bit Hebrew and Box Drawing (DOS)|
|
||
|`cp863`|8-bit Quebecois French (DOS)|
|
||
|`cp864`|8-bit Arabic Positional Forms (DOS)|
|
||
|`cp865`|8-bit Continental Nordic (DOS)|
|
||
|`cp866`, `ibm866`|8-bit Russian Cyrillic (DOS)|
|
||
|`cp869`|8-bit Greek (DOS)|
|
||
|`cp1006`|8-bit Urdu|
|
||
|`cp1125`|8-bit Ukrainian Cyrillic (DOS)|
|
||
|`ecma-43-dv`, `cp367`, `csascii`|"8-bit Plain ASCII", i.e. ASCII without backspace composition, and with high bit unused. Note: most ASCII labels are mapped to Windows-1252, per WHATWG.|
|
||
|`hp-roman8`|8-bit Roman (HP)|
|
||
|`iso-8859-2`|8-bit Central European (ISO)|
|
||
|`iso-8859-3`|8-bit South European (Maltese/Esperanto)|
|
||
|`iso-8859-4`|8-bit North European|
|
||
|`iso-8859-5`|8-bit Cyrillic (ISO)|
|
||
|`iso-8859-6`|8-bit Arabic (ASMO/ISO)|
|
||
|`iso-8859-7`|8-bit Greek (ISO)|
|
||
|`iso-8859-8`, `iso-8859-8-i`|8-bit Hebrew (without vowel points). Although some, but not all, of the labels using this mapping request legacy visual-order behaviour (e.g. `iso-8859-8`, `iso-8859-8-e` or even `visual`, but not e.g. `iso-8859-8-i`), bidirectional conversion for any given markup format is beyond the scope of this package: determining from the label whether legacy visual-order behaviour should be used, and responding if so, should be implemented separately if needed.|
|
||
|`iso-8859-10`|8-bit Nordic|
|
||
|`iso-8859-13`|8-bit Baltic Rim (ISO)|
|
||
|`iso-8859-14`|8-bit Celtic|
|
||
|`iso-8859-15`|8-bit New Western European|
|
||
|`iso-8859-16`|8-bit South-Eastern European (ISO)|
|
||
|`koi8-r`|8-bit Russian Cyrillic (KOI8)|
|
||
|`koi8-u`, `koi8-ru`|8-bit Ruthenian/Ukrainian/Belarusian Cyrillic (KOI8)|
|
||
|`koi8-t`|8-bit Tajik Cyrillic|
|
||
|`kz1048`|8-bit Kazakh Cyrillic|
|
||
|`macintosh`|8-bit Roman (Macintosh)|
|
||
|`palmos`|PalmOS code page|
|
||
|`ptcp154`|8-bit Asian Cyrillic (Paratype)|
|
||
|`windows-874`, `iso-8859-11`, `tis-620`, `cp874`|8-bit Thai|
|
||
|`windows-1250`|8-bit Central European (Windows)|
|
||
|`windows-1251`|8-bit Cyrillic (Windows)|
|
||
|`windows-1252`, `ascii`, `iso-8859-1`, `latin1`|8-bit Western European. This is in accordance with WHATWG specification _in re_ which mappings to associate with which labels. Note: Python's `latin1` is sometimes used to round-trip arbitrary _sensu stricto_ extended ASCII data; in Kuroko, it is better to use `x-user-defined` for that.|
|
||
|`windows-1253`|8-bit Greek (Windows)|
|
||
|`windows-1254`, `iso-8859-9`|8-bit Turkish|
|
||
|`windows-1255`|8-bit Hebrew (logical with vowel points)|
|
||
|`windows-1256`|8-bit Arabic (Windows)|
|
||
|`windows-1257`|8-bit Baltic Rim (Windows)|
|
||
|`windows-1258`|8-bit Vietnamese (Windows). Basic codec: encoder will accept text in the form generated by the decoder, but neither NFC nor NFD normalised forms. This follows both Python and WHATWG behaviour. Conversion of text in NFC or NFD forms to encodable form may need to be done in a separate step before using the encoder.|
|
||
|`x-mac-arabic`|8-bit Arabic (Macintosh)|
|
||
|`x-mac-ce`|8-bit Central European (Macintosh)|
|
||
|`x-mac-croatian`|8-bit Gajica|
|
||
|`x-mac-cyrillic`|8-bit Cyrillic (Macintosh)|
|
||
|`x-mac-farsi`|8-bit Persian (Macintosh)|
|
||
|`x-mac-greek`|8-bit Greek (Macintosh)|
|
||
|`x-mac-icelandic`|8-bit Icelandic (Macintosh)|
|
||
|`x-mac-romanian`|8-bit Romanian (Macintosh)|
|
||
|`x-mac-turkish`|8-bit Turkish (Macintosh)|
|
||
|`x-user-defined`|8-bit User Defined (ASCII based variant: using U+0000–007F, U+F780–F7FF)|
|
||
|
||
### Single-byte symbol or dingbat font encodings:
|
||
|
||
|Major label(s)|Meaning|
|
||
|---|---|
|
||
|`cp042`|8-bit User Defined (variant using U+0000–001F, U+F020–F0FF). Windows uses that mapping for symbol fonts in some contexts.|
|
||
|
||
### 8-bit multi-byte Unicode codecs:
|
||
|
||
|Major label(s)|Meaning|
|
||
|---|---|
|
||
|`cesu-8`, `utf8mb3`, `utf8-ucs2`|CESU-8 (to UTF-16 as UTF-8 is to UTF-32). Mostly for interoperability with existing systems that use it.|
|
||
|`gb18030`|Chinese GB18030, WHATWG version. Not technically a full UTF in this implementation, since one PUA character is changed to an ideographic space per WHATWG.|
|
||
|`utf-8`, `utf8mb4`, `utf8-ucs4`|UTF-8 without a byte order mark|
|
||
|`utf-8-sig`|UTF-8 with a byte order mark|
|
||
|`utf-16`|UTF-16 with byte order mark, little endian if missing|
|
||
|`utf-16be`|UTF-16, big endian, no byte order mark|
|
||
|`utf-16le`|UTF-16, little endian, no byte order mark|
|
||
|`utf-32`|UTF-32 with byte order mark (though byte order can usually also be detected in its absence)|
|
||
|`utf-32be`|UTF-32, big endian, no byte order mark|
|
||
|`utf-32le`|UTF-32, little endian, no byte order mark|
|
||
|
||
### 8-bit multi-byte legacy CJK codecs:
|
||
|
||
|Major label(s)|Meaning|
|
||
|---|---|
|
||
|`big5`, `big5-eten`|Traditional Chinese Big-5, ETen version, condoning HKSCS extensions when decoding.|
|
||
|`big5-hkscs`|Traditional Chinese Big-5 with HKSCS extensions in both directions.|
|
||
|`big5-nonetenkana`, `big5-tw`|Traditional Chinese Big-5, with BIG5.TXT (non-ETen) layout for kana and Cyrillic.|
|
||
|`euc-jp`, `x-euc-jp`|Japanese EUC-JP, with Microsoft extensions, permitting JIS X 0212 only when decoding.|
|
||
|`euc-jp-full`|Japanese EUC-JP, with Microsoft extensions, permitting JIS X 0212 in both directions.|
|
||
|`euc-jisx0213`, `euc-jis-2004`|Japanese EUC-JP, with JIS X 0213 mappings and extensions.|
|
||
|`euc-kr`, `uhc`, `windows-949`|Korean Unified Hangul Code (superset of EUC-KR, encodes KS C 5601).|
|
||
|`gbk`, `gb2312`|Chinese GBK (GB2312 extension), condoning GB18030 when decoding.|
|
||
|`johab`, `johab-ascii`|Korean Johab (ASCII-compatible stateless standard version)|
|
||
|`shift_jis`, `ms-kanji`, `windows-31j`|Japanese Shift JIS (Windows compatible version)|
|
||
|`shift-jisx0213`, `shift-jis-2004`|Japanese Shift JIS (JIS X 0213 version)|
|
||
|`x-mac-chinesesimp`|Simplified Chinese GB2312, Macintosh version|
|
||
|`x-mac-chinesetrad`|Traditional Chinese Big5, Macintosh version|
|
||
|`x-mac-korean`|Korean HangulTalk (Macintosh encoding, another superset of EUC-KR)|
|
||
|
||
### 7-bit stateful codecs:
|
||
|
||
|Major label(s)|Meaning|
|
||
|---|---|
|
||
|`hz-gb-2312`|HZ (Usenet Simplified Chinese) encoding|
|
||
|`iso-2022-cn`|7-bit stateful Chinese (Simplified and Traditional)|
|
||
|`iso-2022-jp`|7-bit stateful Japanese, web version|
|
||
|`iso-2022-jp-ext`|7-bit stateful Japanese, including JIS X 0212 and preserving katakana width|
|
||
|`iso-2022-jp-1`|7-bit stateful Japanese, including JIS X 0212|
|
||
|`iso-2022-jp-2`|7-bit stateful Multilingual (Japanese, Korean, Greek, Simplified Chinese, Western European)|
|
||
|`iso-2022-jp-3`|7-bit stateful Japanese, including JIS X 0213 (2000 edition format)|
|
||
|`iso-2022-jp-2004`|7-bit stateful Japanese, including JIS X 0213 (2004 edition format)|
|
||
|`iso-2022-kr`|7-bit stateful Korean|
|
||
|`jis_encoding`|7-bit stateful Japanese, comprehensive version|
|
||
|`utf-7`|A largely obsolete scheme for mixing ASCII and Base64'd UTF-16BE in e-mail. Included mostly for Python parity.|
|
||
|
||
### EBCDIC codecs:
|
||
|
||
|Major label(s)|Meaning|
|
||
|---|---|
|
||
|`cp037`|EBCDIC Default (United States, Netherlands, Portugal, Brazil, Australia, New Zealand, Canadian ESA/390)|
|
||
|`cp273`|EBCDIC German|
|
||
|`cp424`|EBCDIC Hebrew|
|
||
|`cp500`|EBCDIC "International" (Belgium, Switzerland, Canadian AS/400)|
|
||
|`cp875`|EBCDIC Greek|
|
||
|`cp933`, `ibm-933`, `ibm-1364`, `johab-ebcdic`|EBCDIC Korean (Johab, IBM stateful version for EBCDIC)|
|
||
|`cp1026`|EBCDIC Turkish|
|
||
|`cp1140`|EBCDIC with Euro Sign|
|
||
|
||
### Codecs with unusual behaviour:
|
||
|
||
|Major label(s)|Meaning|
|
||
|---|---|
|
||
|`inverse-base64`|Base64 with inverse semantics to preserve type correctness (encoder reads, decoder creates). Error handler is ignored.|
|
||
|`inverse-base64hqx`|Same, but using the BinHex4 alphabet (note: does not in and of itself create the BinHex4 *format*)|
|
||
|`inverse-base64uu`|Same, but using the uuencode alphabet (note: does not in and of itself create the uuencode *format*)|
|
||
|`inverse-quopri`|Quoted-Printable, with inverse semantics (encoder reads, decoder creates). Error handler is ignored.|
|
||
|`japanese`|Attempts to detect the encoding of a Japanese document (like the unified "Japanese" option now offered by some browsers' encoding override menus), and raises `ValueError` if it cannot. Not intended to be used in the encode direction, but will behave as `utf-8-sig` in that case.|
|
||
|`undefined`, `replacement`|Represents data for which encoding/decoding must not be attempted. Following WHATWG (and differing from Python), error handlers are accepted, though only by the decoder: the encoder will ignore them.|
|
||
|
||
### Notes on error conditions in the ISO-2022-JP family
|
||
|
||
Like most codecs, the ISO-2022-JP family will generate errors in place of sequences which it cannot interpret. However, they will also generate errors over certain sequences which have no immediate effect on the stream being outputted, so as to prevent them being used for masking syntax that would otherwise be sanitised or escaped (with `errors="replace"`, this will cause a `U+FFFD` to be inserted, thus preserving the interruption in any syntax). For this to happen is per WHATWG; the specific circumstances in which this happens, however, deliberately vary somewhat from the WHATWG specification.
|
||
|
||
The `iso-2022-jp` codec is somewhat more pedantic than the WHATWG approach, and is intended to accept a strict subset of what the WHATWG approach accepts (but a superset of what the WHATWG and Python codecs *generate* for a single non-concatenated stream), excluding cases that are unlikely to occur in reality except as masking sequences. The WHATWG approach follows UTR #36 in forbidding exactly those cases which RFC 1468 does not permit, meaning that it forbids some cases that often occur in reality as a result of concatenation and are usually benign, and also permits some cases that are less likely to occur in reality and less likely to be benign. The `jis_encoding` codec (which is also used for decoding, but not for encoding, the `iso-2022-jp*` labels except for `iso-2022-jp` itself) permits the cases that result from concatenation but otherwise behaves the same as the `iso-2022-jp` codec in this regard; this, however, means that Shift Out and Shift In, which are not interpreted by the `iso-2022-jp` codec but are by the `jis_encoding` codec, are not currently checked for zero-effect use.
|
||
|
||
In reaction to the WHATWG approach, [UTC L2/20-202](https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf) defines two "end states", the first of which forbids no such cases, and the second of which forbids certain additional cases while also being more lenient on the WHATWG-forbidden cases that actually occur in practice—however, it forbids the ordinary output of Python's `iso-2022-jp` codec under certain circumstances (see bold in table below), and still misses some obvious zero-effect switching sequences (notably, it permits ASCII→ASCII switches anywhere). Accordingly, end state 2 is not followed exactly by either codec either.
|
||
|
||
All the which having been said, this is purely for the sake of a theoretical consistency, and **absolutely should not** be used as an excuse to sanitise or escape text while it is encoded as ISO-2022-JP (nor any other stateful encoding). Accomodating established and plausible variation in encoders means that some masking sequences may still be possible; furthermore, the `jis_encoding` decoder does not penalise zero-effect Shift In characters. All sanitisation or escaping of data received in ISO-2022-JP must be carried out over the Unicode stream. Furthermore, using the ISO-2022-JP family or other stateful encodings inside an ASCII-delimited structure such as JSON should be avoided if possible; even if it cannot be avoided, one must not substitute an untrusted ISO-2022-JP\* or JIS\_Encoding sequence straight into an ASCII structure without verifying that it in fact returns to Shift In state (if applicable), with ASCII designated, at the end of the sequence, with no trailing single-shifts.
|
||
|
||
A summary of the differing behaviours is listed in the table below:
|
||
|
||
| Approach | DB→ASCII→DB | SB→DB→SB | ASCII→JISCII | JISCII→ASCII | ASCII→ASCII | JISCII→JISCII | DB→JISCII |
|
||
|---|---|---|---|---|---|---|---|
|
||
| `iso-2022-jp` | No Good | No Good | Only at start/end or next to 5C/7E or C0 control code; generated before 5C/7E | Only at start/end or next to 5C/7E or C0 control code; generated before 5C/7E | Only at start | No Good | Okay |
|
||
| `jis_encoding` | Okay | No Good | Only at start/end or next to 5C/7E or C0 control code; generated before 5C/7E | Only at start/end or next to 5C/7E or C0 control code; generated **after** 5C/7E | Only at start | No Good | Okay |
|
||
| Python | Okay | Okay | Okay; generated before 5C/7E | Okay; generated **after** 5C/7E | Okay | Okay | Okay |
|
||
| WHATWG | No Good | No Good | Okay; generated before 5C/7E | Okay; generated before 5C/7E | Okay | Okay | Okay |
|
||
| End State 1 | Okay | Okay | Okay | Okay | Okay | Okay | Okay |
|
||
| End State 2 | Okay | No Good | Only at end or before 5C/7E | Only at end or **before** 5C/7E | Okay | Only before 5C/7E | Only before 5C/7E |
|
||
"""
|
||
|
||
from codecs.infrastructure import encode, decode, lookup
|
||
import codecs.sbencs, codecs.dbdata, codecs.bespokecodecs, codecs.sbextra, codecs.dbextra, codecs.binascii, codecs.pifonts
|