168 lines
9.7 KiB
Python
168 lines
9.7 KiB
Python
"""@brief Convert a string to and from various encodings.
|
||
|
||
The basic supported encodings are roughly as specified in the [WHATWG Encoding Standard
|
||
](https://encoding.spec.whatwg.org/), but more are also supported unless restriction to web
|
||
encodings is explicitly specified.
|
||
|
||
Most encodings supported by Python are implemented, but not currently `idna` or `punycode`. Note
|
||
however that Python makes `x-mac-japanese` and `x-mac-korean` aliases of `shift_jis` and `euc-kr`;
|
||
this has not been done here. Also note that the behaviour in regards to association of encoding
|
||
names with variants is somewhat different to Python's, partly due to following WHATWG: this affects
|
||
most CJK codecs (e.g. Python treats `shift_jis` and `ms-kanji` differently, while this package does
|
||
not), but also e.g. "ISO-8859-1".
|
||
|
||
Main entry points for the package are `codecs.infrastructure.encode`, `codecs.infrastructure.decode`
|
||
and `codecs.infrastructure.lookup`, all three of which are also available as e.g. `codecs.encode`
|
||
for convenience.
|
||
|
||
The list of codecs (not an exhaustive list of labels, nor close to one) is as follows.
|
||
|
||
### Single-byte extended ASCII encodings:
|
||
|
||
|Major label(s)|Meaning|
|
||
|---|---|
|
||
|`cp437`|8-bit United States (DOS)|
|
||
|`cp720`|8-bit Arabic Letters and Box Drawing (DOS)|
|
||
|`cp737`|8-bit Greek and Box Drawing (DOS)|
|
||
|`cp775`|8-bit Baltic Rim (DOS)|
|
||
|`cp850`|8-bit Western Europe and Canada (DOS)|
|
||
|`cp852`|8-bit Central European (DOS)|
|
||
|`cp855`|8-bit Balkan Cyrillic (DOS)|
|
||
|`cp856`|8-bit Hebrew (DOS)|
|
||
|`cp857`|8-bit Turkish (DOS)|
|
||
|`cp858`|8-bit Western Europe and Canada with Euro (DOS)|
|
||
|`cp860`|8-bit European Portugese (DOS)|
|
||
|`cp861`|8-bit Icelandic (DOS)|
|
||
|`cp862`|8-bit Hebrew and Box Drawing (DOS)|
|
||
|`cp863`|8-bit Quebecois French (DOS)|
|
||
|`cp864`|8-bit Arabic Positional Forms (DOS)|
|
||
|`cp865`|8-bit Continental Nordic (DOS)|
|
||
|`cp866`, `ibm866`|8-bit Russian Cyrillic (DOS)|
|
||
|`cp869`|8-bit Greek (DOS)|
|
||
|`cp1006`|8-bit Urdu|
|
||
|`cp1125`|8-bit Ukrainian Cyrillic (DOS)|
|
||
|`ecma-43-dv`, `cp367`, `csascii`|"8-bit Plain ASCII", i.e. ASCII without backspace composition, and with high bit unused. Note: most ASCII labels are mapped to Windows-1252, per WHATWG.|
|
||
|`hp-roman8`|8-bit Roman (HP)|
|
||
|`iso-8859-2`|8-bit Central European (ISO)|
|
||
|`iso-8859-3`|8-bit South European (Maltese/Esperanto)|
|
||
|`iso-8859-4`|8-bit North European|
|
||
|`iso-8859-5`|8-bit Cyrillic (ISO)|
|
||
|`iso-8859-6`|8-bit Arabic (ASMO/ISO)|
|
||
|`iso-8859-7`|8-bit Greek (ISO)|
|
||
|`iso-8859-8`, `iso-8859-8-i`|8-bit Hebrew (without vowel points). Although some, but not all, of the labels using this mapping request legacy visual-order behaviour (e.g. `iso-8859-8`, `iso-8859-8-e` or even `visual`, but not e.g. `iso-8859-8-i`), bidirectional conversion for any given markup format is beyond the scope of this package: determining from the label whether legacy visual-order behaviour should be used, and responding if so, should be implemented separately if needed.|
|
||
|`iso-8859-10`|8-bit Nordic|
|
||
|`iso-8859-13`|8-bit Baltic Rim (ISO)|
|
||
|`iso-8859-14`|8-bit Celtic|
|
||
|`iso-8859-15`|8-bit New Western European|
|
||
|`iso-8859-16`|8-bit South-Eastern European (ISO)|
|
||
|`koi8-r`|8-bit Russian Cyrillic (KOI8)|
|
||
|`koi8-u`, `koi8-ru`|8-bit Ruthenian/Ukrainian/Belarusian Cyrillic (KOI8)|
|
||
|`koi8-t`|8-bit Tajik Cyrillic|
|
||
|`kz1048`|8-bit Kazakh Cyrillic|
|
||
|`macintosh`|8-bit Roman (Macintosh)|
|
||
|`palmos`|PalmOS code page|
|
||
|`ptcp154`|8-bit Asian Cyrillic (Paratype)|
|
||
|`windows-874`, `iso-8859-11`, `tis-620`, `cp874`|8-bit Thai|
|
||
|`windows-1250`|8-bit Central European (Windows)|
|
||
|`windows-1251`|8-bit Cyrillic (Windows)|
|
||
|`windows-1252`, `ascii`, `iso-8859-1`, `latin1`|8-bit Western European. This is in accordance with WHATWG specification _in re_ which mappings to associate with which labels. Note: Python's `latin1` is sometimes used to round-trip arbitrary _sensu stricto_ extended ASCII data; in Kuroko, it is better to use `x-user-defined` for that.|
|
||
|`windows-1253`|8-bit Greek (Windows)|
|
||
|`windows-1254`, `iso-8859-9`|8-bit Turkish|
|
||
|`windows-1255`|8-bit Hebrew (logical with vowel points)|
|
||
|`windows-1256`|8-bit Arabic (Windows)|
|
||
|`windows-1257`|8-bit Baltic Rim (Windows)|
|
||
|`windows-1258`|8-bit Vietnamese (Windows). Basic codec: encoder will accept text in the form generated by the decoder, but neither NFC nor NFD normalised forms. This follows both Python and WHATWG behaviour. Conversion of text in NFC or NFD forms to encodable form may need to be done in a separate step before using the encoder.|
|
||
|`x-mac-arabic`|8-bit Arabic (Macintosh)|
|
||
|`x-mac-ce`|8-bit Central European (Macintosh)|
|
||
|`x-mac-croatian`|8-bit Gajica|
|
||
|`x-mac-cyrillic`|8-bit Cyrillic (Macintosh)|
|
||
|`x-mac-farsi`|8-bit Persian (Macintosh)|
|
||
|`x-mac-greek`|8-bit Greek (Macintosh)|
|
||
|`x-mac-icelandic`|8-bit Icelandic (Macintosh)|
|
||
|`x-mac-romanian`|8-bit Romanian (Macintosh)|
|
||
|`x-mac-turkish`|8-bit Turkish (Macintosh)|
|
||
|`x-user-defined`|8-bit User Defined (ASCII based variant: using U+0000–007F, U+F780–F7FF)|
|
||
|
||
### Single-byte symbol or dingbat font encodings:
|
||
|
||
|Major label(s)|Meaning|
|
||
|---|---|
|
||
|`cp042`|8-bit User Defined (variant using U+0000–001F, U+F020–F0FF). Windows uses that mapping for symbol fonts in some contexts.|
|
||
|
||
### 8-bit multi-byte Unicode codecs:
|
||
|
||
|Major label(s)|Meaning|
|
||
|---|---|
|
||
|`cesu-8`, `utf8mb3`, `utf8-ucs2`|CESU-8 (to UTF-16 as UTF-8 is to UTF-32). Mostly for interoperability with existing systems that use it.|
|
||
|`gb18030`|Chinese GB18030, WHATWG version. Not technically a full UTF in this implementation, since one PUA character is changed to an ideographic space per WHATWG.|
|
||
|`utf-8`, `utf8mb4`, `utf8-ucs4`|UTF-8 without a byte order mark|
|
||
|`utf-8-sig`|UTF-8 with a byte order mark|
|
||
|`utf-16`|UTF-16 with byte order mark, little endian if missing|
|
||
|`utf-16be`|UTF-16, big endian, no byte order mark|
|
||
|`utf-16le`|UTF-16, little endian, no byte order mark|
|
||
|`utf-32`|UTF-32 with byte order mark (though byte order can usually also be detected in its absence)|
|
||
|`utf-32be`|UTF-32, big endian, no byte order mark|
|
||
|`utf-32le`|UTF-32, little endian, no byte order mark|
|
||
|
||
### 8-bit multi-byte legacy CJK codecs:
|
||
|
||
|Major label(s)|Meaning|
|
||
|---|---|
|
||
|`big5`, `big5-eten`|Traditional Chinese Big-5, ETen version, condoning HKSCS extensions when decoding.|
|
||
|`big5-hkscs`|Traditional Chinese Big-5 with HKSCS extensions in both directions.|
|
||
|`big5-nonetenkana`, `big5-tw`|Traditional Chinese Big-5, with BIG5.TXT (non-ETen) layout for kana and Cyrillic.|
|
||
|`euc-jp`, `x-euc-jp`|Japanese EUC-JP, with Microsoft extensions, permitting JIS X 0212 only when decoding.|
|
||
|`euc-jp-full`|Japanese EUC-JP, with Microsoft extensions, permitting JIS X 0212 in both directions.|
|
||
|`euc-jisx0213`, `euc-jis-2004`|Japanese EUC-JP, with JIS X 0213 mappings and extensions.|
|
||
|`euc-kr`, `uhc`, `windows-949`|Korean Unified Hangul Code (superset of EUC-KR, encodes KS C 5601).|
|
||
|`gbk`, `gb2312`|Chinese GBK (GB2312 extension), condoning GB18030 when decoding.|
|
||
|`johab`, `johab-ascii`|Korean Johab (ASCII-compatible stateless standard version)|
|
||
|`shift_jis`, `ms-kanji`, `windows-31j`|Japanese Shift JIS (Windows compatible version)|
|
||
|`shift-jisx0213`, `shift-jis-2004`|Japanese Shift JIS (JIS X 0213 version)|
|
||
|`x-mac-chinesesimp`|Simplified Chinese GB2312, Macintosh version|
|
||
|`x-mac-chinesetrad`|Traditional Chinese Big5, Macintosh version|
|
||
|
||
### 7-bit stateful codecs:
|
||
|
||
|Major label(s)|Meaning|
|
||
|---|---|
|
||
|`hz-gb-2312`|HZ (Usenet Simplified Chinese) encoding|
|
||
|`iso-2022-cn`|7-bit stateful Chinese (Simplified and Traditional)|
|
||
|`iso-2022-jp`|7-bit stateful Japanese, web version|
|
||
|`iso-2022-jp-ext`|7-bit stateful Japanese, including JIS X 0212 and preserving katakana width|
|
||
|`iso-2022-jp-1`|7-bit stateful Japanese, including JIS X 0212|
|
||
|`iso-2022-jp-2`|7-bit stateful Multilingual (Japanese, Korean, Greek, Simplified Chinese, Western European)|
|
||
|`iso-2022-jp-3`|7-bit stateful Japanese, including JIS X 0213 (2000 edition format)|
|
||
|`iso-2022-jp-2004`|7-bit stateful Japanese, including JIS X 0213 (2004 edition format)|
|
||
|`iso-2022-kr`|7-bit stateful Korean|
|
||
|`jis_encoding`|7-bit stateful Japanese, comprehensive version|
|
||
|`utf-7`|A largely obsolete scheme for mixing ASCII and Base64'd UTF-16BE in e-mail. Included mostly for Python parity.|
|
||
|
||
### EBCDIC codecs:
|
||
|
||
|Major label(s)|Meaning|
|
||
|---|---|
|
||
|`cp037`|EBCDIC Default (United States, Netherlands, Portugal, Brazil, Australia, New Zealand, Canadian ESA/390)|
|
||
|`cp273`|EBCDIC German|
|
||
|`cp424`|EBCDIC Hebrew|
|
||
|`cp500`|EBCDIC "International" (Belgium, Switzerland, Canadian AS/400)|
|
||
|`cp875`|EBCDIC Greek|
|
||
|`cp933`, `ibm-933`, `ibm-1364`, `johab-ebcdic`|EBCDIC Korean (Johab, IBM stateful version for EBCDIC)|
|
||
|`cp1026`|EBCDIC Turkish|
|
||
|`cp1140`|EBCDIC with Euro Sign|
|
||
|
||
### Codecs with unusual behaviour:
|
||
|
||
|Major label(s)|Meaning|
|
||
|---|---|
|
||
|`inverse-base64`|Base64 with inverse semantics to preserve type correctness (encoder reads, decoder creates). Error handler is ignored.|
|
||
|`inverse-base64hqx`|Same, but using the BinHex4 alphabet (note: does not in and of itself create the BinHex4 *format*)|
|
||
|`inverse-base64uu`|Same, but using the uuencode alphabet (note: does not in and of itself create the uuencode *format*)|
|
||
|`inverse-quopri`|Quoted-Printable, with inverse semantics (encoder reads, decoder creates). Error handler is ignored.|
|
||
|`japanese`|Attempts to detect the encoding of a Japanese document (like the unified "Japanese" option now offered by some browsers' encoding override menus), and raises `ValueError` if it cannot. Not intended to be used in the encode direction, but will behave as `utf-8-sig` in that case.|
|
||
|`undefined`, `replacement`|Represents data for which encoding/decoding must not be attempted. Following WHATWG (and differing from Python), error handlers are accepted, though only by the decoder: the encoder will ignore them.|
|
||
"""
|
||
|
||
from codecs.infrastructure import encode, decode, lookup
|
||
import codecs.sbencs, codecs.dbdata, codecs.bespokecodecs, codecs.sbextra, codecs.dbextra, codecs.binascii, codecs.pifonts
|