Codecs package docs, as well as some assorted tweaks or minor additions (#5)

* Add some docs, and remove second Code page 874 codec (they handled the
non-overridden C1 area differently, but we only need one).

* More docs work.

* Doc stuff.

* Adjusted.

* More tweaks (table padding is not the docstring's problem).

* CSS and docstring tweaks.

* Link from modules to parent packages and vice versa.

* More documentation.

* Docstrings for all `codecs` submodules.

* Move encode_jis7_reduced into dbextra_data_7bit (thus completing the lazy
startup which was apparently not complete already) and docstrings added to
implementations of base class methods referring up to the base class.

* Remove FUSE junk that somehow made it into the repo.

* Some more docstrings.

* Fix some broken references to `string` (rather than `data`) which would have
caused a problem if any existing error handler had returned a negative
offset (which no current handler does, but it's worth fixing anyway).

* Add a cp042 codec to accompany the x-user-defined codec, and to pave the
way for maybe adding Adobe Symbol, Zapf Dingbats or Wingdings codecs
in future.

* Better Japanese Autodetect behaviour for ISO-2022-JP (add yet another
condition in which it will be detected, making it able to conclusively
detect it prior to end of stream without being fed an entire escape
sequence in one call). Also some docs tweaks.

* idstr() → _idstr() since it's internal.

* Docs for codecs.pifonts.

* Docstrings for dbextra.

* Document the sbextra classes.

* Docstrings for the web encodings.

* Possibly a fairer assessment of likely reality.

* Docstrings for codecs.binascii

* The *encoding* isn't removed (the BOM is).

* Make it clearer when competing OEM code pages use different letter layouts.

* Fix copied in error.

* Stop generating linking to non-existent "← tools" from tools.gendoc.

* Move .fuse_hidden* exclusion to my user-level config.

* Constrain the table style changes to class .markdownTable, to avoid any
effect on other interface tables generated by Doxygen.

* Refer to `__ispackage__` when generating help.
This commit is contained in:
HarJIT 2021-04-02 08:34:10 +01:00 committed by GitHub
parent 0fd2849fd8
commit 614193b8a1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
13 changed files with 1313 additions and 66 deletions

View File

@ -388,3 +388,14 @@ div.memdoc {
color: #bbb;
content: '->';
}
.markdownTableRowOdd {
background: rgba(204, 204, 204, 0.2);
}
table.markdownTable {
margin-bottom: 1em;
}
.markdownTable th, .markdownTable td {
padding-left: 0.5ex;
padding-right: 0.5ex;
}

View File

@ -1,2 +1,167 @@
"""@brief Convert a string to and from various encodings.
The basic supported encodings are roughly as specified in the [WHATWG Encoding Standard
](https://encoding.spec.whatwg.org/), but more are also supported unless restriction to web
encodings is explicitly specified.
Most encodings supported by Python are implemented, but not currently `idna` or `punycode`. Note
however that Python makes `x-mac-japanese` and `x-mac-korean` aliases of `shift_jis` and `euc-kr`;
this has not been done here. Also note that the behaviour in regards to association of encoding
names with variants is somewhat different to Python's, partly due to following WHATWG: this affects
most CJK codecs (e.g. Python treats `shift_jis` and `ms-kanji` differently, while this package does
not), but also e.g. "ISO-8859-1".
Main entry points for the package are `codecs.infrastructure.encode`, `codecs.infrastructure.decode`
and `codecs.infrastructure.lookup`, all three of which are also available as e.g. `codecs.encode`
for convenience.
The list of codecs (not an exhaustive list of labels, nor close to one) is as follows.
### Single-byte extended ASCII encodings:
|Major label(s)|Meaning|
|---|---|
|`cp437`|8-bit United States (DOS)|
|`cp720`|8-bit Arabic Letters and Box Drawing (DOS)|
|`cp737`|8-bit Greek and Box Drawing (DOS)|
|`cp775`|8-bit Baltic Rim (DOS)|
|`cp850`|8-bit Western Europe and Canada (DOS)|
|`cp852`|8-bit Central European (DOS)|
|`cp855`|8-bit Balkan Cyrillic (DOS)|
|`cp856`|8-bit Hebrew (DOS)|
|`cp857`|8-bit Turkish (DOS)|
|`cp858`|8-bit Western Europe and Canada with Euro (DOS)|
|`cp860`|8-bit European Portugese (DOS)|
|`cp861`|8-bit Icelandic (DOS)|
|`cp862`|8-bit Hebrew and Box Drawing (DOS)|
|`cp863`|8-bit Quebecois French (DOS)|
|`cp864`|8-bit Arabic Positional Forms (DOS)|
|`cp865`|8-bit Continental Nordic (DOS)|
|`cp866`, `ibm866`|8-bit Russian Cyrillic (DOS)|
|`cp869`|8-bit Greek (DOS)|
|`cp1006`|8-bit Urdu|
|`cp1125`|8-bit Ukrainian Cyrillic (DOS)|
|`ecma-43-dv`, `cp367`, `csascii`|"8-bit Plain ASCII", i.e. ASCII without backspace composition, and with high bit unused. Note: most ASCII labels are mapped to Windows-1252, per WHATWG.|
|`hp-roman8`|8-bit Roman (HP)|
|`iso-8859-2`|8-bit Central European (ISO)|
|`iso-8859-3`|8-bit South European (Maltese/Esperanto)|
|`iso-8859-4`|8-bit North European|
|`iso-8859-5`|8-bit Cyrillic (ISO)|
|`iso-8859-6`|8-bit Arabic (ASMO/ISO)|
|`iso-8859-7`|8-bit Greek (ISO)|
|`iso-8859-8`, `iso-8859-8-i`|8-bit Hebrew (without vowel points). Although some, but not all, of the labels using this mapping request legacy visual-order behaviour (e.g. `iso-8859-8`, `iso-8859-8-e` or even `visual`, but not e.g. `iso-8859-8-i`), bidirectional conversion for any given markup format is beyond the scope of this package: determining from the label whether legacy visual-order behaviour should be used, and responding if so, should be implemented separately if needed.|
|`iso-8859-10`|8-bit Nordic|
|`iso-8859-13`|8-bit Baltic Rim (ISO)|
|`iso-8859-14`|8-bit Celtic|
|`iso-8859-15`|8-bit New Western European|
|`iso-8859-16`|8-bit South-Eastern European (ISO)|
|`koi8-r`|8-bit Russian Cyrillic (KOI8)|
|`koi8-u`, `koi8-ru`|8-bit Ruthenian/Ukrainian/Belarusian Cyrillic (KOI8)|
|`koi8-t`|8-bit Tajik Cyrillic|
|`kz1048`|8-bit Kazakh Cyrillic|
|`macintosh`|8-bit Roman (Macintosh)|
|`palmos`|PalmOS code page|
|`ptcp154`|8-bit Asian Cyrillic (Paratype)|
|`windows-874`, `iso-8859-11`, `tis-620`, `cp874`|8-bit Thai|
|`windows-1250`|8-bit Central European (Windows)|
|`windows-1251`|8-bit Cyrillic (Windows)|
|`windows-1252`, `ascii`, `iso-8859-1`, `latin1`|8-bit Western European. This is in accordance with WHATWG specification _in re_ which mappings to associate with which labels. Note: Python's `latin1` is sometimes used to round-trip arbitrary _sensu stricto_ extended ASCII data; in Kuroko, it is better to use `x-user-defined` for that.|
|`windows-1253`|8-bit Greek (Windows)|
|`windows-1254`, `iso-8859-9`|8-bit Turkish|
|`windows-1255`|8-bit Hebrew (logical with vowel points)|
|`windows-1256`|8-bit Arabic (Windows)|
|`windows-1257`|8-bit Baltic Rim (Windows)|
|`windows-1258`|8-bit Vietnamese (Windows). Basic codec: encoder will accept text in the form generated by the decoder, but neither NFC nor NFD normalised forms. This follows both Python and WHATWG behaviour. Conversion of text in NFC or NFD forms to encodable form may need to be done in a separate step before using the encoder.|
|`x-mac-arabic`|8-bit Arabic (Macintosh)|
|`x-mac-ce`|8-bit Central European (Macintosh)|
|`x-mac-croatian`|8-bit Gajica|
|`x-mac-cyrillic`|8-bit Cyrillic (Macintosh)|
|`x-mac-farsi`|8-bit Persian (Macintosh)|
|`x-mac-greek`|8-bit Greek (Macintosh)|
|`x-mac-icelandic`|8-bit Icelandic (Macintosh)|
|`x-mac-romanian`|8-bit Romanian (Macintosh)|
|`x-mac-turkish`|8-bit Turkish (Macintosh)|
|`x-user-defined`|8-bit User Defined (ASCII based variant: using U+0000007F, U+F780F7FF)|
### Single-byte symbol or dingbat font encodings:
|Major label(s)|Meaning|
|---|---|
|`cp042`|8-bit User Defined (variant using U+0000001F, U+F020F0FF). Windows uses that mapping for symbol fonts in some contexts.|
### 8-bit multi-byte Unicode codecs:
|Major label(s)|Meaning|
|---|---|
|`cesu-8`, `utf8mb3`, `utf8-ucs2`|CESU-8 (to UTF-16 as UTF-8 is to UTF-32). Mostly for interoperability with existing systems that use it.|
|`gb18030`|Chinese GB18030, WHATWG version. Not technically a full UTF in this implementation, since one PUA character is changed to an ideographic space per WHATWG.|
|`utf-8`, `utf8mb4`, `utf8-ucs4`|UTF-8 without a byte order mark|
|`utf-8-sig`|UTF-8 with a byte order mark|
|`utf-16`|UTF-16 with byte order mark, little endian if missing|
|`utf-16be`|UTF-16, big endian, no byte order mark|
|`utf-16le`|UTF-16, little endian, no byte order mark|
|`utf-32`|UTF-32 with byte order mark (though byte order can usually also be detected in its absence)|
|`utf-32be`|UTF-32, big endian, no byte order mark|
|`utf-32le`|UTF-32, little endian, no byte order mark|
### 8-bit multi-byte legacy CJK codecs:
|Major label(s)|Meaning|
|---|---|
|`big5`, `big5-eten`|Traditional Chinese Big-5, ETen version, condoning HKSCS extensions when decoding.|
|`big5-hkscs`|Traditional Chinese Big-5 with HKSCS extensions in both directions.|
|`big5-nonetenkana`, `big5-tw`|Traditional Chinese Big-5, with BIG5.TXT (non-ETen) layout for kana and Cyrillic.|
|`euc-jp`, `x-euc-jp`|Japanese EUC-JP, with Microsoft extensions, permitting JIS X 0212 only when decoding.|
|`euc-jp-full`|Japanese EUC-JP, with Microsoft extensions, permitting JIS X 0212 in both directions.|
|`euc-jisx0213`, `euc-jis-2004`|Japanese EUC-JP, with JIS X 0213 mappings and extensions.|
|`euc-kr`, `uhc`, `windows-949`|Korean Unified Hangul Code (superset of EUC-KR, encodes KS C 5601).|
|`gbk`, `gb2312`|Chinese GBK (GB2312 extension), condoning GB18030 when decoding.|
|`johab`, `johab-ascii`|Korean Johab (ASCII-compatible stateless standard version)|
|`shift_jis`, `ms-kanji`, `windows-31j`|Japanese Shift JIS (Windows compatible version)|
|`shift-jisx0213`, `shift-jis-2004`|Japanese Shift JIS (JIS X 0213 version)|
|`x-mac-chinesesimp`|Simplified Chinese GB2312, Macintosh version|
|`x-mac-chinesetrad`|Traditional Chinese Big5, Macintosh version|
### 7-bit stateful codecs:
|Major label(s)|Meaning|
|---|---|
|`hz-gb-2312`|HZ (Usenet Simplified Chinese) encoding|
|`iso-2022-cn`|7-bit stateful Chinese (Simplified and Traditional)|
|`iso-2022-jp`|7-bit stateful Japanese, web version|
|`iso-2022-jp-ext`|7-bit stateful Japanese, preserving katakana width|
|`iso-2022-jp-1`|7-bit stateful Japanese, including JIS X 0212|
|`iso-2022-jp-2`|7-bit stateful Multilingual (Japanese, Korean, Greek, Simplified Chinese, Western European)|
|`iso-2022-jp-3`|7-bit stateful Japanese, including JIS X 0213 (2000 edition format)|
|`iso-2022-jp-2004`|7-bit stateful Japanese, including JIS X 0213 (2004 edition format)|
|`iso-2022-kr`|7-bit stateful Korean|
|`jis_encoding`|7-bit stateful Japanese, comprehensive version|
|`utf-7`|A largely obsolete scheme for mixing ASCII and Base64'd UTF-16BE in e-mail. Included mostly for Python parity.|
### EBCDIC codecs:
|Major label(s)|Meaning|
|---|---|
|`cp037`|EBCDIC Default (United States, Netherlands, Portugal, Brazil, Australia, New Zealand, Canadian ESA/390)|
|`cp273`|EBCDIC German|
|`cp424`|EBCDIC Hebrew|
|`cp500`|EBCDIC "International" (Belgium, Switzerland, Canadian AS/400)|
|`cp875`|EBCDIC Greek|
|`cp933`, `ibm-933`, `ibm-1364`, `johab-ebcdic`|EBCDIC Korean (Johab, IBM stateful version for EBCDIC)|
|`cp1026`|EBCDIC Turkish|
|`cp1140`|EBCDIC with Euro Sign|
### Codecs with unusual behaviour:
|Major label(s)|Meaning|
|---|---|
|`inverse-base64`|Base64 with inverse semantics to preserve type correctness (encoder reads, decoder creates). Error handler is ignored.|
|`inverse-base64hqx`|Same, but using the BinHex4 alphabet (note: does not in and of itself create the BinHex4 *format*)|
|`inverse-base64uu`|Same, but using the uuencode alphabet (note: does not in and of itself create the uuencode *format*)|
|`inverse-quopri`|Quoted-Printable, with inverse semantics (encoder reads, decoder creates). Error handler is ignored.|
|`japanese`|Attempts to detect the encoding of a Japanese document (like the unified "Japanese" option now offered by some browsers' encoding override menus), and raises `ValueError` if it cannot. Not intended to be used in the encode direction, but will behave as `utf-8-sig` in that case.|
|`undefined`, `replacement`|Represents data for which encoding/decoding must not be attempted. Following WHATWG (and differing from Python), error handlers are accepted, though only by the decoder: the encoder will ignore them.|
"""
from codecs.infrastructure import encode, decode, lookup
import codecs.sbencs, codecs.dbdata, codecs.bespokecodecs, codecs.sbextra, codecs.dbextra, codecs.binascii
import codecs.sbencs, codecs.dbdata, codecs.bespokecodecs, codecs.sbextra, codecs.dbextra, codecs.binascii, codecs.pifonts

View File

@ -1,11 +1,19 @@
"""Contains various WHATWG-defined codecs which require dedicated implementations.
Also includes `utf-8-sig` which, while not a WHATWG-specified codec _per se_, is detected,
interpreted and handled by WHATWG BOM tag logic, in preference above any label, before the codec
gets to see it. WHATWG BOM tag logic is not implemented here (it is not always sensible in a
non-browser context); hence, they remain separate codecs."""
from codecs.infrastructure import register_kuroko_codec, ByteCatenator, StringCatenator, UnicodeEncodeError, UnicodeDecodeError, lookup_error, lookup, IncrementalDecoder, IncrementalEncoder, lazy_property
from codecs.dbdata import more_dbdata
class Gb18030IncrementalEncoder(IncrementalEncoder):
"""IncrementalEncoder implementation for GB18030 (Mainland Chinese Unicode format)"""
name = "gb18030"
html5name = "gb18030"
four_byte_codes = True
def encode(string, final = False):
"""Implements `IncrementalEncoder.encode`"""
let out = ByteCatenator()
let offset = 0
while 1: # offset can be arbitrarily changed by the error handler, so not a for
@ -64,6 +72,8 @@ class Gb18030IncrementalEncoder(IncrementalEncoder):
offset += 1
class GbkIncrementalEncoder(Gb18030IncrementalEncoder):
"""IncrementalEncoder implementation for GBK (Chinese),
extension of GB2312 (Simplified Chinese)"""
name = "gbk"
html5name = "gbk"
four_byte_codes = False
@ -77,9 +87,12 @@ def _get_gbsurrogate_pointer(leader, i):
return ret
class Gb18030IncrementalDecoder(IncrementalDecoder):
"""IncrementalDecoder implementation for GB18030 (Mainland Chinese Unicode),
extension of GB2312 (Simplified Chinese)"""
name = "gb18030"
html5name = "gb18030"
def decode(data_in, final = False):
"""Implements `IncrementalDecoder.decode`"""
let data = self.pending + data_in
self.pending = b""
let out = StringCatenator()
@ -150,7 +163,7 @@ class Gb18030IncrementalDecoder(IncrementalDecoder):
out.add(errorret[0])
offset = errorret[1]
if offset < 0:
offset += len(string)
offset += len(data)
register_kuroko_codec(["gb18030", "gb18030_2000"], Gb18030IncrementalEncoder, Gb18030IncrementalDecoder)
register_kuroko_codec(
@ -159,6 +172,7 @@ register_kuroko_codec(
GbkIncrementalEncoder, Gb18030IncrementalDecoder)
class Iso2022JpIncrementalEncoder(IncrementalEncoder):
"""IncrementalEncoder implementation for ISO-2022-JP (7-bit stateful Japanese JIS)"""
name = "iso-2022-jp"
html5name = "iso-2022-jp"
encodes_sbcs = []
@ -187,6 +201,7 @@ class Iso2022JpIncrementalEncoder(IncrementalEncoder):
raise ValueError("set to invalid state: " + repr(state))
self.state = state
def encode(string, final = False):
"""Implements `IncrementalEncoder.encode`"""
let out = ByteCatenator()
let offset = 0
while 1: # offset can be arbitrarily changed by the error handler, so not a for
@ -261,17 +276,21 @@ class Iso2022JpIncrementalEncoder(IncrementalEncoder):
else:
raise RuntimeError("inconsistently configured encoder")
def reset():
"""Implements `IncrementalEncoder.reset`"""
self.state = 0
self.state_greekmode = False
self.state_desigsupershift = False
def getstate():
"""Implements `IncrementalEncoder.getstate`"""
return (self.state, self.state_desigsupershift, self.state_greekmode)
def setstate(state):
"""Implements `IncrementalEncoder.setstate`"""
self.state = state[0]
self.state_desigsupershift = state[1]
self.state_greekmode = state[2]
class Iso2022JpIncrementalDecoder(IncrementalDecoder):
"""IncrementalDecoder implementation for ISO-2022-JP (7-bit stateful Japanese JIS)"""
name = "iso-2022-jp"
html5name = "iso-2022-jp"
@lazy_property
@ -291,6 +310,7 @@ class Iso2022JpIncrementalDecoder(IncrementalDecoder):
super_shift = False
concat_lenient = False
def decode(data_in, final = False):
"""Implements `IncrementalDecoder.decode`"""
let data = self.pending + data_in
self.pending = b""
let out = StringCatenator()
@ -457,8 +477,9 @@ class Iso2022JpIncrementalDecoder(IncrementalDecoder):
out.add(errorret[0])
offset = errorret[1]
if offset < 0:
offset += len(string)
offset += len(data)
def reset():
"""Implements `IncrementalDecoder.reset`"""
self.pending = b""
self.state_set = 0
self.state_greekmode = False
@ -475,9 +496,11 @@ class Iso2022JpIncrementalDecoder(IncrementalDecoder):
self.state_last646seen = None
self.scrutinising_inter646 = False
def getstate():
"""Implements `IncrementalDecoder.getstate`"""
return (self.pending, self.state_set, self.state_greekmode, self.state_shiftoutmode,
self.state_justswitched, self.state_last646seen, self.scrutinising_inter646)
def setstate(state):
"""Implements `IncrementalDecoder.setstate`"""
self.pending = state[0]
self.state_set = state[1]
self.state_greekmode = state[2]
@ -491,6 +514,7 @@ register_kuroko_codec(["iso-2022-jp", "iso2022-jp", "iso2022jp", "csiso2022jp",
class Utf16IncrementalEncoder(IncrementalEncoder):
"""IncrementalEncoder implementation for UTF-16 with Byte Order Mark"""
name = "utf-16"
html5name = "utf-16"
encoding_map = {}
@ -507,6 +531,7 @@ class Utf16IncrementalEncoder(IncrementalEncoder):
else:
raise ValueError("unexpected endian value: " + repr(self.endian))
def encode(string, final = False):
"""Implements `IncrementalEncoder.encode`"""
let out = ByteCatenator()
let offset = 0
if self.include_bom and self.state == -1:
@ -536,13 +561,17 @@ class Utf16IncrementalEncoder(IncrementalEncoder):
if offset < 0:
offset += len(string)
def reset():
"""Implements `IncrementalEncoder.reset`"""
self.state = -1
def getstate():
"""Implements `IncrementalEncoder.getstate`"""
return self.state
def setstate(state):
"""Implements `IncrementalEncoder.setstate`"""
self.state = state
class Utf16IncrementalDecoder(IncrementalDecoder):
"""IncrementalDecoder implementation for UTF-16"""
name = "utf-16"
html5name = "utf-16"
force_endian = None # subclass may set to "little" or "big"
@ -552,6 +581,7 @@ class Utf16IncrementalDecoder(IncrementalDecoder):
state = None
pending = b""
def decode(data_in, final = False):
"""Implements `IncrementalDecoder.decode`"""
let data = self.pending + data_in
self.pending = b""
let out = StringCatenator()
@ -616,34 +646,41 @@ class Utf16IncrementalDecoder(IncrementalDecoder):
out.add(errorret[0])
offset = errorret[1]
if offset < 0:
offset += len(string)
offset += len(data)
def reset():
"""Implements `IncrementalDecoder.reset`"""
self.pending = b""
self.state = -1
def getstate():
"""Implements `IncrementalDecoder.getstate`"""
return (self.pending, self.state)
def setstate(state):
"""Implements `IncrementalDecoder.setstate`"""
self.pending = state[0]
self.state = state[1]
class Utf16BeIncrementalEncoder(Utf16IncrementalEncoder):
"""IncrementalEncoder implementation for UTF-16 Big Endian without Byte Order Mark"""
name = "utf-16be"
html5name = "utf-16be"
endian = "big"
include_bom = False
class Utf16BeIncrementalDecoder(Utf16IncrementalDecoder):
"""IncrementalDecoder implementation for UTF-16 Big Endian without Byte Order Mark"""
name = "utf-16be"
html5name = "utf-16be"
force_endian = "big"
class Utf16LeIncrementalEncoder(Utf16IncrementalEncoder):
"""IncrementalEncoder implementation for UTF-16 Little Endian without Byte Order Mark"""
name = "utf-16le"
html5name = "utf-16le"
endian = "little"
include_bom = False
class Utf16LeIncrementalDecoder(Utf16IncrementalDecoder):
"""IncrementalDecoder implementation for UTF-16 Little Endian without Byte Order Mark"""
name = "utf-16le"
html5name = "utf-16le"
force_endian = "little"
@ -660,6 +697,7 @@ register_kuroko_codec(["utf-16be", "utf-16-be", "unicodefffe", "unicodebigunmark
class Utf8IncrementalEncoder(IncrementalEncoder):
"""IncrementalEncoder implementation for UTF-8"""
name = "utf-8"
html5name = "utf-8"
# -1: expecting BOM
@ -667,6 +705,7 @@ class Utf8IncrementalEncoder(IncrementalEncoder):
state = None
include_bom = False
def encode(string, final = False):
"""Implements `IncrementalEncoder.encode`"""
# We use UTF-8 natively, so this is fairly simple
let out = ByteCatenator()
if self.include_bom and self.state == -1:
@ -675,13 +714,17 @@ class Utf8IncrementalEncoder(IncrementalEncoder):
out.add(string.encode())
return out.getvalue()
def reset():
"""Implements `IncrementalEncoder.reset`"""
self.state = -1
def getstate():
"""Implements `IncrementalEncoder.getstate`"""
return self.state
def setstate(state):
"""Implements `IncrementalEncoder.setstate`"""
self.state = state
class Utf8IncrementalDecoder(IncrementalDecoder):
"""IncrementalDecoder implementation for UTF-8"""
name = "utf-8"
html5name = "utf-8"
# -1: expecting BOM
@ -692,6 +735,7 @@ class Utf8IncrementalDecoder(IncrementalDecoder):
def _error_handler(error):
return lookup_error(self.errors)(error)
def decode(data_in, final = False):
"""Implements `IncrementalDecoder.decode`"""
# We use UTF-8 natively, so this only validates it and applies the error handler
# (and removes a BOM if remove_bom is set)
let data = self.pending + data_in
@ -762,7 +806,7 @@ class Utf8IncrementalDecoder(IncrementalDecoder):
out.add(errorret[0])
running_offset = errorret[1]
if running_offset < 0:
running_offset += len(string)
running_offset += len(data)
countdown = 0
bolster = 1
first_offset = running_offset
@ -777,20 +821,25 @@ class Utf8IncrementalDecoder(IncrementalDecoder):
self.pending = bytes(dlist[second_offset:])
return out.getvalue()
def reset():
"""Implements `IncrementalDecoder.reset`"""
self.pending = b""
self.state = -1
def getstate():
"""Implements `IncrementalDecoder.getstate`"""
return (self.pending, self.state)
def setstate(state):
"""Implements `IncrementalDecoder.setstate`"""
self.pending = state[0]
self.state = state[1]
class Utf8SigIncrementalEncoder(Utf8IncrementalEncoder):
"""IncrementalEncoder implementation for UTF-8 with Byte Order Mark"""
name = "utf-8-sig"
html5name = None
include_bom = True
class Utf8SigIncrementalDecoder(Utf8IncrementalDecoder):
"""IncrementalDecoder implementation for UTF-8 with Byte Order Mark"""
name = "utf-8-sig"
html5name = None
remove_bom = True

View File

@ -1,3 +1,6 @@
"""
Defines functions and codecs pertaining to binary-to-text encodings.
"""
from codecs.infrastructure import StringCatenator, ByteCatenator, IncrementalEncoder, IncrementalDecoder, UnicodeDecodeError, UnicodeEncodeError, register_kuroko_codec
let _base64_alphabet = (
@ -11,10 +14,14 @@ let _base64_alphabet_hqx = [ord(i) for i in
"!\"#$%&'()*+,-012345689@ABCDEFGHIJKLMNPQRSTUVXYZ[`abcdefhijklmpqr"]
class Base64IncrementalCreator(IncrementalDecoder):
"""
IncrementalDecoder implementation to create (yes) Base64 from bytes.
"""
name = "inverse-base64"
alphabet = _base64_alphabet
padchar = "="
def decode(data_in, final = False):
"""Implements `IncrementalDecoder.decode`"""
let data = self.pending + data_in
self.pending = b""
let offset = 0
@ -39,10 +46,14 @@ class Base64IncrementalCreator(IncrementalDecoder):
offset += 3
class Base64IncrementalParser(IncrementalEncoder):
"""
IncrementalEncoder implementation to parse (yes) Base64 from a string to bytes.
"""
name = "inverse-base64"
alphabet = _base64_alphabet
padchar = "="
def encode(string_in, final = False):
"""Implements `IncrementalEncoder.encode`"""
let string = self.pending + string_in
self.pending = ""
let offset = 0
@ -90,19 +101,35 @@ class Base64IncrementalParser(IncrementalEncoder):
raise UnicodeEncodeError(self.name, string, offset, suboffset,
"Base64 truncated or with invalid number of pad characters")
offset = suboffset
def reset(): self.pending = ""
def getstate(): return self.pending
def setstate(state): self.pending = state
def reset():
"""Implements `IncrementalEncoder.reset`"""
self.pending = ""
def getstate():
"""Implements `IncrementalEncoder.getstate`"""
return self.pending
def setstate(state):
"""Implements `IncrementalEncoder.setstate`"""
self.pending = state
register_kuroko_codec(["inverse-base64"],
Base64IncrementalParser, Base64IncrementalCreator)
class Base64UUIncrementalCreator(Base64IncrementalCreator):
"""
IncrementalDecoder implementation to create (yes) the flavour of Base64 used in uuencode.
Note that this does not output the uuencode format, and is only one component of implementing it.
"""
name = "inverse-base64uu"
alphabet = _base64_alphabet_uu
padchar = " "
class Base64UUIncrementalParser(Base64IncrementalParser):
"""
IncrementalEncoder implementation to parse (yes) the flavour of Base64 used in uuencode.
Note that this does not take the uuencode format, and is only one component of implementing it.
"""
name = "inverse-base64uu"
alphabet = _base64_alphabet_uu
padchar = " "
@ -111,10 +138,20 @@ register_kuroko_codec(["inverse-base64uu"],
Base64UUIncrementalParser, Base64UUIncrementalCreator)
class Base64HQXIncrementalCreator(Base64IncrementalCreator):
"""
IncrementalDecoder implementation to create (yes) the flavour of Base64 used in BinHex4.
Note that this does not output the BinHex4 format, and is only one component of implementing it.
"""
name = "inverse-base64hqx"
alphabet = _base64_alphabet_hqx
class Base64HQXIncrementalParser(Base64IncrementalParser):
"""
IncrementalEncoder implementation to parse (yes) the flavour of Base64 used in BinHex4.
Note that this does not take the BinHex4 format, and is only one component of implementing it.
"""
name = "inverse-base64hqx"
alphabet = _base64_alphabet_hqx
@ -123,8 +160,12 @@ register_kuroko_codec(["inverse-base64hqx"],
class QuoPriIncrementalCreator(IncrementalDecoder):
"""
IncrementalDecoder implementation to create (yes) Quoted-Printable from bytes.
"""
name = "inverse-quopri"
def decode(data_in, final = False):
"""Implements `IncrementalDecoder.decode`"""
let data = self.pending + data_in
self.pending = b""
let offset = 0
@ -171,17 +212,24 @@ class QuoPriIncrementalCreator(IncrementalDecoder):
self.linelength += 3
offset += 1
def reset():
"""Implements `IncrementalDecoder.reset`"""
self.linelength = 0
self.pending = b""
def getstate():
"""Implements `IncrementalDecoder.getstate`"""
return (self.linelength, self.pending)
def setstate(state):
"""Implements `IncrementalDecoder.setstate`"""
self.linelength = state[0]
self.pending = state[1]
class QuoPriIncrementalParser(IncrementalEncoder):
"""
IncrementalEncoder implementation to parse (yes) Quoted-Printable from a string to bytes.
"""
name = "inverse-quopri"
def encode(string_in, final = False):
"""Implements `IncrementalEncoder.encode`"""
let string = self.pending + string_in
self.pending = ""
let offset = 0
@ -220,15 +268,26 @@ class QuoPriIncrementalParser(IncrementalEncoder):
let byteval = (hexd.index(procsubst[1]) << 4) | hexd.index(procsubst[2])
out.add(bytes([byteval]))
offset += 3
def reset(): self.pending = ""
def getstate(): return self.pending
def setstate(state): self.pending = state
def reset():
"""Implements `IncrementalEncoder.reset`"""
self.pending = ""
def getstate():
"""Implements `IncrementalEncoder.getstate`"""
return self.pending
def setstate(state):
"""Implements `IncrementalEncoder.setstate`"""
self.pending = state
register_kuroko_codec(["inverse-quopri"],
QuoPriIncrementalParser, QuoPriIncrementalCreator)
def base64_file_create(data, filename=None, mode=0o666):
"""
Create a Base64 string containing the provided data, with lines wrapped as required by some
formats. If a filename and optional UNIX mode are provided, Base64 headers as recognised by
some modern versions of uudecode are added.
"""
let out = StringCatenator()
let creator = Base64IncrementalCreator("strict")
if filename != None:
@ -246,6 +305,9 @@ def base64_file_create(data, filename=None, mode=0o666):
def uu_file_create(data, filename="-", mode=0o666):
"""
Create a string in the uuencode file format containing the provided data.
"""
let out = StringCatenator()
let creator = Base64UUIncrementalCreator("strict")
let octmode = oct(mode)[2:]

View File

@ -1,6 +1,7 @@
"""
This module includes some additional variable-width or wide encodings not specified by WHATWG. As
such, none of the codecs in this module should be used in HTML.
This module includes some additional variable-width or wide encodings not specified by WHATWG.
As such, none of the codecs in this module should be used in HTML.
"""
from codecs.dbextra_data_8bit import data_8bit
@ -13,6 +14,17 @@ from collections import xraydict
class Big5NonEtenKanaIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for Big5 with non-ETEN layout of kana, Cyrillic, list markers.
The other ETEN extension section (the one retained by Microsoft's version) is still included.
Although this is the kana/Cyrillic/list marker layout included in the UTC's BIG5.TXT, it is the
less common of the two (most extension schemes for Big5 use the ETEN layout), and has several
problems (katakana lacks the vowel extender, and Cyrillic lacks several capitals) which the
ETEN layout does not have. However, this codec corresponds roughly to Python's `big5`, and more
closely to its (built-in, as opposed to if/when Python aliases it to `mbcs`) `cp950`.
"""
name = "big5-nonetenkana"
html5name = None
@lazy_property
@ -20,6 +32,17 @@ class Big5NonEtenKanaIncrementalEncoder(AsciiIncrementalEncoder):
return xraydict(data_8bit.cp950_no_eudc_encoding_map, data_8bit.encode_big5_nonetenkana)
class Big5NonEtenKanaIncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for Big5 with non-ETEN layout of kana, Cyrillic, list markers.
The other ETEN extension section (the one retained by Microsoft's version) is still included.
Although this is the kana/Cyrillic/list marker layout included in the UTC's BIG5.TXT, it is the
less common of the two (most extension schemes for Big5 use the ETEN layout), and has several
problems (katakana lacks the vowel extender, and Cyrillic lacks several capitals) which the
ETEN layout does not have. However, this codec corresponds roughly to Python's `big5`, and more
closely to its (built-in, as opposed to if/when Python aliases it to `mbcs`) `cp950`.
"""
name = "big5-nonetenkana"
html5name = None
@lazy_property
@ -32,6 +55,13 @@ register_kuroko_codec(["big5-nonetenkana", "big5-tw"],
Big5NonEtenKanaIncrementalEncoder, Big5NonEtenKanaIncrementalDecoder)
class XMacChineseTradIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for Big5 with Apple's additions and reduced lead byte range.
The Unicode mappings are partly changed to be closer to Apple's (as opposed to Microsoft's)
correspondences; however, Microsoft's are retained where following Apple's would have required
PUA transcoding hints to round-trip.
"""
name = "x-mac-chinesetrad"
html5name = None
@lazy_property
@ -54,6 +84,13 @@ class XMacChineseTradIncrementalEncoder(AsciiIncrementalEncoder):
})
class XMacChineseTradIncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for Big5 with Apple's additions and reduced lead byte range.
The Unicode mappings are partly changed to be closer to Apple's (as opposed to Microsoft's)
correspondences; however, Microsoft's are retained where following Apple's would have required
PUA transcoding hints to round-trip.
"""
name = "x-mac-chinesetrad"
html5name = None
@lazy_property
@ -83,6 +120,12 @@ register_kuroko_codec(["x-mac-chinesetrad", "x-mac-trad-chinese"],
class XMacChineseSimpIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for EUC-CN, Apple version (hence slightly reduced lead byte range).
Mappings to more-recently added characters are used for the vertical forms, rather than
Apple transcoding hints (or GB18030 private use codes).
"""
name = "x-mac-chinesesimp"
html5name = None
@lazy_property
@ -102,6 +145,12 @@ class XMacChineseSimpIncrementalEncoder(AsciiIncrementalEncoder):
})
class XMacChineseSimpIncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for EUC-CN, Apple version (hence slightly reduced lead byte range).
Mappings to more-recently added characters are used for the vertical forms, rather than
Apple transcoding hints (or GB18030 private use codes).
"""
name = "x-mac-chinesesimp"
html5name = None
@lazy_property
@ -128,6 +177,10 @@ register_kuroko_codec(["x-mac-chinesesimp", "x-mac-simp-chinese", "euc-cn", "euc
class Cesu8IncrementalEncoder(IncrementalEncoder):
"""
IncrementalEncoder implementation for CESU-8, a deprecated UTF-8-like encoding still used by
some systems, such as TCL, and still mis-called "utf8" in some places for legacy reasons.
"""
name = "cesu-8"
html5name = None
# -1: expecting BOM
@ -135,6 +188,7 @@ class Cesu8IncrementalEncoder(IncrementalEncoder):
state = None
include_bom = False
def encode(string, final = False):
"""Implements `IncrementalEncoder.encode`"""
let out = ByteCatenator()
if self.include_bom and self.state == -1:
out.add("\uFEFF".encode())
@ -161,13 +215,20 @@ class Cesu8IncrementalEncoder(IncrementalEncoder):
out.add(string[first_offset:second_offset].encode())
return out.getvalue()
def reset():
"""Implements `IncrementalEncoder.reset`"""
self.state = -1
def getstate():
"""Implements `IncrementalEncoder.getstate`"""
return self.state
def setstate(state):
"""Implements `IncrementalEncoder.setstate`"""
self.state = state
class Cesu8IncrementalDecoder(Utf8IncrementalDecoder):
"""
IncrementalDecoder implementation for CESU-8, a deprecated UTF-8-like encoding still used by
some systems, such as TCL, and still mis-called "utf8" in some places for legacy reasons.
"""
name = "cesu-8"
html5name = None
def _error_handler(error):
@ -211,6 +272,10 @@ let _base64_alphabet = (
let _utf7_not_need_hyphen = [ord(i) for i in "(),.:? \r\n"]
class Utf7IncrementalEncoder(IncrementalEncoder):
"""
IncrementalEncoder implementation for UTF-7, a largely obsolete (and forbidden in HTML5)
scheme for mixing ASCII with Base64'd UTF-16BE in e-mail.
"""
name = "utf-7"
html5name = None
utf16encoder = None
@ -220,6 +285,7 @@ class Utf7IncrementalEncoder(IncrementalEncoder):
self.utf16encoder = Utf16BeIncrementalEncoder(errors)
IncrementalEncoder.__init__(self, errors)
def encode(data, final=False):
"""Implements `IncrementalEncoder.encode`"""
let incoming = self.pending + list(self.utf16encoder.encode(data, final=final))
self.pending = []
let offset = 0
@ -254,16 +320,23 @@ class Utf7IncrementalEncoder(IncrementalEncoder):
self.mode = "ascii"
return out.getvalue()
def reset():
"""Implements `IncrementalEncoder.reset`"""
self.utf16encoder.reset()
self.mode = "ascii"
def getstate():
"""Implements `IncrementalEncoder.getstate`"""
return (self.utf16encoder.getstate(), self.mode, self.pending)
def setstate(state):
"""Implements `IncrementalEncoder.setstate`"""
self.utf16encoder.setstate(state[0])
self.mode = state[1]
self.pending = state[2]
class Utf7IncrementalDecoder(IncrementalDecoder):
"""
IncrementalDecoder implementation for UTF-7, a largely obsolete (and forbidden in HTML5)
scheme for mixing ASCII with Base64'd UTF-16BE in e-mail.
"""
name = "utf-7"
html5name = None
utf16decoder = None
@ -273,6 +346,7 @@ class Utf7IncrementalDecoder(IncrementalDecoder):
self.utf16decoder = Utf16BeIncrementalDecoder(errors)
IncrementalDecoder.__init__(self, errors)
def decode(data_in, final=False):
"""Implements `IncrementalDecoder.decode`"""
let data = self.pending + data_in
self.pending = b""
let incoming = list(data)
@ -329,12 +403,15 @@ class Utf7IncrementalDecoder(IncrementalDecoder):
self.mode = "ascii"
return out.getvalue()
def reset():
"""Implements `IncrementalDecoder.reset`"""
self.utf16decoder.reset()
self.mode = "ascii"
self.pending = b""
def getstate():
"""Implements `IncrementalDecoder.getstate`"""
return (self.utf16encoder.getstate(), self.mode, self.pending)
def setstate(state):
"""Implements `IncrementalDecoder.setstate`"""
self.utf16encoder.setstate(state[0])
self.mode = state[1]
self.pending = state[2]
@ -344,6 +421,9 @@ register_kuroko_codec(["utf-7", "utf7", "u7", "unicode-1-1-utf-7"],
class EucJpFullIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for EUC-JP, including JIS X 0212.
"""
name = "euc-jp-full"
html5name = None
@lazy_property
@ -355,6 +435,9 @@ register_kuroko_codec(["euc-jp-full"],
class EucJis2004IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for the JIS X 0213 version of EUC-JP.
"""
name = "euc-jis-2004"
html5name = None
@lazy_property
@ -362,6 +445,9 @@ class EucJis2004IncrementalEncoder(AsciiIncrementalEncoder):
return data_8bit.encode_euc04
class EucJis2004IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for the JIS X 0213 version of EUC-JP.
"""
name = "euc-jis-2004"
html5name = None
@lazy_property
@ -377,6 +463,9 @@ register_kuroko_codec(["euc-jis-2004", "jisx0213", "eucjis2004", "euc_jis2004",
class ShiftJis2004IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for the JIS X 0213 version of Shift_JIS.
"""
name = "shift-jis-2004"
html5name = None
@lazy_property
@ -385,6 +474,9 @@ class ShiftJis2004IncrementalEncoder(AsciiIncrementalEncoder):
ascii_exceptions = (0x5C, 0x7E)
class ShiftJis2004IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for the JIS X 0213 version of Shift_JIS.
"""
name = "shift-jis-2004"
html5name = None
@lazy_property
@ -403,6 +495,9 @@ register_kuroko_codec(["shift_jis-2004", "shiftjis2004", "sjis_2004", "s_jis_200
class AsciiJohabIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for the PC Johab encoding (code page 1361).
"""
name = "johab-ascii"
html5name = None
@lazy_property
@ -410,6 +505,9 @@ class AsciiJohabIncrementalEncoder(AsciiIncrementalEncoder):
return data_8bit.encode_johab_ascii
class AsciiJohabIncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for the PC Johab encoding (code page 1361).
"""
name = "johab-ascii"
html5name = None
@lazy_property
@ -424,6 +522,9 @@ register_kuroko_codec(["cp1361", "ms1361", "johab", "x-johab", "johab-ascii"],
class EbcdicJohabIncrementalEncoder(BaseEbcdicIncrementalEncoder):
"""
IncrementalEncoder implementation for code page 1364, a stateful EBCDIC variant of Johab.
"""
name = "johab-ebcdic"
html5name = None
@lazy_property
@ -434,6 +535,9 @@ class EbcdicJohabIncrementalEncoder(BaseEbcdicIncrementalEncoder):
return data_8bit.encode_johab_ebcdic
class EbcdicJohabIncrementalDecoder(BaseEbcdicIncrementalDecoder):
"""
IncrementalDecoder implementation for code page 1364, a stateful EBCDIC variant of Johab.
"""
name = "johab-ebcdic"
html5name = None
@lazy_property
@ -443,12 +547,22 @@ class EbcdicJohabIncrementalDecoder(BaseEbcdicIncrementalDecoder):
def dbcshost_decode():
return data_8bit.decode_johab_ebcdic
register_kuroko_codec(["cp933", "ibm-933", "933", "x-IBM933", "ibm-1364", "x-IBM1364",
register_kuroko_codec(["cp933", "ibm-933", "933", "x-IBM933", "cp1364", "ibm-1364", "x-IBM1364",
"johab-ebcdic"],
EbcdicJohabIncrementalEncoder, EbcdicJohabIncrementalDecoder)
class JisEncodingIncrementalEncoder(Iso2022JpIncrementalEncoder):
"""
IncrementalEncoder implementation for 7-bit stateful Japanese with all features.
This differs from the ISO-2022-JP encoder in that it will:
- Encode forms present in 1978 JIS but simplified by (and absent in) 1983 JIS to 1978 JIS.
- For characters not present in either table, try JIS X 0212, 2000 JIS and 2004 JIS in that order.
- For characters not present in any JIS set, try GB 2312 and Wansung.
- Preserve width of katakana.
"""
name = "jis_encoding"
html5name = None
@lazy_property
@ -477,6 +591,18 @@ class JisEncodingIncrementalEncoder(Iso2022JpIncrementalEncoder):
attitude = "eager"
class JisEncodingIncrementalDecoder(Iso2022JpIncrementalDecoder):
"""
IncrementalDecoder implementation for 7-bit stateful Japanese.
This is differs from the ISO-2022-JP decoder in that it will:
- Decode 1978 JIS with a separate table, including 1978 JIS, NEC extensions and IBM backports.
- Accept and decode extensions from ISO-2022-JP-2 (and -1), ISO-2022-JP-3 and ISO-2022-JP-2004.
- Not generate an error for immediately concatenated JIS-KanjiASCIIJIS-Kanji designations.
- Accept katakana via Shift Out / Shift In.
This is used as the decoder for all other ISO-2022-JP variants besides plain ISO-2022-JP.
"""
name = "jis_encoding"
html5name = None
@lazy_property
@ -518,6 +644,12 @@ register_kuroko_codec(["jis_encoding", "csjisencoding", "jis", "jis7"],
class Iso2022Jp1IncrementalEncoder(Iso2022JpIncrementalEncoder):
"""
IncrementalEncoder implementation for 7-bit stateful Japanese with JIS X 0212.
This differs from the ISO-2022-JP encoder in that it will encode to JIS X 0212, and does so
whenever possible (i.e. it will favour it over any web extensions to JIS X 0208).
"""
name = "iso-2022-jp-1"
html5name = None
@lazy_property
@ -525,6 +657,7 @@ class Iso2022Jp1IncrementalEncoder(Iso2022JpIncrementalEncoder):
return [None, None]
@lazy_property
def encodes_dbcs():
# Favour JIS X 0212 over any extensions in the web JIS X 0208 table.
return [None, None, data_7bit.encode_jis90p2, more_dbdata.encode_jis7]
escs_onebyte = {0: 0x42, 1: 0x4A}
escs_twobyte = {3: 0x42, 2: 0x44}
@ -535,6 +668,11 @@ register_kuroko_codec(["iso-2022-jp-1", "iso2022-jp-1", "iso2022jp-1"],
class Iso2022JpExtIncrementalEncoder(Iso2022JpIncrementalEncoder):
"""
IncrementalEncoder implementation for 7-bit stateful Japanese.
This differs from the ISO-2022-JP encoder in that it preserves katakana width.
"""
name = "iso-2022-jp-ext"
html5name = None
@lazy_property
@ -552,6 +690,9 @@ register_kuroko_codec(["iso-2022-jp-ext", "iso2022-jp-ext", "iso2022jp-ext"],
class Iso2022Jp2IncrementalEncoder(Iso2022JpIncrementalEncoder):
"""
IncrementalEncoder implementation for 7-bit stateful Japanese with multilingual extensions.
"""
name = "iso-2022-jp-2"
html5name = None
@lazy_property
@ -559,6 +700,7 @@ class Iso2022Jp2IncrementalEncoder(Iso2022JpIncrementalEncoder):
return [None, None]
@lazy_property
def encodes_dbcs():
# Favour JIS X 0212 over any extensions in the web JIS X 0208 table.
return [None, None,
data_7bit.encode_jis90p2,
more_dbdata.encode_jis7,
@ -579,19 +721,10 @@ register_kuroko_codec(["iso-2022-jp-2", "iso2022-jp-2", "iso2022jp-2", "csISO202
Iso2022Jp2IncrementalEncoder, JisEncodingIncrementalDecoder)
# Bit confusing to explain what this bit is doing, so let me explain:
# The JIS X 0213 variants of ISO-2022-JP should encode to JIS X 0213 before encoding to any
# extension to JIS X 0208 (assuming they "should" encode to extensions at all). So we remove any
# characters that are encoded to different locations in JIS X 0213.
# Since NEC Row 13 is retained in JIS X 0213, but should be encoded in the JIS X 0213 state not
# the JIS X 0208 state, it is also excluded.
# This also removes certain Unicode characters that are mapped differently by Microsoft/WHATWG
# versus by JIS X 0213, e.g. the fullwidth tilde, in the hope that the JIS X 0213 tables would
# more dependably round trip.
let encode_jis7_reduced = xraydict(more_dbdata.encode_jis7, {}, [33537, 33634, 33663, 33735, 33864, 33972, 34012, 34131, 34137, 34224, 35061, 35100, 35346, 35383, 35449, 35495, 35518, 35551, 35574, 35711, 36080, 36084, 36114, 20008, 20193, 20224, 20227, 20310, 20362, 20370, 20372, 20378, 20425, 20544, 20514, 20510, 20550, 20546, 20592, 20628, 37086, 37141, 37159, 20810, 20893, 37335, 37338, 37357, 37358, 37348, 37349, 37386, 37392, 21013, 37434, 37436, 37440, 37433, 37454, 37457, 37465, 37479, 37496, 37512, 21158, 37543, 21167, 37584, 37587, 37591, 21211, 37593, 37600, 37607, 37625, 21248, 37627, 21255, 37631, 37634, 37662, 37661, 21284, 37669, 37665, 37704, 37719, 37744, 21395, 21426, 37830, 37854, 37957, 21642, 21660, 21673, 21759, 21894, 38557, 38575, 38707, 38715, 38733, 38735, 38741, 22444, 22472, 22471, 38999, 39013, 22686, 22795, 22875, 22877, 39326, 22948, 39502, 39644, 23382, 39794, 39797, 39823, 39857, 23488, 23512, 23532, 39936, 23582, 23718, 23738, 23847, 23874, 23891, 40299, 23917, 40304, 23992, 23993, 40473, 40657, 24372, 24389, 24423, 24503, 24542, 24714, 24789, 24818, 8470, 8481, 24880, 24887, 8544, 8545, 8546, 8547, 8548, 8549, 8550, 8551, 8552, 8553, 8560, 8561, 8562, 8563, 8564, 8565, 8566, 8567, 8568, 8569, 24984, 8721, 8730, 8735, 8736, 8741, 8745, 8746, 8747, 8750, 8757, 8786, 8801, 8869, 25254, 8895, 25589, 9312, 9313, 9314, 9315, 9316, 9317, 9318, 9319, 9320, 9321, 9322, 9323, 9324, 9325, 9326, 9327, 9328, 9329, 9330, 9331, 25696, 25757, 25806, 26112, 26121, 26133, 26142, 26148, 26161, 26199, 26201, 26213, 26227, 26265, 26272, 26290, 26303, 26362, 26363, 26470, 26555, 26560, 26625, 26692, 26706, 26824, 26831, 26984, 27032, 27106, 27184, 27206, 27243, 27251, 27262, 27364, 27606, 27711, 27740, 27782, 27866, 27908, 28039, 28076, 28111, 28156, 28199, 28220, 28252, 28351, 28552, 28597, 28661, 12317, 12319, 28677, 28679, 28712, 28859, 28805, 28843, 28943, 28932, 28998, 28999, 29020, 29121, 29182, 12849, 12850, 12857, 12964, 12965, 12966, 12967, 12968, 29361, 29374, 13059, 13069, 13076, 13080, 13090, 13091, 13094, 13095, 13099, 13110, 13115, 13129, 13130, 13133, 13137, 13143, 29559, 13179, 13180, 13181, 13182, 13198, 13199, 13212, 13213, 13214, 13217, 13252, 29641, 13261, 29654, 29667, 29703, 29734, 29738, 29742, 29794, 29833, 29855, 29953, 29999, 30063, 30363, 30364, 30366, 30374, 30534, 30753, 30798, 30820, 63785, 31024, 31124, 31131, 63964, 64015, 64016, 64017, 64019, 64020, 64021, 64022, 64025, 64026, 64027, 64031, 64032, 64033, 64034, 64036, 64038, 31441, 31463, 31467, 31646, 32072, 32092, 32160, 32183, 32214, 32338, 32394, 65282, 65287, 65293, 65374, 32583, 65508])
class Iso2022Jp3IncrementalEncoder(Iso2022JpIncrementalEncoder):
"""
IncrementalEncoder implementation for 7-bit stateful Japanese with JIS X 0213-2000.
"""
name = "iso-2022-jp-3"
html5name = None
@lazy_property
@ -600,7 +733,7 @@ class Iso2022Jp3IncrementalEncoder(Iso2022JpIncrementalEncoder):
@lazy_property
def encodes_dbcs():
return [None, None, None,
encode_jis7_reduced,
data_7bit.encode_jis7_reduced,
data_7bit.encode_jis00,
data_7bit.encode_jis00p2]
escs_onebyte = {0: 0x42, 1: 0x4A, 2: 0x49}
@ -612,6 +745,9 @@ register_kuroko_codec(["iso-2022-jp-3", "iso2022-jp-3", "iso2022jp-3"],
class Iso2022Jp2004IncrementalEncoder(Iso2022JpIncrementalEncoder):
"""
IncrementalEncoder implementation for 7-bit stateful Japanese with JIS X 0213-2004.
"""
name = "iso-2022-jp-2004"
html5name = None
@lazy_property
@ -620,7 +756,7 @@ class Iso2022Jp2004IncrementalEncoder(Iso2022JpIncrementalEncoder):
@lazy_property
def encodes_dbcs():
return [None, None, None,
encode_jis7_reduced,
data_7bit.encode_jis7_reduced,
data_7bit.encode_jis00p2,
data_7bit.encode_jis04]
escs_onebyte = {0: 0x42, 1: 0x4A, 2: 0x49}
@ -632,6 +768,9 @@ register_kuroko_codec(["iso-2022-jp-2004", "iso2022-jp-2004", "iso2022jp-2004"],
class Utf32IncrementalEncoder(IncrementalEncoder):
"""
IncrementalEncoder implementation for UTF-32 with byte order mark.
"""
name = "utf-32"
html5name = None
@lazy_property
@ -650,6 +789,7 @@ class Utf32IncrementalEncoder(IncrementalEncoder):
else:
raise ValueError("unexpected endian value: " + repr(self.endian))
def encode(string, final = False):
"""Implements `IncrementalEncoder.encode`"""
let out = ByteCatenator()
let offset = 0
if self.include_bom and self.state == -1:
@ -672,13 +812,19 @@ class Utf32IncrementalEncoder(IncrementalEncoder):
if offset < 0:
offset += len(string)
def reset():
"""Implements `IncrementalEncoder.reset`"""
self.state = -1
def getstate():
"""Implements `IncrementalEncoder.getstate`"""
return self.state
def setstate(state):
"""Implements `IncrementalEncoder.setstate`"""
self.state = state
class Utf32IncrementalDecoder(IncrementalDecoder):
"""
IncrementalDecoder implementation for UTF-32, detected byte order, removing any byte order mark.
"""
name = "utf-32"
html5name = None
force_endian = None # subclass may set to "little" or "big"
@ -688,6 +834,7 @@ class Utf32IncrementalDecoder(IncrementalDecoder):
state = None
pending = b""
def decode(data_in, final = False):
"""Implements `IncrementalDecoder.decode`"""
let data = self.pending + data_in
self.pending = b""
let out = StringCatenator()
@ -754,34 +901,49 @@ class Utf32IncrementalDecoder(IncrementalDecoder):
out.add(errorret[0])
offset = errorret[1]
if offset < 0:
offset += len(string)
offset += len(data)
def reset():
"""Implements `IncrementalDecoder.reset`"""
self.pending = b""
self.state = -1
def getstate():
"""Implements `IncrementalDecoder.getstate`"""
return (self.pending, self.state)
def setstate(state):
"""Implements `IncrementalDecoder.setstate`"""
self.pending = state[0]
self.state = state[1]
class Utf32BeIncrementalEncoder(Utf32IncrementalEncoder):
"""
IncrementalEncoder implementation for UTF-32, big endian, without a byte order mark.
"""
name = "utf-32be"
html5name = None
endian = "big"
include_bom = False
class Utf32BeIncrementalDecoder(Utf32IncrementalDecoder):
"""
IncrementalDecoder implementation for UTF-32, big endian, without a byte order mark.
"""
name = "utf-32be"
html5name = None
force_endian = "big"
class Utf32LeIncrementalEncoder(Utf32IncrementalEncoder):
"""
IncrementalEncoder implementation for UTF-32, little endian, without a byte order mark.
"""
name = "utf-32le"
html5name = None
endian = "little"
include_bom = False
class Utf32LeIncrementalDecoder(Utf32IncrementalDecoder):
"""
IncrementalDecoder implementation for UTF-32, little endian, without a byte order mark.
"""
name = "utf-32le"
html5name = None
force_endian = "little"
@ -795,6 +957,11 @@ register_kuroko_codec(["utf-32be", "utf-32-be"],
class HzIncrementalEncoder(IncrementalEncoder):
"""
IncrementalEncoder implementation for HZ-GB-2312 (Usenet simplified Chinese).
This is an old scheme for embedding GB 2312 data into a pure ASCII stream.
"""
name = "hz-gb-2312"
html5name = None
def ensure_state_number(state, out):
@ -818,6 +985,7 @@ class HzIncrementalEncoder(IncrementalEncoder):
raise ValueError("set to invalid state: " + repr(state))
self.state = state
def encode(string, final = False):
"""Implements `IncrementalEncoder.encode`"""
let out = ByteCatenator()
let offset = 0
while 1: # offset can be arbitrarily changed by the error handler, so not a for
@ -850,18 +1018,27 @@ class HzIncrementalEncoder(IncrementalEncoder):
if offset < 0:
offset += len(string)
def reset():
"""Implements `IncrementalEncoder.reset`"""
self.state = 0
self.linelength = 0
def getstate():
"""Implements `IncrementalEncoder.getstate`"""
return (self.state, self.linelength)
def setstate(state):
"""Implements `IncrementalEncoder.setstate`"""
self.state = state[0]
self.linelength = state[1]
class HzIncrementalDecoder(IncrementalDecoder):
"""
IncrementalDecoder implementation for HZ-GB-2312 (Usenet simplified Chinese).
This is an old scheme for embedding GB 2312 data into a pure ASCII stream.
"""
name = "hz-gb-2312"
html5name = None
def decode(data_in, final = False):
"""Implements `IncrementalDecoder.decode`"""
let data = self.pending + data_in
self.pending = b""
let out = StringCatenator()
@ -920,13 +1097,16 @@ class HzIncrementalDecoder(IncrementalDecoder):
out.add(errorret[0])
offset = errorret[1]
if offset < 0:
offset += len(string)
offset += len(data)
def reset():
"""Implements `IncrementalDecoder.reset`"""
self.pending = b""
self.state_set = 0
def getstate():
"""Implements `IncrementalDecoder.getstate`"""
return (self.pending, self.state_set)
def setstate(state):
"""Implements `IncrementalDecoder.setstate`"""
self.pending = state[0]
self.state_set = state[1]
@ -935,6 +1115,14 @@ register_kuroko_codec(["hz-gb-2312", "hz", "hzgb", "hz_gb"],
class JapaneseAutodetectIncrementalDecoder(IncrementalDecoder):
"""
IncrementalDecoder implementation for the automatic "Japanese" character encoding option.
This will attempt to interpret the stream as the web versions of ISO-2022-JP, Shift_JIS and
EUC-JP, as well as UTF-8, at once, and start returning the data once it has narrowed it down
to one. If it fails to narrow it down conclusively, it will wait until the final call before
making an educated guess. If it doesn't seem to be any of them, it will raise `ValueError`.
"""
name = "japanese"
html5name = None
# State flags:
@ -951,11 +1139,14 @@ class JapaneseAutodetectIncrementalDecoder(IncrementalDecoder):
self.utf = lookup("utf-8-sig").incrementaldecoder("strict")
self.reset()
def decode(data, final = False):
"""Implements `IncrementalDecoder.decode`"""
if not (self.state & 0x01):
try:
self.pendingjis.add(self.jis.decode(data, final))
except UnicodeDecodeError:
self.state |= 0x01
if self.jis.state_set != 0:
self.state |= 0x0E
#
if not (self.state & 0x02):
try:
@ -1029,6 +1220,7 @@ class JapaneseAutodetectIncrementalDecoder(IncrementalDecoder):
return ret
return ""
def reset():
"""Implements `IncrementalDecoder.reset`"""
self.state = 0
self.pending = b""
self.jis.reset()
@ -1040,12 +1232,14 @@ class JapaneseAutodetectIncrementalDecoder(IncrementalDecoder):
self.utf.reset()
self.pendingutf = StringCatenator()
def getstate():
"""Implements `IncrementalDecoder.getstate`"""
return (self.jis.getstate(), self.pendingjis.getvalue(),
self.sjis.getstate(), self.pendingsjis.getvalue(),
self.ujis.getstate(), self.pendingujis.getvalue(),
self.utf.getstate(), self.pendingutf.getvalue(),
self.state)
def setstate(state):
"""Implements `IncrementalDecoder.setstate`"""
self.jis.setstate(state[0])
self.pendingjis = StringCatenator()
self.pendingjis.add(state[1])
@ -1067,6 +1261,9 @@ register_kuroko_codec(["japanese"], Utf8SigIncrementalEncoder,
class Iso2022NonJpIncrementalEncoder(IncrementalEncoder):
"""
IncrementalEncoder subclass, base class for ISO-2022-KR and ISO-2022-CN. Not used directly.
"""
name = None
html5name = None
encodes = []
@ -1095,6 +1292,7 @@ class Iso2022NonJpIncrementalEncoder(IncrementalEncoder):
self.super3_desig = state
def run_prelude(out):
def encode(string, final = False):
"""Implements `IncrementalEncoder.encode`"""
let out = ByteCatenator()
let offset = 0
if self.shift_desig == None and self.super_desig == None and self.super3_desig == None:
@ -1156,19 +1354,25 @@ class Iso2022NonJpIncrementalEncoder(IncrementalEncoder):
if offset < 0:
offset += len(string)
def reset():
"""Implements `IncrementalEncoder.reset`"""
self.shift = False
self.shift_desig = None
self.super_desig = None
self.super3_desig = None
def getstate():
"""Implements `IncrementalEncoder.getstate`"""
return (self.shift, self.shift_desig, self.super_desig, self.super3_desig)
def setstate(state):
"""Implements `IncrementalEncoder.setstate`"""
self.shift = state[0]
self.shift_desig = state[1]
self.super_desig = state[2]
self.super3_desig = state[2]
class Iso2022NonJpIncrementalDecoder(IncrementalDecoder):
"""
IncrementalDecoder subclass, base class for ISO-2022-KR and ISO-2022-CN. Not used directly.
"""
name = None
html5name = None
decodes = []
@ -1176,6 +1380,7 @@ class Iso2022NonJpIncrementalDecoder(IncrementalDecoder):
escs_super = {}
escs_super3 = {}
def decode(data_in, final = False):
"""Implements `IncrementalDecoder.decode`"""
let data = self.pending + data_in
self.pending = b""
let out = StringCatenator()
@ -1285,16 +1490,19 @@ class Iso2022NonJpIncrementalDecoder(IncrementalDecoder):
out.add(errorret[0])
offset = errorret[1]
if offset < 0:
offset += len(string)
offset += len(data)
def reset():
"""Implements `IncrementalDecoder.reset`"""
self.pending = b""
self.shift = False
self.shift_desig = None
self.super_desig = None
self.super3_desig = None
def getstate():
"""Implements `IncrementalDecoder.getstate`"""
return (self.pending, self.shift, self.shift_desig, self.super_desig, self.super3_desig)
def setstate(state):
"""Implements `IncrementalDecoder.setstate`"""
self.pending = state[0]
self.shift = state[1]
self.shift_desig = state[2]
@ -1302,6 +1510,9 @@ class Iso2022NonJpIncrementalDecoder(IncrementalDecoder):
self.super3_desig = state[4]
class Iso2022KrIncrementalEncoder(Iso2022NonJpIncrementalEncoder):
"""
IncrementalEncoder implementation for ISO-2022-KR (7-bit stateful Korean, South).
"""
name = "iso-2022-kr"
html5name = None
@lazy_property
@ -1315,6 +1526,9 @@ class Iso2022KrIncrementalEncoder(Iso2022NonJpIncrementalEncoder):
self.ensure_shift_designation(0, out)
class Iso2022KrIncrementalDecoder(Iso2022NonJpIncrementalDecoder):
"""
IncrementalDecoder implementation for ISO-2022-KR (7-bit stateful Korean, South).
"""
name = "iso-2022-kr"
html5name = None
@lazy_property
@ -1326,6 +1540,11 @@ register_kuroko_codec(["iso-2022-kr", "iso2022-kr", "iso2022kr", "csiso2022kr"],
Iso2022KrIncrementalEncoder, Iso2022KrIncrementalDecoder)
class Iso2022CnIncrementalEncoder(Iso2022NonJpIncrementalEncoder):
"""
IncrementalEncoder implementation for ISO-2022-CN (7-bit stateful Chinese).
ISO-2022-CN-Ext is not included (it requires a much larger set of tables and is very rare).
"""
name = "iso-2022-cn"
html5name = None
@lazy_property
@ -1335,6 +1554,11 @@ class Iso2022CnIncrementalEncoder(Iso2022NonJpIncrementalEncoder):
escs_super = {2: 0x48}
class Iso2022CnIncrementalDecoder(Iso2022NonJpIncrementalDecoder):
"""
IncrementalDecoder implementation for ISO-2022-CN (7-bit stateful Chinese).
ISO-2022-CN-Ext is not included (it requires a much larger set of tables and is very rare).
"""
name = "iso-2022-cn"
html5name = None
@lazy_property

View File

@ -1,3 +1,6 @@
"""
Defines 7-bit mapping data for `codecs.dbextra`.
"""
from collections import xraydict
from codecs.infrastructure import encodesto7bit, decodesto7bit, lazy_property
from codecs.dbdata import more_dbdata, Windows949IncrementalEncoder, Windows949IncrementalDecoder
@ -29,6 +32,20 @@ class _DBExtraData7Bit:
return {65408: 64, 65409: 65, 65410: 66, 65411: 67, 65412: 68, 65413: 69, 65414: 70, 65415: 71, 65416: 72, 65417: 73, 65418: 74, 65419: 75, 65420: 76, 65421: 77, 65422: 78, 65423: 79, 65424: 80, 65425: 81, 65426: 82, 65427: 83, 65428: 84, 65429: 85, 65430: 86, 65431: 87, 65432: 88, 65433: 89, 65434: 90, 65435: 91, 65436: 92, 65437: 93, 65438: 94, 65439: 95, 65377: 33, 65378: 34, 65379: 35, 65380: 36, 65381: 37, 65382: 38, 65383: 39, 65384: 40, 65385: 41, 65386: 42, 65387: 43, 65388: 44, 65389: 45, 65390: 46, 65391: 47, 65392: 48, 65393: 49, 65394: 50, 65395: 51, 65396: 52, 65397: 53, 65398: 54, 65399: 55, 65400: 56, 65401: 57, 65402: 58, 65403: 59, 65404: 60, 65405: 61, 65406: 62, 65407: 63, }
# Bit confusing to explain what this bit is doing, so let me explain:
# The JIS X 0213 variants of ISO-2022-JP should encode to JIS X 0213 before encoding to any
# extension to JIS X 0208 (assuming they "should" encode to extensions at all). So we remove any
# characters that are encoded to different locations in JIS X 0213.
# Since NEC Row 13 is retained in JIS X 0213, but should be encoded in the JIS X 0213 state not
# the JIS X 0208 state, it is also excluded.
# This also removes certain Unicode characters that are mapped differently by Microsoft/WHATWG
# versus by JIS X 0213, e.g. the fullwidth tilde, in the hope that the JIS X 0213 tables would
# more dependably round trip.
@lazy_property
def encode_jis7_reduced():
return xraydict(more_dbdata.encode_jis7, {}, [33537, 33634, 33663, 33735, 33864, 33972, 34012, 34131, 34137, 34224, 35061, 35100, 35346, 35383, 35449, 35495, 35518, 35551, 35574, 35711, 36080, 36084, 36114, 20008, 20193, 20224, 20227, 20310, 20362, 20370, 20372, 20378, 20425, 20544, 20514, 20510, 20550, 20546, 20592, 20628, 37086, 37141, 37159, 20810, 20893, 37335, 37338, 37357, 37358, 37348, 37349, 37386, 37392, 21013, 37434, 37436, 37440, 37433, 37454, 37457, 37465, 37479, 37496, 37512, 21158, 37543, 21167, 37584, 37587, 37591, 21211, 37593, 37600, 37607, 37625, 21248, 37627, 21255, 37631, 37634, 37662, 37661, 21284, 37669, 37665, 37704, 37719, 37744, 21395, 21426, 37830, 37854, 37957, 21642, 21660, 21673, 21759, 21894, 38557, 38575, 38707, 38715, 38733, 38735, 38741, 22444, 22472, 22471, 38999, 39013, 22686, 22795, 22875, 22877, 39326, 22948, 39502, 39644, 23382, 39794, 39797, 39823, 39857, 23488, 23512, 23532, 39936, 23582, 23718, 23738, 23847, 23874, 23891, 40299, 23917, 40304, 23992, 23993, 40473, 40657, 24372, 24389, 24423, 24503, 24542, 24714, 24789, 24818, 8470, 8481, 24880, 24887, 8544, 8545, 8546, 8547, 8548, 8549, 8550, 8551, 8552, 8553, 8560, 8561, 8562, 8563, 8564, 8565, 8566, 8567, 8568, 8569, 24984, 8721, 8730, 8735, 8736, 8741, 8745, 8746, 8747, 8750, 8757, 8786, 8801, 8869, 25254, 8895, 25589, 9312, 9313, 9314, 9315, 9316, 9317, 9318, 9319, 9320, 9321, 9322, 9323, 9324, 9325, 9326, 9327, 9328, 9329, 9330, 9331, 25696, 25757, 25806, 26112, 26121, 26133, 26142, 26148, 26161, 26199, 26201, 26213, 26227, 26265, 26272, 26290, 26303, 26362, 26363, 26470, 26555, 26560, 26625, 26692, 26706, 26824, 26831, 26984, 27032, 27106, 27184, 27206, 27243, 27251, 27262, 27364, 27606, 27711, 27740, 27782, 27866, 27908, 28039, 28076, 28111, 28156, 28199, 28220, 28252, 28351, 28552, 28597, 28661, 12317, 12319, 28677, 28679, 28712, 28859, 28805, 28843, 28943, 28932, 28998, 28999, 29020, 29121, 29182, 12849, 12850, 12857, 12964, 12965, 12966, 12967, 12968, 29361, 29374, 13059, 13069, 13076, 13080, 13090, 13091, 13094, 13095, 13099, 13110, 13115, 13129, 13130, 13133, 13137, 13143, 29559, 13179, 13180, 13181, 13182, 13198, 13199, 13212, 13213, 13214, 13217, 13252, 29641, 13261, 29654, 29667, 29703, 29734, 29738, 29742, 29794, 29833, 29855, 29953, 29999, 30063, 30363, 30364, 30366, 30374, 30534, 30753, 30798, 30820, 63785, 31024, 31124, 31131, 63964, 64015, 64016, 64017, 64019, 64020, 64021, 64022, 64025, 64026, 64027, 64031, 64032, 64033, 64034, 64036, 64038, 31441, 31463, 31467, 31646, 32072, 32092, 32160, 32183, 32214, 32338, 32394, 65282, 65287, 65293, 65374, 32583, 65508])
@lazy_property
def decode_jis78():
return xraydict(more_dbdata.decode_jis7, {(116, 33): 23597, (48, 34): 21854, (116, 34): 27097, (116, 35): 36965, (116, 36): 29814, (44, 36): 9472, (44, 37): 9473, (44, 38): 9474, (44, 39): 9475, (44, 40): 9476, (44, 41): 9477, (44, 42): 9478, (44, 43): 9479, (72, 46): 28497, (44, 44): 9480, (72, 48): 37297, (44, 45): 9481, (44, 50): 9486, (48, 51): 39994, (56, 52): 40572, (44, 46): 9482, (44, 51): 9487, (44, 47): 9483, (44, 48): 9484, (44, 49): 9485, (44, 58): 9494, (44, 59): 9495, (44, 52): 9488, (44, 53): 9489, (44, 62): 9498, (44, 63): 9499, (44, 64): 9500, (44, 65): 9501, (44, 66): 9502, (52, 67): 28748, (44, 67): 9503, (44, 68): 9504, (100, 70): 31725, (44, 69): 9505, (60, 72): 23650, (60, 73): 34306, (44, 54): 9490, (44, 55): 9491, (44, 56): 9492, (76, 77): 40629, (108, 77): 36046, (68, 79): 25681, (44, 57): 9493, (44, 70): 9506, (52, 82): 35563, (44, 71): 9507, (44, 72): 9508, (44, 73): 9509, (80, 86): 20397, (112, 87): 38765, (44, 74): 9510, (44, 75): 9511, (44, 60): 9496, (68, 91): 22778, (44, 61): 9497, (44, 76): 9512, (44, 77): 9513, (44, 78): 9514, (44, 79): 9515, (44, 80): 9516, (44, 81): 9517, (44, 82): 9518, (84, 100): 22775, (44, 83): 9519, (44, 84): 9520, (44, 85): 9521, (44, 86): 9522, (44, 87): 9523, (44, 88): 9524, (44, 107): 9543, (44, 89): 9525, (44, 90): 9526, (44, 91): 9527, (44, 92): 9528, (44, 93): 9529, (44, 94): 9530, (44, 95): 9531, (44, 96): 9532, (112, 116): 38938, (44, 97): 9533, (96, 118): 29796, (44, 98): 9534, (44, 99): 9535, (76, 121): 34282, (44, 100): 9536, (44, 101): 9537, (44, 102): 9538, (44, 103): 9539, (44, 104): 9540, (44, 105): 9541, (44, 106): 9542, (44, 108): 9544, (44, 109): 9545, (44, 110): 9546, (44, 111): 9547, (41, 33): 33, (105, 34): 34122, (41, 34): 34, (41, 35): 35, (41, 36): 36, (41, 37): 37, (41, 38): 38, (65, 40): 36068, (41, 39): 39, (41, 40): 40, (61, 43): 32353, (41, 41): 41, (41, 42): 42, (105, 46): 34222, (41, 43): 43, (73, 48): 27292, (41, 44): 44, (41, 45): 45, (41, 46): 46, (41, 47): 47, (41, 48): 48, (69, 54): 22625, (41, 49): 49, (41, 50): 50, (41, 51): 51, (41, 52): 52, (41, 53): 53, (41, 54): 54, (41, 55): 55, (41, 56): 56, (69, 63): 39002, (41, 57): 57, (41, 58): 58, (41, 59): 59, (41, 60): 60, (41, 61): 61, (41, 62): 62, (41, 63): 63, (41, 64): 64, (41, 65): 65, (41, 66): 66, (41, 67): 67, (41, 68): 68, (41, 69): 69, (41, 70): 70, (41, 71): 71, (41, 72): 72, (41, 73): 73, (41, 74): 74, (41, 75): 75, (41, 76): 76, (41, 77): 77, (41, 78): 78, (41, 79): 79, (69, 87): 31018, (41, 80): 80, (41, 81): 81, (77, 90): 36953, (105, 90): 34510, (57, 92): 31014, (41, 82): 82, (41, 88): 88, (65, 95): 25620, (41, 83): 83, (41, 84): 84, (41, 85): 85, (41, 86): 86, (41, 87): 87, (41, 89): 89, (41, 90): 90, (41, 91): 91, (41, 92): 165, (65, 105): 30246, (77, 105): 33802, (49, 107): 28976, (41, 93): 93, (57, 109): 40628, (109, 110): 36841, (69, 110): 27310, (41, 94): 94, (41, 95): 95, (41, 96): 96, (69, 115): 28644, (41, 97): 97, (41, 98): 98, (41, 99): 99, (41, 100): 100, (69, 120): 31153, (89, 120): 25785, (41, 101): 101, (41, 102): 102, (41, 103): 103, (41, 104): 104, (41, 105): 105, (41, 106): 106, (41, 107): 107, (41, 108): 108, (41, 109): 109, (41, 110): 110, (41, 111): 111, (41, 112): 112, (41, 113): 113, (41, 114): 114, (41, 115): 115, (41, 116): 116, (41, 117): 117, (41, 118): 118, (41, 119): 119, (41, 120): 120, (41, 121): 121, (41, 122): 122, (41, 123): 123, (41, 124): 124, (41, 125): 125, (41, 126): 8254, (42, 33): 65377, (54, 34): 20448, (42, 34): 65378, (106, 36): 34687, (42, 35): 65379, (42, 36): 65380, (42, 37): 65381, (42, 38): 65382, (50, 41): 40367, (50, 42): 40407, (42, 39): 65383, (42, 40): 65384, (42, 41): 65385, (42, 42): 65386, (42, 43): 65387, (42, 44): 65388, (42, 45): 65389, (42, 46): 65390, (42, 47): 65391, (42, 48): 65392, (42, 49): 65393, (42, 50): 65394, (42, 51): 65395, (42, 52): 65396, (90, 57): 25890, (94, 57): 28059, (42, 53): 65397, (42, 54): 65398, (42, 55): 65399, (42, 56): 65400, (42, 57): 65401, (42, 58): 65402, (42, 59): 65403, (70, 66): 28678, (42, 60): 65404, (42, 61): 65405,

View File

@ -1,3 +1,6 @@
"""
Defines 8-bit mapping data for `codecs.dbextra`.
"""
from collections import xraydict
from codecs.infrastructure import lazy_property
from codecs.dbdata import more_dbdata, XEucJpIncrementalEncoder, XEucJpIncrementalDecoder, Windows31JIncrementalEncoder, Windows31JIncrementalDecoder, Big5EtenIncrementalEncoder, Big5HkscsIncrementalDecoder

View File

@ -1,11 +1,19 @@
"""Underpinning infrastructure for the codecs module."""
from codecs.isweblabel import map_weblabel
def idstr(obj):
def _idstr(obj):
let reprd = object.__repr__(obj)
return reprd.split(" at 0x")[1].split(">")[0]
let _encoder_registry = {}
let _decoder_registry = {}
def register_kuroko_codec(labels, incremental_encoder_class, incremental_decoder_class):
"""
Register a given `IncrementalEncoder` subclass and a given `IncrementalDecoder` subclass
with a given list of labels. Usually, this is expected to include the encoding name, along
with a list labels for aliases and/or subsets of the encoding. Either coder class may be `None`,
if the encoder/decoder labels are being registered asymmetrically.
"""
for label in labels:
let norm = label.replace("_", "-").lower()
if incremental_encoder_class:
@ -28,15 +36,32 @@ def register_kuroko_codec(labels, incremental_encoder_class, incremental_decoder
_decoder_registry[norm] = incremental_decoder_class
class KurokoCodecInfo:
"""
Descriptor for the registered encoder and decoder for a given label. Has five members:
- `name`: the label covered by this descriptor.
- `encode`: encode a complete Unicode sequence.
- `decode`: decode a complete byte sequence.
- `incrementalencoder`: IncrementalEncoder subclass.
- `incrementaldecoder`: IncrementalDecoder subclass.
"""
def __init__(label, encoder, decoder):
self.name = label
self.incrementalencoder = encoder
self.incrementaldecoder = decoder
def encode(string, errors="strict"):
"""
Encode a complete Unicode sequence to a complete byte string.
Semantic of name passed to `errors=` is as documented for `lookup_error()`.
"""
if self.incrementalencoder:
return self.incrementalencoder(errors).encode(string, True)
raise ValueError(f"unrecognised encoding or decode-only encoding: {self.name!r}")
def decode(data, errors="strict"):
"""
Decode a complete byte sequence to a complete Unicode stream.
Semantic of name passed to `errors=` is as documented for `lookup_error()`.
"""
if self.incrementaldecoder:
return self.incrementaldecoder(errors).decode(data, True)
raise ValueError(f"unrecognised encoding or encode-only encoding: {self.name!r}")
@ -66,9 +91,17 @@ class KurokoCodecInfo:
ret += " (HTML5 " + repr(dec.html5name) + ")"
else:
ret += "; no decoder"
return ret + "; at 0x" + idstr(self) + ">"
return ret + "; at 0x" + _idstr(self) + ">"
def lookup(label, web=False):
"""
Obtain a `KurokoCodecInfo` for a given label. If `web=False` (the default), will always succeed,
but the resulting `KurokoCodecInfo` might be unable to encode and/or unable to decode if the
label is not recognised in that direction. If `web=True`, will raise KeyError if the label is
not a WHATWG-permitted label, and will map certain labels to undefined per the WHATWG spec.
Can be simply accessed as `codecs.lookup`.
"""
let proclabel = label.lower()
if web:
proclabel = map_weblabel(label)
@ -85,15 +118,34 @@ def lookup(label, web=False):
return KurokoCodecInfo(proclabel, enc, dec)
def encode(string, label, web=False, errors="strict"):
"""
Encode a complete Unicode sequence to a complete byte string in the given encoding. Semantic
of the web= argument is the same as with `lookup()`. Semantic of name passed to errors= is as
documented for `lookup_error()`.
Can be simply accessed as `codecs.encode`.
"""
return lookup(label, web = web).encode(string, errors=errors)
def decode(data, label, web=False, errors="strict"):
"""
Decode a complete byte sequence in the given encoding to a complete Unicode stream. Semantic
of the web= argument is the same as with `lookup()`. Semantic of name passed to errors= is as
documented for `lookup_error()`.
Can be simply accessed as `codecs.decode`.
"""
return lookup(label, web = web).decode(data, errors=errors)
# Constructor is e.g. UnicodeEncodeError(encoding, object, start, end, reason)
# Wouldn't it be wonderful if Python bloody documented that anywhere (e.g. manual or docstring)?
# -- Har.
class UnicodeError(ValueError):
"""
Exception raised when an error is encountered or detected in the process of encoding or
decoding. May instead be passed to a handler when not in strict mode. Contains machine-readable
information about the error encountered, allowing approaches to respond to it.
"""
def __init__(encoding, object, start, end, reason):
self.encoding = encoding
self.object = object
@ -113,27 +165,66 @@ class UnicodeError(ValueError):
return f"codec for {self.encoding!r} cannot process sequence {slice!r}: {self.reason}"
class UnicodeEncodeError(UnicodeError):
"""
UnicodeError subclass raised when an error is encountered in the process of encoding.
"""
class UnicodeDecodeError(UnicodeError):
"""
UnicodeError subclass raised when an error is encountered in the process of decoding.
"""
let _error_registry = {}
def register_error(name, handler):
"""
Reister a new error handler. The handler should be a function taking a `UnicodeError` and
either raising an exception or returning a tuple of (substitute, resume_index). The substitute
should be bytes (usually expected to be in ASCII) for a `UnicodeEncodeError`, str otherwise.
"""
_error_registry[name] = handler
def lookup_error(name):
"""
Look up an error handler function registered with a certain name. By default, the following
are registered. It is important to note that nothing obligates a codec to actually *use* the
error handler if it is not deemed possible or appropriate, and so specifying a non-strict
error handler will not guarantee an exception will not be raised, especially when working with
a codec which is not a "normal" text encoding (e.g. `undefined` or `inverse-base64`).
- `strict`: raise an exception.
- `ignore`: skip invalid substrings. Not always recommended: can facilitate masked injection.
- `replace`: insert a replacement character (decoding) or question mark (encoding).
- `warnreplace`: like `replace` but prints a message to stderr; good for debugging.
- `backslashreplace`: replace with Python/Kuroko style Unicode escapes. Note that this only
matches JavaScript escape syntax for Basic Multilingual Plane characters. Encoding only.
- `xmlcharrefreplace`: replace with HTML/XML numerical entities. Note that this will, per
WHATWG, never generate entities for Shift Out, Shift In and Escape (i.e. when encoding to a
stateful encoding which uses them, e.g. ISO-2022-JP), instead generating an entity for the
replacement character. Encoding only.
"""
return _error_registry[name]
def strict_errors(exc):
"""
Handler for `strict` errors: raise the exception.
"""
raise exc
register_error("strict", strict_errors)
def ignore_errors(exc):
"""
Handler for `ignore` errors: skip invalid sequences.
"""
if isinstance(exc, UnicodeEncodeError):
return (b"", exc.end)
return ("", exc.end)
register_error("ignore", ignore_errors)
def replace_errors(exc):
"""
Handler for `replace` errors: insert replacement character (if decoding) or
question mark (if encoding).
"""
if isinstance(exc, UnicodeEncodeError):
return (b"?", exc.end)
else if isinstance(exc, UnicodeDecodeError):
@ -143,6 +234,10 @@ def replace_errors(exc):
register_error("replace", replace_errors)
def warnreplace_errors(exc):
"""
Handler for `warnreplace` errors: insert replacement character (if decoding) or question mark
(if encoding) and print a warning to `stderr`.
"""
import fileio
fileio.stderr.write(type(exc).__name__ + ": " + str(exc) + "\n")
if isinstance(exc, UnicodeEncodeError):
@ -154,6 +249,11 @@ def warnreplace_errors(exc):
register_error("warnreplace", warnreplace_errors)
def backslashreplace_errors(exc):
"""
Handler for `backslashreplace` errors: replace unencodable character with Python/Kuroko style
escape sequence. For Basic Multilingual Plane characters, this also matches JavaScript; beyond
that, they differ.
"""
if isinstance(exc, UnicodeEncodeError):
# Work around str.format not supporting format specifiers
let myhex = hex(ord(exc.object[exc.start])).split("x", 1)[1]
@ -170,6 +270,11 @@ def backslashreplace_errors(exc):
register_error("backslashreplace", backslashreplace_errors)
def xmlcharrefreplace_errors(exc):
"""
Handler for `xmlcharrefreplace` errors: replace unencodable character with XML numeric entity
for the character unless it is Shift Out, Shift In or Escape, in which case insert the XML
numeric entity for the replacement character (as stipulated by WHATWG for ISO-2022-JP).
"""
if isinstance(exc, UnicodeEncodeError):
let codepoint = ord(exc.object[exc.start])
# Per WHATWG (specified in its ISO-2022-JP encoder, the only one that
@ -181,6 +286,10 @@ def xmlcharrefreplace_errors(exc):
register_error("xmlcharrefreplace", xmlcharrefreplace_errors)
class ByteCatenator:
"""
Helper class for maintaining a stream to which `bytes` objects will be repeatedly catenated
in place.
"""
def __init__():
self.list = []
def add(data):
@ -189,6 +298,10 @@ class ByteCatenator:
return b"".join(self.list)
class StringCatenator:
"""
Helper class for maintaining a stream to which `str` objects will be repeatedly catenated
in place.
"""
def __init__():
self.list = []
def add(string):
@ -197,6 +310,13 @@ class StringCatenator:
return "".join(self.list)
class IncrementalEncoder:
"""
Incremental encoder, allowing more encoded data to be generated as more Unicode data is
obtained. Note that the return values from `encode` are not guaranteed to encompass all data
which has been passed in, until it is called with `final=True`.
This is the base class and should not be instantiated directly.
"""
name = None
html5name = None
def __init__(errors):
@ -207,15 +327,41 @@ class IncrementalEncoder:
let w = "(non-HTML5)"
if self.html5name:
w = f"(HTML5 {self.html5name!r})"
let addr = idstr(self)
let addr = _idstr(self)
return f"<{c.__name__} instance: encoder for {self.name!r} {w} at 0x{addr}>"
def encode(string, final = False):
"""
Passes the given string in to the encoder, and returns a sequence of bytes. When
final=False, the return value might not represent the entire input (some of which may
become represented at the start of the value returned by the next call). When final=True,
all of the input will be represented, and any final state change sequence required by the
encoding will be outputted.
"""
raise NotImplementedError("must be implemented by subclass")
def reset():
"""
Reset encoder to initial state, without outputting, discarding any pending data.
"""
pass
def getstate():
"""
Returns an arbitrary object encapsulating encoder state.
"""
pass
def setstate(state):
"""
Sets encoder state to one previously returned by getstate().
"""
pass
class IncrementalDecoder:
"""
Incremental decoder, allowing more Unicode data to be generated as more encoded data is
obtained. Note that the return values from `decode` are not guaranteed to encompass all data
which has been passed in, until it is called with `final=True`.
This is the base class and should not be instantiated directly.
"""
name = None
html5name = None
def __init__(errors):
@ -226,11 +372,20 @@ class IncrementalDecoder:
let w = "(non-HTML5)"
if self.html5name:
w = f"(HTML5 {self.html5name!r})"
let addr = idstr(self)
let addr = _idstr(self)
return f"<{c.__name__} instance: decoder for {self.name!r} {w} at 0x{addr}>"
def decode(data_in, final = False):
"""
Passes the given bytes in to the encoder, and returns a Unicode string. When
final=False, the return value might not represent the entire input (some of which may
become represented at the start of the value returned by the next call). When final=True,
all of the input will be represented, and an error will be generated if it is truncated.
"""
raise NotImplementedError("must be implemented by subclass")
def _handle_truncation(out, unused, final, data, offset, leader):
"""
Helper function used by subclasses to handle any pending data when returning from `decode`.
"""
if len(leader) == 0:
return out.getvalue()
else if final:
@ -242,13 +397,30 @@ class IncrementalDecoder:
self.pending = bytes(leader)
return out.getvalue()
def reset():
"""
Reset decoder to initial state, without outputting, discarding any pending data.
"""
self.pending = b""
def getstate():
"""
Returns an arbitrary object encapsulating decoder state.
"""
return self.pending
def setstate(state):
"""
Sets decoder state to one previously returned by getstate().
"""
self.pending = state
class AsciiIncrementalEncoder(IncrementalEncoder):
"""
Encoder for ISO/IEC 4873-DV, and base class for simple _sensu lato_ extended ASCII encoders.
Encoders for more complex cases, such as ISO-2022-JP, do not inherit from this class.
ISO/IEC 4873-DV is, as of the current (third) edition of ISO/IEC 4873, the same as what
people usually mean when they say "ASCII" (i.e. an eighth bit exists but is never used, and
backspace composition is not a thing which exists for encoding characters).
"""
# The obvious labels for ASCII are all Windows-1252 per WHATWG. Also, what people call
# "ASCII" in 8-bit-byte contexts (without backspace combining) is properly ISO-4873-DV.
name = "ecma-43-dv"
@ -266,6 +438,7 @@ class AsciiIncrementalEncoder(IncrementalEncoder):
if isinstance(i, tuple):
self._lead_codes.setdefault(i[0], []).append(i)
def encode(string_in, final = False):
"""Implements `IncrementalEncoder.encode`"""
let string = self.pending_lead + string_in
self.pending_lead = ""
let out = ByteCatenator()
@ -313,13 +486,24 @@ class AsciiIncrementalEncoder(IncrementalEncoder):
if offset < 0:
offset += len(string)
def reset():
"""Implements `IncrementalEncoder.reset`"""
self.pending_lead = ""
def getstate():
"""Implements `IncrementalEncoder.getstate`"""
return self.pending_lead
def setstate(state):
"""Implements `IncrementalEncoder.setstate`"""
self.pending_lead = state
class AsciiIncrementalDecoder(IncrementalDecoder):
"""
Decoder for ISO/IEC 4873-DV, and base class for simple _sensu lato_ extended ASCII decoders.
Decoders for more complex cases, such as ISO-2022-JP, do not inherit from this class.
ISO/IEC 4873-DV is, as of the current (third) edition of ISO/IEC 4873, the same as what
people usually mean when they say "ASCII" (i.e. an eighth bit exists but is never used, and
backspace composition is not a thing which exists for encoding characters).
"""
name = "ecma-43-dv"
html5name = None
# For non-ASCII characters (this should work as a base class)
@ -329,6 +513,7 @@ class AsciiIncrementalDecoder(IncrementalDecoder):
trailrange = ()
ascii_exceptions = ()
def decode(data_in, final = False):
"""Implements `IncrementalDecoder.decode`"""
let data = self.pending + data_in
self.pending = b""
let out = StringCatenator()
@ -392,13 +577,19 @@ class AsciiIncrementalDecoder(IncrementalDecoder):
out.add(errorret[0])
offset = errorret[1]
if offset < 0:
offset += len(string)
offset += len(data)
register_kuroko_codec(["ecma-43-dv", "iso-4873-dv", "646", "cp367", "ibm367", "iso646-us",
"iso-646.irv-1991", "iso-ir-6", "us", "csascii"],
AsciiIncrementalEncoder, AsciiIncrementalDecoder)
class BaseEbcdicIncrementalEncoder(IncrementalEncoder):
"""
Base class for EBCDIC encoders.
On its own, it is only capable of encoding `U+3000` (as ``x'0E', x'40', x'40', x'0F'``); hence,
it should not, generally speaking, be used directly.
"""
name = None
html5name = None
sbcs_encode = {}
@ -407,6 +598,7 @@ class BaseEbcdicIncrementalEncoder(IncrementalEncoder):
shift_to_dbcs = 0x0E
shift_to_sbcs = 0x0F
def encode(string, final = False):
"""Implements `IncrementalEncoder.encode`"""
let out = ByteCatenator()
let offset = 0
while 1: # offset can be arbitrarily changed by the error handler, so not a for
@ -450,13 +642,22 @@ class BaseEbcdicIncrementalEncoder(IncrementalEncoder):
if offset < 0:
offset += len(string)
def reset():
"""Implements `IncrementalEncoder.reset`"""
self.in_dbcshost = False
def getstate():
"""Implements `IncrementalEncoder.getstate`"""
return self.in_dbcshost
def setstate(state):
"""Implements `IncrementalEncoder.setstate`"""
self.in_dbcshost = state
class BaseEbcdicIncrementalDecoder(IncrementalDecoder):
"""
Base class for EBCDIC decoders.
On its own, it is only capable of decoding `U+3000` (from ``x'0E', x'40', x'40', x'0F'``); hence,
it should not, generally speaking, be used directly.
"""
name = None
html5name = None
sbcs_decode = {}
@ -465,6 +666,7 @@ class BaseEbcdicIncrementalDecoder(IncrementalDecoder):
shift_to_dbcs = 0x0E
shift_to_sbcs = 0x0F
def decode(data_in, final = False):
"""Implements `IncrementalDecoder.decode`"""
let data = self.pending + data_in
self.pending = b""
let out = StringCatenator()
@ -537,17 +739,24 @@ class BaseEbcdicIncrementalDecoder(IncrementalDecoder):
out.add(errorret[0])
offset = errorret[1]
if offset < 0:
offset += len(string)
offset += len(data)
def reset():
"""Implements `IncrementalDecoder.reset`"""
self.pending = b""
self.in_dbcshost = False
def getstate():
"""Implements `IncrementalDecoder.getstate`"""
return (self.pending, self.in_dbcshost)
def setstate(state):
"""Implements `IncrementalDecoder.setstate`"""
self.pending = state[0]
self.in_dbcshost = state[1]
class UndefinedIncrementalEncoder(IncrementalEncoder):
"""
Encoder which errors out on all input. For use on input for which encoding should not be
attempted. Error handler is ignored.
"""
name = "undefined"
html5name = "replacement"
# WHATWG doesn't specify an encoder for "replacement" so follow Python "undefined" here.
@ -558,6 +767,10 @@ class UndefinedIncrementalEncoder(IncrementalEncoder):
strict_errors(error)
class UndefinedIncrementalDecoder(IncrementalDecoder):
"""
Decoder which errors out on all input. For use on input for which decoding should not be
attempted. Error handler is honoured, and called once per non-empty `decode` method call.
"""
name = "undefined"
html5name = "replacement"
def decode(data, final = False):
@ -574,6 +787,10 @@ register_kuroko_codec(
def lazy_property(method):
"""
Like property(), but memoises the value returned. The return value is assumed to be
constant at the class level, i.e. the same for all instances.
"""
let memo = None
def retriever(this):
if memo == None:
@ -583,6 +800,9 @@ def lazy_property(method):
class encodesto7bit:
"""
Encoding map for a 7-bit set, wrapping an encoding map for an 8-bit EUC or EUC-superset encoding.
"""
def __init__(base):
self.base = base
def __contains__(key):
@ -613,6 +833,9 @@ class encodesto7bit:
class decodesto7bit:
"""
Decoding map for a 7-bit set, wrapping an decoding map for an 8-bit EUC or EUC-superset encoding.
"""
def __init__(base):
self.base = base
def __contains__(key):

View File

@ -0,0 +1,81 @@
"""
This module includes codecs implementing special handling for symbol fonts.
"""
from codecs.infrastructure import register_kuroko_codec, ByteCatenator, StringCatenator, UnicodeEncodeError, UnicodeDecodeError, lookup_error, lookup, IncrementalEncoder, IncrementalDecoder, lazy_property
from collections import xraydict
class Cp042IncrementalEncoder(IncrementalEncoder):
"""
Encoder for Windows code page 42 (GDI Symbol), and base class for symbol font encoders.
This maps characters to PUA with the low 8 bits matching the original byte encoding, similarly
to `x-user-defined`, but using a different PUA range and including all non-C0 bytes, not
only non-ASCII bytes.
"""
name = "cp042"
html5name = None
encoding_map = {}
def encode(string, final = False):
"""Implements `IncrementalEncoder.encode`"""
let out = ByteCatenator()
let offset = 0
while 1: # offset can be arbitrarily changed by the error handler, so not a for
if offset >= len(string):
return out.getvalue()
let i = string[offset]
if ord(i) in self.encoding_map:
let target = self.encoding_map[ord(i)]
out.add(bytes([target]))
offset += 1
else if ord(i) < 0x100:
# U+0020 thru U+00FF are accepted by GDI itself, but not by Code page 42
# as implemented by Microsoft, which has caused problems:
# http://archives.miloush.net/michkap/archive/2005/11/08/490495.html
out.add(bytes([ord(i)]))
offset += 1
else if (0xF020 <= ord(i)) and (ord(i) < 0xF100):
out.add(bytes([ord(i) - 0xF000]))
offset += 1
else if (0xF780 <= ord(i)) and (ord(i) < 0xF800):
# Accept (not generate) the x-user-defined range as well, because why not?
out.add(bytes([ord(i) - 0xF700]))
offset += 1
else:
let error = UnicodeEncodeError(self.name, string, offset, offset + 1,
"character not supported by target encoding")
let errorret = lookup_error(self.errors)(error)
out.add(errorret[0])
offset = errorret[1]
if offset < 0:
offset += len(string)
class Cp042IncrementalDecoder(IncrementalDecoder):
"""
Decoder for Windows code page 42 (GDI Symbol), and base class for symbol font decoders.
This maps characters to PUA with the low 8 bits matching the original byte encoding, similarly
to `x-user-defined`, but using a different PUA range and including all non-C0 bytes, not
only non-ASCII bytes.
"""
name = "cp042"
html5name = None
decoding_map = {}
def decode(data, final = False):
"""Implements `IncrementalDecoder.decode`"""
self.pending = b""
let out = StringCatenator()
let offset = 0
for i in data:
if i in self.decoding_map:
out.add(chr(self.decoding_map[i]))
else if i < 0x20:
out.add(chr(i))
else:
out.add(chr(i + 0xF000))
return out.getvalue()
register_kuroko_codec(["cp042"], Cp042IncrementalEncoder, Cp042IncrementalDecoder)

View File

@ -1,11 +1,18 @@
"""
This module includes some additional single-byte encodings not specified by WHATWG. As such, none
of the codecs in this module should be used in HTML.
This module includes some additional single-byte encodings not specified by WHATWG.
As such, none of the codecs in this module should be used in HTML.
"""
from codecs.infrastructure import AsciiIncrementalEncoder, AsciiIncrementalDecoder, register_kuroko_codec, BaseEbcdicIncrementalEncoder, BaseEbcdicIncrementalDecoder, lazy_property
class Cp037IncrementalEncoder(BaseEbcdicIncrementalEncoder):
"""
IncrementalEncoder implementation for EBCDIC-037.
This is what might be considered the "default" EBCDIC set, and is used in the United States,
the Netherlands, Portugal, Brazil, Australia and New Zealand, and on the ESA/390 in Canada.
"""
name = 'cp037'
html5name = None
@lazy_property
@ -13,6 +20,12 @@ class Cp037IncrementalEncoder(BaseEbcdicIncrementalEncoder):
return {0: 0, 1: 1, 2: 2, 3: 3, 156: 4, 9: 5, 134: 6, 127: 7, 151: 8, 141: 9, 142: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 157: 20, 133: 21, 8: 22, 135: 23, 24: 24, 25: 25, 146: 26, 143: 27, 28: 28, 29: 29, 30: 30, 31: 31, 128: 32, 129: 33, 130: 34, 131: 35, 132: 36, 10: 37, 23: 38, 27: 39, 136: 40, 137: 41, 138: 42, 139: 43, 140: 44, 5: 45, 6: 46, 7: 47, 144: 48, 145: 49, 22: 50, 147: 51, 148: 52, 149: 53, 150: 54, 4: 55, 152: 56, 153: 57, 154: 58, 155: 59, 20: 60, 21: 61, 158: 62, 26: 63, 32: 64, 160: 65, 226: 66, 228: 67, 224: 68, 225: 69, 227: 70, 229: 71, 231: 72, 241: 73, 162: 74, 46: 75, 60: 76, 40: 77, 43: 78, 124: 79, 38: 80, 233: 81, 234: 82, 235: 83, 232: 84, 237: 85, 238: 86, 239: 87, 236: 88, 223: 89, 33: 90, 36: 91, 42: 92, 41: 93, 59: 94, 172: 95, 45: 96, 47: 97, 194: 98, 196: 99, 192: 100, 193: 101, 195: 102, 197: 103, 199: 104, 209: 105, 166: 106, 44: 107, 37: 108, 95: 109, 62: 110, 63: 111, 248: 112, 201: 113, 202: 114, 203: 115, 200: 116, 205: 117, 206: 118, 207: 119, 204: 120, 96: 121, 58: 122, 35: 123, 64: 124, 39: 125, 61: 126, 34: 127, 216: 128, 97: 129, 98: 130, 99: 131, 100: 132, 101: 133, 102: 134, 103: 135, 104: 136, 105: 137, 171: 138, 187: 139, 240: 140, 253: 141, 254: 142, 177: 143, 176: 144, 106: 145, 107: 146, 108: 147, 109: 148, 110: 149, 111: 150, 112: 151, 113: 152, 114: 153, 170: 154, 186: 155, 230: 156, 184: 157, 198: 158, 164: 159, 181: 160, 126: 161, 115: 162, 116: 163, 117: 164, 118: 165, 119: 166, 120: 167, 121: 168, 122: 169, 161: 170, 191: 171, 208: 172, 221: 173, 222: 174, 174: 175, 94: 176, 163: 177, 165: 178, 183: 179, 169: 180, 167: 181, 182: 182, 188: 183, 189: 184, 190: 185, 91: 186, 93: 187, 175: 188, 168: 189, 180: 190, 215: 191, 123: 192, 65: 193, 66: 194, 67: 195, 68: 196, 69: 197, 70: 198, 71: 199, 72: 200, 73: 201, 173: 202, 244: 203, 246: 204, 242: 205, 243: 206, 245: 207, 125: 208, 74: 209, 75: 210, 76: 211, 77: 212, 78: 213, 79: 214, 80: 215, 81: 216, 82: 217, 185: 218, 251: 219, 252: 220, 249: 221, 250: 222, 255: 223, 92: 224, 247: 225, 83: 226, 84: 227, 85: 228, 86: 229, 87: 230, 88: 231, 89: 232, 90: 233, 178: 234, 212: 235, 214: 236, 210: 237, 211: 238, 213: 239, 48: 240, 49: 241, 50: 242, 51: 243, 52: 244, 53: 245, 54: 246, 55: 247, 56: 248, 57: 249, 179: 250, 219: 251, 220: 252, 217: 253, 218: 254, 159: 255}
class Cp037IncrementalDecoder(BaseEbcdicIncrementalDecoder):
"""
IncrementalDecoder implementation for EBCDIC-037.
This is what might be considered the "default" EBCDIC set, and is used in the United States,
the Netherlands, Portugal, Brazil, Australia and New Zealand, and on the ESA/390 in Canada.
"""
name = 'cp037'
html5name = None
@lazy_property
@ -23,6 +36,9 @@ register_kuroko_codec(['cp037', '037', 'csibm037', 'ebcdic-cp-ca', 'ebcdic-cp-nl
class Cp273IncrementalEncoder(BaseEbcdicIncrementalEncoder):
"""
IncrementalEncoder implementation for EBCDIC-273 (used in German-speaking locales).
"""
name = 'cp273'
html5name = None
@lazy_property
@ -30,6 +46,9 @@ class Cp273IncrementalEncoder(BaseEbcdicIncrementalEncoder):
return {0: 0, 1: 1, 2: 2, 3: 3, 156: 4, 9: 5, 134: 6, 127: 7, 151: 8, 141: 9, 142: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 157: 20, 133: 21, 8: 22, 135: 23, 24: 24, 25: 25, 146: 26, 143: 27, 28: 28, 29: 29, 30: 30, 31: 31, 128: 32, 129: 33, 130: 34, 131: 35, 132: 36, 10: 37, 23: 38, 27: 39, 136: 40, 137: 41, 138: 42, 139: 43, 140: 44, 5: 45, 6: 46, 7: 47, 144: 48, 145: 49, 22: 50, 147: 51, 148: 52, 149: 53, 150: 54, 4: 55, 152: 56, 153: 57, 154: 58, 155: 59, 20: 60, 21: 61, 158: 62, 26: 63, 32: 64, 160: 65, 226: 66, 123: 67, 224: 68, 225: 69, 227: 70, 229: 71, 231: 72, 241: 73, 196: 74, 46: 75, 60: 76, 40: 77, 43: 78, 33: 79, 38: 80, 233: 81, 234: 82, 235: 83, 232: 84, 237: 85, 238: 86, 239: 87, 236: 88, 126: 89, 220: 90, 36: 91, 42: 92, 41: 93, 59: 94, 94: 95, 45: 96, 47: 97, 194: 98, 91: 99, 192: 100, 193: 101, 195: 102, 197: 103, 199: 104, 209: 105, 246: 106, 44: 107, 37: 108, 95: 109, 62: 110, 63: 111, 248: 112, 201: 113, 202: 114, 203: 115, 200: 116, 205: 117, 206: 118, 207: 119, 204: 120, 96: 121, 58: 122, 35: 123, 167: 124, 39: 125, 61: 126, 34: 127, 216: 128, 97: 129, 98: 130, 99: 131, 100: 132, 101: 133, 102: 134, 103: 135, 104: 136, 105: 137, 171: 138, 187: 139, 240: 140, 253: 141, 254: 142, 177: 143, 176: 144, 106: 145, 107: 146, 108: 147, 109: 148, 110: 149, 111: 150, 112: 151, 113: 152, 114: 153, 170: 154, 186: 155, 230: 156, 184: 157, 198: 158, 164: 159, 181: 160, 223: 161, 115: 162, 116: 163, 117: 164, 118: 165, 119: 166, 120: 167, 121: 168, 122: 169, 161: 170, 191: 171, 208: 172, 221: 173, 222: 174, 174: 175, 162: 176, 163: 177, 165: 178, 183: 179, 169: 180, 64: 181, 182: 182, 188: 183, 189: 184, 190: 185, 172: 186, 124: 187, 8254: 188, 168: 189, 180: 190, 215: 191, 228: 192, 65: 193, 66: 194, 67: 195, 68: 196, 69: 197, 70: 198, 71: 199, 72: 200, 73: 201, 173: 202, 244: 203, 166: 204, 242: 205, 243: 206, 245: 207, 252: 208, 74: 209, 75: 210, 76: 211, 77: 212, 78: 213, 79: 214, 80: 215, 81: 216, 82: 217, 185: 218, 251: 219, 125: 220, 249: 221, 250: 222, 255: 223, 214: 224, 247: 225, 83: 226, 84: 227, 85: 228, 86: 229, 87: 230, 88: 231, 89: 232, 90: 233, 178: 234, 212: 235, 92: 236, 210: 237, 211: 238, 213: 239, 48: 240, 49: 241, 50: 242, 51: 243, 52: 244, 53: 245, 54: 246, 55: 247, 56: 248, 57: 249, 179: 250, 219: 251, 93: 252, 217: 253, 218: 254, 159: 255}
class Cp273IncrementalDecoder(BaseEbcdicIncrementalDecoder):
"""
IncrementalDecoder implementation for EBCDIC-273 (used in German-speaking locales).
"""
name = 'cp273'
html5name = None
@lazy_property
@ -40,6 +59,9 @@ register_kuroko_codec(['cp273', '273', 'ibm273', 'csibm273'], Cp273IncrementalEn
class Cp424IncrementalEncoder(BaseEbcdicIncrementalEncoder):
"""
IncrementalEncoder implementation for EBCDIC-273 (used in Hebrew-speaking locales).
"""
name = 'cp424'
html5name = None
@lazy_property
@ -47,6 +69,9 @@ class Cp424IncrementalEncoder(BaseEbcdicIncrementalEncoder):
return {0: 0, 1: 1, 2: 2, 3: 3, 156: 4, 9: 5, 134: 6, 127: 7, 151: 8, 141: 9, 142: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 157: 20, 133: 21, 8: 22, 135: 23, 24: 24, 25: 25, 146: 26, 143: 27, 28: 28, 29: 29, 30: 30, 31: 31, 128: 32, 129: 33, 130: 34, 131: 35, 132: 36, 10: 37, 23: 38, 27: 39, 136: 40, 137: 41, 138: 42, 139: 43, 140: 44, 5: 45, 6: 46, 7: 47, 144: 48, 145: 49, 22: 50, 147: 51, 148: 52, 149: 53, 150: 54, 4: 55, 152: 56, 153: 57, 154: 58, 155: 59, 20: 60, 21: 61, 158: 62, 26: 63, 32: 64, 1488: 65, 1489: 66, 1490: 67, 1491: 68, 1492: 69, 1493: 70, 1494: 71, 1495: 72, 1496: 73, 162: 74, 46: 75, 60: 76, 40: 77, 43: 78, 124: 79, 38: 80, 1497: 81, 1498: 82, 1499: 83, 1500: 84, 1501: 85, 1502: 86, 1503: 87, 1504: 88, 1505: 89, 33: 90, 36: 91, 42: 92, 41: 93, 59: 94, 172: 95, 45: 96, 47: 97, 1506: 98, 1507: 99, 1508: 100, 1509: 101, 1510: 102, 1511: 103, 1512: 104, 1513: 105, 166: 106, 44: 107, 37: 108, 95: 109, 62: 110, 63: 111, 1514: 113, 160: 116, 8215: 120, 96: 121, 58: 122, 35: 123, 64: 124, 39: 125, 61: 126, 34: 127, 97: 129, 98: 130, 99: 131, 100: 132, 101: 133, 102: 134, 103: 135, 104: 136, 105: 137, 171: 138, 187: 139, 177: 143, 176: 144, 106: 145, 107: 146, 108: 147, 109: 148, 110: 149, 111: 150, 112: 151, 113: 152, 114: 153, 184: 157, 164: 159, 181: 160, 126: 161, 115: 162, 116: 163, 117: 164, 118: 165, 119: 166, 120: 167, 121: 168, 122: 169, 174: 175, 94: 176, 163: 177, 165: 178, 183: 179, 169: 180, 167: 181, 182: 182, 188: 183, 189: 184, 190: 185, 91: 186, 93: 187, 175: 188, 168: 189, 180: 190, 215: 191, 123: 192, 65: 193, 66: 194, 67: 195, 68: 196, 69: 197, 70: 198, 71: 199, 72: 200, 73: 201, 173: 202, 125: 208, 74: 209, 75: 210, 76: 211, 77: 212, 78: 213, 79: 214, 80: 215, 81: 216, 82: 217, 185: 218, 92: 224, 247: 225, 83: 226, 84: 227, 85: 228, 86: 229, 87: 230, 88: 231, 89: 232, 90: 233, 178: 234, 48: 240, 49: 241, 50: 242, 51: 243, 52: 244, 53: 245, 54: 246, 55: 247, 56: 248, 57: 249, 179: 250, 159: 255}
class Cp424IncrementalDecoder(BaseEbcdicIncrementalDecoder):
"""
IncrementalEncoder implementation for EBCDIC-273 (used in Hebrew-speaking locales).
"""
name = 'cp424'
html5name = None
@lazy_property
@ -57,6 +82,9 @@ register_kuroko_codec(['cp424', '424', 'csibm424', 'ebcdic-cp-he', 'ibm424'], Cp
class Cp437IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-437 (the default, hardware or United States DOS encoding)
"""
name = 'cp437'
html5name = None
@lazy_property
@ -64,16 +92,25 @@ class Cp437IncrementalEncoder(AsciiIncrementalEncoder):
return {199: 128, 252: 129, 233: 130, 226: 131, 228: 132, 224: 133, 229: 134, 231: 135, 234: 136, 235: 137, 232: 138, 239: 139, 238: 140, 236: 141, 196: 142, 197: 143, 201: 144, 230: 145, 198: 146, 244: 147, 246: 148, 242: 149, 251: 150, 249: 151, 255: 152, 214: 153, 220: 154, 162: 155, 163: 156, 165: 157, 8359: 158, 402: 159, 225: 160, 237: 161, 243: 162, 250: 163, 241: 164, 209: 165, 170: 166, 186: 167, 191: 168, 8976: 169, 172: 170, 189: 171, 188: 172, 161: 173, 171: 174, 187: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 9569: 181, 9570: 182, 9558: 183, 9557: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 9564: 189, 9563: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 9566: 198, 9567: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 9575: 207, 9576: 208, 9572: 209, 9573: 210, 9561: 211, 9560: 212, 9554: 213, 9555: 214, 9579: 215, 9578: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 9612: 221, 9616: 222, 9600: 223, 945: 224, 223: 225, 915: 226, 960: 227, 931: 228, 963: 229, 181: 230, 964: 231, 934: 232, 920: 233, 937: 234, 948: 235, 8734: 236, 966: 237, 949: 238, 8745: 239, 8801: 240, 177: 241, 8805: 242, 8804: 243, 8992: 244, 8993: 245, 247: 246, 8776: 247, 176: 248, 8729: 249, 183: 250, 8730: 251, 8319: 252, 178: 253, 9632: 254, 160: 255}
class Cp437IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-437 (the default, hardware or United States DOS encoding)
"""
name = 'cp437'
html5name = None
@lazy_property
def decoding_map():
return {128: 199, 129: 252, 130: 233, 131: 226, 132: 228, 133: 224, 134: 229, 135: 231, 136: 234, 137: 235, 138: 232, 139: 239, 140: 238, 141: 236, 142: 196, 143: 197, 144: 201, 145: 230, 146: 198, 147: 244, 148: 246, 149: 242, 150: 251, 151: 249, 152: 255, 153: 214, 154: 220, 155: 162, 156: 163, 157: 165, 158: 8359, 159: 402, 160: 225, 161: 237, 162: 243, 163: 250, 164: 241, 165: 209, 166: 170, 167: 186, 168: 191, 169: 8976, 170: 172, 171: 189, 172: 188, 173: 161, 174: 171, 175: 187, 176: 9617, 177: 9618, 178: 9619, 179: 9474, 180: 9508, 181: 9569, 182: 9570, 183: 9558, 184: 9557, 185: 9571, 186: 9553, 187: 9559, 188: 9565, 189: 9564, 190: 9563, 191: 9488, 192: 9492, 193: 9524, 194: 9516, 195: 9500, 196: 9472, 197: 9532, 198: 9566, 199: 9567, 200: 9562, 201: 9556, 202: 9577, 203: 9574, 204: 9568, 205: 9552, 206: 9580, 207: 9575, 208: 9576, 209: 9572, 210: 9573, 211: 9561, 212: 9560, 213: 9554, 214: 9555, 215: 9579, 216: 9578, 217: 9496, 218: 9484, 219: 9608, 220: 9604, 221: 9612, 222: 9616, 223: 9600, 224: 945, 225: 223, 226: 915, 227: 960, 228: 931, 229: 963, 230: 181, 231: 964, 232: 934, 233: 920, 234: 937, 235: 948, 236: 8734, 237: 966, 238: 949, 239: 8745, 240: 8801, 241: 177, 242: 8805, 243: 8804, 244: 8992, 245: 8993, 246: 247, 247: 8776, 248: 176, 249: 8729, 250: 183, 251: 8730, 252: 8319, 253: 178, 254: 9632, 255: 160}
register_kuroko_codec(['cp437', '437', 'cspc8codepage437', 'ibm437'], Cp437IncrementalEncoder, Cp437IncrementalDecoder)
register_kuroko_codec(['cp437', '437', 'cspc8codepage437', 'ibm437', 'oem-us'], Cp437IncrementalEncoder, Cp437IncrementalDecoder)
class Cp500IncrementalEncoder(BaseEbcdicIncrementalEncoder):
"""
IncrementalEncoder implementation for EBCDIC-500.
This is the so-called "International" EBCDIC locale, used in Belgium and Switzerland, as well
as on the AS/400 in Canada.
"""
name = 'cp500'
html5name = None
@lazy_property
@ -81,6 +118,12 @@ class Cp500IncrementalEncoder(BaseEbcdicIncrementalEncoder):
return {0: 0, 1: 1, 2: 2, 3: 3, 156: 4, 9: 5, 134: 6, 127: 7, 151: 8, 141: 9, 142: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 157: 20, 133: 21, 8: 22, 135: 23, 24: 24, 25: 25, 146: 26, 143: 27, 28: 28, 29: 29, 30: 30, 31: 31, 128: 32, 129: 33, 130: 34, 131: 35, 132: 36, 10: 37, 23: 38, 27: 39, 136: 40, 137: 41, 138: 42, 139: 43, 140: 44, 5: 45, 6: 46, 7: 47, 144: 48, 145: 49, 22: 50, 147: 51, 148: 52, 149: 53, 150: 54, 4: 55, 152: 56, 153: 57, 154: 58, 155: 59, 20: 60, 21: 61, 158: 62, 26: 63, 32: 64, 160: 65, 226: 66, 228: 67, 224: 68, 225: 69, 227: 70, 229: 71, 231: 72, 241: 73, 91: 74, 46: 75, 60: 76, 40: 77, 43: 78, 33: 79, 38: 80, 233: 81, 234: 82, 235: 83, 232: 84, 237: 85, 238: 86, 239: 87, 236: 88, 223: 89, 93: 90, 36: 91, 42: 92, 41: 93, 59: 94, 94: 95, 45: 96, 47: 97, 194: 98, 196: 99, 192: 100, 193: 101, 195: 102, 197: 103, 199: 104, 209: 105, 166: 106, 44: 107, 37: 108, 95: 109, 62: 110, 63: 111, 248: 112, 201: 113, 202: 114, 203: 115, 200: 116, 205: 117, 206: 118, 207: 119, 204: 120, 96: 121, 58: 122, 35: 123, 64: 124, 39: 125, 61: 126, 34: 127, 216: 128, 97: 129, 98: 130, 99: 131, 100: 132, 101: 133, 102: 134, 103: 135, 104: 136, 105: 137, 171: 138, 187: 139, 240: 140, 253: 141, 254: 142, 177: 143, 176: 144, 106: 145, 107: 146, 108: 147, 109: 148, 110: 149, 111: 150, 112: 151, 113: 152, 114: 153, 170: 154, 186: 155, 230: 156, 184: 157, 198: 158, 164: 159, 181: 160, 126: 161, 115: 162, 116: 163, 117: 164, 118: 165, 119: 166, 120: 167, 121: 168, 122: 169, 161: 170, 191: 171, 208: 172, 221: 173, 222: 174, 174: 175, 162: 176, 163: 177, 165: 178, 183: 179, 169: 180, 167: 181, 182: 182, 188: 183, 189: 184, 190: 185, 172: 186, 124: 187, 175: 188, 168: 189, 180: 190, 215: 191, 123: 192, 65: 193, 66: 194, 67: 195, 68: 196, 69: 197, 70: 198, 71: 199, 72: 200, 73: 201, 173: 202, 244: 203, 246: 204, 242: 205, 243: 206, 245: 207, 125: 208, 74: 209, 75: 210, 76: 211, 77: 212, 78: 213, 79: 214, 80: 215, 81: 216, 82: 217, 185: 218, 251: 219, 252: 220, 249: 221, 250: 222, 255: 223, 92: 224, 247: 225, 83: 226, 84: 227, 85: 228, 86: 229, 87: 230, 88: 231, 89: 232, 90: 233, 178: 234, 212: 235, 214: 236, 210: 237, 211: 238, 213: 239, 48: 240, 49: 241, 50: 242, 51: 243, 52: 244, 53: 245, 54: 246, 55: 247, 56: 248, 57: 249, 179: 250, 219: 251, 220: 252, 217: 253, 218: 254, 159: 255}
class Cp500IncrementalDecoder(BaseEbcdicIncrementalDecoder):
"""
IncrementalDecoder implementation for EBCDIC-500.
This is the so-called "International" EBCDIC locale, used in Belgium and Switzerland, as well
as on the AS/400 in Canada.
"""
name = 'cp500'
html5name = None
@lazy_property
@ -91,6 +134,13 @@ register_kuroko_codec(['cp500', '500', 'csibm500', 'ebcdic-cp-be', 'ebcdic-cp-ch
class Cp720IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-720 (Arabic Letters with Box Drawing)
Note: OEM-720 competed with OEM-864 (which used a different layout, did not include box drawing
characters, included positional forms rather than general letters for Arabic characters, and
included separate East Arabic digits).
"""
name = 'cp720'
html5name = None
@lazy_property
@ -98,6 +148,13 @@ class Cp720IncrementalEncoder(AsciiIncrementalEncoder):
return {128: 128, 129: 129, 233: 130, 226: 131, 132: 132, 224: 133, 134: 134, 231: 135, 234: 136, 235: 137, 232: 138, 239: 139, 238: 140, 141: 141, 142: 142, 143: 143, 144: 144, 1617: 145, 1618: 146, 244: 147, 164: 148, 1600: 149, 251: 150, 249: 151, 1569: 152, 1570: 153, 1571: 154, 1572: 155, 163: 156, 1573: 157, 1574: 158, 1575: 159, 1576: 160, 1577: 161, 1578: 162, 1579: 163, 1580: 164, 1581: 165, 1582: 166, 1583: 167, 1584: 168, 1585: 169, 1586: 170, 1587: 171, 1588: 172, 1589: 173, 171: 174, 187: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 9569: 181, 9570: 182, 9558: 183, 9557: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 9564: 189, 9563: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 9566: 198, 9567: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 9575: 207, 9576: 208, 9572: 209, 9573: 210, 9561: 211, 9560: 212, 9554: 213, 9555: 214, 9579: 215, 9578: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 9612: 221, 9616: 222, 9600: 223, 1590: 224, 1591: 225, 1592: 226, 1593: 227, 1594: 228, 1601: 229, 181: 230, 1602: 231, 1603: 232, 1604: 233, 1605: 234, 1606: 235, 1607: 236, 1608: 237, 1609: 238, 1610: 239, 8801: 240, 1611: 241, 1612: 242, 1613: 243, 1614: 244, 1615: 245, 1616: 246, 8776: 247, 176: 248, 8729: 249, 183: 250, 8730: 251, 8319: 252, 178: 253, 9632: 254, 160: 255}
class Cp720IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-720 (Arabic Letters with Box Drawing)
Note: OEM-720 competed with OEM-864 (which used a different layout, did not include box drawing
characters, included positional forms rather than general letters for Arabic characters, and
included separate East Arabic digits).
"""
name = 'cp720'
html5name = None
@lazy_property
@ -108,6 +165,12 @@ register_kuroko_codec(['cp720'], Cp720IncrementalEncoder, Cp720IncrementalDecode
class Cp737IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-737 (Greek with Box Drawing).
Note: OEM-737 competed with OEM-869 (which used a different Greek layout and preserved only a
subset of the box drawing characters, but included letters with combined trema/acute).
"""
name = 'cp737'
html5name = None
@lazy_property
@ -115,6 +178,12 @@ class Cp737IncrementalEncoder(AsciiIncrementalEncoder):
return {913: 128, 914: 129, 915: 130, 916: 131, 917: 132, 918: 133, 919: 134, 920: 135, 921: 136, 922: 137, 923: 138, 924: 139, 925: 140, 926: 141, 927: 142, 928: 143, 929: 144, 931: 145, 932: 146, 933: 147, 934: 148, 935: 149, 936: 150, 937: 151, 945: 152, 946: 153, 947: 154, 948: 155, 949: 156, 950: 157, 951: 158, 952: 159, 953: 160, 954: 161, 955: 162, 956: 163, 957: 164, 958: 165, 959: 166, 960: 167, 961: 168, 963: 169, 962: 170, 964: 171, 965: 172, 966: 173, 967: 174, 968: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 9569: 181, 9570: 182, 9558: 183, 9557: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 9564: 189, 9563: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 9566: 198, 9567: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 9575: 207, 9576: 208, 9572: 209, 9573: 210, 9561: 211, 9560: 212, 9554: 213, 9555: 214, 9579: 215, 9578: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 9612: 221, 9616: 222, 9600: 223, 969: 224, 940: 225, 941: 226, 942: 227, 970: 228, 943: 229, 972: 230, 973: 231, 971: 232, 974: 233, 902: 234, 904: 235, 905: 236, 906: 237, 908: 238, 910: 239, 911: 240, 177: 241, 8805: 242, 8804: 243, 938: 244, 939: 245, 247: 246, 8776: 247, 176: 248, 8729: 249, 183: 250, 8730: 251, 8319: 252, 178: 253, 9632: 254, 160: 255}
class Cp737IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-737 (Greek with Box Drawing).
Note: OEM-737 competed with OEM-869 (which used a different Greek layout and preserved only a
subset of the box drawing characters, but included letters with combined trema/acute).
"""
name = 'cp737'
html5name = None
@lazy_property
@ -125,6 +194,9 @@ register_kuroko_codec(['cp737'], Cp737IncrementalEncoder, Cp737IncrementalDecode
class Cp775IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-775 (Baltic Rim)
"""
name = 'cp775'
html5name = None
@lazy_property
@ -132,6 +204,9 @@ class Cp775IncrementalEncoder(AsciiIncrementalEncoder):
return {262: 128, 252: 129, 233: 130, 257: 131, 228: 132, 291: 133, 229: 134, 263: 135, 322: 136, 275: 137, 342: 138, 343: 139, 299: 140, 377: 141, 196: 142, 197: 143, 201: 144, 230: 145, 198: 146, 333: 147, 246: 148, 290: 149, 162: 150, 346: 151, 347: 152, 214: 153, 220: 154, 248: 155, 163: 156, 216: 157, 215: 158, 164: 159, 256: 160, 298: 161, 243: 162, 379: 163, 380: 164, 378: 165, 8221: 166, 166: 167, 169: 168, 174: 169, 172: 170, 189: 171, 188: 172, 321: 173, 171: 174, 187: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 260: 181, 268: 182, 280: 183, 278: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 302: 189, 352: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 370: 198, 362: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 381: 207, 261: 208, 269: 209, 281: 210, 279: 211, 303: 212, 353: 213, 371: 214, 363: 215, 382: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 9612: 221, 9616: 222, 9600: 223, 211: 224, 223: 225, 332: 226, 323: 227, 245: 228, 213: 229, 181: 230, 324: 231, 310: 232, 311: 233, 315: 234, 316: 235, 326: 236, 274: 237, 325: 238, 8217: 239, 173: 240, 177: 241, 8220: 242, 190: 243, 182: 244, 167: 245, 247: 246, 8222: 247, 176: 248, 8729: 249, 183: 250, 185: 251, 179: 252, 178: 253, 9632: 254, 160: 255}
class Cp775IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-775 (Baltic Rim)
"""
name = 'cp775'
html5name = None
@lazy_property
@ -142,6 +217,9 @@ register_kuroko_codec(['cp775', '775', 'cspc775baltic', 'ibm775'], Cp775Incremen
class Cp850IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-850 (Western Europe and Canada)
"""
name = 'cp850'
html5name = None
@lazy_property
@ -149,6 +227,9 @@ class Cp850IncrementalEncoder(AsciiIncrementalEncoder):
return {199: 128, 252: 129, 233: 130, 226: 131, 228: 132, 224: 133, 229: 134, 231: 135, 234: 136, 235: 137, 232: 138, 239: 139, 238: 140, 236: 141, 196: 142, 197: 143, 201: 144, 230: 145, 198: 146, 244: 147, 246: 148, 242: 149, 251: 150, 249: 151, 255: 152, 214: 153, 220: 154, 248: 155, 163: 156, 216: 157, 215: 158, 402: 159, 225: 160, 237: 161, 243: 162, 250: 163, 241: 164, 209: 165, 170: 166, 186: 167, 191: 168, 174: 169, 172: 170, 189: 171, 188: 172, 161: 173, 171: 174, 187: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 193: 181, 194: 182, 192: 183, 169: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 162: 189, 165: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 227: 198, 195: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 164: 207, 240: 208, 208: 209, 202: 210, 203: 211, 200: 212, 305: 213, 205: 214, 206: 215, 207: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 166: 221, 204: 222, 9600: 223, 211: 224, 223: 225, 212: 226, 210: 227, 245: 228, 213: 229, 181: 230, 254: 231, 222: 232, 218: 233, 219: 234, 217: 235, 253: 236, 221: 237, 175: 238, 180: 239, 173: 240, 177: 241, 8215: 242, 190: 243, 182: 244, 167: 245, 247: 246, 184: 247, 176: 248, 168: 249, 183: 250, 185: 251, 179: 252, 178: 253, 9632: 254, 160: 255}
class Cp850IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-850 (Western Europe and Canada)
"""
name = 'cp850'
html5name = None
@lazy_property
@ -159,6 +240,9 @@ register_kuroko_codec(['cp850', '850', 'cspc850multilingual', 'ibm850'], Cp850In
class Cp852IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-852 (Central Europe)
"""
name = 'cp852'
html5name = None
@lazy_property
@ -166,6 +250,9 @@ class Cp852IncrementalEncoder(AsciiIncrementalEncoder):
return {199: 128, 252: 129, 233: 130, 226: 131, 228: 132, 367: 133, 263: 134, 231: 135, 322: 136, 235: 137, 336: 138, 337: 139, 238: 140, 377: 141, 196: 142, 262: 143, 201: 144, 313: 145, 314: 146, 244: 147, 246: 148, 317: 149, 318: 150, 346: 151, 347: 152, 214: 153, 220: 154, 356: 155, 357: 156, 321: 157, 215: 158, 269: 159, 225: 160, 237: 161, 243: 162, 250: 163, 260: 164, 261: 165, 381: 166, 382: 167, 280: 168, 281: 169, 172: 170, 378: 171, 268: 172, 351: 173, 171: 174, 187: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 193: 181, 194: 182, 282: 183, 350: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 379: 189, 380: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 258: 198, 259: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 164: 207, 273: 208, 272: 209, 270: 210, 203: 211, 271: 212, 327: 213, 205: 214, 206: 215, 283: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 354: 221, 366: 222, 9600: 223, 211: 224, 223: 225, 212: 226, 323: 227, 324: 228, 328: 229, 352: 230, 353: 231, 340: 232, 218: 233, 341: 234, 368: 235, 253: 236, 221: 237, 355: 238, 180: 239, 173: 240, 733: 241, 731: 242, 711: 243, 728: 244, 167: 245, 247: 246, 184: 247, 176: 248, 168: 249, 729: 250, 369: 251, 344: 252, 345: 253, 9632: 254, 160: 255}
class Cp852IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-852 (Central Europe)
"""
name = 'cp852'
html5name = None
@lazy_property
@ -176,6 +263,14 @@ register_kuroko_codec(['cp852', '852', 'cspcp852', 'ibm852'], Cp852IncrementalEn
class Cp855IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-855 (Balkan Cyrillic).
Note: OEM-855 competed with OEM-866 for Cyrillic; OEM-866 preserved all box drawing characters
(rather then only a subset) and was more popular for Russian, but did not provide coverage
for all of the different South Slavic Cyrillic orthographies, unlike OEM-855. Their layouts
for Cyrillic are entirely different.
"""
name = 'cp855'
html5name = None
@lazy_property
@ -183,6 +278,14 @@ class Cp855IncrementalEncoder(AsciiIncrementalEncoder):
return {1106: 128, 1026: 129, 1107: 130, 1027: 131, 1105: 132, 1025: 133, 1108: 134, 1028: 135, 1109: 136, 1029: 137, 1110: 138, 1030: 139, 1111: 140, 1031: 141, 1112: 142, 1032: 143, 1113: 144, 1033: 145, 1114: 146, 1034: 147, 1115: 148, 1035: 149, 1116: 150, 1036: 151, 1118: 152, 1038: 153, 1119: 154, 1039: 155, 1102: 156, 1070: 157, 1098: 158, 1066: 159, 1072: 160, 1040: 161, 1073: 162, 1041: 163, 1094: 164, 1062: 165, 1076: 166, 1044: 167, 1077: 168, 1045: 169, 1092: 170, 1060: 171, 1075: 172, 1043: 173, 171: 174, 187: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 1093: 181, 1061: 182, 1080: 183, 1048: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 1081: 189, 1049: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 1082: 198, 1050: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 164: 207, 1083: 208, 1051: 209, 1084: 210, 1052: 211, 1085: 212, 1053: 213, 1086: 214, 1054: 215, 1087: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 1055: 221, 1103: 222, 9600: 223, 1071: 224, 1088: 225, 1056: 226, 1089: 227, 1057: 228, 1090: 229, 1058: 230, 1091: 231, 1059: 232, 1078: 233, 1046: 234, 1074: 235, 1042: 236, 1100: 237, 1068: 238, 8470: 239, 173: 240, 1099: 241, 1067: 242, 1079: 243, 1047: 244, 1096: 245, 1064: 246, 1101: 247, 1069: 248, 1097: 249, 1065: 250, 1095: 251, 1063: 252, 167: 253, 9632: 254, 160: 255}
class Cp855IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-855 (Balkan Cyrillic).
Note: OEM-855 competed with OEM-866 for Cyrillic; OEM-866 preserved all box drawing characters
(rather then only a subset) and was more popular for Russian, but did not provide coverage
for all of the different South Slavic Cyrillic orthographies, unlike OEM-855. Their layouts
for Cyrillic are entirely different.
"""
name = 'cp855'
html5name = None
@lazy_property
@ -193,6 +296,12 @@ register_kuroko_codec(['cp855', '855', 'csibm855', 'ibm855'], Cp855IncrementalEn
class Cp856IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-856 (Hebrew).
Note: OEM-856 competed with OEM-862 for Hebrew, although they encoded the Hebrew letters in the
same layout. OEM-862 preserved all box drawing characters, while OEM-856 preserved a subset only.
"""
name = 'cp856'
html5name = None
@lazy_property
@ -200,6 +309,12 @@ class Cp856IncrementalEncoder(AsciiIncrementalEncoder):
return {1488: 128, 1489: 129, 1490: 130, 1491: 131, 1492: 132, 1493: 133, 1494: 134, 1495: 135, 1496: 136, 1497: 137, 1498: 138, 1499: 139, 1500: 140, 1501: 141, 1502: 142, 1503: 143, 1504: 144, 1505: 145, 1506: 146, 1507: 147, 1508: 148, 1509: 149, 1510: 150, 1511: 151, 1512: 152, 1513: 153, 1514: 154, 163: 156, 215: 158, 174: 169, 172: 170, 189: 171, 188: 172, 171: 174, 187: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 169: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 162: 189, 165: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 164: 207, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 166: 221, 9600: 223, 181: 230, 175: 238, 180: 239, 173: 240, 177: 241, 8215: 242, 190: 243, 182: 244, 167: 245, 247: 246, 184: 247, 176: 248, 168: 249, 183: 250, 185: 251, 179: 252, 178: 253, 9632: 254, 160: 255}
class Cp856IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-856 (Hebrew).
Note: OEM-856 competed with OEM-862 for Hebrew, although they encoded the Hebrew letters in the
same layout. OEM-862 preserved all box drawing characters, while OEM-856 preserved a subset only.
"""
name = 'cp856'
html5name = None
@lazy_property
@ -210,6 +325,9 @@ register_kuroko_codec(['cp856'], Cp856IncrementalEncoder, Cp856IncrementalDecode
class Cp857IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-857 (Turkish).
"""
name = 'cp857'
html5name = None
@lazy_property
@ -217,6 +335,9 @@ class Cp857IncrementalEncoder(AsciiIncrementalEncoder):
return {199: 128, 252: 129, 233: 130, 226: 131, 228: 132, 224: 133, 229: 134, 231: 135, 234: 136, 235: 137, 232: 138, 239: 139, 238: 140, 305: 141, 196: 142, 197: 143, 201: 144, 230: 145, 198: 146, 244: 147, 246: 148, 242: 149, 251: 150, 249: 151, 304: 152, 214: 153, 220: 154, 248: 155, 163: 156, 216: 157, 350: 158, 351: 159, 225: 160, 237: 161, 243: 162, 250: 163, 241: 164, 209: 165, 286: 166, 287: 167, 191: 168, 174: 169, 172: 170, 189: 171, 188: 172, 161: 173, 171: 174, 187: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 193: 181, 194: 182, 192: 183, 169: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 162: 189, 165: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 227: 198, 195: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 164: 207, 186: 208, 170: 209, 202: 210, 203: 211, 200: 212, 205: 214, 206: 215, 207: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 166: 221, 204: 222, 9600: 223, 211: 224, 223: 225, 212: 226, 210: 227, 245: 228, 213: 229, 181: 230, 215: 232, 218: 233, 219: 234, 217: 235, 236: 236, 255: 237, 175: 238, 180: 239, 173: 240, 177: 241, 190: 243, 182: 244, 167: 245, 247: 246, 184: 247, 176: 248, 168: 249, 183: 250, 185: 251, 179: 252, 178: 253, 9632: 254, 160: 255}
class Cp857IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-857 (Turkish).
"""
name = 'cp857'
html5name = None
@lazy_property
@ -227,6 +348,9 @@ register_kuroko_codec(['cp857', '857', 'csibm857', 'ibm857'], Cp857IncrementalEn
class Cp858IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-858 (Western Europe and Canada with the Euro sign).
"""
name = 'cp858'
html5name = None
@lazy_property
@ -234,6 +358,9 @@ class Cp858IncrementalEncoder(AsciiIncrementalEncoder):
return {199: 128, 252: 129, 233: 130, 226: 131, 228: 132, 224: 133, 229: 134, 231: 135, 234: 136, 235: 137, 232: 138, 239: 139, 238: 140, 236: 141, 196: 142, 197: 143, 201: 144, 230: 145, 198: 146, 244: 147, 246: 148, 242: 149, 251: 150, 249: 151, 255: 152, 214: 153, 220: 154, 248: 155, 163: 156, 216: 157, 215: 158, 402: 159, 225: 160, 237: 161, 243: 162, 250: 163, 241: 164, 209: 165, 170: 166, 186: 167, 191: 168, 174: 169, 172: 170, 189: 171, 188: 172, 161: 173, 171: 174, 187: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 193: 181, 194: 182, 192: 183, 169: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 162: 189, 165: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 227: 198, 195: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 164: 207, 240: 208, 208: 209, 202: 210, 203: 211, 200: 212, 8364: 213, 205: 214, 206: 215, 207: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 166: 221, 204: 222, 9600: 223, 211: 224, 223: 225, 212: 226, 210: 227, 245: 228, 213: 229, 181: 230, 254: 231, 222: 232, 218: 233, 219: 234, 217: 235, 253: 236, 221: 237, 175: 238, 180: 239, 173: 240, 177: 241, 8215: 242, 190: 243, 182: 244, 167: 245, 247: 246, 184: 247, 176: 248, 168: 249, 183: 250, 185: 251, 179: 252, 178: 253, 9632: 254, 160: 255}
class Cp858IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-858 (Western Europe and Canada with the Euro sign).
"""
name = 'cp858'
html5name = None
@lazy_property
@ -244,6 +371,9 @@ register_kuroko_codec(['cp858', '858', 'csibm858', 'ibm858'], Cp858IncrementalEn
class Cp860IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-860 (European Portugese).
"""
name = 'cp860'
html5name = None
@lazy_property
@ -251,6 +381,9 @@ class Cp860IncrementalEncoder(AsciiIncrementalEncoder):
return {199: 128, 252: 129, 233: 130, 226: 131, 227: 132, 224: 133, 193: 134, 231: 135, 234: 136, 202: 137, 232: 138, 205: 139, 212: 140, 236: 141, 195: 142, 194: 143, 201: 144, 192: 145, 200: 146, 244: 147, 245: 148, 242: 149, 218: 150, 249: 151, 204: 152, 213: 153, 220: 154, 162: 155, 163: 156, 217: 157, 8359: 158, 211: 159, 225: 160, 237: 161, 243: 162, 250: 163, 241: 164, 209: 165, 170: 166, 186: 167, 191: 168, 210: 169, 172: 170, 189: 171, 188: 172, 161: 173, 171: 174, 187: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 9569: 181, 9570: 182, 9558: 183, 9557: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 9564: 189, 9563: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 9566: 198, 9567: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 9575: 207, 9576: 208, 9572: 209, 9573: 210, 9561: 211, 9560: 212, 9554: 213, 9555: 214, 9579: 215, 9578: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 9612: 221, 9616: 222, 9600: 223, 945: 224, 223: 225, 915: 226, 960: 227, 931: 228, 963: 229, 181: 230, 964: 231, 934: 232, 920: 233, 937: 234, 948: 235, 8734: 236, 966: 237, 949: 238, 8745: 239, 8801: 240, 177: 241, 8805: 242, 8804: 243, 8992: 244, 8993: 245, 247: 246, 8776: 247, 176: 248, 8729: 249, 183: 250, 8730: 251, 8319: 252, 178: 253, 9632: 254, 160: 255}
class Cp860IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-860 (European Portugese).
"""
name = 'cp860'
html5name = None
@lazy_property
@ -261,6 +394,9 @@ register_kuroko_codec(['cp860', '860', 'csibm860', 'ibm860'], Cp860IncrementalEn
class Cp861IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-861 (Icelandic).
"""
name = 'cp861'
html5name = None
@lazy_property
@ -268,6 +404,9 @@ class Cp861IncrementalEncoder(AsciiIncrementalEncoder):
return {199: 128, 252: 129, 233: 130, 226: 131, 228: 132, 224: 133, 229: 134, 231: 135, 234: 136, 235: 137, 232: 138, 208: 139, 240: 140, 222: 141, 196: 142, 197: 143, 201: 144, 230: 145, 198: 146, 244: 147, 246: 148, 254: 149, 251: 150, 221: 151, 253: 152, 214: 153, 220: 154, 248: 155, 163: 156, 216: 157, 8359: 158, 402: 159, 225: 160, 237: 161, 243: 162, 250: 163, 193: 164, 205: 165, 211: 166, 218: 167, 191: 168, 8976: 169, 172: 170, 189: 171, 188: 172, 161: 173, 171: 174, 187: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 9569: 181, 9570: 182, 9558: 183, 9557: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 9564: 189, 9563: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 9566: 198, 9567: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 9575: 207, 9576: 208, 9572: 209, 9573: 210, 9561: 211, 9560: 212, 9554: 213, 9555: 214, 9579: 215, 9578: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 9612: 221, 9616: 222, 9600: 223, 945: 224, 223: 225, 915: 226, 960: 227, 931: 228, 963: 229, 181: 230, 964: 231, 934: 232, 920: 233, 937: 234, 948: 235, 8734: 236, 966: 237, 949: 238, 8745: 239, 8801: 240, 177: 241, 8805: 242, 8804: 243, 8992: 244, 8993: 245, 247: 246, 8776: 247, 176: 248, 8729: 249, 183: 250, 8730: 251, 8319: 252, 178: 253, 9632: 254, 160: 255}
class Cp861IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-861 (Icelandic).
"""
name = 'cp861'
html5name = None
@lazy_property
@ -278,6 +417,12 @@ register_kuroko_codec(['cp861', '861', 'cp-is', 'csibm861', 'ibm861'], Cp861Incr
class Cp862IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-862 (Hebrew and Box Drawing).
Note: OEM-862 competed with OEM-856 for Hebrew, although they encoded the Hebrew letters in the
same layout. OEM-862 preserved all box drawing characters, while OEM-856 preserved a subset only.
"""
name = 'cp862'
html5name = None
@lazy_property
@ -285,6 +430,12 @@ class Cp862IncrementalEncoder(AsciiIncrementalEncoder):
return {1488: 128, 1489: 129, 1490: 130, 1491: 131, 1492: 132, 1493: 133, 1494: 134, 1495: 135, 1496: 136, 1497: 137, 1498: 138, 1499: 139, 1500: 140, 1501: 141, 1502: 142, 1503: 143, 1504: 144, 1505: 145, 1506: 146, 1507: 147, 1508: 148, 1509: 149, 1510: 150, 1511: 151, 1512: 152, 1513: 153, 1514: 154, 162: 155, 163: 156, 165: 157, 8359: 158, 402: 159, 225: 160, 237: 161, 243: 162, 250: 163, 241: 164, 209: 165, 170: 166, 186: 167, 191: 168, 8976: 169, 172: 170, 189: 171, 188: 172, 161: 173, 171: 174, 187: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 9569: 181, 9570: 182, 9558: 183, 9557: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 9564: 189, 9563: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 9566: 198, 9567: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 9575: 207, 9576: 208, 9572: 209, 9573: 210, 9561: 211, 9560: 212, 9554: 213, 9555: 214, 9579: 215, 9578: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 9612: 221, 9616: 222, 9600: 223, 945: 224, 223: 225, 915: 226, 960: 227, 931: 228, 963: 229, 181: 230, 964: 231, 934: 232, 920: 233, 937: 234, 948: 235, 8734: 236, 966: 237, 949: 238, 8745: 239, 8801: 240, 177: 241, 8805: 242, 8804: 243, 8992: 244, 8993: 245, 247: 246, 8776: 247, 176: 248, 8729: 249, 183: 250, 8730: 251, 8319: 252, 178: 253, 9632: 254, 160: 255}
class Cp862IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-862 (Hebrew and Box Drawing).
Note: OEM-862 competed with OEM-856 for Hebrew, although they encoded the Hebrew letters in the
same layout. OEM-862 preserved all box drawing characters, while OEM-856 preserved a subset only.
"""
name = 'cp862'
html5name = None
@lazy_property
@ -295,6 +446,9 @@ register_kuroko_codec(['cp862', '862', 'cspc862latinhebrew', 'ibm862'], Cp862Inc
class Cp863IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-863 (Canadian French).
"""
name = 'cp863'
html5name = None
@lazy_property
@ -302,6 +456,9 @@ class Cp863IncrementalEncoder(AsciiIncrementalEncoder):
return {199: 128, 252: 129, 233: 130, 226: 131, 194: 132, 224: 133, 182: 134, 231: 135, 234: 136, 235: 137, 232: 138, 239: 139, 238: 140, 8215: 141, 192: 142, 167: 143, 201: 144, 200: 145, 202: 146, 244: 147, 203: 148, 207: 149, 251: 150, 249: 151, 164: 152, 212: 153, 220: 154, 162: 155, 163: 156, 217: 157, 219: 158, 402: 159, 166: 160, 180: 161, 243: 162, 250: 163, 168: 164, 184: 165, 179: 166, 175: 167, 206: 168, 8976: 169, 172: 170, 189: 171, 188: 172, 190: 173, 171: 174, 187: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 9569: 181, 9570: 182, 9558: 183, 9557: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 9564: 189, 9563: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 9566: 198, 9567: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 9575: 207, 9576: 208, 9572: 209, 9573: 210, 9561: 211, 9560: 212, 9554: 213, 9555: 214, 9579: 215, 9578: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 9612: 221, 9616: 222, 9600: 223, 945: 224, 223: 225, 915: 226, 960: 227, 931: 228, 963: 229, 181: 230, 964: 231, 934: 232, 920: 233, 937: 234, 948: 235, 8734: 236, 966: 237, 949: 238, 8745: 239, 8801: 240, 177: 241, 8805: 242, 8804: 243, 8992: 244, 8993: 245, 247: 246, 8776: 247, 176: 248, 8729: 249, 183: 250, 8730: 251, 8319: 252, 178: 253, 9632: 254, 160: 255}
class Cp863IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-863 (Canadian French).
"""
name = 'cp863'
html5name = None
@lazy_property
@ -312,6 +469,13 @@ register_kuroko_codec(['cp863', '863', 'csibm863', 'ibm863'], Cp863IncrementalEn
class Cp864IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-864 (Arabic Positional Forms)
Note: OEM-864 competed with OEM-720 (which used a different layout, included box drawing
characters, included general letters rather than positional forms of Arabic characters, and
didn't include separate East Arabic digits).
"""
name = 'cp864'
html5name = None
@lazy_property
@ -319,6 +483,13 @@ class Cp864IncrementalEncoder(AsciiIncrementalEncoder):
return {176: 128, 183: 129, 8729: 130, 8730: 131, 9618: 132, 9472: 133, 9474: 134, 9532: 135, 9508: 136, 9516: 137, 9500: 138, 9524: 139, 9488: 140, 9484: 141, 9492: 142, 9496: 143, 946: 144, 8734: 145, 966: 146, 177: 147, 189: 148, 188: 149, 8776: 150, 171: 151, 187: 152, 65271: 153, 65272: 154, 65275: 157, 65276: 158, 160: 160, 173: 161, 65154: 162, 163: 163, 164: 164, 65156: 165, 65166: 168, 65167: 169, 65173: 170, 65177: 171, 1548: 172, 65181: 173, 65185: 174, 65189: 175, 1632: 176, 1633: 177, 1634: 178, 1635: 179, 1636: 180, 1637: 181, 1638: 182, 1639: 183, 1640: 184, 1641: 185, 65233: 186, 1563: 187, 65201: 188, 65205: 189, 65209: 190, 1567: 191, 162: 192, 65152: 193, 65153: 194, 65155: 195, 65157: 196, 65226: 197, 65163: 198, 65165: 199, 65169: 200, 65171: 201, 65175: 202, 65179: 203, 65183: 204, 65187: 205, 65191: 206, 65193: 207, 65195: 208, 65197: 209, 65199: 210, 65203: 211, 65207: 212, 65211: 213, 65215: 214, 65217: 215, 65221: 216, 65227: 217, 65231: 218, 166: 219, 172: 220, 247: 221, 215: 222, 65225: 223, 1600: 224, 65235: 225, 65239: 226, 65243: 227, 65247: 228, 65251: 229, 65255: 230, 65259: 231, 65261: 232, 65263: 233, 65267: 234, 65213: 235, 65228: 236, 65230: 237, 65229: 238, 65249: 239, 65149: 240, 1617: 241, 65253: 242, 65257: 243, 65260: 244, 65264: 245, 65266: 246, 65232: 247, 65237: 248, 65269: 249, 65270: 250, 65245: 251, 65241: 252, 65265: 253, 9632: 254}
class Cp864IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-864 (Arabic Positional Forms)
Note: OEM-864 competed with OEM-720 (which used a different layout, included box drawing
characters, included general letters rather than positional forms of Arabic characters, and
didn't include separate East Arabic digits).
"""
name = 'cp864'
html5name = None
@lazy_property
@ -329,6 +500,9 @@ register_kuroko_codec(['cp864', '864', 'csibm864', 'ibm864'], Cp864IncrementalEn
class Cp865IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-865 (Continental Nordic)
"""
name = 'cp865'
html5name = None
@lazy_property
@ -336,6 +510,9 @@ class Cp865IncrementalEncoder(AsciiIncrementalEncoder):
return {199: 128, 252: 129, 233: 130, 226: 131, 228: 132, 224: 133, 229: 134, 231: 135, 234: 136, 235: 137, 232: 138, 239: 139, 238: 140, 236: 141, 196: 142, 197: 143, 201: 144, 230: 145, 198: 146, 244: 147, 246: 148, 242: 149, 251: 150, 249: 151, 255: 152, 214: 153, 220: 154, 248: 155, 163: 156, 216: 157, 8359: 158, 402: 159, 225: 160, 237: 161, 243: 162, 250: 163, 241: 164, 209: 165, 170: 166, 186: 167, 191: 168, 8976: 169, 172: 170, 189: 171, 188: 172, 161: 173, 171: 174, 164: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 9569: 181, 9570: 182, 9558: 183, 9557: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 9564: 189, 9563: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 9566: 198, 9567: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 9575: 207, 9576: 208, 9572: 209, 9573: 210, 9561: 211, 9560: 212, 9554: 213, 9555: 214, 9579: 215, 9578: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 9612: 221, 9616: 222, 9600: 223, 945: 224, 223: 225, 915: 226, 960: 227, 931: 228, 963: 229, 181: 230, 964: 231, 934: 232, 920: 233, 937: 234, 948: 235, 8734: 236, 966: 237, 949: 238, 8745: 239, 8801: 240, 177: 241, 8805: 242, 8804: 243, 8992: 244, 8993: 245, 247: 246, 8776: 247, 176: 248, 8729: 249, 183: 250, 8730: 251, 8319: 252, 178: 253, 9632: 254, 160: 255}
class Cp865IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-865 (Continental Nordic)
"""
name = 'cp865'
html5name = None
@lazy_property
@ -346,6 +523,12 @@ register_kuroko_codec(['cp865', '865', 'csibm865', 'ibm865'], Cp865IncrementalEn
class Cp869IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-869 (Greek).
Note: OEM-869 competed with OEM-737 (which used a different Greek layout and preserved all of
the box drawing characters rather than a subset, but omitted letters with combined trema/acute).
"""
name = 'cp869'
html5name = None
@lazy_property
@ -353,6 +536,12 @@ class Cp869IncrementalEncoder(AsciiIncrementalEncoder):
return {902: 134, 183: 136, 172: 137, 166: 138, 8216: 139, 8217: 140, 904: 141, 8213: 142, 905: 143, 906: 144, 938: 145, 908: 146, 910: 149, 939: 150, 169: 151, 911: 152, 178: 153, 179: 154, 940: 155, 163: 156, 941: 157, 942: 158, 943: 159, 970: 160, 912: 161, 972: 162, 973: 163, 913: 164, 914: 165, 915: 166, 916: 167, 917: 168, 918: 169, 919: 170, 189: 171, 920: 172, 921: 173, 171: 174, 187: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 922: 181, 923: 182, 924: 183, 925: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 926: 189, 927: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 928: 198, 929: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 931: 207, 932: 208, 933: 209, 934: 210, 935: 211, 936: 212, 937: 213, 945: 214, 946: 215, 947: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 948: 221, 949: 222, 9600: 223, 950: 224, 951: 225, 952: 226, 953: 227, 954: 228, 955: 229, 956: 230, 957: 231, 958: 232, 959: 233, 960: 234, 961: 235, 963: 236, 962: 237, 964: 238, 900: 239, 173: 240, 177: 241, 965: 242, 966: 243, 967: 244, 167: 245, 968: 246, 901: 247, 176: 248, 168: 249, 969: 250, 971: 251, 944: 252, 974: 253, 9632: 254, 160: 255}
class Cp869IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-869 (Greek).
Note: OEM-869 competed with OEM-737 (which used a different Greek layout and preserved all of
the box drawing characters rather than a subset, but omitted letters with combined trema/acute).
"""
name = 'cp869'
html5name = None
@lazy_property
@ -362,24 +551,10 @@ class Cp869IncrementalDecoder(AsciiIncrementalDecoder):
register_kuroko_codec(['cp869', '869', 'cp-gr', 'csibm869', 'ibm869'], Cp869IncrementalEncoder, Cp869IncrementalDecoder)
class Cp874IncrementalEncoder(AsciiIncrementalEncoder):
name = 'cp874'
html5name = None
@lazy_property
def encoding_map():
return {8364: 128, 8230: 133, 8216: 145, 8217: 146, 8220: 147, 8221: 148, 8226: 149, 8211: 150, 8212: 151, 160: 160, 3585: 161, 3586: 162, 3587: 163, 3588: 164, 3589: 165, 3590: 166, 3591: 167, 3592: 168, 3593: 169, 3594: 170, 3595: 171, 3596: 172, 3597: 173, 3598: 174, 3599: 175, 3600: 176, 3601: 177, 3602: 178, 3603: 179, 3604: 180, 3605: 181, 3606: 182, 3607: 183, 3608: 184, 3609: 185, 3610: 186, 3611: 187, 3612: 188, 3613: 189, 3614: 190, 3615: 191, 3616: 192, 3617: 193, 3618: 194, 3619: 195, 3620: 196, 3621: 197, 3622: 198, 3623: 199, 3624: 200, 3625: 201, 3626: 202, 3627: 203, 3628: 204, 3629: 205, 3630: 206, 3631: 207, 3632: 208, 3633: 209, 3634: 210, 3635: 211, 3636: 212, 3637: 213, 3638: 214, 3639: 215, 3640: 216, 3641: 217, 3642: 218, 3647: 223, 3648: 224, 3649: 225, 3650: 226, 3651: 227, 3652: 228, 3653: 229, 3654: 230, 3655: 231, 3656: 232, 3657: 233, 3658: 234, 3659: 235, 3660: 236, 3661: 237, 3662: 238, 3663: 239, 3664: 240, 3665: 241, 3666: 242, 3667: 243, 3668: 244, 3669: 245, 3670: 246, 3671: 247, 3672: 248, 3673: 249, 3674: 250, 3675: 251}
class Cp874IncrementalDecoder(AsciiIncrementalDecoder):
name = 'cp874'
html5name = None
@lazy_property
def decoding_map():
return {128: 8364, 133: 8230, 145: 8216, 146: 8217, 147: 8220, 148: 8221, 149: 8226, 150: 8211, 151: 8212, 160: 160, 161: 3585, 162: 3586, 163: 3587, 164: 3588, 165: 3589, 166: 3590, 167: 3591, 168: 3592, 169: 3593, 170: 3594, 171: 3595, 172: 3596, 173: 3597, 174: 3598, 175: 3599, 176: 3600, 177: 3601, 178: 3602, 179: 3603, 180: 3604, 181: 3605, 182: 3606, 183: 3607, 184: 3608, 185: 3609, 186: 3610, 187: 3611, 188: 3612, 189: 3613, 190: 3614, 191: 3615, 192: 3616, 193: 3617, 194: 3618, 195: 3619, 196: 3620, 197: 3621, 198: 3622, 199: 3623, 200: 3624, 201: 3625, 202: 3626, 203: 3627, 204: 3628, 205: 3629, 206: 3630, 207: 3631, 208: 3632, 209: 3633, 210: 3634, 211: 3635, 212: 3636, 213: 3637, 214: 3638, 215: 3639, 216: 3640, 217: 3641, 218: 3642, 223: 3647, 224: 3648, 225: 3649, 226: 3650, 227: 3651, 228: 3652, 229: 3653, 230: 3654, 231: 3655, 232: 3656, 233: 3657, 234: 3658, 235: 3659, 236: 3660, 237: 3661, 238: 3662, 239: 3663, 240: 3664, 241: 3665, 242: 3666, 243: 3667, 244: 3668, 245: 3669, 246: 3670, 247: 3671, 248: 3672, 249: 3673, 250: 3674, 251: 3675}
register_kuroko_codec(['cp874'], Cp874IncrementalEncoder, Cp874IncrementalDecoder)
class Cp875IncrementalEncoder(BaseEbcdicIncrementalEncoder):
"""
IncrementalEncoder implementation for EBCDIC-875 (used in Greek-speaking locales).
"""
name = 'cp875'
html5name = None
@lazy_property
@ -387,6 +562,9 @@ class Cp875IncrementalEncoder(BaseEbcdicIncrementalEncoder):
return {0: 0, 1: 1, 2: 2, 3: 3, 156: 4, 9: 5, 134: 6, 127: 7, 151: 8, 141: 9, 142: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 157: 20, 133: 21, 8: 22, 135: 23, 24: 24, 25: 25, 146: 26, 143: 27, 28: 28, 29: 29, 30: 30, 31: 31, 128: 32, 129: 33, 130: 34, 131: 35, 132: 36, 10: 37, 23: 38, 27: 39, 136: 40, 137: 41, 138: 42, 139: 43, 140: 44, 5: 45, 6: 46, 7: 47, 144: 48, 145: 49, 22: 50, 147: 51, 148: 52, 149: 53, 150: 54, 4: 55, 152: 56, 153: 57, 154: 58, 155: 59, 20: 60, 21: 61, 158: 62, 26: 253, 32: 64, 913: 65, 914: 66, 915: 67, 916: 68, 917: 69, 918: 70, 919: 71, 920: 72, 921: 73, 91: 74, 46: 75, 60: 76, 40: 77, 43: 78, 33: 79, 38: 80, 922: 81, 923: 82, 924: 83, 925: 84, 926: 85, 927: 86, 928: 87, 929: 88, 931: 89, 93: 90, 36: 91, 42: 92, 41: 93, 59: 94, 94: 95, 45: 96, 47: 97, 932: 98, 933: 99, 934: 100, 935: 101, 936: 102, 937: 103, 938: 104, 939: 105, 124: 106, 44: 107, 37: 108, 95: 109, 62: 110, 63: 111, 168: 112, 902: 113, 904: 114, 905: 115, 160: 116, 906: 117, 908: 118, 910: 119, 911: 120, 96: 121, 58: 122, 35: 123, 64: 124, 39: 125, 61: 126, 34: 127, 901: 128, 97: 129, 98: 130, 99: 131, 100: 132, 101: 133, 102: 134, 103: 135, 104: 136, 105: 137, 945: 138, 946: 139, 947: 140, 948: 141, 949: 142, 950: 143, 176: 144, 106: 145, 107: 146, 108: 147, 109: 148, 110: 149, 111: 150, 112: 151, 113: 152, 114: 153, 951: 154, 952: 155, 953: 156, 954: 157, 955: 158, 956: 159, 180: 160, 126: 161, 115: 162, 116: 163, 117: 164, 118: 165, 119: 166, 120: 167, 121: 168, 122: 169, 957: 170, 958: 171, 959: 172, 960: 173, 961: 174, 963: 175, 163: 176, 940: 177, 941: 178, 942: 179, 970: 180, 943: 181, 972: 182, 973: 183, 971: 184, 974: 185, 962: 186, 964: 187, 965: 188, 966: 189, 967: 190, 968: 191, 123: 192, 65: 193, 66: 194, 67: 195, 68: 196, 69: 197, 70: 198, 71: 199, 72: 200, 73: 201, 173: 202, 969: 203, 912: 204, 944: 205, 8216: 206, 8213: 207, 125: 208, 74: 209, 75: 210, 76: 211, 77: 212, 78: 213, 79: 214, 80: 215, 81: 216, 82: 217, 177: 218, 189: 219, 903: 221, 8217: 222, 166: 223, 92: 224, 83: 226, 84: 227, 85: 228, 86: 229, 87: 230, 88: 231, 89: 232, 90: 233, 178: 234, 167: 235, 171: 238, 172: 239, 48: 240, 49: 241, 50: 242, 51: 243, 52: 244, 53: 245, 54: 246, 55: 247, 56: 248, 57: 249, 179: 250, 169: 251, 187: 254, 159: 255}
class Cp875IncrementalDecoder(BaseEbcdicIncrementalDecoder):
"""
IncrementalDecoder implementation for EBCDIC-875 (used in Greek-speaking locales).
"""
name = 'cp875'
html5name = None
@lazy_property
@ -397,6 +575,9 @@ register_kuroko_codec(['cp875'], Cp875IncrementalEncoder, Cp875IncrementalDecode
class Cp1006IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-1006 (Urdu).
"""
name = 'cp1006'
html5name = None
@lazy_property
@ -404,6 +585,9 @@ class Cp1006IncrementalEncoder(AsciiIncrementalEncoder):
return {128: 128, 129: 129, 130: 130, 131: 131, 132: 132, 133: 133, 134: 134, 135: 135, 136: 136, 137: 137, 138: 138, 139: 139, 140: 140, 141: 141, 142: 142, 143: 143, 144: 144, 145: 145, 146: 146, 147: 147, 148: 148, 149: 149, 150: 150, 151: 151, 152: 152, 153: 153, 154: 154, 155: 155, 156: 156, 157: 157, 158: 158, 159: 159, 160: 160, 1776: 161, 1777: 162, 1778: 163, 1779: 164, 1780: 165, 1781: 166, 1782: 167, 1783: 168, 1784: 169, 1785: 170, 1548: 171, 1563: 172, 173: 173, 1567: 174, 65153: 175, 65165: 176, 65166: 178, 65167: 179, 65169: 180, 64342: 181, 64344: 182, 65171: 183, 65173: 184, 65175: 185, 64358: 186, 64360: 187, 65177: 188, 65179: 189, 65181: 190, 65183: 191, 64378: 192, 64380: 193, 65185: 194, 65187: 195, 65189: 196, 65191: 197, 65193: 198, 64388: 199, 65195: 200, 65197: 201, 64396: 202, 65199: 203, 64394: 204, 65201: 205, 65203: 206, 65205: 207, 65207: 208, 65209: 209, 65211: 210, 65213: 211, 65215: 212, 65217: 213, 65221: 214, 65225: 215, 65226: 216, 65227: 217, 65228: 218, 65229: 219, 65230: 220, 65231: 221, 65232: 222, 65233: 223, 65235: 224, 65237: 225, 65239: 226, 65241: 227, 65243: 228, 64402: 229, 64404: 230, 65245: 231, 65247: 232, 65248: 233, 65249: 234, 65251: 235, 64414: 236, 65253: 237, 65255: 238, 65157: 239, 65261: 240, 64422: 241, 64424: 242, 64425: 243, 64426: 244, 65152: 245, 65161: 246, 65162: 247, 65163: 248, 65265: 249, 65266: 250, 65267: 251, 64432: 252, 64430: 253, 65148: 254, 65149: 255}
class Cp1006IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-1006 (Urdu).
"""
name = 'cp1006'
html5name = None
@lazy_property
@ -414,6 +598,9 @@ register_kuroko_codec(['cp1006'], Cp1006IncrementalEncoder, Cp1006IncrementalDec
class Cp1026IncrementalEncoder(BaseEbcdicIncrementalEncoder):
"""
IncrementalEncoder implementation for EBCDIC-1026 (used in Turkish-speaking locales).
"""
name = 'cp1026'
html5name = None
@lazy_property
@ -421,6 +608,9 @@ class Cp1026IncrementalEncoder(BaseEbcdicIncrementalEncoder):
return {0: 0, 1: 1, 2: 2, 3: 3, 156: 4, 9: 5, 134: 6, 127: 7, 151: 8, 141: 9, 142: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 157: 20, 133: 21, 8: 22, 135: 23, 24: 24, 25: 25, 146: 26, 143: 27, 28: 28, 29: 29, 30: 30, 31: 31, 128: 32, 129: 33, 130: 34, 131: 35, 132: 36, 10: 37, 23: 38, 27: 39, 136: 40, 137: 41, 138: 42, 139: 43, 140: 44, 5: 45, 6: 46, 7: 47, 144: 48, 145: 49, 22: 50, 147: 51, 148: 52, 149: 53, 150: 54, 4: 55, 152: 56, 153: 57, 154: 58, 155: 59, 20: 60, 21: 61, 158: 62, 26: 63, 32: 64, 160: 65, 226: 66, 228: 67, 224: 68, 225: 69, 227: 70, 229: 71, 123: 72, 241: 73, 199: 74, 46: 75, 60: 76, 40: 77, 43: 78, 33: 79, 38: 80, 233: 81, 234: 82, 235: 83, 232: 84, 237: 85, 238: 86, 239: 87, 236: 88, 223: 89, 286: 90, 304: 91, 42: 92, 41: 93, 59: 94, 94: 95, 45: 96, 47: 97, 194: 98, 196: 99, 192: 100, 193: 101, 195: 102, 197: 103, 91: 104, 209: 105, 351: 106, 44: 107, 37: 108, 95: 109, 62: 110, 63: 111, 248: 112, 201: 113, 202: 114, 203: 115, 200: 116, 205: 117, 206: 118, 207: 119, 204: 120, 305: 121, 58: 122, 214: 123, 350: 124, 39: 125, 61: 126, 220: 127, 216: 128, 97: 129, 98: 130, 99: 131, 100: 132, 101: 133, 102: 134, 103: 135, 104: 136, 105: 137, 171: 138, 187: 139, 125: 140, 96: 141, 166: 142, 177: 143, 176: 144, 106: 145, 107: 146, 108: 147, 109: 148, 110: 149, 111: 150, 112: 151, 113: 152, 114: 153, 170: 154, 186: 155, 230: 156, 184: 157, 198: 158, 164: 159, 181: 160, 246: 161, 115: 162, 116: 163, 117: 164, 118: 165, 119: 166, 120: 167, 121: 168, 122: 169, 161: 170, 191: 171, 93: 172, 36: 173, 64: 174, 174: 175, 162: 176, 163: 177, 165: 178, 183: 179, 169: 180, 167: 181, 182: 182, 188: 183, 189: 184, 190: 185, 172: 186, 124: 187, 175: 188, 168: 189, 180: 190, 215: 191, 231: 192, 65: 193, 66: 194, 67: 195, 68: 196, 69: 197, 70: 198, 71: 199, 72: 200, 73: 201, 173: 202, 244: 203, 126: 204, 242: 205, 243: 206, 245: 207, 287: 208, 74: 209, 75: 210, 76: 211, 77: 212, 78: 213, 79: 214, 80: 215, 81: 216, 82: 217, 185: 218, 251: 219, 92: 220, 249: 221, 250: 222, 255: 223, 252: 224, 247: 225, 83: 226, 84: 227, 85: 228, 86: 229, 87: 230, 88: 231, 89: 232, 90: 233, 178: 234, 212: 235, 35: 236, 210: 237, 211: 238, 213: 239, 48: 240, 49: 241, 50: 242, 51: 243, 52: 244, 53: 245, 54: 246, 55: 247, 56: 248, 57: 249, 179: 250, 219: 251, 34: 252, 217: 253, 218: 254, 159: 255}
class Cp1026IncrementalDecoder(BaseEbcdicIncrementalDecoder):
"""
IncrementalDecoder implementation for EBCDIC-1026 (used in Turkish-speaking locales).
"""
name = 'cp1026'
html5name = None
@lazy_property
@ -431,6 +621,12 @@ register_kuroko_codec(['cp1026', '1026', 'csibm1026', 'ibm1026'], Cp1026Incremen
class Cp1125IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for OEM-1125 (Ukrainian Cyrillic).
OEM-1125 is the Ukrainian standard RST 2018-91; due to both being modifications of the so-called
Alternative Code Page, OEM-1125 and OEM-866 are compatible for the Russian/Bulgarian letters.
"""
name = 'cp1125'
html5name = None
@lazy_property
@ -438,6 +634,12 @@ class Cp1125IncrementalEncoder(AsciiIncrementalEncoder):
return {1040: 128, 1041: 129, 1042: 130, 1043: 131, 1044: 132, 1045: 133, 1046: 134, 1047: 135, 1048: 136, 1049: 137, 1050: 138, 1051: 139, 1052: 140, 1053: 141, 1054: 142, 1055: 143, 1056: 144, 1057: 145, 1058: 146, 1059: 147, 1060: 148, 1061: 149, 1062: 150, 1063: 151, 1064: 152, 1065: 153, 1066: 154, 1067: 155, 1068: 156, 1069: 157, 1070: 158, 1071: 159, 1072: 160, 1073: 161, 1074: 162, 1075: 163, 1076: 164, 1077: 165, 1078: 166, 1079: 167, 1080: 168, 1081: 169, 1082: 170, 1083: 171, 1084: 172, 1085: 173, 1086: 174, 1087: 175, 9617: 176, 9618: 177, 9619: 178, 9474: 179, 9508: 180, 9569: 181, 9570: 182, 9558: 183, 9557: 184, 9571: 185, 9553: 186, 9559: 187, 9565: 188, 9564: 189, 9563: 190, 9488: 191, 9492: 192, 9524: 193, 9516: 194, 9500: 195, 9472: 196, 9532: 197, 9566: 198, 9567: 199, 9562: 200, 9556: 201, 9577: 202, 9574: 203, 9568: 204, 9552: 205, 9580: 206, 9575: 207, 9576: 208, 9572: 209, 9573: 210, 9561: 211, 9560: 212, 9554: 213, 9555: 214, 9579: 215, 9578: 216, 9496: 217, 9484: 218, 9608: 219, 9604: 220, 9612: 221, 9616: 222, 9600: 223, 1088: 224, 1089: 225, 1090: 226, 1091: 227, 1092: 228, 1093: 229, 1094: 230, 1095: 231, 1096: 232, 1097: 233, 1098: 234, 1099: 235, 1100: 236, 1101: 237, 1102: 238, 1103: 239, 1025: 240, 1105: 241, 1168: 242, 1169: 243, 1028: 244, 1108: 245, 1030: 246, 1110: 247, 1031: 248, 1111: 249, 183: 250, 8730: 251, 8470: 252, 164: 253, 9632: 254, 160: 255}
class Cp1125IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for OEM-1125 (Ukrainian Cyrillic).
OEM-1125 is the Ukrainian standard RST 2018-91; due to both being modifications of the so-called
Alternative Code Page, OEM-1125 and OEM-866 are compatible for the Russian/Bulgarian letters.
"""
name = 'cp1125'
html5name = None
@lazy_property
@ -448,6 +650,9 @@ register_kuroko_codec(['cp1125', '1125', 'ibm1125', 'cp866u', 'ruscii'], Cp1125I
class Cp1140IncrementalEncoder(BaseEbcdicIncrementalEncoder):
"""
IncrementalEncoder implementation for EBCDIC-1140 (EBCDIC with Euro sign).
"""
name = 'cp1140'
html5name = None
@lazy_property
@ -455,6 +660,9 @@ class Cp1140IncrementalEncoder(BaseEbcdicIncrementalEncoder):
return {0: 0, 1: 1, 2: 2, 3: 3, 156: 4, 9: 5, 134: 6, 127: 7, 151: 8, 141: 9, 142: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 157: 20, 133: 21, 8: 22, 135: 23, 24: 24, 25: 25, 146: 26, 143: 27, 28: 28, 29: 29, 30: 30, 31: 31, 128: 32, 129: 33, 130: 34, 131: 35, 132: 36, 10: 37, 23: 38, 27: 39, 136: 40, 137: 41, 138: 42, 139: 43, 140: 44, 5: 45, 6: 46, 7: 47, 144: 48, 145: 49, 22: 50, 147: 51, 148: 52, 149: 53, 150: 54, 4: 55, 152: 56, 153: 57, 154: 58, 155: 59, 20: 60, 21: 61, 158: 62, 26: 63, 32: 64, 160: 65, 226: 66, 228: 67, 224: 68, 225: 69, 227: 70, 229: 71, 231: 72, 241: 73, 162: 74, 46: 75, 60: 76, 40: 77, 43: 78, 124: 79, 38: 80, 233: 81, 234: 82, 235: 83, 232: 84, 237: 85, 238: 86, 239: 87, 236: 88, 223: 89, 33: 90, 36: 91, 42: 92, 41: 93, 59: 94, 172: 95, 45: 96, 47: 97, 194: 98, 196: 99, 192: 100, 193: 101, 195: 102, 197: 103, 199: 104, 209: 105, 166: 106, 44: 107, 37: 108, 95: 109, 62: 110, 63: 111, 248: 112, 201: 113, 202: 114, 203: 115, 200: 116, 205: 117, 206: 118, 207: 119, 204: 120, 96: 121, 58: 122, 35: 123, 64: 124, 39: 125, 61: 126, 34: 127, 216: 128, 97: 129, 98: 130, 99: 131, 100: 132, 101: 133, 102: 134, 103: 135, 104: 136, 105: 137, 171: 138, 187: 139, 240: 140, 253: 141, 254: 142, 177: 143, 176: 144, 106: 145, 107: 146, 108: 147, 109: 148, 110: 149, 111: 150, 112: 151, 113: 152, 114: 153, 170: 154, 186: 155, 230: 156, 184: 157, 198: 158, 8364: 159, 181: 160, 126: 161, 115: 162, 116: 163, 117: 164, 118: 165, 119: 166, 120: 167, 121: 168, 122: 169, 161: 170, 191: 171, 208: 172, 221: 173, 222: 174, 174: 175, 94: 176, 163: 177, 165: 178, 183: 179, 169: 180, 167: 181, 182: 182, 188: 183, 189: 184, 190: 185, 91: 186, 93: 187, 175: 188, 168: 189, 180: 190, 215: 191, 123: 192, 65: 193, 66: 194, 67: 195, 68: 196, 69: 197, 70: 198, 71: 199, 72: 200, 73: 201, 173: 202, 244: 203, 246: 204, 242: 205, 243: 206, 245: 207, 125: 208, 74: 209, 75: 210, 76: 211, 77: 212, 78: 213, 79: 214, 80: 215, 81: 216, 82: 217, 185: 218, 251: 219, 252: 220, 249: 221, 250: 222, 255: 223, 92: 224, 247: 225, 83: 226, 84: 227, 85: 228, 86: 229, 87: 230, 88: 231, 89: 232, 90: 233, 178: 234, 212: 235, 214: 236, 210: 237, 211: 238, 213: 239, 48: 240, 49: 241, 50: 242, 51: 243, 52: 244, 53: 245, 54: 246, 55: 247, 56: 248, 57: 249, 179: 250, 219: 251, 220: 252, 217: 253, 218: 254, 159: 255}
class Cp1140IncrementalDecoder(BaseEbcdicIncrementalDecoder):
"""
IncrementalDecoder implementation for EBCDIC-1140 (EBCDIC with Euro sign).
"""
name = 'cp1140'
html5name = None
@lazy_property
@ -465,6 +673,9 @@ register_kuroko_codec(['cp1140', '1140', 'ibm1140'], Cp1140IncrementalEncoder, C
class HpRoman8IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for the HP 8-bit Roman encoding.
"""
name = 'hp-roman8'
html5name = None
@lazy_property
@ -472,6 +683,9 @@ class HpRoman8IncrementalEncoder(AsciiIncrementalEncoder):
return {128: 128, 129: 129, 130: 130, 131: 131, 132: 132, 133: 133, 134: 134, 135: 135, 136: 136, 137: 137, 138: 138, 139: 139, 140: 140, 141: 141, 142: 142, 143: 143, 144: 144, 145: 145, 146: 146, 147: 147, 148: 148, 149: 149, 150: 150, 151: 151, 152: 152, 153: 153, 154: 154, 155: 155, 156: 156, 157: 157, 158: 158, 159: 159, 160: 160, 192: 161, 194: 162, 200: 163, 202: 164, 203: 165, 206: 166, 207: 167, 180: 168, 715: 169, 710: 170, 168: 171, 732: 172, 217: 173, 219: 174, 8356: 175, 175: 176, 221: 177, 253: 178, 176: 179, 199: 180, 231: 181, 209: 182, 241: 183, 161: 184, 191: 185, 164: 186, 163: 187, 165: 188, 167: 189, 402: 190, 162: 191, 226: 192, 234: 193, 244: 194, 251: 195, 225: 196, 233: 197, 243: 198, 250: 199, 224: 200, 232: 201, 242: 202, 249: 203, 228: 204, 235: 205, 246: 206, 252: 207, 197: 208, 238: 209, 216: 210, 198: 211, 229: 212, 237: 213, 248: 214, 230: 215, 196: 216, 236: 217, 214: 218, 220: 219, 201: 220, 239: 221, 223: 222, 212: 223, 193: 224, 195: 225, 227: 226, 208: 227, 240: 228, 205: 229, 204: 230, 211: 231, 210: 232, 213: 233, 245: 234, 352: 235, 353: 236, 218: 237, 376: 238, 255: 239, 222: 240, 254: 241, 183: 242, 181: 243, 182: 244, 190: 245, 8212: 246, 188: 247, 189: 248, 170: 249, 186: 250, 171: 251, 9632: 252, 187: 253, 177: 254}
class HpRoman8IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for the HP 8-bit Roman encoding.
"""
name = 'hp-roman8'
html5name = None
@lazy_property
@ -482,6 +696,9 @@ register_kuroko_codec(['hp-roman8', 'roman8', 'r8', 'csHPRoman8', 'cp1051', 'ibm
class Koi8TIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for the KOI8-T (KOI-8 Cyrillic for Tajik) encoding.
"""
name = 'koi8-t'
html5name = None
@lazy_property
@ -489,6 +706,9 @@ class Koi8TIncrementalEncoder(AsciiIncrementalEncoder):
return {1179: 128, 1171: 129, 8218: 130, 1170: 131, 8222: 132, 8230: 133, 8224: 134, 8225: 135, 8240: 137, 1203: 138, 8249: 139, 1202: 140, 1207: 141, 1206: 142, 1178: 144, 8216: 145, 8217: 146, 8220: 147, 8221: 148, 8226: 149, 8211: 150, 8212: 151, 8482: 153, 8250: 155, 1263: 161, 1262: 162, 1105: 163, 164: 164, 1251: 165, 166: 166, 167: 167, 171: 171, 172: 172, 173: 173, 174: 174, 176: 176, 177: 177, 178: 178, 1025: 179, 1250: 181, 182: 182, 183: 183, 8470: 185, 187: 187, 169: 191, 1102: 192, 1072: 193, 1073: 194, 1094: 195, 1076: 196, 1077: 197, 1092: 198, 1075: 199, 1093: 200, 1080: 201, 1081: 202, 1082: 203, 1083: 204, 1084: 205, 1085: 206, 1086: 207, 1087: 208, 1103: 209, 1088: 210, 1089: 211, 1090: 212, 1091: 213, 1078: 214, 1074: 215, 1100: 216, 1099: 217, 1079: 218, 1096: 219, 1101: 220, 1097: 221, 1095: 222, 1098: 223, 1070: 224, 1040: 225, 1041: 226, 1062: 227, 1044: 228, 1045: 229, 1060: 230, 1043: 231, 1061: 232, 1048: 233, 1049: 234, 1050: 235, 1051: 236, 1052: 237, 1053: 238, 1054: 239, 1055: 240, 1071: 241, 1056: 242, 1057: 243, 1058: 244, 1059: 245, 1046: 246, 1042: 247, 1068: 248, 1067: 249, 1047: 250, 1064: 251, 1069: 252, 1065: 253, 1063: 254, 1066: 255}
class Koi8TIncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for the KOI8-T (KOI-8 Cyrillic for Tajik) encoding.
"""
name = 'koi8-t'
html5name = None
@lazy_property
@ -499,6 +719,11 @@ register_kuroko_codec(['koi8-t'], Koi8TIncrementalEncoder, Koi8TIncrementalDecod
class Kz1048IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for Kazakh standard KZ-1048.
This is an modification of Windows-1251 to add support for Kazakh.
"""
name = 'kz1048'
html5name = None
@lazy_property
@ -506,6 +731,11 @@ class Kz1048IncrementalEncoder(AsciiIncrementalEncoder):
return {1026: 128, 1027: 129, 8218: 130, 1107: 131, 8222: 132, 8230: 133, 8224: 134, 8225: 135, 8364: 136, 8240: 137, 1033: 138, 8249: 139, 1034: 140, 1178: 141, 1210: 142, 1039: 143, 1106: 144, 8216: 145, 8217: 146, 8220: 147, 8221: 148, 8226: 149, 8211: 150, 8212: 151, 8482: 153, 1113: 154, 8250: 155, 1114: 156, 1179: 157, 1211: 158, 1119: 159, 160: 160, 1200: 161, 1201: 162, 1240: 163, 164: 164, 1256: 165, 166: 166, 167: 167, 1025: 168, 169: 169, 1170: 170, 171: 171, 172: 172, 173: 173, 174: 174, 1198: 175, 176: 176, 177: 177, 1030: 178, 1110: 179, 1257: 180, 181: 181, 182: 182, 183: 183, 1105: 184, 8470: 185, 1171: 186, 187: 187, 1241: 188, 1186: 189, 1187: 190, 1199: 191, 1040: 192, 1041: 193, 1042: 194, 1043: 195, 1044: 196, 1045: 197, 1046: 198, 1047: 199, 1048: 200, 1049: 201, 1050: 202, 1051: 203, 1052: 204, 1053: 205, 1054: 206, 1055: 207, 1056: 208, 1057: 209, 1058: 210, 1059: 211, 1060: 212, 1061: 213, 1062: 214, 1063: 215, 1064: 216, 1065: 217, 1066: 218, 1067: 219, 1068: 220, 1069: 221, 1070: 222, 1071: 223, 1072: 224, 1073: 225, 1074: 226, 1075: 227, 1076: 228, 1077: 229, 1078: 230, 1079: 231, 1080: 232, 1081: 233, 1082: 234, 1083: 235, 1084: 236, 1085: 237, 1086: 238, 1087: 239, 1088: 240, 1089: 241, 1090: 242, 1091: 243, 1092: 244, 1093: 245, 1094: 246, 1095: 247, 1096: 248, 1097: 249, 1098: 250, 1099: 251, 1100: 252, 1101: 253, 1102: 254, 1103: 255}
class Kz1048IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for Kazakh standard KZ-1048.
This is an modification of Windows-1251 to add support for Kazakh.
"""
name = 'kz1048'
html5name = None
@lazy_property
@ -516,6 +746,11 @@ register_kuroko_codec(['kz1048', 'kz-1048', 'rk1048', 'strk1048-2002'], Kz1048In
class PalmosIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for the PalmOS encoding.
This is an modification of ISO-8859-1 along similar lines to Windows-1252.
"""
name = 'palmos'
html5name = None
@lazy_property
@ -523,6 +758,11 @@ class PalmosIncrementalEncoder(AsciiIncrementalEncoder):
return {8364: 128, 129: 129, 8218: 130, 402: 131, 8222: 132, 8230: 133, 8224: 134, 8225: 135, 710: 136, 8240: 137, 352: 138, 8249: 139, 338: 140, 9830: 141, 9827: 142, 9829: 143, 9824: 144, 8216: 145, 8217: 146, 8220: 147, 8221: 148, 8226: 149, 8211: 150, 8212: 151, 732: 152, 8482: 153, 353: 154, 155: 155, 339: 156, 157: 157, 158: 158, 376: 159, 160: 160, 161: 161, 162: 162, 163: 163, 164: 164, 165: 165, 166: 166, 167: 167, 168: 168, 169: 169, 170: 170, 171: 171, 172: 172, 173: 173, 174: 174, 175: 175, 176: 176, 177: 177, 178: 178, 179: 179, 180: 180, 181: 181, 182: 182, 183: 183, 184: 184, 185: 185, 186: 186, 187: 187, 188: 188, 189: 189, 190: 190, 191: 191, 192: 192, 193: 193, 194: 194, 195: 195, 196: 196, 197: 197, 198: 198, 199: 199, 200: 200, 201: 201, 202: 202, 203: 203, 204: 204, 205: 205, 206: 206, 207: 207, 208: 208, 209: 209, 210: 210, 211: 211, 212: 212, 213: 213, 214: 214, 215: 215, 216: 216, 217: 217, 218: 218, 219: 219, 220: 220, 221: 221, 222: 222, 223: 223, 224: 224, 225: 225, 226: 226, 227: 227, 228: 228, 229: 229, 230: 230, 231: 231, 232: 232, 233: 233, 234: 234, 235: 235, 236: 236, 237: 237, 238: 238, 239: 239, 240: 240, 241: 241, 242: 242, 243: 243, 244: 244, 245: 245, 246: 246, 247: 247, 248: 248, 249: 249, 250: 250, 251: 251, 252: 252, 253: 253, 254: 254, 255: 255}
class PalmosIncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for the PalmOS encoding.
This is an modification of ISO-8859-1 along similar lines to Windows-1252.
"""
name = 'palmos'
html5name = None
@lazy_property
@ -533,6 +773,11 @@ register_kuroko_codec(['palmos'], PalmosIncrementalEncoder, PalmosIncrementalDec
class Ptcp154IncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for Paratype PTCP-154 (Asian Cyrillic).
This is an modification of Windows-1251 to add support for Asian Cyrillic orthographies.
"""
name = 'ptcp154'
html5name = None
@lazy_property
@ -540,6 +785,11 @@ class Ptcp154IncrementalEncoder(AsciiIncrementalEncoder):
return {1174: 128, 1170: 129, 1262: 130, 1171: 131, 8222: 132, 8230: 133, 1206: 134, 1198: 135, 1202: 136, 1199: 137, 1184: 138, 1250: 139, 1186: 140, 1178: 141, 1210: 142, 1208: 143, 1175: 144, 8216: 145, 8217: 146, 8220: 147, 8221: 148, 8226: 149, 8211: 150, 8212: 151, 1203: 152, 1207: 153, 1185: 154, 1251: 155, 1187: 156, 1179: 157, 1211: 158, 1209: 159, 160: 160, 1038: 161, 1118: 162, 1032: 163, 1256: 164, 1176: 165, 1200: 166, 167: 167, 1025: 168, 169: 169, 1240: 170, 171: 171, 172: 172, 1263: 173, 174: 174, 1180: 175, 176: 176, 1201: 177, 1030: 178, 1110: 179, 1177: 180, 1257: 181, 182: 182, 183: 183, 1105: 184, 8470: 185, 1241: 186, 187: 187, 1112: 188, 1194: 189, 1195: 190, 1181: 191, 1040: 192, 1041: 193, 1042: 194, 1043: 195, 1044: 196, 1045: 197, 1046: 198, 1047: 199, 1048: 200, 1049: 201, 1050: 202, 1051: 203, 1052: 204, 1053: 205, 1054: 206, 1055: 207, 1056: 208, 1057: 209, 1058: 210, 1059: 211, 1060: 212, 1061: 213, 1062: 214, 1063: 215, 1064: 216, 1065: 217, 1066: 218, 1067: 219, 1068: 220, 1069: 221, 1070: 222, 1071: 223, 1072: 224, 1073: 225, 1074: 226, 1075: 227, 1076: 228, 1077: 229, 1078: 230, 1079: 231, 1080: 232, 1081: 233, 1082: 234, 1083: 235, 1084: 236, 1085: 237, 1086: 238, 1087: 239, 1088: 240, 1089: 241, 1090: 242, 1091: 243, 1092: 244, 1093: 245, 1094: 246, 1095: 247, 1096: 248, 1097: 249, 1098: 250, 1099: 251, 1100: 252, 1101: 253, 1102: 254, 1103: 255}
class Ptcp154IncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for Paratype PTCP-154 (Asian Cyrillic).
This is an modification of Windows-1251 to add support for Asian Cyrillic orthographies.
"""
name = 'ptcp154'
html5name = None
@lazy_property
@ -550,6 +800,9 @@ register_kuroko_codec(['ptcp154', 'csptcp154', 'pt154', 'cp154', 'cyrillic-asian
class XMacArabicIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for the Macintosh Arabic encoding.
"""
name = 'x-mac-arabic'
html5name = None
@lazy_property
@ -557,6 +810,9 @@ class XMacArabicIncrementalEncoder(AsciiIncrementalEncoder):
return {196: 128, 160: 129, 199: 130, 201: 131, 209: 132, 214: 133, 220: 134, 225: 135, 224: 136, 226: 137, 228: 138, 1722: 139, 171: 140, 231: 141, 233: 142, 232: 143, 234: 144, 235: 145, 237: 146, 8230: 147, 238: 148, 239: 149, 241: 150, 243: 151, 187: 152, 244: 153, 246: 154, 247: 155, 250: 156, 249: 157, 251: 158, 252: 159, 32: 160, 33: 161, 34: 162, 35: 163, 36: 164, 1642: 165, 38: 166, 39: 167, 40: 168, 41: 169, 42: 170, 43: 171, 1548: 172, 45: 173, 46: 174, 47: 175, 1632: 176, 1633: 177, 1634: 178, 1635: 179, 1636: 180, 1637: 181, 1638: 182, 1639: 183, 1640: 184, 1641: 185, 58: 186, 1563: 187, 60: 188, 61: 189, 62: 190, 1567: 191, 10058: 192, 1569: 193, 1570: 194, 1571: 195, 1572: 196, 1573: 197, 1574: 198, 1575: 199, 1576: 200, 1577: 201, 1578: 202, 1579: 203, 1580: 204, 1581: 205, 1582: 206, 1583: 207, 1584: 208, 1585: 209, 1586: 210, 1587: 211, 1588: 212, 1589: 213, 1590: 214, 1591: 215, 1592: 216, 1593: 217, 1594: 218, 91: 219, 92: 220, 93: 221, 94: 222, 95: 223, 1600: 224, 1601: 225, 1602: 226, 1603: 227, 1604: 228, 1605: 229, 1606: 230, 1607: 231, 1608: 232, 1609: 233, 1610: 234, 1611: 235, 1612: 236, 1613: 237, 1614: 238, 1615: 239, 1616: 240, 1617: 241, 1618: 242, 1662: 243, 1657: 244, 1670: 245, 1749: 246, 1700: 247, 1711: 248, 1672: 249, 1681: 250, 123: 251, 124: 252, 125: 253, 1688: 254, 1746: 255}
class XMacArabicIncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for the Macintosh Arabic encoding.
"""
name = 'x-mac-arabic'
html5name = None
@lazy_property
@ -567,6 +823,9 @@ register_kuroko_codec(['mac-arabic', 'x-mac-arabic'], XMacArabicIncrementalEncod
class XMacCeIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for the Macintosh Central European encoding.
"""
name = 'x-mac-ce'
html5name = None
@lazy_property
@ -574,6 +833,9 @@ class XMacCeIncrementalEncoder(AsciiIncrementalEncoder):
return {196: 128, 256: 129, 257: 130, 201: 131, 260: 132, 214: 133, 220: 134, 225: 135, 261: 136, 268: 137, 228: 138, 269: 139, 262: 140, 263: 141, 233: 142, 377: 143, 378: 144, 270: 145, 237: 146, 271: 147, 274: 148, 275: 149, 278: 150, 243: 151, 279: 152, 244: 153, 246: 154, 245: 155, 250: 156, 282: 157, 283: 158, 252: 159, 8224: 160, 176: 161, 280: 162, 163: 163, 167: 164, 8226: 165, 182: 166, 223: 167, 174: 168, 169: 169, 8482: 170, 281: 171, 168: 172, 8800: 173, 291: 174, 302: 175, 303: 176, 298: 177, 8804: 178, 8805: 179, 299: 180, 310: 181, 8706: 182, 8721: 183, 322: 184, 315: 185, 316: 186, 317: 187, 318: 188, 313: 189, 314: 190, 325: 191, 326: 192, 323: 193, 172: 194, 8730: 195, 324: 196, 327: 197, 8710: 198, 171: 199, 187: 200, 8230: 201, 160: 202, 328: 203, 336: 204, 213: 205, 337: 206, 332: 207, 8211: 208, 8212: 209, 8220: 210, 8221: 211, 8216: 212, 8217: 213, 247: 214, 9674: 215, 333: 216, 340: 217, 341: 218, 344: 219, 8249: 220, 8250: 221, 345: 222, 342: 223, 343: 224, 352: 225, 8218: 226, 8222: 227, 353: 228, 346: 229, 347: 230, 193: 231, 356: 232, 357: 233, 205: 234, 381: 235, 382: 236, 362: 237, 211: 238, 212: 239, 363: 240, 366: 241, 218: 242, 367: 243, 368: 244, 369: 245, 370: 246, 371: 247, 221: 248, 253: 249, 311: 250, 379: 251, 321: 252, 380: 253, 290: 254, 711: 255}
class XMacCeIncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for the Macintosh Central European encoding.
"""
name = 'x-mac-ce'
html5name = None
@lazy_property
@ -584,6 +846,13 @@ register_kuroko_codec(['mac-centeuro', 'x-mac-ce', 'mac-latin2', 'maccentraleuro
class XMacCroatianIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for the Macintosh Croatian encoding.
In contrast to the Windows and ISO Central European encodings, the Macintosh Central European
encoding did not include complete coverage for Gajica, hence a separate encoding was used.
The two do not resemble one another except insofar as both derive from Macintosh Roman.
"""
name = 'x-mac-croatian'
html5name = None
@lazy_property
@ -591,6 +860,13 @@ class XMacCroatianIncrementalEncoder(AsciiIncrementalEncoder):
return {196: 128, 197: 129, 199: 130, 201: 131, 209: 132, 214: 133, 220: 134, 225: 135, 224: 136, 226: 137, 228: 138, 227: 139, 229: 140, 231: 141, 233: 142, 232: 143, 234: 144, 235: 145, 237: 146, 236: 147, 238: 148, 239: 149, 241: 150, 243: 151, 242: 152, 244: 153, 246: 154, 245: 155, 250: 156, 249: 157, 251: 158, 252: 159, 8224: 160, 176: 161, 162: 162, 163: 163, 167: 164, 8226: 165, 182: 166, 223: 167, 174: 168, 352: 169, 8482: 170, 180: 171, 168: 172, 8800: 173, 381: 174, 216: 175, 8734: 176, 177: 177, 8804: 178, 8805: 179, 8710: 180, 181: 181, 8706: 182, 8721: 183, 8719: 184, 353: 185, 8747: 186, 170: 187, 186: 188, 937: 189, 382: 190, 248: 191, 191: 192, 161: 193, 172: 194, 8730: 195, 402: 196, 8776: 197, 262: 198, 171: 199, 268: 200, 8230: 201, 160: 202, 192: 203, 195: 204, 213: 205, 338: 206, 339: 207, 272: 208, 8212: 209, 8220: 210, 8221: 211, 8216: 212, 8217: 213, 247: 214, 9674: 215, 63743: 216, 169: 217, 8260: 218, 8364: 219, 8249: 220, 8250: 221, 198: 222, 187: 223, 8211: 224, 183: 225, 8218: 226, 8222: 227, 8240: 228, 194: 229, 263: 230, 193: 231, 269: 232, 200: 233, 205: 234, 206: 235, 207: 236, 204: 237, 211: 238, 212: 239, 273: 240, 210: 241, 218: 242, 219: 243, 217: 244, 305: 245, 710: 246, 732: 247, 175: 248, 960: 249, 203: 250, 730: 251, 184: 252, 202: 253, 230: 254, 711: 255}
class XMacCroatianIncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for the Macintosh Croatian encoding.
In contrast to the Windows and ISO Central European encodings, the Macintosh Central European
encoding did not include complete coverage for Gajica, hence a separate encoding was used.
The two do not resemble one another except insofar as both derive from Macintosh Roman.
"""
name = 'x-mac-croatian'
html5name = None
@lazy_property
@ -601,6 +877,9 @@ register_kuroko_codec(['mac-croatian', 'x-mac-croatian'], XMacCroatianIncrementa
class XMacFarsiIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for the Macintosh Farsi encoding.
"""
name = 'x-mac-farsi'
html5name = None
@lazy_property
@ -608,6 +887,9 @@ class XMacFarsiIncrementalEncoder(AsciiIncrementalEncoder):
return {196: 128, 160: 129, 199: 130, 201: 131, 209: 132, 214: 133, 220: 134, 225: 135, 224: 136, 226: 137, 228: 138, 1722: 139, 171: 140, 231: 141, 233: 142, 232: 143, 234: 144, 235: 145, 237: 146, 8230: 147, 238: 148, 239: 149, 241: 150, 243: 151, 187: 152, 244: 153, 246: 154, 247: 155, 250: 156, 249: 157, 251: 158, 252: 159, 32: 160, 33: 161, 34: 162, 35: 163, 36: 164, 1642: 165, 38: 166, 39: 167, 40: 168, 41: 169, 42: 170, 43: 171, 1548: 172, 45: 173, 46: 174, 47: 175, 1776: 176, 1777: 177, 1778: 178, 1779: 179, 1780: 180, 1781: 181, 1782: 182, 1783: 183, 1784: 184, 1785: 185, 58: 186, 1563: 187, 60: 188, 61: 189, 62: 190, 1567: 191, 10058: 192, 1569: 193, 1570: 194, 1571: 195, 1572: 196, 1573: 197, 1574: 198, 1575: 199, 1576: 200, 1577: 201, 1578: 202, 1579: 203, 1580: 204, 1581: 205, 1582: 206, 1583: 207, 1584: 208, 1585: 209, 1586: 210, 1587: 211, 1588: 212, 1589: 213, 1590: 214, 1591: 215, 1592: 216, 1593: 217, 1594: 218, 91: 219, 92: 220, 93: 221, 94: 222, 95: 223, 1600: 224, 1601: 225, 1602: 226, 1603: 227, 1604: 228, 1605: 229, 1606: 230, 1607: 231, 1608: 232, 1609: 233, 1610: 234, 1611: 235, 1612: 236, 1613: 237, 1614: 238, 1615: 239, 1616: 240, 1617: 241, 1618: 242, 1662: 243, 1657: 244, 1670: 245, 1749: 246, 1700: 247, 1711: 248, 1672: 249, 1681: 250, 123: 251, 124: 252, 125: 253, 1688: 254, 1746: 255}
class XMacFarsiIncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for the Macintosh Farsi encoding.
"""
name = 'x-mac-farsi'
html5name = None
@lazy_property
@ -618,6 +900,9 @@ register_kuroko_codec(['mac-farsi', 'x-mac-farsi'], XMacFarsiIncrementalEncoder,
class XMacGreekIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for the Macintosh Greek encoding.
"""
name = 'x-mac-greek'
html5name = None
@lazy_property
@ -625,6 +910,9 @@ class XMacGreekIncrementalEncoder(AsciiIncrementalEncoder):
return {196: 128, 185: 129, 178: 130, 201: 131, 179: 132, 214: 133, 220: 134, 901: 135, 224: 136, 226: 137, 228: 138, 900: 139, 168: 140, 231: 141, 233: 142, 232: 143, 234: 144, 235: 145, 163: 146, 8482: 147, 238: 148, 239: 149, 8226: 150, 189: 151, 8240: 152, 244: 153, 246: 154, 166: 155, 8364: 156, 249: 157, 251: 158, 252: 159, 8224: 160, 915: 161, 916: 162, 920: 163, 923: 164, 926: 165, 928: 166, 223: 167, 174: 168, 169: 169, 931: 170, 938: 171, 167: 172, 8800: 173, 176: 174, 183: 175, 913: 176, 177: 177, 8804: 178, 8805: 179, 165: 180, 914: 181, 917: 182, 918: 183, 919: 184, 921: 185, 922: 186, 924: 187, 934: 188, 939: 189, 936: 190, 937: 191, 940: 192, 925: 193, 172: 194, 927: 195, 929: 196, 8776: 197, 932: 198, 171: 199, 187: 200, 8230: 201, 160: 202, 933: 203, 935: 204, 902: 205, 904: 206, 339: 207, 8211: 208, 8213: 209, 8220: 210, 8221: 211, 8216: 212, 8217: 213, 247: 214, 905: 215, 906: 216, 908: 217, 910: 218, 941: 219, 942: 220, 943: 221, 972: 222, 911: 223, 973: 224, 945: 225, 946: 226, 968: 227, 948: 228, 949: 229, 966: 230, 947: 231, 951: 232, 953: 233, 958: 234, 954: 235, 955: 236, 956: 237, 957: 238, 959: 239, 960: 240, 974: 241, 961: 242, 963: 243, 964: 244, 952: 245, 969: 246, 962: 247, 967: 248, 965: 249, 950: 250, 970: 251, 971: 252, 912: 253, 944: 254, 173: 255}
class XMacGreekIncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for the Macintosh Greek encoding.
"""
name = 'x-mac-greek'
html5name = None
@lazy_property
@ -635,6 +923,9 @@ register_kuroko_codec(['mac-greek', 'macgreek', 'x-mac-greek'], XMacGreekIncreme
class XMacIcelandicIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for the Macintosh Icelandic encoding.
"""
name = 'x-mac-icelandic'
html5name = None
@lazy_property
@ -642,6 +933,9 @@ class XMacIcelandicIncrementalEncoder(AsciiIncrementalEncoder):
return {196: 128, 197: 129, 199: 130, 201: 131, 209: 132, 214: 133, 220: 134, 225: 135, 224: 136, 226: 137, 228: 138, 227: 139, 229: 140, 231: 141, 233: 142, 232: 143, 234: 144, 235: 145, 237: 146, 236: 147, 238: 148, 239: 149, 241: 150, 243: 151, 242: 152, 244: 153, 246: 154, 245: 155, 250: 156, 249: 157, 251: 158, 252: 159, 221: 160, 176: 161, 162: 162, 163: 163, 167: 164, 8226: 165, 182: 166, 223: 167, 174: 168, 169: 169, 8482: 170, 180: 171, 168: 172, 8800: 173, 198: 174, 216: 175, 8734: 176, 177: 177, 8804: 178, 8805: 179, 165: 180, 181: 181, 8706: 182, 8721: 183, 8719: 184, 960: 185, 8747: 186, 170: 187, 186: 188, 937: 189, 230: 190, 248: 191, 191: 192, 161: 193, 172: 194, 8730: 195, 402: 196, 8776: 197, 8710: 198, 171: 199, 187: 200, 8230: 201, 160: 202, 192: 203, 195: 204, 213: 205, 338: 206, 339: 207, 8211: 208, 8212: 209, 8220: 210, 8221: 211, 8216: 212, 8217: 213, 247: 214, 9674: 215, 255: 216, 376: 217, 8260: 218, 8364: 219, 208: 220, 240: 221, 222: 222, 254: 223, 253: 224, 183: 225, 8218: 226, 8222: 227, 8240: 228, 194: 229, 202: 230, 193: 231, 203: 232, 200: 233, 205: 234, 206: 235, 207: 236, 204: 237, 211: 238, 212: 239, 63743: 240, 210: 241, 218: 242, 219: 243, 217: 244, 305: 245, 710: 246, 732: 247, 175: 248, 728: 249, 729: 250, 730: 251, 184: 252, 733: 253, 731: 254, 711: 255}
class XMacIcelandicIncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalEncoder implementation for the Macintosh Icelandic encoding.
"""
name = 'x-mac-icelandic'
html5name = None
@lazy_property
@ -652,6 +946,9 @@ register_kuroko_codec(['mac-iceland', 'maciceland', 'x-mac-icelandic'], XMacIcel
class XMacRomanianIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for the Macintosh Romanian encoding.
"""
name = 'x-mac-romanian'
html5name = None
@lazy_property
@ -659,6 +956,9 @@ class XMacRomanianIncrementalEncoder(AsciiIncrementalEncoder):
return {196: 128, 197: 129, 199: 130, 201: 131, 209: 132, 214: 133, 220: 134, 225: 135, 224: 136, 226: 137, 228: 138, 227: 139, 229: 140, 231: 141, 233: 142, 232: 143, 234: 144, 235: 145, 237: 146, 236: 147, 238: 148, 239: 149, 241: 150, 243: 151, 242: 152, 244: 153, 246: 154, 245: 155, 250: 156, 249: 157, 251: 158, 252: 159, 8224: 160, 176: 161, 162: 162, 163: 163, 167: 164, 8226: 165, 182: 166, 223: 167, 174: 168, 169: 169, 8482: 170, 180: 171, 168: 172, 8800: 173, 258: 174, 536: 175, 8734: 176, 177: 177, 8804: 178, 8805: 179, 165: 180, 181: 181, 8706: 182, 8721: 183, 8719: 184, 960: 185, 8747: 186, 170: 187, 186: 188, 937: 189, 259: 190, 537: 191, 191: 192, 161: 193, 172: 194, 8730: 195, 402: 196, 8776: 197, 8710: 198, 171: 199, 187: 200, 8230: 201, 160: 202, 192: 203, 195: 204, 213: 205, 338: 206, 339: 207, 8211: 208, 8212: 209, 8220: 210, 8221: 211, 8216: 212, 8217: 213, 247: 214, 9674: 215, 255: 216, 376: 217, 8260: 218, 8364: 219, 8249: 220, 8250: 221, 538: 222, 539: 223, 8225: 224, 183: 225, 8218: 226, 8222: 227, 8240: 228, 194: 229, 202: 230, 193: 231, 203: 232, 200: 233, 205: 234, 206: 235, 207: 236, 204: 237, 211: 238, 212: 239, 63743: 240, 210: 241, 218: 242, 219: 243, 217: 244, 305: 245, 710: 246, 732: 247, 175: 248, 728: 249, 729: 250, 730: 251, 184: 252, 733: 253, 731: 254, 711: 255}
class XMacRomanianIncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for the Macintosh Romanian encoding.
"""
name = 'x-mac-romanian'
html5name = None
@lazy_property
@ -669,6 +969,9 @@ register_kuroko_codec(['mac-romanian', 'x-mac-romanian'], XMacRomanianIncrementa
class XMacTurkishIncrementalEncoder(AsciiIncrementalEncoder):
"""
IncrementalEncoder implementation for the Macintosh Turkish encoding.
"""
name = 'x-mac-turkish'
html5name = None
@lazy_property
@ -676,6 +979,9 @@ class XMacTurkishIncrementalEncoder(AsciiIncrementalEncoder):
return {196: 128, 197: 129, 199: 130, 201: 131, 209: 132, 214: 133, 220: 134, 225: 135, 224: 136, 226: 137, 228: 138, 227: 139, 229: 140, 231: 141, 233: 142, 232: 143, 234: 144, 235: 145, 237: 146, 236: 147, 238: 148, 239: 149, 241: 150, 243: 151, 242: 152, 244: 153, 246: 154, 245: 155, 250: 156, 249: 157, 251: 158, 252: 159, 8224: 160, 176: 161, 162: 162, 163: 163, 167: 164, 8226: 165, 182: 166, 223: 167, 174: 168, 169: 169, 8482: 170, 180: 171, 168: 172, 8800: 173, 198: 174, 216: 175, 8734: 176, 177: 177, 8804: 178, 8805: 179, 165: 180, 181: 181, 8706: 182, 8721: 183, 8719: 184, 960: 185, 8747: 186, 170: 187, 186: 188, 937: 189, 230: 190, 248: 191, 191: 192, 161: 193, 172: 194, 8730: 195, 402: 196, 8776: 197, 8710: 198, 171: 199, 187: 200, 8230: 201, 160: 202, 192: 203, 195: 204, 213: 205, 338: 206, 339: 207, 8211: 208, 8212: 209, 8220: 210, 8221: 211, 8216: 212, 8217: 213, 247: 214, 9674: 215, 255: 216, 376: 217, 286: 218, 287: 219, 304: 220, 305: 221, 350: 222, 351: 223, 8225: 224, 183: 225, 8218: 226, 8222: 227, 8240: 228, 194: 229, 202: 230, 193: 231, 203: 232, 200: 233, 205: 234, 206: 235, 207: 236, 204: 237, 211: 238, 212: 239, 63743: 240, 210: 241, 218: 242, 219: 243, 217: 244, 63648: 245, 710: 246, 732: 247, 175: 248, 728: 249, 729: 250, 730: 251, 184: 252, 733: 253, 731: 254, 711: 255}
class XMacTurkishIncrementalDecoder(AsciiIncrementalDecoder):
"""
IncrementalDecoder implementation for the Macintosh Turkish encoding.
"""
name = 'x-mac-turkish'
html5name = None
@lazy_property

View File

@ -14,7 +14,11 @@ with fileio.open('tools/codectools/encodings.json') as f:
for enc in i['encodings']:
aliases[enc['name'].lower()] = enc['labels']
let boilerplate = '''# Generated by tools/codectools/gen_dbdata.krk from WHATWG encodings.json and indexes.json
let boilerplate = '''"""
Defines WHATWG-specified double-byte encodings which do not require dedicated implementations, and
supplies data used by those (in `codecs.bespokecodecs`) which do.
"""
# Generated by tools/codectools/gen_dbdata.krk from WHATWG encodings.json and indexes.json
from collections import xraydict
from codecs.infrastructure import AsciiIncrementalEncoder, AsciiIncrementalDecoder, register_kuroko_codec, encodesto7bit, decodesto7bit, lazy_property
@ -22,6 +26,7 @@ from codecs.infrastructure import AsciiIncrementalEncoder, AsciiIncrementalDecod
let template = '''
class {idname}IncrementalEncoder(AsciiIncrementalEncoder):
"""IncrementalEncoder implementation for {description}"""
name = {mainlabel}
html5name = {weblabel}
@lazy_property
@ -29,6 +34,7 @@ class {idname}IncrementalEncoder(AsciiIncrementalEncoder):
return {encode}
class {idname}IncrementalDecoder(AsciiIncrementalDecoder):
"""IncrementalDecoder implementation for {description}"""
name = {mainlabel}
html5name = {weblabel}
@lazy_property
@ -43,6 +49,7 @@ register_kuroko_codec({labels}, {idname}IncrementalEncoder, {idname}IncrementalD
let template_big5 = '''
class {idnameenc}IncrementalEncoder(AsciiIncrementalEncoder):
"""IncrementalEncoder implementation for {description}"""
name = {mainlabelenc}
html5name = {weblabel}
@lazy_property
@ -50,6 +57,7 @@ class {idnameenc}IncrementalEncoder(AsciiIncrementalEncoder):
return {encode}
class {idnameenc2}IncrementalEncoder(AsciiIncrementalEncoder):
"""IncrementalEncoder implementation for {description2}"""
name = {mainlabelenc2}
html5name = None
@lazy_property
@ -57,6 +65,7 @@ class {idnameenc2}IncrementalEncoder(AsciiIncrementalEncoder):
return xraydict({idnameenc}IncrementalEncoder("strict").encoding_map, {encode2})
class {idnamedec}IncrementalDecoder(AsciiIncrementalDecoder):
"""IncrementalDecoder implementation for {descriptiondec}"""
name = {mainlabeldec}
html5name = {weblabel}
@lazy_property
@ -226,6 +235,7 @@ with fileio.open('modules/codecs/dbdata.krk', 'w') as f:
mainlabel=repr('windows-31j'),
weblabel=repr('shift_jis'),
labels=repr(aliases['shift_jis'] + ["cp932", "932", "mskanji", "shiftjis", "s_jis"]),
description="Windows-31J (Shift_JIS as implemented by Microsoft).",
encode=smartrepr(encode_shiftjis), decode=smartrepr(decode_shiftjis), idname='Windows31J',
dbrange=repr(dbrange_shiftjis), tbrange=repr(tbrange_shiftjis),
trailrange=repr(trailrange_shiftjis)))
@ -233,6 +243,7 @@ with fileio.open('modules/codecs/dbdata.krk', 'w') as f:
mainlabel=repr("x-euc-jp"),
weblabel=repr("euc-jp"),
labels=repr(aliases["euc-jp"] + ["eucjp", "ujis", "u_jis"]),
description="EUC-JP (web version).",
encode=smartrepr(encode_eucjp), decode=smartrepr(decode_eucjp), idname="XEucJp",
dbrange=repr(dbrange_eucjp), tbrange=repr(tbrange_eucjp),
trailrange=repr(trailrange_eucjp)))
@ -241,6 +252,7 @@ with fileio.open('modules/codecs/dbdata.krk', 'w') as f:
weblabel=repr("euc-kr"),
labels=repr(aliases["euc-kr"] + ["cp949", "949", "ms949", "uhc", "euckr",
"ks_c_5601", "ksx1001", "ks_x_1001"]),
description="Unified Hangul Code (extended EUC-KR Wansung, Microsoft's KS C 5601 encoding).",
encode=smartrepr(encode_uhc), decode=smartrepr(decode_uhc), idname="Windows949",
dbrange=repr(dbrange_uhc), tbrange=repr(tbrange_uhc),
trailrange=repr(trailrange_uhc)))
@ -249,6 +261,9 @@ with fileio.open('modules/codecs/dbdata.krk', 'w') as f:
mainlabelenc2=repr("big5-hkscs"),
mainlabeldec=repr("big5-hkscs"),
weblabel=repr("big5"),
description="Big-5 (ETen version).",
description2="Big-5 (HKSCS version).",
descriptiondec="Big-5 (HKSCS version).",
labels=repr(["big5", "cn-big5", "csbig5", "x-x-big5", "big5-eten", "cp950", "950", "ms950"]),
labels2=repr(["big5-hkscs", "big5hkscs", "hkscs"]),
encode=smartrepr(encode_big5eten), idnameenc="Big5Eten",

View File

@ -17,6 +17,9 @@ def build_sbmap(name):
let template = """
class {idname}IncrementalEncoder(AsciiIncrementalEncoder):
'''
IncrementalEncoder implementation for {description}
'''
name = {mainlabel}
html5name = {weblabel}
@lazy_property
@ -24,6 +27,9 @@ class {idname}IncrementalEncoder(AsciiIncrementalEncoder):
return {encode}
class {idname}IncrementalDecoder(AsciiIncrementalDecoder):
'''
IncrementalDecoder implementation for {description}
'''
name = {mainlabel}
html5name = {weblabel}
@lazy_property
@ -36,7 +42,10 @@ register_kuroko_codec(
{idname}IncrementalDecoder)
"""
let boilerplate = """# Generated by tools/codectools/gen_sbencs.krk from WHATWG encodings.json and indexes.json
let boilerplate = """'''
Defines WHATWG-specified single-byte encodings.
'''
# Generated by tools/codectools/gen_sbencs.krk from WHATWG encodings.json and indexes.json
from codecs.infrastructure import AsciiIncrementalEncoder, AsciiIncrementalDecoder, register_kuroko_codec, lazy_property
"""
@ -73,7 +82,7 @@ let parity_labels = {
# do). Also, it aliases "iso-ir-166" to the former (since it cites TIS-620) despite it having
# an NBSP in the registration document (case in point).
"windows-874": ["iso-8859-11-2001", "tis620", "tis-620-0", "tis-620-2529-0",
"tis-620-2529-1", "iso-ir-166", "thai"],
"tis-620-2529-1", "iso-ir-166", "thai", "cp874"],
"iso-8859-13": ["l7", "latin7"],
"iso-8859-14": ["iso-8859-14-1998", "l8", "latin8", "iso-ir-199", "iso_celtic"],
"iso-8859-15": ["latin9"],
@ -82,6 +91,62 @@ let parity_labels = {
"x-mac-cyrillic": ["mac-cyrillic", "maccyrillic"],
}
let descriptions = {
"windows-1250": "Windows-1250 (Central Europe)",
"windows-1251": "Windows-1251 (Cyrillic)",
"windows-1252": "Windows-1252 (Western Europe), ISO-8859-1 modification/extension",
"windows-1253": "Windows-1253 (Greek)",
"windows-1254": "Windows-1254 (Turkish), ISO-8859-9 modification/extension",
"windows-1255": "Windows-1255 (Logical order Hebrew with vowel points)",
"windows-1256": "Windows-1256 (Arabic)",
"windows-1257": "Windows-1257 (Baltic Rim)",
"windows-1258": """Windows-1258 (Vietnam), basic implementation
Note that Windows-1258 includes a mixture of composed forms and combining characters,
and that some grapheme clusters must be represented with a sequence of a composed
form and a combining character, even though a fully composed form exists in Unicode
taken from other encodings such as VISCII, since a fully composed form is not included,
and a combining form is included for only one of the diacritics.
The encoder is a simple mapping which will accept text in the form generated by the decoder
but, due to the above, some grapheme clusters will not be accepted in either NFC or NFD
normalised form. The decoder does not convert its output to any normalised form. This follows
both Python and WHATWG behaviour. Conversion of text between encodable form and either
normalised form may need to be handled in a separate step by any code using this codec.""",
"ibm866": """OEM-866 (Russian Cyrillic).
Note: OEM-866 competed with OEM-855 for Cyrillic; OEM-866 preserved all box drawing characters
(rather then only a subset) and was more popular for Russian, but did not provide coverage
for all of the different South Slavic Cyrillic orthographies, unlike OEM-855. Their layouts
for Cyrillic are entirely different.""",
"iso-8859-2": "ISO/IEC 8859-2 (Central European)",
"iso-8859-3": "ISO/IEC 8859-3 (Maltese and Esperanto)",
"iso-8859-4": "ISO/IEC 8859-4 (North European)",
"iso-8859-5": "ISO/IEC 8859-5 (Cyrillic)",
"iso-8859-6": "ISO/IEC 8859-6 (Arabic ASMO 708)",
"iso-8859-7": "ISO/IEC 8859-7 (Greek ELOT 928)",
"iso-8859-8": "ISO/IEC 8859-8 (Hebrew)",
"iso-8859-8-i": "ISO/IEC 8859-8 (Hebrew)", # Artifact: they do the same thing inside codecs.
"iso-8859-10": "ISO/IEC 8859-10 (Nordic)",
"windows-874": "Windows-874 (Thai), TIS-620 / ISO-8859-11 modification/extension",
"iso-8859-13": "ISO/IEC 8859-13 (Baltic Rim)",
"iso-8859-14": "ISO/IEC 8859-14 (Celtic)",
"iso-8859-15": "ISO/IEC 8859-15 (New Western European)",
"iso-8859-16": "ISO/IEC 8859-16 (South-Eastern European; Romanian SR 14111)",
"koi8-r": "the KOI8-R (KOI-8 Cyrillic for Russian) encoding.",
"koi8-ru": "the KOI8-RU (KOI-8 Cyrillic for Belarusian, Ukrainian and Ruthenian) encoding.",
"macintosh": "the Macintosh Roman encoding.",
"x-mac-cyrillic": "the Macintosh Cyrillic encoding.",
"x-user-defined": """the user-defined extended ASCII encoding.
This maps ASCII bytes as ASCII characters, and non-ASCII bytes to the private use
range U+F780F7FF, such that the low 8 bits always match the original byte.
This is sometimes useful for round-tripping arbitrary _sensu stricto_ extended
ASCII data without caring about the non-ASCII part. Note however, that _sensu lato_
extended ASCII may for example use ASCII bytes as trail bytes in a multi-byte code.""",
}
let encode_xudef = {}
let decode_xudef = {}
for i in range(128):
@ -110,7 +175,7 @@ with fileio.open("modules/codecs/sbencs.krk", "w") as outf:
let decoding_map = built[1]
let idname = name.title().replace("-", "")
outf.write(template.format(mainlabel=repr(name), encode=repr(encoding_map),
weblabel=repr(whatwgname),
weblabel=repr(whatwgname), description=descriptions.get(name, "TODO"),
decode=repr(decoding_map), labels=repr(labels), idname=idname))
else:
for enc in i["encodings"]:
@ -119,18 +184,24 @@ with fileio.open("modules/codecs/sbencs.krk", "w") as outf:
else:
mapped_to_replacement.extend(enc["labels"])
outf.write(template.format(mainlabel=repr("x-user-defined"), encode=repr(encode_xudef),
weblabel=repr("x-user-defined"),
weblabel=repr("x-user-defined"), description=descriptions.get("x-user-defined", "TODO"),
decode=repr(decode_xudef), labels=repr(["x-user-defined"]), idname="XUserDefined"))
with fileio.open("modules/codecs/isweblabel.krk", "w") as outf:
outf.write(f"""
outf.write(f"""'''
Allows checking the WHATWG status of a given label (listed, not listed, or mapped to undefined).
'''
# Generated by tools/codectools/gen_sbencs.krk from WHATWG encodings.json
let weblabels = {all_weblabels!r}
let mapped_to_replacement = {mapped_to_replacement!r}
def map_weblabel(label):
'''
If `label` is a regular WHATWG label, returns it; if it is a label mapped to Replacement,
returns `"undefined"`; otherwise, returns `None`.
'''
if label in mapped_to_replacement:
# WHATWG aliases these following to replacement to prevent their use in injection/XSS attacks.
# WHATWG aliases these to replacement to prevent their use in injection/XSS attacks.
return "undefined"
else if label in weblabels:
return label

View File

@ -81,14 +81,16 @@ let modules = [
'tools.gendoc',
# Codecs module
'codecs',
'codecs.bespokecodecs',
'codecs.binascii',
'codecs.dbdata',
'codecs.dbextra',
'codecs.dbextra_data_7bit',
'codecs.dbextra_data_8bit',
'codecs.dbextra',
'codecs.infrastructure',
'codecs.isweblabel',
'codecs.pifonts',
'codecs.sbencs',
'codecs.sbextra',
]
@ -156,7 +158,7 @@ def functionDoc(func):
let doc = func.__doc__ if ('__doc__' in dir(func) and func.__doc__) else ''
if '@arguments ' in doc:
doc = '\n'.join([x for x in doc.split('\n') if '@arguments' not in x])
return doc
return "<p>" + doc + "</p>"
def processModules(modules):
@ -176,6 +178,13 @@ def processModules(modules):
output.write('\n')
print('## ' + fixup(modulepath) + ' {#mod_' + modulepath.replace('.','_') + '}')
let rsplit = lambda s,d,l: reversed("".join(reversed(i)) for i in "".join(reversed(s)).split(d, l))
if "." in modulepath:
let parent = rsplit(modulepath, ".", 1)[0]
if parent in modules:
let parentpath = fixup(parent).replace('.','_')
print(f"\n<a href='mod_{parentpath}.html'>← {parent}</a>\n")
if '__doc__' in dir(module) and module.__doc__:
print(module.__doc__.strip())
docString[modulepath] = truncateString(module.__doc__)
@ -279,6 +288,17 @@ def processModules(modules):
else:
other.append(Pair(member,obj))
if hasattr(module, "__ispackage__") and module.__ispackage__:
print("\n### Package contents\n")
print('\htmlonly<div class="krk-class-index"><ul>\n')
for i in modules:
if not i.startswith(name + "."):
continue
let uscored = fixup(i).replace('.','_')
let relative = i[len(name) + 1:]
print(f'<li><a class="el" href="mod_{uscored}.html">{relative}</a></li>\n')
print('</ul></div>\endhtmlonly\n')
if classes:
print('\n### Classes\n')
classes.sort()