kuroko/modules
HarJIT a580a835b8
Codecs revisited (#28)
* xraydict functionality and usage improvements

Add a filter_function to xraydict, allowing fewer big data structures. Make
uses of xraydict prefer exclusion sets to exclusion lists, to avoid
repeated linear search of a list.

* Make `big5_coded_forms_from_hkscs` a set, remove set trailing commas.

* Remove `big5_coded_forms_from_hkscs` in favour of a filter function.

* Similarly, use sets for 7-bit exclusion lists except when really short.

* Revise mappings for seven 78JIS codepoints.

Mappings for 25-23 and 90-22 were previously the same as those used for
97JIS; they have been swapped to correspond with how the IBM extension
versus the standard code are mapped in the "old sequence" (78JIS-based)
as opposed to the "new sequence".

Mappings for 32-70, 34-45, 35-29, 39-77 and 54-02 in 78JIS have been
changed to reflect disunifications made in 2000-JIS and 2004-JIS, assigning
the 1978-edition unsimplified variants of those characters separate coded
forms (where previously, only swaps and disunifications in 83JIS and
disunifications in 90JIS (including JIS X 0212) had been considered).

This only affects the `jis_encoding` codec (including the decoding
direction for `iso-2022-jp-2`, `iso-2022-jp-3` and `iso-2022-jp-2004`),
and the decoding is only affected when `ESC $ @` (not `ESC $ B`) is used.
The `iso-2022-jp` codec is unaffected, and remains similar to (but more
consistently pedantic than) the WHATWG specification, thus using the same
table for both 78JIS and 97JIS.

* Make `johab-ebcdic` decoder use many-to-one, not corporate PUA.

Many-to-one decodes are not uncommon in CJK encodings (e.g. Windows-31J),
and mapping to the IBM Corporate PUA (code page 1449) would probably make
it render as completely the wrong character if at all in practice.

* Switch `cp950_no_eudc_encoding_map` away from a hardcoded exclusion list.

* Codec support for `x-mac-korean`.

* Add a test bit for the UTF-8 wrapper.

* Document the unique error-condition definition of the ISO-2022-JP codec.

* Update docs now there is an actual implementation for `x-mac-korean`.

* Further explanations of the hazards of `jis_encoding`.

* Sanitised → Sanitised or escaped.

* Further clarify the status with not verifying Shift In.

* Corrected description of End State 2.

* Changes to MacKorean to avoid mapping non-ASCII using ASCII punctuation.

* Extraneous word "still".

* Fix omitting MacKorean single-byte codes.
2022-07-23 08:32:54 +09:00
..
codecs Codecs revisited (#28) 2022-07-23 08:32:54 +09:00
foo Fix up path search ordering to be more like CPython; implement __main__.krk 2022-05-28 17:31:43 +09:00
syntax Add more keywords to the stdlib highlighter 2022-06-03 13:45:12 +09:00
callgrind.krk Add docstrings to 'callgrind' module 2021-03-25 09:54:23 +09:00
collections.krk Codecs revisited (#28) 2022-07-23 08:32:54 +09:00
dis.krk -m dis should recurse 2022-07-15 08:06:34 +09:00
dummy.krk Fix tracking what should be 'global' through function calls? 2021-01-07 10:39:09 +09:00
help.krk Update copyright years, it's been 2022 for a while now 2022-05-26 20:46:06 +09:00
json.krk Fix bug in json float parsing found by WASM IDE's static analyzer 2021-03-10 14:35:14 +09:00
maindemo.krk Add basic support for -m argument to interpreter 2021-01-13 09:08:11 +09:00
string.krk Fix up string escapes, make sure we're handling nil bytes when printing 2021-01-11 11:41:26 +09:00