Commit Graph

2 Commits

Author SHA1 Message Date
HarJIT
a580a835b8
Codecs revisited (#28)
* xraydict functionality and usage improvements

Add a filter_function to xraydict, allowing fewer big data structures. Make
uses of xraydict prefer exclusion sets to exclusion lists, to avoid
repeated linear search of a list.

* Make `big5_coded_forms_from_hkscs` a set, remove set trailing commas.

* Remove `big5_coded_forms_from_hkscs` in favour of a filter function.

* Similarly, use sets for 7-bit exclusion lists except when really short.

* Revise mappings for seven 78JIS codepoints.

Mappings for 25-23 and 90-22 were previously the same as those used for
97JIS; they have been swapped to correspond with how the IBM extension
versus the standard code are mapped in the "old sequence" (78JIS-based)
as opposed to the "new sequence".

Mappings for 32-70, 34-45, 35-29, 39-77 and 54-02 in 78JIS have been
changed to reflect disunifications made in 2000-JIS and 2004-JIS, assigning
the 1978-edition unsimplified variants of those characters separate coded
forms (where previously, only swaps and disunifications in 83JIS and
disunifications in 90JIS (including JIS X 0212) had been considered).

This only affects the `jis_encoding` codec (including the decoding
direction for `iso-2022-jp-2`, `iso-2022-jp-3` and `iso-2022-jp-2004`),
and the decoding is only affected when `ESC $ @` (not `ESC $ B`) is used.
The `iso-2022-jp` codec is unaffected, and remains similar to (but more
consistently pedantic than) the WHATWG specification, thus using the same
table for both 78JIS and 97JIS.

* Make `johab-ebcdic` decoder use many-to-one, not corporate PUA.

Many-to-one decodes are not uncommon in CJK encodings (e.g. Windows-31J),
and mapping to the IBM Corporate PUA (code page 1449) would probably make
it render as completely the wrong character if at all in practice.

* Switch `cp950_no_eudc_encoding_map` away from a hardcoded exclusion list.

* Codec support for `x-mac-korean`.

* Add a test bit for the UTF-8 wrapper.

* Document the unique error-condition definition of the ISO-2022-JP codec.

* Update docs now there is an actual implementation for `x-mac-korean`.

* Further explanations of the hazards of `jis_encoding`.

* Sanitised → Sanitised or escaped.

* Further clarify the status with not verifying Shift In.

* Corrected description of End State 2.

* Changes to MacKorean to avoid mapping non-ASCII using ASCII punctuation.

* Extraneous word "still".

* Fix omitting MacKorean single-byte codes.
2022-07-23 08:32:54 +09:00
HarJIT
5c2de206b9
Codecs package (#4)
Codecs package

Co-authored-by: HarJIT <harjit@harjit.moe>
2021-03-24 04:53:02 -07:00