Commit Graph

9 Commits

Author SHA1 Message Date
HarJIT
a580a835b8
Codecs revisited (#28)
* xraydict functionality and usage improvements

Add a filter_function to xraydict, allowing fewer big data structures. Make
uses of xraydict prefer exclusion sets to exclusion lists, to avoid
repeated linear search of a list.

* Make `big5_coded_forms_from_hkscs` a set, remove set trailing commas.

* Remove `big5_coded_forms_from_hkscs` in favour of a filter function.

* Similarly, use sets for 7-bit exclusion lists except when really short.

* Revise mappings for seven 78JIS codepoints.

Mappings for 25-23 and 90-22 were previously the same as those used for
97JIS; they have been swapped to correspond with how the IBM extension
versus the standard code are mapped in the "old sequence" (78JIS-based)
as opposed to the "new sequence".

Mappings for 32-70, 34-45, 35-29, 39-77 and 54-02 in 78JIS have been
changed to reflect disunifications made in 2000-JIS and 2004-JIS, assigning
the 1978-edition unsimplified variants of those characters separate coded
forms (where previously, only swaps and disunifications in 83JIS and
disunifications in 90JIS (including JIS X 0212) had been considered).

This only affects the `jis_encoding` codec (including the decoding
direction for `iso-2022-jp-2`, `iso-2022-jp-3` and `iso-2022-jp-2004`),
and the decoding is only affected when `ESC $ @` (not `ESC $ B`) is used.
The `iso-2022-jp` codec is unaffected, and remains similar to (but more
consistently pedantic than) the WHATWG specification, thus using the same
table for both 78JIS and 97JIS.

* Make `johab-ebcdic` decoder use many-to-one, not corporate PUA.

Many-to-one decodes are not uncommon in CJK encodings (e.g. Windows-31J),
and mapping to the IBM Corporate PUA (code page 1449) would probably make
it render as completely the wrong character if at all in practice.

* Switch `cp950_no_eudc_encoding_map` away from a hardcoded exclusion list.

* Codec support for `x-mac-korean`.

* Add a test bit for the UTF-8 wrapper.

* Document the unique error-condition definition of the ISO-2022-JP codec.

* Update docs now there is an actual implementation for `x-mac-korean`.

* Further explanations of the hazards of `jis_encoding`.

* Sanitised → Sanitised or escaped.

* Further clarify the status with not verifying Shift In.

* Corrected description of End State 2.

* Changes to MacKorean to avoid mapping non-ASCII using ASCII punctuation.

* Extraneous word "still".

* Fix omitting MacKorean single-byte codes.
2022-07-23 08:32:54 +09:00
K. Lange
19927ea13f Stop using tupleOf in codec tools 2022-07-05 11:42:57 +09:00
HarJIT
14db828233
Fix an oversight in the UTF-32 endian sniffing. (#18)
(I'd commented about the heuristic of characters at the start of the plane
being rare, but failed to actually implement said heuristic, only having
implemented the detection of the high eight bits (which can be expanded
to eleven) having to be false.)
2021-10-13 17:04:11 +09:00
HarJIT
fa6dbc8365
Expansion and fixes to codecs.sbextra docs. (#16)
Mostly expanding docs with more information, but also correcting a mistake
where the cp424 docstrings refer to cp273.
2021-10-07 07:18:11 +09:00
HarJIT
f5f314a42d
One fix and one improvement to GB18030: (#15)
— The codec had been failing to decode 0x81308130 to U+0080, even though
it successfully encoded it. Since U+0080 is not used for anything in most
contexts (it's allocated as a control code in the ECMA-35 sense, but
ECMA-48 does not use it) this is unlikely to have hurt anything, but I
have fixed it anyway (it arose from 0 and None being conflated in a
conditional).

— The encoding and decoding of GB18030 four-byte codes now uses binary
search rather than linear search. This significantly improves performance
on four-byte codes, though performance on two-byte codes is unaffected.
2021-08-12 19:17:59 +09:00
HarJIT
0ef38bb6ee
Corrected documentation for iso-2022-jp-ext (implementation unchanged) (#8) 2021-04-09 17:59:39 +09:00
HarJIT
614193b8a1
Codecs package docs, as well as some assorted tweaks or minor additions (#5)
* Add some docs, and remove second Code page 874 codec (they handled the
non-overridden C1 area differently, but we only need one).

* More docs work.

* Doc stuff.

* Adjusted.

* More tweaks (table padding is not the docstring's problem).

* CSS and docstring tweaks.

* Link from modules to parent packages and vice versa.

* More documentation.

* Docstrings for all `codecs` submodules.

* Move encode_jis7_reduced into dbextra_data_7bit (thus completing the lazy
startup which was apparently not complete already) and docstrings added to
implementations of base class methods referring up to the base class.

* Remove FUSE junk that somehow made it into the repo.

* Some more docstrings.

* Fix some broken references to `string` (rather than `data`) which would have
caused a problem if any existing error handler had returned a negative
offset (which no current handler does, but it's worth fixing anyway).

* Add a cp042 codec to accompany the x-user-defined codec, and to pave the
way for maybe adding Adobe Symbol, Zapf Dingbats or Wingdings codecs
in future.

* Better Japanese Autodetect behaviour for ISO-2022-JP (add yet another
condition in which it will be detected, making it able to conclusively
detect it prior to end of stream without being fed an entire escape
sequence in one call). Also some docs tweaks.

* idstr() → _idstr() since it's internal.

* Docs for codecs.pifonts.

* Docstrings for dbextra.

* Document the sbextra classes.

* Docstrings for the web encodings.

* Possibly a fairer assessment of likely reality.

* Docstrings for codecs.binascii

* The *encoding* isn't removed (the BOM is).

* Make it clearer when competing OEM code pages use different letter layouts.

* Fix copied in error.

* Stop generating linking to non-existent "← tools" from tools.gendoc.

* Move .fuse_hidden* exclusion to my user-level config.

* Constrain the table style changes to class .markdownTable, to avoid any
effect on other interface tables generated by Doxygen.

* Refer to `__ispackage__` when generating help.
2021-04-02 16:34:10 +09:00
K. Lange
40836cba21 Implement Python 3 division semantics 2021-04-02 16:02:05 +09:00
HarJIT
5c2de206b9
Codecs package (#4)
Codecs package

Co-authored-by: HarJIT <harjit@harjit.moe>
2021-03-24 04:53:02 -07:00