Commit Graph

49 Commits

Author SHA1 Message Date
HarJIT
a580a835b8
Codecs revisited (#28)
* xraydict functionality and usage improvements

Add a filter_function to xraydict, allowing fewer big data structures. Make
uses of xraydict prefer exclusion sets to exclusion lists, to avoid
repeated linear search of a list.

* Make `big5_coded_forms_from_hkscs` a set, remove set trailing commas.

* Remove `big5_coded_forms_from_hkscs` in favour of a filter function.

* Similarly, use sets for 7-bit exclusion lists except when really short.

* Revise mappings for seven 78JIS codepoints.

Mappings for 25-23 and 90-22 were previously the same as those used for
97JIS; they have been swapped to correspond with how the IBM extension
versus the standard code are mapped in the "old sequence" (78JIS-based)
as opposed to the "new sequence".

Mappings for 32-70, 34-45, 35-29, 39-77 and 54-02 in 78JIS have been
changed to reflect disunifications made in 2000-JIS and 2004-JIS, assigning
the 1978-edition unsimplified variants of those characters separate coded
forms (where previously, only swaps and disunifications in 83JIS and
disunifications in 90JIS (including JIS X 0212) had been considered).

This only affects the `jis_encoding` codec (including the decoding
direction for `iso-2022-jp-2`, `iso-2022-jp-3` and `iso-2022-jp-2004`),
and the decoding is only affected when `ESC $ @` (not `ESC $ B`) is used.
The `iso-2022-jp` codec is unaffected, and remains similar to (but more
consistently pedantic than) the WHATWG specification, thus using the same
table for both 78JIS and 97JIS.

* Make `johab-ebcdic` decoder use many-to-one, not corporate PUA.

Many-to-one decodes are not uncommon in CJK encodings (e.g. Windows-31J),
and mapping to the IBM Corporate PUA (code page 1449) would probably make
it render as completely the wrong character if at all in practice.

* Switch `cp950_no_eudc_encoding_map` away from a hardcoded exclusion list.

* Codec support for `x-mac-korean`.

* Add a test bit for the UTF-8 wrapper.

* Document the unique error-condition definition of the ISO-2022-JP codec.

* Update docs now there is an actual implementation for `x-mac-korean`.

* Further explanations of the hazards of `jis_encoding`.

* Sanitised → Sanitised or escaped.

* Further clarify the status with not verifying Shift In.

* Corrected description of End State 2.

* Changes to MacKorean to avoid mapping non-ASCII using ASCII punctuation.

* Extraneous word "still".

* Fix omitting MacKorean single-byte codes.
2022-07-23 08:32:54 +09:00
K. Lange
d73d7bdef9 -m dis should recurse 2022-07-15 08:06:34 +09:00
K. Lange
a20c89fe2f Support -m dis with a dis.krk pseudomodule 2022-07-13 09:22:19 +09:00
K. Lange
19927ea13f Stop using tupleOf in codec tools 2022-07-05 11:42:57 +09:00
K. Lange
b4349303e9 Add more keywords to the stdlib highlighter 2022-06-03 13:45:12 +09:00
K. Lange
285f7cd496 Fix up path search ordering to be more like CPython; implement __main__.krk 2022-05-28 17:31:43 +09:00
K. Lange
a4e2b5c881 Relative imports 2022-05-28 15:41:47 +09:00
K. Lange
1c5fc954ed Update copyright years, it's been 2022 for a while now 2022-05-26 20:46:06 +09:00
HarJIT
14db828233
Fix an oversight in the UTF-32 endian sniffing. (#18)
(I'd commented about the heuristic of characters at the start of the plane
being rare, but failed to actually implement said heuristic, only having
implemented the detection of the high eight bits (which can be expanded
to eleven) having to be false.)
2021-10-13 17:04:11 +09:00
HarJIT
fa6dbc8365
Expansion and fixes to codecs.sbextra docs. (#16)
Mostly expanding docs with more information, but also correcting a mistake
where the cp424 docstrings refer to cp273.
2021-10-07 07:18:11 +09:00
HarJIT
f5f314a42d
One fix and one improvement to GB18030: (#15)
— The codec had been failing to decode 0x81308130 to U+0080, even though
it successfully encoded it. Since U+0080 is not used for anything in most
contexts (it's allocated as a control code in the ECMA-35 sense, but
ECMA-48 does not use it) this is unlikely to have hurt anything, but I
have fixed it anyway (it arose from 0 and None being conflated in a
conditional).

— The encoding and decoding of GB18030 four-byte codes now uses binary
search rather than linear search. This significantly improves performance
on four-byte codes, though performance on two-byte codes is unaffected.
2021-08-12 19:17:59 +09:00
K Lange
791cdb321c Update help module with correct LICENSE text... 2021-04-12 21:00:15 +09:00
HarJIT
0ef38bb6ee
Corrected documentation for iso-2022-jp-ext (implementation unchanged) (#8) 2021-04-09 17:59:39 +09:00
HarJIT
614193b8a1
Codecs package docs, as well as some assorted tweaks or minor additions (#5)
* Add some docs, and remove second Code page 874 codec (they handled the
non-overridden C1 area differently, but we only need one).

* More docs work.

* Doc stuff.

* Adjusted.

* More tweaks (table padding is not the docstring's problem).

* CSS and docstring tweaks.

* Link from modules to parent packages and vice versa.

* More documentation.

* Docstrings for all `codecs` submodules.

* Move encode_jis7_reduced into dbextra_data_7bit (thus completing the lazy
startup which was apparently not complete already) and docstrings added to
implementations of base class methods referring up to the base class.

* Remove FUSE junk that somehow made it into the repo.

* Some more docstrings.

* Fix some broken references to `string` (rather than `data`) which would have
caused a problem if any existing error handler had returned a negative
offset (which no current handler does, but it's worth fixing anyway).

* Add a cp042 codec to accompany the x-user-defined codec, and to pave the
way for maybe adding Adobe Symbol, Zapf Dingbats or Wingdings codecs
in future.

* Better Japanese Autodetect behaviour for ISO-2022-JP (add yet another
condition in which it will be detected, making it able to conclusively
detect it prior to end of stream without being fed an entire escape
sequence in one call). Also some docs tweaks.

* idstr() → _idstr() since it's internal.

* Docs for codecs.pifonts.

* Docstrings for dbextra.

* Document the sbextra classes.

* Docstrings for the web encodings.

* Possibly a fairer assessment of likely reality.

* Docstrings for codecs.binascii

* The *encoding* isn't removed (the BOM is).

* Make it clearer when competing OEM code pages use different letter layouts.

* Fix copied in error.

* Stop generating linking to non-existent "← tools" from tools.gendoc.

* Move .fuse_hidden* exclusion to my user-level config.

* Constrain the table style changes to class .markdownTable, to avoid any
effect on other interface tables generated by Doxygen.

* Refer to `__ispackage__` when generating help.
2021-04-02 16:34:10 +09:00
K. Lange
40836cba21 Implement Python 3 division semantics 2021-04-02 16:02:05 +09:00
K. Lange
04a95fa779 Add docstrings to 'callgrind' module 2021-03-25 09:54:23 +09:00
K Lange
a58ca88bd3 Cleanup callgrind output 2021-03-24 21:52:10 +09:00
HarJIT
5c2de206b9
Codecs package (#4)
Codecs package

Co-authored-by: HarJIT <harjit@harjit.moe>
2021-03-24 04:53:02 -07:00
K Lange
2ed8e65c89 Rework KrkValue to use NaN-boxing 2021-03-24 20:49:44 +09:00
K. Lange
d05dc4fd08 Include codeobject pointer reference in callgrind output so lambdas, etc. can be differentatiated 2021-03-24 12:37:25 +09:00
K. Lange
49066f7ae2 Trace function calls 2021-03-23 19:17:54 +09:00
K. Lange
540a9aea0d Fix bug in json float parsing found by WASM IDE's static analyzer 2021-03-10 14:35:14 +09:00
K. Lange
c9aa17e119 Rename __get__, __set__ to match Python's __getitem__, __setitem__ and make room for future addition of descriptors 2021-03-10 14:24:22 +09:00
K. Lange
a5ff538dc1 Write a bunch more docs 2021-02-20 20:44:07 +09:00
K. Lange
d5d3d721e7 The big documentation system overhaul 2021-02-20 14:10:36 +09:00
K. Lange
a7e110cad9 Change how module imports work to support a package importing contained modules 2021-02-14 08:13:53 +09:00
K. Lange
90b219cdce Eliminate builtins.krk 2021-02-05 17:22:08 +09:00
K. Lange
3e90615021 Also more built-in functions 2021-01-23 08:41:53 +09:00
K. Lange
9c15ac6638 Add missing keywords to highlighter 2021-01-23 08:39:02 +09:00
K. Lange
460d1c39b9 Fix bad multiline handling in syntax.highlighter 2021-01-22 11:10:03 +09:00
K. Lange
85e7c667b4 C-ify some more collection methods 2021-01-19 22:27:05 +09:00
K. Lange
fdc1a500fe Add pass statement just for compatibility. 2021-01-19 18:22:13 +09:00
K. Lange
87c99d5c8f Add support for \U escape 2021-01-19 14:05:21 +09:00
K. Lange
ef7fb215b2 Add a module that does simple Kuroko syntax highlighting with flexible outputs 2021-01-18 20:45:26 +09:00
K. Lange
abfaa50bee Implement module packages 2021-01-17 22:01:58 +09:00
K. Lange
2b02ef457e Add basic support for -m argument to interpreter 2021-01-13 09:08:11 +09:00
K. Lange
f43eff0f2e JSON module should be able to support unicode strings easily 2021-01-12 22:19:24 +09:00
K. Lange
5517162a93 Add json module 2021-01-11 19:02:51 +09:00
K. Lange
991ed99e78 Add a basic collections module 2021-01-11 14:08:05 +09:00
K. Lange
213c496372 Fix up string escapes, make sure we're handling nil bytes when printing 2021-01-11 11:41:26 +09:00
K Lange
2d012e4126 Add more help information and some startup text to the repl 2021-01-07 20:00:57 +09:00
K. Lange
fbf4dda818 Fix tracking what should be 'global' through function calls? 2021-01-07 10:39:09 +09:00
K. Lange
902d2222b5 Make modules work like in Python. TODO: module class for better repring 2021-01-07 09:50:58 +09:00
K. Lange
0966a21c7a Move sleep, uname to Pythonic module names 2021-01-03 09:49:19 +09:00
K. Lange
7f47224bd9 remove superfluous range module 2020-12-29 22:06:29 +09:00
K. Lange
eb17af8076 Embed __builtins__ source directly 2020-12-29 18:50:18 +09:00
K. Lange
3ba8025eeb lots of fixups so we can create dicts from the vm 2020-12-28 20:38:26 +09:00
K. Lange
b3ad2e1f22 Second pass at cleaning up built-ins 2020-12-28 19:26:01 +09:00
K. Lange
cdcbf6cf54 First pass at module/builtin cleanup 2020-12-28 19:01:28 +09:00