More mentions of unicode stuff

2021-01-12 19:44:18 +09:00 · 2021-01-12 19:44:18 +09:00 · de71ada519
commit de71ada519
parent be8b8adbc6
1 changed files with 3 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -22,6 +22,7 @@ On top of this, Kuroko adds a number of features inspired by Python, such as:
 - Pseudo-classes for basic values (eg. strings are pseudo-instances of a `str` class providing methods like `.format()`)
 - Exception handling, with `try`/`except`/`raise`.
 - Modules, both for native C code and managed Kuroko code.
+- Unicode strings and identifiers.

 ## Building Kuroko

@ -183,6 +184,8 @@ print("t".__ord__())
 # → 116
 ```

+Invalid UTF-8 sequences will most likely result in a `ValueError` during decoding or parsing.
+
 _**Implementation Note:** Generally, the internal representation of strings is their UTF-8 encoded form. When an indexing or slicing operation happens in which a codepoint index needs to be converted to an offset in the string, the most appropriate 'canonical' format will be generated and remain with the interned string until is garbage collected. For strings containing only ASCII characters, no conversion is done and no additional copy is created. For all other strings, the smallest possible size for representing the largest codepoint is used, among the options of 1, 2, or 4. This approach is similar to CPython post-3.9._

 Strings can be encoded to _bytes_ objects to get their UTF-8 representation: