If CHARSET_IS_UTF8 is not set, read_char() is broken in a large
number of ways:
1. The isascii(3) check can yield false positives. If a string in
an arbitrary encoding contains a byte in the range 0..127,
that does not at all imply that it forms a character all by
itself, and even less that it represents the same character
as in ASCII. Consequently, read_char() may return characters
the user never typed.
Even if the encoding is not state dependent, the assumption that
bytes in the range 0..127 represent ASCII characters is broken.
Consider UTF-16, for example.
2. The reverse problem can also occur. In an arbitrary encoding,
there is no guarantee that a character that can be represented
by ASCII is represented by a seven-bit byte, and even less by
the same byte as in ASCII.
Even for single-byte encodings, these assumptions are broken.
Consider the ISO 646 national variants, for example.
Consequently, the current code is insufficient to keep ASCII
characters working even for single-byte encodings.
3. The condition "++cbp != 1" can never trigger (because initially,
cbp is 0, and the code can only go back up via the final goto,
which has another cbp = 0 right before it) and it has no effect
(because cbp isn't used afterwards).
4. bytes = ct_mbtowc(cp, cbuf, cbp) is broken. If this returns -1,
the code assumes that is can just call mbtowc(3) again for later
input bytes. In some implementations, that may even be broken
for state-independent encodings, but trying again after mbtowc(3)
failure certainly produces completely erratic and meaningless
results in state-dependent encodings.
5. The assignment "*cp = (Char)(unsigned char)cbuf[0]" is
completely bogus. Even if the byte cbuf[0] represents a
character all by itself, which it usually will not, whether
or not the cast produces the desired result depends on the
internal representation of wchar_t in the C library, which
the application program can know nothing about. Even for ASCII
in the C/POSIX locale, an ASCII character other than '\0' ==
L'\0' == 0 need not have the same numeric value as a char and
as a wchar_t.
To summarize, this code only works if all of the following
conditions hold:
- The encoding is a single-byte encoding.
- ASCII is a subset of the encoding.
- The implementation of mbtowc(3) in the C library does not
require re-initialization after encoding errors.
- The implementation of wchar_t in the C library uses the
same numerical values as ASCII.
Otherwise, it silently produces wrong results.
The simplest way to fix this is to just use the same code as for
UTF-8 (right above). Of course, that causes functional changes
but that shouldn't matter since current behaviour is undefined.
The patch below provides the following improvements:
- It works for all stateless single-byte encodings, no matter
whether they are somehow related to ASCII, no matter how
mb[r]towc(3) are internally implemented, and no matter how
wchar_t is internally represented.
- Instead of producing unpredictable and definitely wrong
results for non-UTF-8 multibyte characters, it behaves in
a well-defined way: It aborts input processing, sets errno,
and returns failure.
Note that short of providing full support for arbitrary locales,
it is impossible to do better. We cannot know whether a given
unsupported locale is state-dependent, and for a state-dependent
locale, it makes no sense to retry parsing after an encoding
error, so the best we can do is abort processing for *any*
unsupported multi-byte character.
- Note that single-byte characters in arbitrary state-independent
locales still work, even in locales that may potentially also
contain multibyte characters, as long as those don't occur in
input. I'm not sure whether any such locales exist in practice...
Tested with UTF-8 and C/POSIX on OpenBSD. Also tested that in the
C/POSIX locale, non-ASCII bytes get through unmangled. You may
wish to test with ISO-LATIN on NetBSD if NetBSD supports that.
----
Also use a constant for meta to avoid warnings.
Not optimal though - for some reason the framebuffer's endianness in 32bit
colour is wrong and I have no idea (yet) how to change that, so many apps
using xrender will crash.
list of changes, they can be found at
http://pcc.ludd.ltu.se/fisheye/changelog/pcc
Along with numerous bug fixes, the highlights might be a rewrite
of the CPP parser, updated backends for arm, pdp11, m68k, vax and
mips along with new backend for 8086. PCC now builds itself as a
2-pass compiler. There have been fixes for use with musl, C11
support added and use of UTF8 internally. PE/COFF target was fixed,
and Minix target added.
This change intends to run the whole network stack in softint context
(or normal LWP), not hardware interrupt context. Note that the work is
still incomplete by this change; to that end, we also have to softint-ify
if_link_state_change (and bpf) which can still run in hardware interrupt.
This change softint-ifies at ifp->if_input that is called from
each device driver (and ieee80211_input) to ensure Layer 2 runs
in softint (e.g., ether_input and bridge_input). To this end,
we provide a framework (called percpuq) that utlizes softint(9)
and percpu ifqueues. With this patch, rxintr of most drivers just
queues received packets and schedules a softint, and the softint
dequeues packets and does rest packet processing.
To minimize changes to each driver, percpuq is allocated in struct
ifnet for now and that is initialized by default (in if_attach).
We probably have to move percpuq to softc of each driver, but it's
future work. At this point, only wm(4) has percpuq in its softc
as a reference implementation.
Additional information including performance numbers can be found
in the thread at tech-kern@ and tech-net@:
http://mail-index.netbsd.org/tech-kern/2016/01/14/msg019997.html
Acknowledgment: riastradh@ greatly helped this work.
Thank you very much!
1. Assume that errno is non-zero when entering read_char()
and that read(2) returns 0 (indicating end of file).
Then, the code will clear errno before returning.
(Obviously, the statement "errno = 0" is almost always
a bug unless there is save_errno = errno right before it
and the previous value is properly restored later,
in all reachable code paths.)
2. When encountering an invalid byte sequence, the code discards
all following bytes until MB_LEN_MAX overflows; consider, for
example, 0xc2 immediately followed by a few valid ASCII bytes.
Three of those ASCII bytes will be discarded.
3. On a POSIX system, EILSEQ will always be set after reading a
valid (yes, valid, not invalid!) UTF-8 character. The reason
is that mbtowc(3) will first be called with a length limit
(third argument) of 1, which will fail, return -1, and - on
a POSIX system - set errno to EILSEQ.
This third bug is mitigated a bit because i couldn't find any
system that actually conforms to POSIX in this respect: None
of OpenBSD, NetBSD, FreeBSD, Solaris 11, and glibc set errno
when an incomplete character is passed to mbtowc(3), even though
that is required by POSIX.
Anyway, that mbtowc(3) bug will be fixed at least in OpenBSD
after release unlock, so it would be good to fix this bug in
libedit before fixing the bug in mbtowc(3).
How can these three bugs be fixed?
1. As far as i understand it, the intention of the bogus errno = 0
is to undo the effects of failing system calls in el_wset(),
sig_set(), and read__fixio() if the subsequent read(2) indicates
end of file. So, restoring errno has to be moved right after
read__fixio(). Of course, neither 0 nor e is the right value
to restore: 0 is wrong if errno happened to be set on entry, e
would be wrong because if one read(2) fails but a second attempt
succeeds after read__fixio(), errno should not be touched. So,
the errno to be restored in this case has to be saved before
calling read(2) for the first time.
2. Solving the second issue requires distinguishing invalid and
incomplete characters, but that is impossible with the function
mbtowc(3) because it returns -1 in both cases and sets errno
to EILSEQ in both cases (once properly implemented).
It is vital that each input character is processed right away.
It is not acceptable to wait for the next input character before
processing the previous one because this is an interactive
library, not a batch system. Consequently, the only situation
where it is acceptable to wait for the next byte without first
processing the previous one(s) is when the previous one(s) form
an incomplete sequence that can be continued to form a valid
character.
Consequently, short of reimplementing a full UTF-8 state machine
by hand, the only correct way forward is to use mbrtowc(3).
Even then, care is needed to always have the state object
properly initialized before using it, and to not discard a valid
ASCII or UTF-8 lead byte if it happens to follow an invalid
sequence.
3. Fortunately, solution 2. also solves issue 3. as a side effect,
by no longer using mbtowc(3) in the first place.
and return a single error value EEXIST. When making a recursive
call (to load required modules), treat a pre-existing module as
success.
Without this change, when a module was loaded by specific request
(as opposed to being loaded as a requirement of some other module),
we would always load the module from the file-system, and then
after making various sanity/compatability checks we would destroy
the new copy if there was a pre-existing copy.
Fixes PR kern/40764
XXX Note that if the module exists, we bypass all of the various
XXX "compatability" checks, including whether or not the existing
XXX module is of any particular class! (In the previous code, we
XXX checked to see if the newly-loaded copy had the correct class,
XXX but not the pre-existing copy, which could have been loaded
XXX from a different path/filename.)