Commit Graph

661 Commits

Author SHA1 Message Date
christos bc4f01ce82 - don't set _GNU_SOURCE. We are not supposed to make decisions for others.
- don't special-case wcsdup()
From Ingo Schwarze.
2016-02-16 19:29:51 +00:00
christos 23f3e7075d get rid of bool_t (Ingo Schwarze) 2016-02-16 19:11:25 +00:00
christos 747f681109 more include file cleanup (Ingo Schwarze) 2016-02-16 19:08:41 +00:00
christos a539b892c3 include errno.h 2016-02-16 15:54:15 +00:00
christos aefc1e4460 From Ingo Scharze:
Let "el.h" include everything needed for struct editline,
and don't include that stuff multiple times.  That also improves
consistency, also avoids circular inclusions, and also makes it
easier to follow what is going on, even though not quite as nice.
But it seems like the best we can do...
2016-02-16 15:53:48 +00:00
christos f09cb8c626 cleanup chartype.h includes (Ingo Schwarze) 2016-02-16 14:08:25 +00:00
christos c807fdff98 one more 2016-02-16 14:07:47 +00:00
christos 40850369f8 cleanup inclusion of histedit.h (Ingo Schwarze) 2016-02-16 14:06:05 +00:00
christos 89cffc1532 include explicitly errno.h since we use it. 2016-02-16 14:04:58 +00:00
christos 67bb823526 No need to include "sys.h" from here; it is included from config.h 2016-02-16 14:04:24 +00:00
christos d8252c8b23 attribute unused 2016-02-15 23:36:30 +00:00
christos 67b10d3e9e OpenBSD term.c rev. 1.7 2002/11/29 20:13:39 deraadt
spelling
2016-02-15 22:53:38 +00:00
christos f91f480498 OpenBSD readline.c rev. 1.14 2015/02/06 23:21:58 millert
use SIZE_MAX
2016-02-15 22:48:59 +00:00
christos 92b1772005 OpenBSD readline.c rev. 1.13 2015/01/13 08:33:12 reyk
rl_set_keyboard_input_timeout() for readline 4.2 compat
2016-02-15 21:58:37 +00:00
christos f8ee3c5528 OpenBSD eln.c rev. 1.3 2011/11/27 21:46:44 pascal
kill a C++-style comment
2016-02-15 21:56:35 +00:00
christos 42e2a4d875 Compile with WIDECHAR the same way the main Makefile does (Ingo Schwarze) 2016-02-15 21:38:07 +00:00
christos 1e12a8d1ca Don't free getline memory (Ingo Schwarze). 2016-02-15 21:35:52 +00:00
christos e8d0e8c012 forgot one fgetln define 2016-02-15 17:35:39 +00:00
christos 5390c8faa5 change tests for fgetln. 2016-02-15 16:14:39 +00:00
christos a7ab79fbe5 Use getline for better portability. 2016-02-15 15:53:45 +00:00
christos c0d16449e0 OpenBSD tokenizer.c rev. 1.8 2003/08/11 18:21:40 deraadt
don't increase amax on realloc failure
2016-02-15 15:37:20 +00:00
christos efeef4e587 OpenBSD term.c rev. 1.13 2009/12/11 18:58:59 jacekm
fix two memory leaks
2016-02-15 15:35:03 +00:00
christos c825536317 Change the test for the size of encoded buffer to include the NULL, from
OpenBSD; no functional change.
2016-02-15 15:30:50 +00:00
christos 5367da5f9e OpenBSD sig.c rev. 1.6 2001/12/06 04:26:00 deraadt
save and restore errno in signal handler
2016-02-15 15:29:25 +00:00
christos 87240809e9 Use fparseln to avoid newline hacks. 2016-02-15 15:26:48 +00:00
christos 70a36d136a use fparseln() to avoid needing to deal with missing \n in the last line
and also to handle comments automatically.
2016-02-15 15:18:01 +00:00
christos 2884af9fee From Ingo Schwarze:
el_getc() for the WIDECHAR case, that is, the version in eln.c.
For a UTF-8 locale, it is broken in four ways:

 1. If the character read is outside the ASCII range, the function
    does an undefined cast from wchar_t to char.  Even if wchar_t
    is internally represented as UCS-4, that is wrong and dangerous
    because characters beyond codepoint U+0255 get their high bits
    truncated, meaning that perfectly valid printable Unicode
    characters get mapped to arbitrary bytes, even the ASCII escape
    character for some Unicode characters.  But wchar_t need not
    be implemented in terms of UCS-4, so the outcome of this function
    is undefined for any and all input.

 2. If insufficient space is available for the result, the function
    fails to detect failure and returns garbage rather than -1 as
    specified in the documentation.

 3. The documentation says that errno will be set on failure, but
    that doesn't happen either in the above case.

 4. Even for ASCII characters, the results may be wrong if wchar_t
    is not using UCS-4.
2016-02-14 17:06:24 +00:00
christos f54e4f97f9 From Ingo Schwarze:
As we have seen before, "histedit.h" can never get rid of including
the <wchar.h> header because using the data types defined there is
deeply ingrained in the public interfaces of libedit.

Now POSIX unconditionally requires that <wchar.h> defines the type
wint_t.  Consequently, it can be used unconditionally, no matter
whether WIDECHAR is active or not.  Consequently, the #define Int
is pointless.

Note that removing it is not gratuitious churn.  Auditing for
integer signedness problems is already hard when only fundamental
types like "int" and "unsigned" are involved.  It gets very hard
when types come into the picture that have platform-dependent
signedness, like "char" and "wint_t".  Adding yet another layer
on top, changing both the signedness and the width in a platform-
dependent way, makes auditing yet harder, which IMHO is really
dangerous.  Note that while removing the #define, i already found
one bug caused by this excessive complication - in the function
re_putc() in refresh.c.  If WIDECHAR was defined, it printed an
Int = wint_t value with %c.  Fortunately, that bug only affects
debugging, not production.  The fix is contained in the patch.

With WIDECHAR, this doesn't change anything.  For the case without
WIDECHAR, i checked that none of the places wants to store values
that might not fit in wint_t.

This only changes internal interfaces; public ones remain unchanged.
2016-02-14 14:49:34 +00:00
christos 61ee30487d From Ingo Schwartze:
Next step:  Remove #ifdef'ing in read_char(), in the same style
as we did for setlocale(3) in el.c.

A few remarks are required to explain the choices made.

 * On first sight, handling mbrtowc(3) seems a bit less trivial
   than handling setlocale(3) because its prototype uses the data
   type mbstate_t from <wchar.h>.  However, it turns out that
   "histedit.h" already includes <wchar.h> unconditionally (i don't
   like headers including other headers, but that ship has sailed,
   people are by now certainly used to the fact that including
   "histedit.h" doesn't require including <wchar.h> before), and
   "histedit.h" is of course included all over the place.  So from
   that perspective, there is no problem with using mbrtowc(3)
   unconditionally ever for !WIDECHAR.

 * However, <wchar.h> also defines the mbrtowc(3) prototype,
   so we cannot just #define mbrtowc away, or including the header
   will break.  It would also be a bad idea to porovide a local
   implementation of mbrtowc() and hope that it overrides the one
   in libc.  Besides, the required prototype is subtly different:
   While mbrtowc(3) takes "wchar_t *" as its first argument, we
   need a function that takes "Char *".  So unfortunately, we have
   to keep a ct_mbrtowc #define, at least until we can maybe get
   rid of "Char *" in the more remote future.

 * After getting rid of the #else clause in read_char(), we can
   pull "return 1;" into the default: clause.  After that, we can
   get rid of the ugly "goto again_lastbyte;" and just "break;".
   As a bonus, that also gets rid of the ugly CONSTCOND.

 * While here, delete the unused ct_mbtowc() from chartype.h.
2016-02-14 14:47:48 +00:00
christos cc7f005f24 Avoid c99 for now. 2016-02-12 17:23:21 +00:00
christos 57c556fd79 GC IGNORE_EXTCHARS and simplify code (Ingo Schwarze) 2016-02-12 15:36:08 +00:00
christos 0e1288d7c8 From Ingo Schwarze:
If CHARSET_IS_UTF8 is not set, read_char() is broken in a large
number of ways:

 1. The isascii(3) check can yield false positives.  If a string in
    an arbitrary encoding contains a byte in the range 0..127,
    that does not at all imply that it forms a character all by
    itself, and even less that it represents the same character
    as in ASCII.  Consequently, read_char() may return characters
    the user never typed.
    Even if the encoding is not state dependent, the assumption that
    bytes in the range 0..127 represent ASCII characters is broken.
    Consider UTF-16, for example.

 2. The reverse problem can also occur.  In an arbitrary encoding,
    there is no guarantee that a character that can be represented
    by ASCII is represented by a seven-bit byte, and even less by
    the same byte as in ASCII.
    Even for single-byte encodings, these assumptions are broken.
    Consider the ISO 646 national variants, for example.
    Consequently, the current code is insufficient to keep ASCII
    characters working even for single-byte encodings.

 3. The condition "++cbp != 1" can never trigger (because initially,
    cbp is 0, and the code can only go back up via the final goto,
    which has another cbp = 0 right before it) and it has no effect
    (because cbp isn't used afterwards).

 4. bytes = ct_mbtowc(cp, cbuf, cbp) is broken.  If this returns -1,
    the code assumes that is can just call mbtowc(3) again for later
    input bytes.  In some implementations, that may even be broken
    for state-independent encodings, but trying again after mbtowc(3)
    failure certainly produces completely erratic and meaningless
    results in state-dependent encodings.

 5. The assignment "*cp = (Char)(unsigned char)cbuf[0]" is
    completely bogus.  Even if the byte cbuf[0] represents a
    character all by itself, which it usually will not, whether
    or not the cast produces the desired result depends on the
    internal representation of wchar_t in the C library, which
    the application program can know nothing about.  Even for ASCII
    in the C/POSIX locale, an ASCII character other than '\0' ==
    L'\0' == 0 need not have the same numeric value as a char and
    as a wchar_t.

To summarize, this code only works if all of the following
conditions hold:

 - The encoding is a single-byte encoding.
 - ASCII is a subset of the encoding.
 - The implementation of mbtowc(3) in the C library does not
   require re-initialization after encoding errors.
 - The implementation of wchar_t in the C library uses the
   same numerical values as ASCII.

Otherwise, it silently produces wrong results.

The simplest way to fix this is to just use the same code as for
UTF-8 (right above).  Of course, that causes functional changes
but that shouldn't matter since current behaviour is undefined.

The patch below provides the following improvements:

 - It works for all stateless single-byte encodings, no matter
   whether they are somehow related to ASCII, no matter how
   mb[r]towc(3) are internally implemented, and no matter how
   wchar_t is internally represented.
 - Instead of producing unpredictable and definitely wrong
   results for non-UTF-8 multibyte characters, it behaves in
   a well-defined way: It aborts input processing, sets errno,
   and returns failure.
   Note that short of providing full support for arbitrary locales,
   it is impossible to do better.  We cannot know whether a given
   unsupported locale is state-dependent, and for a state-dependent
   locale, it makes no sense to retry parsing after an encoding
   error, so the best we can do is abort processing for *any*
   unsupported multi-byte character.
 - Note that single-byte characters in arbitrary state-independent
   locales still work, even in locales that may potentially also
   contain multibyte characters, as long as those don't occur in
   input.  I'm not sure whether any such locales exist in practice...

Tested with UTF-8 and C/POSIX on OpenBSD.  Also tested that in the
C/POSIX locale, non-ASCII bytes get through unmangled.  You may
wish to test with ISO-LATIN on NetBSD if NetBSD supports that.

----
Also use a constant for meta to avoid warnings.
2016-02-12 15:11:09 +00:00
christos 6af8d6733f - Add some more Char casts
- reduce ifdefs by providing empty defs for nls functions (Ingo Schwarze)
2016-02-11 19:21:04 +00:00
christos 28c0290948 remove unused wrapper (Ingo Schwarze) 2016-02-11 19:10:18 +00:00
christos 3ae44d1033 Remove utf8_islead() mbrtowc() handles this just fine (Ingo Schwarze) 2016-02-11 16:08:47 +00:00
christos 6b42622b31 UTF-8 fixes from Ingo Schwarze:
1. Assume that errno is non-zero when entering read_char()
    and that read(2) returns 0 (indicating end of file).
    Then, the code will clear errno before returning.
    (Obviously, the statement "errno = 0" is almost always
     a bug unless there is save_errno = errno right before it
     and the previous value is properly restored later,
     in all reachable code paths.)

 2. When encountering an invalid byte sequence, the code discards
    all following bytes until MB_LEN_MAX overflows; consider, for
    example, 0xc2 immediately followed by a few valid ASCII bytes.
    Three of those ASCII bytes will be discarded.

 3. On a POSIX system, EILSEQ will always be set after reading a
    valid (yes, valid, not invalid!) UTF-8 character.  The reason
    is that mbtowc(3) will first be called with a length limit
    (third argument) of 1, which will fail, return -1, and - on
    a POSIX system - set errno to EILSEQ.
    This third bug is mitigated a bit because i couldn't find any
    system that actually conforms to POSIX in this respect:  None
    of OpenBSD, NetBSD, FreeBSD, Solaris 11, and glibc set errno
    when an incomplete character is passed to mbtowc(3), even though
    that is required by POSIX.
    Anyway, that mbtowc(3) bug will be fixed at least in OpenBSD
    after release unlock, so it would be good to fix this bug in
    libedit before fixing the bug in mbtowc(3).

How can these three bugs be fixed?

 1. As far as i understand it, the intention of the bogus errno = 0
    is to undo the effects of failing system calls in el_wset(),
    sig_set(), and read__fixio() if the subsequent read(2) indicates
    end of file.  So, restoring errno has to be moved right after
    read__fixio().  Of course, neither 0 nor e is the right value
    to restore: 0 is wrong if errno happened to be set on entry, e
    would be wrong because if one read(2) fails but a second attempt
    succeeds after read__fixio(), errno should not be touched.  So,
    the errno to be restored in this case has to be saved before
    calling read(2) for the first time.

 2. Solving the second issue requires distinguishing invalid and
    incomplete characters, but that is impossible with the function
    mbtowc(3) because it returns -1 in both cases and sets errno
    to EILSEQ in both cases (once properly implemented).

    It is vital that each input character is processed right away.
    It is not acceptable to wait for the next input character before
    processing the previous one because this is an interactive
    library, not a batch system.  Consequently, the only situation
    where it is acceptable to wait for the next byte without first
    processing the previous one(s) is when the previous one(s) form
    an incomplete sequence that can be continued to form a valid
    character.

    Consequently, short of reimplementing a full UTF-8 state machine
    by hand, the only correct way forward is to use mbrtowc(3).
    Even then, care is needed to always have the state object
    properly initialized before using it, and to not discard a valid
    ASCII or UTF-8 lead byte if it happens to follow an invalid
    sequence.

 3. Fortunately, solution 2. also solves issue 3. as a side effect,
    by no longer using mbtowc(3) in the first place.
2016-02-08 17:18:43 +00:00
christos ef555cf8bb Whitespace fix (Ingo Schwarze) 2016-01-30 15:05:27 +00:00
christos 7ce9f672f2 Fix misplaced parentheses (Ingo Schwarze) 2016-01-30 04:02:51 +00:00
christos 65691b0e16 One macro is enough (Ingo Schwarze) 2016-01-29 19:59:11 +00:00
gson 07d2388506 unbreak the build 2015-12-08 16:53:27 +00:00
christos 8d14d38c26 If we did not setup the tty, don't reset it. 2015-12-08 12:57:16 +00:00
christos a2993d741e Only reset the terminal if we have a tty (Boris Ranto) 2015-12-08 12:56:55 +00:00
christos 8ec268554e Fix descriptions of el_set functions.
Americanise initialise :-)
2015-11-03 21:36:59 +00:00
christos 0fe5419e98 Use the full buffer for the conversion; ideally we should be dynamically
allocating this. From Jilles Tjoelker
2015-10-21 21:45:30 +00:00
christos 234792da04 make sure we have space for NUL and NUL terminate buffer array (Jilles Tjoelker) 2015-10-19 00:36:27 +00:00
christos 14ccb7c1cc remove duplicate declaration 2015-06-02 15:36:45 +00:00
christos 0804279dff Adjust API to a more modern readline (Ryo Onodera) 2015-06-02 15:35:31 +00:00
christos 0b61093115 - fix types of rl_completion_entry_function and rl_add_defun
- call update pos before completion to refresh the screen
From Thomas Eriksson
2015-05-26 19:59:21 +00:00
christos bdf16bca92 make el_gets() return the number of characters read in wide mode (not the
number of wide characters) From khorben@ by FreeBSD:
https://svnweb.freebsd.org/ports/head/devel/libedit/files/patch-src_eln.c?\
revision=382458&view=markup
XXX: Pullup-7
2015-05-18 15:07:04 +00:00
christos 5113710e5b add FreeBSD 2015-05-17 13:14:41 +00:00