NetBSD/bin/sh/syntax.h

/*	$NetBSD: syntax.h,v 1.9 2017/08/21 13:20:49 kre Exp $	*/

/*-
 * Copyright (c) 1991, 1993
 *	The Regents of the University of California.  All rights reserved.
 *
 * This code is derived from software contributed to Berkeley by
 * Kenneth Almquist.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 3. Neither the name of the University nor the names of its contributors
 *    may be used to endorse or promote products derived from this software
 *    without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 * SUCH DAMAGE.
 */

#include <sys/cdefs.h>
#include <ctype.h>

/* Syntax classes */
#define CWORD 0			/* character is nothing special */
#define CNL 1			/* newline character */
#define CBACK 2			/* a backslash character */
#define CSQUOTE 3		/* single quote */
#define CDQUOTE 4		/* double quote */
#define CBQUOTE 5		/* backwards single quote */
#define CVAR 6			/* a dollar sign */
#define CENDVAR 7		/* a '}' character */
#define CLP 8			/* a left paren in arithmetic */
#define CRP 9			/* a right paren in arithmetic */
#define CEOF 10			/* end of file */
#define CCTL 11			/* like CWORD, except it must be escaped */
#define CSPCL 12		/* these terminate a word */
#define CSBACK 13		/* a backslash in a single quote syntax */

/* Syntax classes for is_ functions */
#define ISDIGIT 01		/* a digit */
#define ISUPPER 02		/* an upper case letter */
#define ISLOWER 04		/* a lower case letter */
#define ISUNDER 010		/* an underscore */
#define ISSPECL 020		/* the name of a special parameter */
#define ISSPACE 040		/* a white space character */

#define PEOF (CHAR_MIN - 1)
#define SYNBASE (-PEOF)


#define BASESYNTAX (basesyntax + SYNBASE)
#define DQSYNTAX (dqsyntax + SYNBASE)
#define SQSYNTAX (sqsyntax + SYNBASE)
#define ARISYNTAX (arisyntax + SYNBASE)

/* These defines assume that the digits are contiguous (which is guaranteed) */
#define	is_digit(c)	((unsigned)((c) - '0') <= 9)
#define	sh_ctype(c)	(is_type+SYNBASE)[(int)(c)]
#define	is_upper(c)	(sh_ctype(c) & ISUPPER)
#define	is_lower(c)	(sh_ctype(c) & ISLOWER)
#define	is_alpha(c)	(sh_ctype(c) & (ISUPPER|ISLOWER))
#define	is_name(c)	(sh_ctype(c) & (ISUPPER|ISLOWER|ISUNDER))
#define	is_in_name(c)	(sh_ctype(c) & (ISUPPER|ISLOWER|ISUNDER|ISDIGIT))
#define	is_special(c)	(sh_ctype(c) & (ISSPECL|ISDIGIT))
#define	is_space(c)	(sh_ctype(c) & ISSPACE)
#define	digit_val(c)	((c) - '0')

extern const char basesyntax[];
extern const char dqsyntax[];
extern const char sqsyntax[];
extern const char arisyntax[];
extern const char is_type[];
Add support for $'...' quoting (based upon C "..." strings, with \ expansions.) Implementation largely obtained from FreeBSD, with adaptations to meet the needs and style of this sh, some updates to agree with the current POSIX spec, and a few other minor changes. The POSIX spec for this ( http://austingroupbugs.net/view.php?id=249 ) [see note 2809 for the current proposed text] is yet to be approved, so might change. It currently leaves several aspects as unspecified, this implementation handles those as: Where more than 2 hex digits follow \x this implementation processes the first two as hex, the following characters are processed as if the \x sequence was not present. The value obtained from a \nnn octal sequence is truncated to the low 8 bits (if a bigger value is written, eg: \456.) Invalid escape sequences are errors. Invalid \u (or \U) code points are errors if known to be invalid, otherwise can generate a '?' character. Where any escape sequence generates nul ('\0') that char, and the rest of the $'...' string is discarded, but anything remaining in the word is processed, ie: aaa$'bbb\0ccc'ddd produces the same as aaa'bbb'ddd. Differences from FreeBSD: FreeBSD allows only exactly 4 or 8 hex digits for \u and \U (as does C, but the current sh proposal differs.) reeBSD also continues consuming as many hex digits as exist after \x (permitted by the spec, but insane), and reject \u0000 as invalid). Some of this is possibly because that their implementation is based upon an earlier proposal, perhaps note 590 - though that has been updated several times. Differences from the current POSIX proposal: We currently always generate UTF-8 for the \u & \U escapes. We should generate the equivalent character from the current locale's character set (and UTF8 only if that is what the current locale uses.) If anyone would like to correct that, go ahead. We (and FreeBSD) generate (X & 0x1F) for \cX escapes where we should generate the appropriate control character (SOH for \cA for example) with whatever value that has in the current character set. Apart from EBCDIC, which we do not support, I've never seen a case where they differ, so ... 2017-08-21 16:20:49 +03:00			`/* $NetBSD: syntax.h,v 1.9 2017/08/21 13:20:49 kre Exp $ */`
Put syntax.h under CVS instead of having it generated by mksyntax. Use CHAR_MIN (from limits.h) to determine whether target char are signed or unsigned - the syntax tables will not be indexed properly. Rip out all the stuff from mksyntax.c that wrote syntax.h. syntax.c can stiff be generated incorrectly... 2004-01-17 18:40:09 +03:00
			`/*-`
			`* Copyright (c) 1991, 1993`
			`* The Regents of the University of California. All rights reserved.`
			`*`
			`* This code is derived from software contributed to Berkeley by`
			`* Kenneth Almquist.`
			`*`
			`* Redistribution and use in source and binary forms, with or without`
			`* modification, are permitted provided that the following conditions`
			`* are met:`
			`* 1. Redistributions of source code must retain the above copyright`
			`* notice, this list of conditions and the following disclaimer.`
			`* 2. Redistributions in binary form must reproduce the above copyright`
			`* notice, this list of conditions and the following disclaimer in the`
			`* documentation and/or other materials provided with the distribution.`
			`* 3. Neither the name of the University nor the names of its contributors`
			`* may be used to endorse or promote products derived from this software`
			`* without specific prior written permission.`
			`*`
			* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
			`* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE`
			`* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE`
			`* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE`
			`* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL`
			`* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS`
			`* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)`
			`* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT`
			`* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY`
			`* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF`
			`* SUCH DAMAGE.`
			`*/`

			`#include <sys/cdefs.h>`
			`#include <ctype.h>`

			`/* Syntax classes */`
			`#define CWORD 0 /* character is nothing special */`
			`#define CNL 1 /* newline character */`
			`#define CBACK 2 /* a backslash character */`
			`#define CSQUOTE 3 /* single quote */`
			`#define CDQUOTE 4 /* double quote */`
			`#define CBQUOTE 5 /* backwards single quote */`
			`#define CVAR 6 /* a dollar sign */`
			`#define CENDVAR 7 /* a '}' character */`
			`#define CLP 8 /* a left paren in arithmetic */`
			`#define CRP 9 /* a right paren in arithmetic */`
			`#define CEOF 10 /* end of file */`
			`#define CCTL 11 /* like CWORD, except it must be escaped */`
			`#define CSPCL 12 /* these terminate a word */`
Add support for $'...' quoting (based upon C "..." strings, with \ expansions.) Implementation largely obtained from FreeBSD, with adaptations to meet the needs and style of this sh, some updates to agree with the current POSIX spec, and a few other minor changes. The POSIX spec for this ( http://austingroupbugs.net/view.php?id=249 ) [see note 2809 for the current proposed text] is yet to be approved, so might change. It currently leaves several aspects as unspecified, this implementation handles those as: Where more than 2 hex digits follow \x this implementation processes the first two as hex, the following characters are processed as if the \x sequence was not present. The value obtained from a \nnn octal sequence is truncated to the low 8 bits (if a bigger value is written, eg: \456.) Invalid escape sequences are errors. Invalid \u (or \U) code points are errors if known to be invalid, otherwise can generate a '?' character. Where any escape sequence generates nul ('\0') that char, and the rest of the $'...' string is discarded, but anything remaining in the word is processed, ie: aaa$'bbb\0ccc'ddd produces the same as aaa'bbb'ddd. Differences from FreeBSD: FreeBSD allows only exactly 4 or 8 hex digits for \u and \U (as does C, but the current sh proposal differs.) reeBSD also continues consuming as many hex digits as exist after \x (permitted by the spec, but insane), and reject \u0000 as invalid). Some of this is possibly because that their implementation is based upon an earlier proposal, perhaps note 590 - though that has been updated several times. Differences from the current POSIX proposal: We currently always generate UTF-8 for the \u & \U escapes. We should generate the equivalent character from the current locale's character set (and UTF8 only if that is what the current locale uses.) If anyone would like to correct that, go ahead. We (and FreeBSD) generate (X & 0x1F) for \cX escapes where we should generate the appropriate control character (SOH for \cA for example) with whatever value that has in the current character set. Apart from EBCDIC, which we do not support, I've never seen a case where they differ, so ... 2017-08-21 16:20:49 +03:00			`#define CSBACK 13 /* a backslash in a single quote syntax */`
Put syntax.h under CVS instead of having it generated by mksyntax. Use CHAR_MIN (from limits.h) to determine whether target char are signed or unsigned - the syntax tables will not be indexed properly. Rip out all the stuff from mksyntax.c that wrote syntax.h. syntax.c can stiff be generated incorrectly... 2004-01-17 18:40:09 +03:00
			`/* Syntax classes for is_ functions */`
			`#define ISDIGIT 01 /* a digit */`
			`#define ISUPPER 02 /* an upper case letter */`
			`#define ISLOWER 04 /* a lower case letter */`
			`#define ISUNDER 010 /* an underscore */`
			`#define ISSPECL 020 /* the name of a special parameter */`
Add support for $'...' quoting (based upon C "..." strings, with \ expansions.) Implementation largely obtained from FreeBSD, with adaptations to meet the needs and style of this sh, some updates to agree with the current POSIX spec, and a few other minor changes. The POSIX spec for this ( http://austingroupbugs.net/view.php?id=249 ) [see note 2809 for the current proposed text] is yet to be approved, so might change. It currently leaves several aspects as unspecified, this implementation handles those as: Where more than 2 hex digits follow \x this implementation processes the first two as hex, the following characters are processed as if the \x sequence was not present. The value obtained from a \nnn octal sequence is truncated to the low 8 bits (if a bigger value is written, eg: \456.) Invalid escape sequences are errors. Invalid \u (or \U) code points are errors if known to be invalid, otherwise can generate a '?' character. Where any escape sequence generates nul ('\0') that char, and the rest of the $'...' string is discarded, but anything remaining in the word is processed, ie: aaa$'bbb\0ccc'ddd produces the same as aaa'bbb'ddd. Differences from FreeBSD: FreeBSD allows only exactly 4 or 8 hex digits for \u and \U (as does C, but the current sh proposal differs.) reeBSD also continues consuming as many hex digits as exist after \x (permitted by the spec, but insane), and reject \u0000 as invalid). Some of this is possibly because that their implementation is based upon an earlier proposal, perhaps note 590 - though that has been updated several times. Differences from the current POSIX proposal: We currently always generate UTF-8 for the \u & \U escapes. We should generate the equivalent character from the current locale's character set (and UTF8 only if that is what the current locale uses.) If anyone would like to correct that, go ahead. We (and FreeBSD) generate (X & 0x1F) for \cX escapes where we should generate the appropriate control character (SOH for \cA for example) with whatever value that has in the current character set. Apart from EBCDIC, which we do not support, I've never seen a case where they differ, so ... 2017-08-21 16:20:49 +03:00			`#define ISSPACE 040 /* a white space character */`
Put syntax.h under CVS instead of having it generated by mksyntax. Use CHAR_MIN (from limits.h) to determine whether target char are signed or unsigned - the syntax tables will not be indexed properly. Rip out all the stuff from mksyntax.c that wrote syntax.h. syntax.c can stiff be generated incorrectly... 2004-01-17 18:40:09 +03:00
			`#define PEOF (CHAR_MIN - 1)`
			`#define SYNBASE (-PEOF)`


			`#define BASESYNTAX (basesyntax + SYNBASE)`
			`#define DQSYNTAX (dqsyntax + SYNBASE)`
			`#define SQSYNTAX (sqsyntax + SYNBASE)`
			`#define ARISYNTAX (arisyntax + SYNBASE)`

Revert (kind of) the change in 1.12 of the ancient mksyntax.sh script (undoing the effect of that commit on syntax.h when it was being dynamically generated) from 1996. This means that the shell parser is now locale independent, so scripts that work anywhere will work consistently everywhere. Inspired by a similar change in FreeBSD's sh (from 2010) - the original change in the other direction came from FreeBSD as well.... Note that this does not in any way add any kind of support for locales to sh (which is a whole different problem.) (from kre) 2016-03-16 18:45:40 +03:00			`/* These defines assume that the digits are contiguous (which is guaranteed) */`
			`#define is_digit(c) ((unsigned)((c) - '0') <= 9)`
Add support for $'...' quoting (based upon C "..." strings, with \ expansions.) Implementation largely obtained from FreeBSD, with adaptations to meet the needs and style of this sh, some updates to agree with the current POSIX spec, and a few other minor changes. The POSIX spec for this ( http://austingroupbugs.net/view.php?id=249 ) [see note 2809 for the current proposed text] is yet to be approved, so might change. It currently leaves several aspects as unspecified, this implementation handles those as: Where more than 2 hex digits follow \x this implementation processes the first two as hex, the following characters are processed as if the \x sequence was not present. The value obtained from a \nnn octal sequence is truncated to the low 8 bits (if a bigger value is written, eg: \456.) Invalid escape sequences are errors. Invalid \u (or \U) code points are errors if known to be invalid, otherwise can generate a '?' character. Where any escape sequence generates nul ('\0') that char, and the rest of the $'...' string is discarded, but anything remaining in the word is processed, ie: aaa$'bbb\0ccc'ddd produces the same as aaa'bbb'ddd. Differences from FreeBSD: FreeBSD allows only exactly 4 or 8 hex digits for \u and \U (as does C, but the current sh proposal differs.) reeBSD also continues consuming as many hex digits as exist after \x (permitted by the spec, but insane), and reject \u0000 as invalid). Some of this is possibly because that their implementation is based upon an earlier proposal, perhaps note 590 - though that has been updated several times. Differences from the current POSIX proposal: We currently always generate UTF-8 for the \u & \U escapes. We should generate the equivalent character from the current locale's character set (and UTF8 only if that is what the current locale uses.) If anyone would like to correct that, go ahead. We (and FreeBSD) generate (X & 0x1F) for \cX escapes where we should generate the appropriate control character (SOH for \cA for example) with whatever value that has in the current character set. Apart from EBCDIC, which we do not support, I've never seen a case where they differ, so ... 2017-08-21 16:20:49 +03:00			`#define sh_ctype(c) (is_type+SYNBASE)[(int)(c)]`
DEBUG mode shell update (changes nothing for shells which are not compiled for DEBUG.) Add debug builtin command, and corresponding -D command line option. As usual, for DEBUG related stuff, read the source for info, that's all there is about this. This completes the infrastructure changes for the updated DEBUG TRACE mechanism, so now converting the rest of the shell's internal tracing can happen as desired - piecemeal. 2017-05-15 23:00:36 +03:00			`#define is_upper(c) (sh_ctype(c) & ISUPPER)`
			`#define is_lower(c) (sh_ctype(c) & ISLOWER)`
factor out common code in macro. 2016-03-16 18:48:01 +03:00			`#define is_alpha(c) (sh_ctype(c) & (ISUPPER\|ISLOWER))`
			`#define is_name(c) (sh_ctype(c) & (ISUPPER\|ISLOWER\|ISUNDER))`
			`#define is_in_name(c) (sh_ctype(c) & (ISUPPER\|ISLOWER\|ISUNDER\|ISDIGIT))`
Add support for $'...' quoting (based upon C "..." strings, with \ expansions.) Implementation largely obtained from FreeBSD, with adaptations to meet the needs and style of this sh, some updates to agree with the current POSIX spec, and a few other minor changes. The POSIX spec for this ( http://austingroupbugs.net/view.php?id=249 ) [see note 2809 for the current proposed text] is yet to be approved, so might change. It currently leaves several aspects as unspecified, this implementation handles those as: Where more than 2 hex digits follow \x this implementation processes the first two as hex, the following characters are processed as if the \x sequence was not present. The value obtained from a \nnn octal sequence is truncated to the low 8 bits (if a bigger value is written, eg: \456.) Invalid escape sequences are errors. Invalid \u (or \U) code points are errors if known to be invalid, otherwise can generate a '?' character. Where any escape sequence generates nul ('\0') that char, and the rest of the $'...' string is discarded, but anything remaining in the word is processed, ie: aaa$'bbb\0ccc'ddd produces the same as aaa'bbb'ddd. Differences from FreeBSD: FreeBSD allows only exactly 4 or 8 hex digits for \u and \U (as does C, but the current sh proposal differs.) reeBSD also continues consuming as many hex digits as exist after \x (permitted by the spec, but insane), and reject \u0000 as invalid). Some of this is possibly because that their implementation is based upon an earlier proposal, perhaps note 590 - though that has been updated several times. Differences from the current POSIX proposal: We currently always generate UTF-8 for the \u & \U escapes. We should generate the equivalent character from the current locale's character set (and UTF8 only if that is what the current locale uses.) If anyone would like to correct that, go ahead. We (and FreeBSD) generate (X & 0x1F) for \cX escapes where we should generate the appropriate control character (SOH for \cA for example) with whatever value that has in the current character set. Apart from EBCDIC, which we do not support, I've never seen a case where they differ, so ... 2017-08-21 16:20:49 +03:00			`#define is_special(c) (sh_ctype(c) & (ISSPECL\|ISDIGIT))`
			`#define is_space(c) (sh_ctype(c) & ISSPACE)`
			`#define digit_val(c) ((c) - '0')`
Put syntax.h under CVS instead of having it generated by mksyntax. Use CHAR_MIN (from limits.h) to determine whether target char are signed or unsigned - the syntax tables will not be indexed properly. Rip out all the stuff from mksyntax.c that wrote syntax.h. syntax.c can stiff be generated incorrectly... 2004-01-17 18:40:09 +03:00
			`extern const char basesyntax[];`
			`extern const char dqsyntax[];`
			`extern const char sqsyntax[];`
			`extern const char arisyntax[];`
			`extern const char is_type[];`