Import nawk as of 2003/07/29

Changes: * internationalization improvements * [:digit:] addition * some bugfixes
2003-08-02 22:21:23 +00:00 · 2003-08-02 22:21:23 +00:00 · 4ea2a427d1
commit 4ea2a427d1
parent 9d1ca6d8d9
1 changed files with 136 additions and 0 deletions
--- a/dist/nawk/FIXES
+++ b/dist/nawk/FIXES
@ -25,6 +25,142 @@ THIS SOFTWARE.
 This file lists all bug fixes, changes, etc., made since the AWK book
 was sent to the printers in August, 1987.

+Jul 29, 2003:
+	fixed (i think) the long-standing botch that included the beginning of
+	line state ^ for RE's in the set of valid characters; this led to a
+	variety of odd problems, including failure to properly match certain
+	regular expressions in non-US locales.  thanks to ruslan for keeping
+	at this one.
+
+Jul 28, 2003:
+	n-th try at getting internationalization right, with thanks to volker
+	kiefel, arnold robbins and ruslan ermilov for advice, though they
+	should not be blamed for the outcome.  according to posix, "."  is the
+	radix character in programs and command line arguments regardless of
+	the locale; otherwise, the locale should prevail for input and output
+	of numbers.  so it's intended to work that way.
+	
+	i have rescinded the attempt to use strcoll in expanding shorthands in
+	regular expressions (cclenter).  its properties are much too
+	surprising; for example [a-c] matches aAbBc in locale en_US but abBcC
+	in locale fr_CA.  i can see how this might arise by implementation
+	but i cannot explain it to a human user.  (this behavior can be seen
+	in gawk as well; we're leaning on the same library.)
+
+	the issue appears to be that strcoll is meant for sorting, where
+	merging upper and lower case may make sense (though note that unix
+	sort does not do this by default either).  it is not appropriate
+	for regular expressions, where the goal is to match specific
+	patterns of characters.  in any case, the notations [:lower:], etc.,
+	are available in awk, and they are more likely to work correctly in
+	most locales.
+
+	a moratorium is hereby declared on internationalization changes.
+	i apologize to friends and colleagues in other parts of the world.
+	i would truly like to get this "right", but i don't know what
+	that is, and i do not want to keep making changes until it's clear.
+
+Jul 4, 2003:
+	fixed bug that permitted non-terminated RE, as in "awk /x".
+
+Jun 1, 2003:
+	subtle change to split: if source is empty, number of elems
+	is always 0 and the array is not set.
+
+Mar 21, 2003:
+	added some parens to isblank, in another attempt to make things
+	internationally portable.
+
+Mar 14, 2003:
+	the internationalization changes, somewhat modified, are now
+	reinstated.  in theory awk will now do character comparisons
+	and case conversions in national language, but "." will always
+	be the decimal point separator on input and output regardless
+	of national language.  isblank(){} has an #ifndef.
+
+	this no longer compiles on windows: LC_MESSAGES isn't defined
+	in vc6++.
+
+	fixed subtle behavior in field and record splitting: if FS is
+	a single character and RS is not empty, \n is NOT a separator.
+	this tortuous reading is found in the awk book; behavior now
+	matches gawk and mawk.
+
+Dec 13, 2002:
+	for the moment, the internationalization changes of nov 29 are
+	rolled back -- programs like x = 1.2 don't work in some locales,
+	because the parser is expecting x = 1,2.  until i understand this
+	better, this will have to wait.
+
+Nov 29, 2002:
+	modified b.c (with tiny changes in main and run) to support
+	locales, using strcoll and iswhatever tests for posix character
+	classes.  thanks to ruslan ermilov (ru@freebsd.org) for code.
+	the function isblank doesn't seem to have propagated to any
+	header file near me, so it's there explicitly.  not properly
+	tested on non-ascii character sets by me.
+
+Jun 28, 2002:
+	modified run/format() and tran/getsval() to do a slightly better
+	job on using OFMT for output from print and CONVFMT for other
+	number->string conversions, as promised by posix and done by 
+	gawk and mawk.  there are still places where it doesn't work
+	right if CONVFMT is changed; by then the STR attribute of the
+	variable has been irrevocably set.  thanks to arnold robbins for
+	code and examples.
+
+	fixed subtle bug in format that could get core dump.  thanks to
+	Jaromir Dolecek <jdolecek@NetBSD.org> for finding and fixing.
+	minor cleanup in run.c / format() at the same time.
+
+	added some tests for null pointers to debugging printf's, which
+	were never intended for external consumption.  thanks to dave
+	kerns (dkerns@lucent.com) for pointing this out.
+
+	GNU compatibility: an empty regexp matches anything (thanks to
+	dag-erling smorgrav, des@ofug.org).  subject to reversion if
+	this does more harm than good.
+
+	pervasive small changes to make things more const-correct, as
+	reported by gcc's -Wwrite-strings.  as it says in the gcc manual,
+	this may be more nuisance than useful.  provoked by a suggestion
+	and code from arnaud desitter, arnaud@nimbus.geog.ox.ac.uk
+
+	minor documentation changes to note that this now compiles out
+	of the box on Mac OS X.
+
+Feb 10, 2002:
+	changed types in posix chars structure to quiet solaris cc.
+
+Jan 1, 2002:
+	fflush() or fflush("") flushes all files and pipes.
+
+	length(arrayname) returns number of elements; thanks to 
+	arnold robbins for suggestion.
+
+	added a makefile.win to make it easier to build on windows.
+	based on dan allen's buildwin.bat.
+
+Nov 16, 2001:
+	added support for posix character class names like [:digit:],
+	which are not exactly shorter than [0-9] and perhaps no more
+	portable.  thanks to dag-erling smorgrav for code.
+
+Feb 16, 2001:
+	removed -m option; no longer needed, and it was actually
+	broken (noted thanks to volker kiefel).
+
+Feb 10, 2001:
+	fixed an appalling bug in gettok: any sequence of digits, +,-, E, e,
+	and period was accepted as a valid number if it started with a period.
+	this would never have happened with the lex version.
+
+	other 1-character botches, now fixed, include a bare $ and a
+	bare " at the end of the input.
+
+Feb 7, 2001:
+	more (const char *) casts in b.c and tran.c to silence warnings.
+
 Nov 15, 2000:
 	fixed a bug introduced in august 1997 that caused expressions
 	like $f[1] to be syntax errors.  thanks to arnold robbins for